Welcome to Always Bee Tracing!If you haven’t already, please clone the repository of your choice: ▸ Golang into your $GOPATH: git clone git@github.com:honeycombio/tracing-workshop-go.git
Trang 1Welcome to Always Bee Tracing!
If you haven’t already, please clone the repository of your choice:
▸ Golang (into your $GOPATH):
git clone git@github.com:honeycombio/tracing-workshop-go.git
▸ Node:
git clone git@github.com:honeycombio/tracing-workshop-node.git
Please: also accept your invites to the "Always Bee Tracing" Honeycomb team and our Slack channel
Trang 2Always Bee Tracing
A Honeycomb Tracing workshop
Trang 3▸ We used to have "one thing" (monolithic application)
▸ Then we started to have "more things" (splitting monoliths into services)
▸ Now we have "yet more things", or even "Death Star" architectures
(microservices, containers, serverless)
A bit of history
Trang 4▸ Now we have N2 problems (one slow service bogs down everything, etc.)
▸ 2010 - Google releases the Dapper paper describing how they improve on
existing tracing systems
▸ Key innovations: use of sampling, common client libraries decoupling app
code from tracing logic
A bit of history
Trang 5▸ 2012 - Zipkin was developed at Twitter for use with Thrift RPC
▸ 2015 - Uber releases Jaeger (also OpenTracing)
▸ Better sampling story, better client libraries, no Scribe/Kafka
▸ Various proprietary systems abound
▸ 2019 - Honeycomb is the best available due to best-in-class queries ;)
Why should GOOG have all the fun?
Trang 6▸ Standards for tracing exist: OpenTracing, OpenCensus, etc
▸ Pros: Collaboration, preventing vendor lock-in
▸ Cons: Slower innovation, political battles/drama
▸ Honeycomb has integrations to bridge standard formats with the
Honeycomb event model
A word on standards
Trang 7How Honeycomb fits in
Understand how your production systems
are behaving, right now
QUERY BUILDER INTERACTIVE VISUALS RAW DATA TRACES BUBBLEUP + OUTLIERS
BEELINES (AUTOMATIC INSTRUMENTATION + TRACING APIS)
DATA STORE
High Cardinality Data | High Dimensionality Data | Efficient storage
Trang 8▸ For software engineers who need to understand their code
▸ Better when visualized (preferably first in aggregate)
▸ Best when layered on top of existing data streams (rather than adding
another data silo to your toolkit)
Tracing is…
Trang 10Instrumentation (and tracing)
Trang 11Our path today
▸ Establish a baseline: send simple events
▸ Customize: enrich with custom fields and extend into traces
▸ Explore: learn to query a collection of traces, to find the most interesting
one
Trang 12a third-party dependency
a black-box service
Trang 13EXERCISE: Run the wall service
go run /wall.go
‣ Open up http://localhost:8080 in your browser and post some messages
to your wall
‣ Try writing messages like these:
‣ "hello #test #hashtag"
‣ "seems @twitteradmin isn’t a valid username but @honeycombio is"
node /wall.js
Trang 14→ let’s see what we’ve got
Trang 15Go
Trang 16→ let’s see what we’ve got
Trang 17Custom Instrumentation
▸ Identify metadata that will help you isolate unexpected behavior in
custom logic:
▸ Bits about your infrastructure (e.g which host)
▸ Bits about your deploy (e.g which version/build, which feature flags)
▸ Bits about your business (e.g which customer, which shopping cart)
▸ Bits about your execution (e.g payload characteristics, sub-timers)
Trang 18EXERCISE: Find Checkpoint 1
Go
Node
Trang 19→ let’s see what we’ve got
Trang 20EVENT ID: B, PARENTID: A
EVENT ID: C, PARENTID: B
TRACE 1
Trang 21EVENT ID: A
EVENT ID: B, PARENTID: A
EVENT ID: C, PARENTID: B
TRACE 1
Trang 22EXERCISE: Find Checkpoint 2
‣ Try writing messages like these:
‣ "seems @twitteradmin isn’t a valid username but @honeycombio is"
‣ "have you tried @honeycombio for @mysql #observability?"
Trang 23→ let’s see what we’ve got
Trang 24Our first, simple trace
Trang 25→ let’s see what we’ve got
Trang 26Checkpoint 2 Takeaways
▸ Events can be used to trace across functions within a service just as
easily as it can be "distributed"
▸ Store useful metadata on any event in a trace — and query against it!
▸ To aggregate per trace, filter to trace.parent_id does-not-exist
(or break down by unique trace.trace_id values)
Trang 27EXERCISE: ID sources of latency
▸ Who’s experienced the longest delay when talking to Twitter?
▸ Hint: app.username, MAX(duration_ms),
and name = check_twitter
▸ Who’s responsible for the most amount of cumulative time talking to
Twitter?
▸ Hint: Use SUM(duration_ms) instead
Trang 28a third-party dependency
Trang 29EXERCISE: Run the analysis service
‣ Open up http://localhost:8080 in your browser and post some messages
to your wall
‣ Try these:
‣ "everything is awesome!"
‣ "the sky is dark and gloomy and #winteriscoming"
go run /analysis.go node /analysis.js
Trang 30→ let’s see what we’ve got
Trang 31EXERCISE: Find Checkpoint 3
Go
Node
Trang 32→ let’s see what we’ve got
Trang 34Break
Trang 35Mosey
back
to seats, please :)
Trang 36a third-party dependency
a black-box service
Trang 37EXERCISE: Find Checkpoint 4
Go
Node
Trang 38→ let’s see what we’ve got
Trang 39Checkpoint 4 Takeaways
▸ Working with a black box? Instrument from the perspective of the code
you can control
▸ Similar to identifying test cases in TDD: capture fields to let you refine your
understanding of the system
Trang 40EXERCISE: Who’s knocking over my black box?
▸ First: what does "knocking over" mean? We know that we talk to our black
box via an HTTP call What are our signals of health?
▸ What’s the "usual worst" latency for this call out to AWS?
(Explore different calculations: P95 = 95th percentile, MAX, HEATMAP)
▸ Hint: P95(duration_ms),
and request.host contains aws
Trang 41Puzzle Time
Trang 42Scenario #1
Symptoms: we pulled in that last POST in order to persist messages somewhere, but
we’re hearing from customer support that behavior has felt buggy lately — like it works sometimes but not always What’s going on?
Think about:
those failing requests.
response.status_code request.content_length HEATMAPs are great :)
Trang 43Scenario #2
Symptoms: everything feels slowed down, but more importantly the persistence
behavior seems completely broken What gives?
Think about:
the bleeding? What might we need to find out to answer that question?
response.status_code app.username
Trang 44Scenario #3
Symptoms: persistence seems fine, but all requests seem to have slowed down to a
snail’s pace What could be impacting our overall latency so badly?
Prompts:
if you’d like to capture more about the characteristics of your payload
amazonaws.com
response.status_code request.host contains aws
Trang 45Thank you & Office Hours