IT training always bee tracing slides khotailieu

Welcome to Always Bee Tracing!If you haven’t already, please clone the repository of your choice: ▸ Golang into your $GOPATH: git clone git@github.com:honeycombio/tracing-workshop-go.git

Trang 1

Welcome to Always Bee Tracing!

If you haven’t already, please clone the repository of your choice: 

▸ Golang (into your $GOPATH): 

git clone git@github.com:honeycombio/tracing-workshop-go.git

▸ Node: 

git clone git@github.com:honeycombio/tracing-workshop-node.git

Please: also accept your invites to the "Always Bee Tracing" Honeycomb team and our Slack channel

Trang 2

Always Bee Tracing

A Honeycomb Tracing workshop

Trang 3

▸ We used to have "one thing" (monolithic application)

▸ Then we started to have "more things" (splitting monoliths into services)

▸ Now we have "yet more things", or even "Death Star" architectures

(microservices, containers, serverless)

A bit of history

Trang 4

▸ Now we have N2 problems (one slow service bogs down everything, etc.)

▸ 2010 - Google releases the Dapper paper describing how they improve on

existing tracing systems

▸ Key innovations: use of sampling, common client libraries decoupling app

code from tracing logic

A bit of history

Trang 5

▸ 2012 - Zipkin was developed at Twitter for use with Thrift RPC

▸ 2015 - Uber releases Jaeger (also OpenTracing)

▸ Better sampling story, better client libraries, no Scribe/Kafka

▸ Various proprietary systems abound

▸ 2019 - Honeycomb is the best available due to best-in-class queries ;)

Why should GOOG have all the fun?

Trang 6

▸ Standards for tracing exist: OpenTracing, OpenCensus, etc

▸ Pros: Collaboration, preventing vendor lock-in

▸ Cons: Slower innovation, political battles/drama

▸ Honeycomb has integrations to bridge standard formats with the

Honeycomb event model

A word on standards

Trang 7

How Honeycomb ﬁts in

Understand how your production systems

are behaving, right now

QUERY BUILDER INTERACTIVE VISUALS RAW DATA TRACES BUBBLEUP + OUTLIERS

BEELINES (AUTOMATIC INSTRUMENTATION + TRACING APIS)

DATA STORE 

High Cardinality Data | High Dimensionality Data | Efﬁcient storage

Trang 8

▸ For software engineers who need to understand their code

▸ Better when visualized (preferably ﬁrst in aggregate)

▸ Best when layered on top of existing data streams (rather than adding

another data silo to your toolkit)

Tracing is…

Trang 10

Instrumentation (and tracing)

Trang 11

Our path today

▸ Establish a baseline: send simple events

▸ Customize: enrich with custom ﬁelds and extend into traces

▸ Explore: learn to query a collection of traces, to ﬁnd the most interesting

one

Trang 12

a third-party dependency

a black-box service

Trang 13

EXERCISE: Run the wall service

go run /wall.go

‣ Open up http://localhost:8080 in your browser and post some messages

to your wall

‣ Try writing messages like these:

‣ "hello #test #hashtag"

‣ "seems @twitteradmin isn’t a valid username but @honeycombio is"

node /wall.js

Trang 14

→ let’s see what we’ve got

Trang 15

Go

Trang 16

Trang 17

Custom Instrumentation

▸ Identify metadata that will help you isolate unexpected behavior in

custom logic:

▸ Bits about your infrastructure (e.g which host)

▸ Bits about your deploy (e.g which version/build, which feature ﬂags)

▸ Bits about your business (e.g which customer, which shopping cart)

▸ Bits about your execution (e.g payload characteristics, sub-timers)

Trang 18

EXERCISE: Find Checkpoint 1

Go

Node

Trang 19

Trang 20

EVENT ID: B, PARENTID: A

EVENT ID: C, PARENTID: B

TRACE 1

Trang 21

EVENT ID: A

EVENT ID: B, PARENTID: A

EVENT ID: C, PARENTID: B

TRACE 1

Trang 22

‣ Try writing messages like these:

‣ "seems @twitteradmin isn’t a valid username but @honeycombio is"

‣ "have you tried @honeycombio for @mysql #observability?"

Trang 23

Trang 24

Our ﬁrst, simple trace

Trang 25

Trang 26

Checkpoint 2 Takeaways

▸ Events can be used to trace across functions within a service just as

easily as it can be "distributed"

▸ Store useful metadata on any event in a trace — and query against it!

▸ To aggregate per trace, ﬁlter to trace.parent_id does-not-exist

(or break down by unique trace.trace_id values)

Trang 27

EXERCISE: ID sources of latency

▸ Who’s experienced the longest delay when talking to Twitter?

▸ Hint: app.username, MAX(duration_ms), 

and name = check_twitter

▸ Who’s responsible for the most amount of cumulative time talking to

Twitter?

▸ Hint: Use SUM(duration_ms) instead

Trang 28

Trang 29

EXERCISE: Run the analysis service

‣ Open up http://localhost:8080 in your browser and post some messages

to your wall

‣ Try these:

‣ "everything is awesome!"

‣ "the sky is dark and gloomy and #winteriscoming"

go run /analysis.go node /analysis.js

Trang 30

Trang 31

Go

Node

Trang 32

Trang 34

Break

Trang 35

Mosey

back

to seats, please :)

Trang 36

a black-box service

Trang 37

Go

Node

Trang 38

Trang 39

Checkpoint 4 Takeaways

▸ Working with a black box? Instrument from the perspective of the code

you can control

▸ Similar to identifying test cases in TDD: capture ﬁelds to let you reﬁne your

understanding of the system

Trang 40

EXERCISE: Who’s knocking over my black box?

▸ First: what does "knocking over" mean? We know that we talk to our black

box via an HTTP call What are our signals of health?

▸ What’s the "usual worst" latency for this call out to AWS? 

(Explore different calculations: P95 = 95th percentile, MAX, HEATMAP)

▸ Hint: P95(duration_ms), 

and request.host contains aws

Trang 41

Puzzle Time

Trang 42

Scenario #1

Symptoms: we pulled in that last POST in order to persist messages somewhere, but

we’re hearing from customer support that behavior has felt buggy lately — like it works sometimes but not always What’s going on?

Think about:

those failing requests.

response.status_code request.content_length HEATMAPs are great :)

Trang 43

Scenario #2

Symptoms: everything feels slowed down, but more importantly the persistence

behavior seems completely broken What gives?

Think about:

the bleeding? What might we need to find out to answer that question?

response.status_code app.username

Trang 44

Scenario #3

Symptoms: persistence seems fine, but all requests seem to have slowed down to a

snail’s pace What could be impacting our overall latency so badly?

Prompts:

if you’d like to capture more about the characteristics of your payload

amazonaws.com

response.status_code request.host contains aws

Trang 45

Thank you & Ofﬁce Hours

Định dạng
Số trang	45
Dung lượng	4,37 MB