IT training calculating costs for observability khotailieu

DevOps teams need the best tools to debug issues when they come up, not just hope they can catch everything in staging.. Can you evaluate systems performance using distributed tracing vi

Trang 1

What Next-generation APM Means for Your Business

Mar, 2019

Trang 2

Observability is the only way to proactively manage production systems Complex systems are the top

challenge facing DevOps teams Your customers

depend upon you to deliver high reliability without

slowing development productivity You must invest in

shortening outage durations and eliminating wasted developer time.

Practitioners of DevOps and business leaders alike are beginning to understand that in order to scale and operate a service that drives growth and competitive edge, you must invest in the right tools and approach Production system

performance and uptime is just one aspect which directly impacts the customer experience and when you continuously deliver and integrate new features,

systems become more complex and unless tightly managed, business risk goes

up Observability is a critical requirement that enables teams to level up and manage ever-increasing complexity.

Distributed systems architectures are inherently complex, and the addition of continuous integration and continuous delivery (CI/CD) raises the stakes.

Visibility and control are central to success and as delivery systems become automated, everything becomes more opaque and therefore harder to proactively manage Add to this the abstraction layers of containers or a serverless

infrastructure and the team feels farther removed from being in control As a result, the number of potential causes for any given issue increases while your ability to point at any single issue as the cause is becoming much harder.

Trang 3

Debugging in production is a requirement for modern teams, especially for teams who ship frequently DevOps teams need the best tools to debug issues when they come up, not just hope they can catch everything in staging Our customers tell us that before Honeycomb, they frequently experienced incidents where problem sources were never identified Teams can no longer rely on simple metrics alone to provide the level of insight they need to diagnose and resolve, especially at scale.

Observable production systems enable you to move beyond locating gnarly bugs

or fixing a problematic incident or outage Designing your systems to include observability from the point at which a feature is released allows teams to

immediately learn how it behaves in production and adjust before a critical outage occurs.

Performance Analysis

When a new feature is shipped, can you clearly see the impact it has on your systems? As load climbs and you have to choose to add capacity or optimize code, do you know where to focus in order to make the most impact and keep your most important customers happy?

Intercom used Honeycomb to evaluate performance across all the

dimensions required to understand how different users and types of usage affected the performance of a given endpoint They were able to both identify the portions of the code needing refactoring as well as document concrete examples of how they'd improve performance.

How Intercom sped up their busiest endpoint(by as much as 50%)

Trang 4

Incident Response

When a user misuses your service, maliciously or otherwise, are you able to locate the vulnerability in your codebase and then address the problem before others notice? Do your tools have the power to isolate the source of an attack, or how many users it may be impacting?

When hackers tried to DOS their service, carwow needed the ability to query at a level of granularity that their traditional APM tools couldn't manage, so they turned to Honeycomb

Preventing Bad Actors from Spoiling the Show at carwow

Visibility into 3rd-party Services

If your product relies on external API calls and responses, can you identify the source of a service slowdown? Do you have the ability to sift through the

information coming from your database, your cache, your load balancers, and your own code quickly and reliably to know if you should be looking to 3rd party providers to resolve?

Behaviour Interactive (BHVR) had been using a classic APM approach for some time to troubleshoot latency issues in their flagship multiplayer video game, but were unable to identify the source of a service

slowdown—was it in the caching, the database, or somewhere in one of the numerous external calls? With Honeycomb, they found the issues in just minutes

Gamers Won't Wait: Dead By Daylight Gets Some Sweet

Attention

Trang 5

Addressing Technical Debt

As your organization scales and your product's footprint grows, are you able to maintain clear sight-lines across your infrastructure as complexity increases? Can you evaluate systems performance using distributed tracing views and better understand the interactions among an increasing number of services?

While growing as fast as possible to meet their business demands,

carwow leveraged Honeycomb to follow a request through its entire life-cycle and understand the impact on different subsystems in the code, leveraging its cross-team collaboration features to solve issues:

Honeycomb Tracing Drives Efficiencies as carwow Scales

User Happiness and Product Management

Do you understand how the end user experiences your product? Do you notice when they use features in unexpected ways and can you capture that data for your product team to investigate?

Using Honeycomb, Intercom discovered one of their users was trying so hard to use their product in ways they hadn't anticipated that it was impacting the overall experience of many others—and as a result

informed future product planning for that feature:

Intercom <3 Honeycomb

Trang 6

Key capabilities to move the needle on your observability practice

If you experience any of the following, then you must adopt an observability approach This will involve cultural and process-centric changes but for this document, we will focus on technology tooling that DevOps teams require in order to fully understand production, debug faster and spend less time fighting technical debt.

● Increased frequency of code ships or feature releases

● Increase in volume of users / customers

● More questions for engineers from on-call teams

● Customer complaint issues on the rise

● Pressure to get new features into production faster.

Technology tooling requirements for an observability practice requires the

following capabilities Without these, it is extremely difficult to answer the

questions that matter to your production system.

Here are some key capabilities and why they matter:

● Automatic instrumentation of events and traces across

popular languages

Most developers don't enjoy instrumenting their code, yet everyone on the team needs that telemetry to give context and meaning to ultimately achieve

observability.

Does your solution provide drop-in, immediate, automatic instrumentation

to instrumentation to jump-start your observability practice?

Trang 7

● Query performance suitable for rapid iteration and debugging

in production

When your team is in a firefight, the last thing you want to wait for is query

results, whether from ETL delay or slow search performance or worse, no access

to the data set.

Does your solution provide the ability to slice and dice your data across a number of dimensions in a frictionless way, as well as the ability to

backtrack and try new theories, and still get to the problem source fast?

● Support for flexible queries over many dimensions

More and more, the questions that need to be asked to move the business

forward require the ability to drill down across many aspects of your system data, but most tools aggregate data, removing the detail required.

Does your solution support fast query response across many fields

containing an arbitrary number of unique values so you can ask the

important questions?

● Intuitive query interface

Seems like every new tool involves learning a new query language, and the

slowdown of the associated learning curve.

Does your solution offer a graphical query interface designed to speed

users to insights and intelligent, data-driven suggestions for investigation paths when troubleshooting any issue, large or small?

Trang 8

● Next-generation anomaly-detection

When something in the data looks unusual, it can take a lot of false starts to get

at what might be the cause, to determine how big an impact it may have When production is impacted, speed matters.

Does your solution allow investigators to simply select outliers and

accelerate their validation process by seeing the most likely suspects right away? Does your solution have a high signal to noise ratio that aids

debuggers rather than slowing them down with red herrings?

● Distributed tracing visualizations in context

Context switching is a known productivity killer.

Does your solution force you to switch tools and re-orient yourself if you

want to try a different visualization? Will that other tool have the same

data-set? Does it give you everything you need to keep investigating and

easily switch to a tracing waterfall diagram with the same data?

● Smart sampling to retain key data points

Relying on metrics means having little or no control over what gets averaged out

of your data.

Can you precisely manage the level and application of sampling so you can control costs without missing the events that are important to track for your business?

Trang 9

● Fine-tunable data retention policies

Multiple data sources means multiple legal or business requirements for

retention.

Does your solution allow you to rebalance storage allocation and determine how long you want to keep data in order to meet business needs as you

scale over time?

● Support for open standards such as OpenCensus

Ease of getting data into your tools is critical, and some organizations want to adhere to open source standards where available

Does your solution support and provide capabilities for getting data in via OpenCensus libraries that collect key metrics and distributed traces from your system services?

● The ability to fully leverage your best talent, past and present

Collaboration is more than just an idea or process approach.

Does your solution encourage each team member to discover past queries from their teammates and learn by following in their footsteps at every step

in the query construction?

Trang 10

Does this seem like a tall order? In many ways, observability is equal parts

process, culture and tooling Building your own stack or leveraging a myriad of different tools has drawbacks and costs to the business overall.

If you're thinking of building your own

observability stack

With a homegrown solution, you might start small and expand from there.

Someone on your team provisions a couple of AWS EC2 instances for Logstash, the cluster nodes, and Kibana, and you might get some useful data flowing into ELK for your team to troubleshoot problems.

But if you’re doing it *right*, more people will want to use the system and so you will need to either limit what data you ingest, or scale it up, and that will

dramatically increase your costs year over year.

Consider this breakdown from an IT manager who investigated how much time would be spent on each of these tasks in order to deploy and run an ELK stack for a year:

These relatively conservative estimates come out to 344 hours, essentially a full-time person's workload.

Trang 11

Doing the same calculation for deploying Honeycomb, a purpose built technology

to enable system observability reduces the time down to just 40 hours—a single work-week per year, which is a significantly reduced level of time and money investment:

And of course, this doesn't take into account the cost of the hardware you would need to run a home-grown solution—whether on-premise or in the cloud.

What about traces?

Beyond being costly to run, Elastic stack doesn’t allow you to see the full

distributed context of the requests that generated log messages You could of course ask your team to use a second tool to access tracing event data, but in doing so they would then have to re-orient their analysis and investigation, drilling into a different UI with a different approach and mindset—which can not only derail an investigation slowing it down but more importantly with less accuracy and reliability.

This is undesirable when time-to-understanding (and ultimately, resolution) is critical Honeycomb provides distributed tracing built in, and with Honeycomb Beelines, your code is instrumented automatically, including traces.

“The Beelines include insightful and valuable traces by default, they're built by people who know what is useful The magic that is there makes sense.”

- Alex Newman, Co-founder, hCaptcha

Why not offload all that operational overhead to Honeycomb? We are your first

Trang 12

If you're thinking of using a vendor to

achieve observability

Metrics tools do well at counting discrete gauges and counters such as number

of jobs queued, host level resource usage, number of requests served by the system, and so on Searching is often fast because these systems aggregate, or

average the data by a fixed set of buckets (e.g host, region, or version) But for

resolving and preventing problematic production issues, your team needs the rich

details and context that Honeycomb provides—and although metrics vendors claim to offer context, they charge a hefty fee to deliver it.

With standard metrics tools, engineers and ops teams typically work with

counters that track numbers as they change over time Some common things to measure are the number of HTTP requests a given app is serving, average

latency, and rate of errors Each one of these counters or gauges will generate

one unique time series on disk.

That’s good for a basic start at detecting when problems are potentially

occurring, but you need much more context to find out what is going wrong, and

that context is not present in the data if there is only one global counter.

As a result, what happens next is that engineers try to add the context they need using tags or labels, which generate new time series to track the counter's value

Trang 13

summed across each unique set of tags or labels, and those time series add up fast.

And the multiplicative effect gets greater and greater as you add more tags and labels If you want to know which user was associated with a given request, rather than just what host or container it came from, the cost of storing all that additional data goes up exponentially Likewise, to track percentiles in addition to just averages, you need to create multiples of the metric(s) in question The number of time series quickly becomes very expensive to store for tools that aren’t designed to handle it, much less read it back quickly enough to help you figure out where issues lie.

Metrics vendors typically limit the number of tags you can define, or simply charge you the exponentially increasing cost for each combination of tag values you send All features that are built in to Honeycomb's solution such as unlimited queryable fields/columns, tracing, anomaly detection are typically an extra cost

for most metrics providers These vendors have to charge more to provide

Định dạng
Số trang	16
Dung lượng	1,82 MB