Engineering Reliable Mobile ApplicationsModern mobile apps are complex systems.. In the sections that fol‐low, we share our experiences and learnings from supporting anddeveloping first-
Trang 1Engineering
Reliable Mobile Applications
Strategies for Developing Resilient Client-Side Applications
Kristine Chen, Venkat Patnala,
Devin Carraway & Pranjal Deo
with Jessie Yang
Trang 2Kristine Chen, Venkat Patnala, Devin Carraway, and Pranjal Deo
with Jessie Yang
Engineering Reliable Mobile Applications
Strategies for Developing Resilient
Client-Side Applications
Trang 3[LSI]
Engineering Reliable Mobile Applications
by Kristine Chen, Venkat Patnala, Devin Carraway, and Pranjal Deo, with Jessie Yang
Copyright © 2019 O’Reilly Media All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com) For more infor‐
mation, contact our corporate/institutional sales department: 800-998-9938 or cor‐
porate@oreilly.com.
Acquisition Editor: Nikki McDonald
Development Editor: Virginia Wilson
Production Editor: Deborah Baker
Copyeditor: Bob Russell, Octal Publish‐
ing, LLC
Proofreader: Matthew Burgoyne
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest
June 2019: First Edition
Revision History for the First Edition
2019-06-17: First Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Engineering Relia‐
ble Mobile Applications, the cover image, and related trade dress are trademarks of
O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is sub‐ ject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
This work is part of a collaboration between O’Reilly and Google See our statement
of editorial independence
Trang 4Table of Contents
Engineering Reliable Mobile Applications 1
How to SRE a Mobile Application 2
Case Studies 15
SRE: Hope Is Not a Mobile Strategy 29
Trang 6Engineering Reliable Mobile Applications
Modern mobile apps are complex systems They mix multitieredserver architecture run in data centers, messaging stacks, and net‐works with sophisticated on-device functionality both foregroundand background However elaborate, users perceive the reliability ofthe service through the devices in their hands Did the application
do what was expected quickly and flawlessly? At Google, the shift to
a mobile focus brought SRE to emphasize the true end-to-end userexperience and the specific reliability problems presented onmobile We’ve seen a number of production incidents in whichserver-side instrumentation taken by itself would have shown notrouble, but where a view inclusive of the user experience reflectedend-user problems For example:
• Your serving stack is successfully returning what it thinks areperfectly valid responses, but users of your app see blankscreens
• Users opening your maps app in a new city for the first timewould see a crash, before the servers received any requests at all
• After your application receives an update, although nothing hasvisibly changed, users experience significantly worse battery lifefrom their devices than before
These are all issues that cannot be detected by just monitoring ourservers and datacenters For many products, the user experience(UX) does not start or reach the server at all; it starts at the mobileapplication that the user employs to address their particular use
Trang 7case, such as finding a good restaurant in the vicinity A server hav‐ing five 9’s of availability is meaningless if your mobile applicationcan’t access it In our experience, it became increasingly important
to not just focus our efforts on server reliability, but to also expandreliability principles to our first-party mobile applications
This report is for people interested in learning how to build andmanage reliable native mobile applications In the sections that fol‐low, we share our experiences and learnings from supporting anddeveloping first-party native mobile applications at Google, includ‐ing:
• Core concepts that are critical to engineering reliable nativemobile applications Although the content in this report primar‐ily addresses native mobile applications, many concepts are notunique to these applications and are often shared with all types
How to SRE a Mobile Application
We can compare a mobile application to a distributed system thathas billions of machines—a size three to four orders of magnitudelarger than a typical large company’s footprint This scale is just one
of the many unique challenges of the mobile world Things we takefor granted in the server world today become very complicated toaccomplish in the mobile world, if not impossible for native mobileapplications Here are just some of the challenges:
Scale
There are billions of devices and thousands of device models,with hundreds of apps running on them, each app with multipleversions It becomes more difficult to accurately attribute
Trang 8degrading UX to unreliable network connections, service unre‐liability, or external factors.
Control
On servers, we can change binaries and update configurations
on demand In the mobile world, this power lies with the user
In the case of native apps, after an update is available to users,
we cannot force a user to download a new binary or configura‐tion Users might consider upgrades to be an indication of poor-quality software and assume that all the upgrades are simplybug fixes Upgrades also have tangible cost—for example,metered network usage—to the end user On-device storagemight be constrained, and data connection might be sparse ornonexistent
Monitoring
We need to tolerate potential inconsistency in the mobile worldbecause we’re relying on a piece of hardware that’s beyond ourcontrol There’s very little we can do when an app is in a state inwhich it can’t send information back to you
In this diverse ecosystem, the task of monitoring every singlemetric has many possible dimensions, with many possible val‐ues; it’s infeasible to monitor every combination independently
We also must consider the effect of logging and monitoring onthe end user given that they pay the price of resource usage—battery and network, for example
Change management
If there’s a bad change, one immediate response is to roll it back
We can quickly roll back servers, and we know that users will nolonger be on the bad version after the rollback is complete Onthe other hand, it is impossible to roll back a binary for a nativemobile application on Android and iOS Instead, the currentstandard is to roll forward and hope that the affected users willupgrade to the newest version Considering the scale and lack ofcontrol in the mobile environment, managing changes in a safeand reliable manner is arguably one of the most critical pieces
of managing a reliable mobile application
In the following sections, we take a look at what it means to be anSRE for a native mobile application and learn how to apply the core
Trang 9tenets of SRE outside of our datacenters to the devices in our users’pockets.
Is My App Available?
Availability is one of the most important measures of reliability Infact, we set Service-Level Objectives (SLOs) with a goal of beingavailable for a certain number of 9’s (e.g., 99.9% available) SLOs are
an important tool for SREs to make data-driven decisions about reli‐ability, but first we need to define what it means for a mobile appli‐cation to be “available.” To better understand availability, let’s take alook at what unavailability looks like
Think about a time when this happened to you:
• You tapped an app icon, and the app was about to load when itimmediately vanished
• A message displayed saying “application has stopped” or “appli‐cation not responding.”
• You tapped a button, and the app made no sign of responding
to your tap When you tried again, you got the same response
• An empty screen displayed or a screen with old results, and youhad to refresh
• You waited for something to load, and eventually abandoned it
by clicking the back button
These are all examples of an application being effectively “unavail‐able” to you You, the user, interacted with the application (e.g.,loaded it from the home screen) and it did not perform in a way youexpected, such as the application crashing One way to think aboutmobile application reliability is its ability to be available, servicinginteractions consistently well relative to the user’s expectations.Users are constantly interacting with their mobile apps, and tounderstand how available these apps are we need on-device, client-side telemetry to measure and gain visibility As a well-known say‐ing goes, “If you can’t measure it, you can’t improve it.”
Crash reports
When an app is crashing, the crash is a clear signal of possibleunavailability A user’s experience might be interrupted with a crashdialog, the application might close unexpectedly, or the user might
be prompted to report a bug Crashes can occur for a number of rea‐
Trang 10sons when an exception is not caught, such as a null-pointer derefer‐ence, an issue with locally cached data, or invalid server response,thereby causing the app to terminate Whatever the reason, it’s criti‐cal to monitor and triage these issues right away.
Crash reporting solutions such as Firebase Crashlytics can help col‐lect data on crashes from devices, cluster them based on the stacktrace, and alert you of anomalies On a wide enough install base, youmight find crashes that occur only on particular app or platformversions, from a particular locale, on a certain device model, oraccording to a peculiar combination of factors In most cases, acrash is triggered by some change, either binary, configuration, orexternal dependency The stack trace should give you clues as towhere in the code the exception occurred and whether the issue can
be mitigated by pausing a binary rollout, rolling back a configura‐tion flag, or changing a server response
Service-Level Indicators
As defined in Site Reliability Engineering, by Betsy Beyer, Chris
Jones, Jennifer Petoff, and Niall Richard Murphy (O’Reilly, 2016), a
Service-Level Indicator (SLI) is “a carefully defined quantitativemeasure of some aspect of the level of service that is provided.” Con‐sidering our previous statement about servicing users and theirexpectations, a key SLI for an app might be the availability or latency
of a user interaction However, an SLI is a metric, and usually anaggregation of events For example, possible definitions of SLIs forthe “search” interaction might be as follows:
Availability SLIsearch= eventssearcheventscode = OK
Trang 11sented for production monitoring or analytics As with other userdata, monitoring data should be collected and stored in a privacy-compliant way—for example, to adhere to policies such as the Euro‐pean Union’s General Data Privacy Regulation (GDPR) Here is anexample log event, to capture the performance of a voice searchinteraction:
To derive an SLI metric from an event, we need a formal definition
of which set of events (e.g., the critical user interactions from ver‐sions within a support horizon) and the event success criteria (e.g.,
code = OK) This model allows the telemetry to be durable tochanges, where events logged on clients can contribute to one ormore SLIs These logged events can also apply to different successcriteria It supports more advanced use cases such as slicing an SLImetric along dimensions like country, locale, app version, and so on.Furthermore, it allows for more sophisticated definitions of reliabil‐
ity; that is, an Availability definition in which success is “code = OKAND latency < 5s” is more consistent with user-perceived availa‐bility and thresholds for abandonment
After you have high-quality SLI metrics, you might consider settingthe overall reliability goals in the form of SLOs—the number of 9’syou’d expect to deliver SLOs can be very useful in the context ofreal-time monitoring and change management, and they also canhelp set engineering priorities If an application is not meeting itsreliability goals, the time spent in feature development can be diver‐ted to performance and reliability work (or vice versa, when an app
is consistently exceeding its goals)
Trang 12Real-Time Monitoring
SRE teams love real-time metrics: the faster we see a problem, themore quickly we can alert on it, and the quicker we can begin inves‐tigating Real-time metrics help us see the results of our efforts to fixthings quickly and get feedback on production changes Over theyears, many postmortems at Google have looked at the early symp‐toms of an incident and called for improvements in monitoring toquickly detect a fault that, in retrospect, was pretty obvious
Let’s look at server monitoring as an example For this example,most incidents have a minimum resolution time that is driven more
by humans organizing themselves around problems and determin‐ing how to fix them rather than the promptness of the alerting Formobile, the resolution time is also affected by the speed with whichfixes, when necessary, can be pushed to devices Most mobile experi‐mentation and configuration at Google is polling oriented, withdevices updating themselves during opportune times of battery andbandwidth conservation This means that even after submitting afix, it might be several hours before client-side metrics can beexpected to normalize
In spite of the latency just discussed, we do find that on widelydeployed apps, even if the majority of the installed populationdoesn’t see the effects of a fix for hours, client telemetry is constantlyarriving Therefore, it becomes a near certainty that some of thedevices that have picked up a fix will also, by chance, upload theirtelemetry shortly afterward This leads to two general approachesfor actionable feedback:
• Design low-latency error ratios with high-confidence denomi‐nators (to control for normal traffic fluctuation), so after push‐ing a fix, you can immediately begin looking for a change inerror rates There’s a small cognitive shift here: an SRE looking
at an error ratio curve needs to mentally or programmaticallyscale the change they see by the fix uptake rate This shift comesnaturally with time
• Design metrics such that the metrics derived from device tele‐metry include the configuration state as a dimension Then, youcan constrain your view of error metrics to consider only thedevices that are using your fix This becomes easier underexperiment-based change regimes, in which all client changesare rolled out through launch experiments, given that problems
Trang 13almost always occur in one or another experiment population,and the ID of the experiment that caused the problem (or thatcontains a fix) is consistently a dimension in your monitoring.
We typically employ two kinds of real-time monitoring, white-box and black-box, which are used together to alert us of critical issues
affecting mobile apps in a timely manner
White-box monitoring
When developing an app, we have the luxury of opening a debugconsole and looking at fine-grained log statements to inspect appexecution and state However, when it is deployed to an end-user’sdevice, we have visibility into only what we have chosen to measureand transport back to us Measuring counts of attempts, errors,states, or timers in the code—particularly around key entry/exitpoints or business logic—provides indications of the app’s usage andcorrect functioning
We have already alluded to several standard types of monitoring,including crash reports and SLI metrics We also can instrumentcustom metrics in the app to monitor business logic These areexamples of white-box monitoring, or monitoring of metricsexposed by the internals of the app This class of monitoring canproduce very clear signals of an app’s observed behaviors in the wild
Black-box monitoring
Black-box monitoring—testing external, visible behaviors of the app
as if the user performed those actions—is complementary to box monitoring Generally, probing is a kind of black-box monitor‐ing in which a regularly scheduled “test” is performed Forapplications, this entails starting a current version of the binary on areal or emulated device, inputting actions as a user would, andasserting that certain properties hold true throughout the test Forexample, to exercise the search “user journey,” a UI test probe would
white-be written to install and launch the app on an emulator, select theinput box, type a search term, click the “search” button, and thenverify that there are results on the page
Black-box monitoring gives us a signal about the UX A continu‐ously running prober can give us success or failures for particularuser journeys in the application that can be attributed to the applica‐tion itself or any number of external factors that affect it, such as a
Trang 14dependent server endpoint Even though black-box monitoring hasmore coverage of failure modes, it does not easily indicate whetherthe failure is internal to the application This is why we view white-box and black-box monitoring to be complementary and why weuse them together for better coverage of the app.
Performance and Efficiency
When was the last time your phone ran out of battery at the mostinconvenient time? The battery is arguably the most valuableresource of a mobile device—in fact, it is what enables the device to
be “mobile” in the first place Mobile applications on a device shareprecious resources, such as the battery, network, storage, CPU, andmemory When one application abuses or wastes those sharedresources, it does not go unnoticed No application wants to be atthe top of the battery or network usage list and attract negativereviews Efficiency is particularly important if you expect your app
to be used on lower-end devices and in markets where metered net‐work and other resources have nontrivial cost to the user
The platform as a whole suffers when shared resources are misused,and, increasingly, the OS places limitations to prevent abuse Inturn, platforms provide tools (such as Android Vitals) to helpattribute and pinpoint problems in your own app Google’s internalapplications have incentives to care about system health factors thataffect user happiness, such as battery life In fact, feature launchesthat would have decreased mean device battery life by as little as0.1% have been blocked from launching precisely because of theirnegative effect on user happiness Many small regressions acrossapplications, features, and platforms create a tragedy of the com‐mons and poor overall experience Antipatterns that lead to issueswith performance and efficiency, dubbed as “known bads,” are iden‐tified and published on internal and external developer guides (e.g.,
Android Developer Guide)
Teams at Google are required to identify their use cases—both whenthe screen is on and when it is off—that might affect performancemetrics These include use cases such as key user flows, frequentlyencountered flows, and the use cases expected to have high resourceusage Teams do a variety of internal testing to collect statistics onmobile system components such as battery, memory, and binarysize Any unexpected regressions are triaged and fixed beforelaunch, and any expected regressions beyond predefined thresholds
Trang 15are approved only after careful consideration of the trade-offs vis the benefits a feature user would acquire As a result of this pro‐cess, much of the system health testing is automated and reports areeasily prepared for review.
vis-à-Change Management
We recommend a number of best practices when releasing clientapplications Many of these practices are based upon the principlesexpressed in Site Reliability Engineering Best practices are particu‐larly important to SRE because client rollbacks are near impossibleand problems found in production can be irrecoverable (see “CaseStudies” on page 15) This makes it especially important to takeextra care when releasing client changes
Release safety is especially critical because a broken new version canerode user trust, and that user might decide to never update again
Staged rollout
A staged rollout is a term used in Android development that refers to
releasing an update to a percentage of your users that you increase
over time iOS refers to this practice as Phased Releases All changes
should go through some sort of staged rollout before releasing fully
to external users This allows you to gradually gather productionfeedback on your release rather than blasting the release to all users
at once Figure 1-1 shows an example of a release life cycle, includ‐ing a staged rollout
Figure 1-1 Release life cycle with a staged rollout in which 1% of users receive the new version, then 10% of users, then 50% of users, before releasing to all users
Trang 16Internal testing and dogfooding (using your own product internally)
is rarely enough to fully qualify a release because developer devicesare not typically representative of the actual user population If theplatform supports it, it can be useful to add an extra stage betweeninternal and external users in which you release to a subset of exter‐nal users that choose to receive beta versions of your application(open beta in Android or TestFlight in iOS) This widens the pool ofdevices and users, meaning that you can test on a larger set of userswhile not affecting the entire world, and adds an extra step of releasequalification
Experimentation
Unlike traditional services, client applications tend to exist in a verydiverse ecosystem Each user’s device differs from the next, depend‐ing on platform version, application version, locale, network, andmany other factors One issue we’ve noticed when conducting astaged rollout is that some metrics, such as latency, can look great onthe newest version for the first few days but then change signifi‐cantly for the worse after a week Users who have better networksand/or devices tend to upgrade earlier, whereas those who haveworse networks and/or devices upgrade later These factors make itless straightforward to compare metrics of the newest version to themetrics of the previous version and tend to result in manual inspec‐tion of graphs, with low confidence in the correctness of the signal.When possible, we recommend releasing all changes via experi‐ments and conducting an A/B analysis, as shown in Figure 1-2, over
a staged-rollout process This helps reduce the noise significantlyand enables simpler automation Control and treatment group selec‐tion should be randomized for every change to ensure that the samegroup of users are not repeatedly updating their applications
Trang 17Figure 1-2 A/B experiment analysis for mobile change
New code is released through the binary release and should be dis‐abled by using a feature flag by default
Feature flags are another tool to manage production
changes, and you need to use them according to the
relevant distribution platform’s rules (e.g., Apple’s App
Store has specific rules on allowed changes)
Releasing code through the binary release makes the feature avail‐able on all users’ devices, and launching the feature with a featureflag enables it for a smaller set of users that is controlled by thedeveloper, as illustrated in Figure 1-3 Feature flags are generally thepreferred way to make changes to the client because they give thedeveloper more control in determining the initial launch radius andtimeline Rolling back the feature flag is as simple as ramping thelaunch back down to 0%, instead of rebuilding an entire binary withthe fix to release to the world
When using feature flag-driven development, it is especially impor‐tant to test that rolling back the flag does not break the application.You should perform this testing before the flag launch begins
Trang 18Figure 1-3 Example stages of a feature flag ramp: A feature flag’s functionality is tested first on internal users before rolling out in stages
to production users If an issue is found on a subset of production users, it can be rolled back and the code is fixed before ramping the feature flag to 100%.
Upgrade side effects and noise
Some changes can introduce side effects simply by the process ofupgrading to the newest version of the code For example, if thechange requires a restart to the application, the experiment group(the group that received the change) needs to take into account theside effects of restarting an application, such as the latency increasecaused by cold caches
One way to address side effects is to create something like a placebochange in which the control group receives a no-op update, and theusers go through much of the same behavior as the experiment