Antifragile Systems and Teamsby Dave Zwieback Copyright © 2014 O’Reilly Media, Inc.. Antifragile Systems and Teams and related trade dress are trademarks of O’Reilly Media, Inc.. 1 Imper
Trang 2“ Velocity is the most
valuable conference I have ever brought my team to For every person I took this year, I now have three who want to go next year.”
Join business technology leaders,
engineers, product managers,
system administrators, and developers
at the O’Reilly Velocity Conference
You’ll learn from the experts—and
each other—about the strategies,
tools, and technologies that are
building and supporting successful,
real-time businesses
Santa Clara, CA May 27–29, 2015
http://oreil.ly/SC15
Trang 3Dave Zwieback
Antifragile Systems
and Teams
Trang 4Antifragile Systems and Teams
by Dave Zwieback
Copyright © 2014 O’Reilly Media, Inc All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (http://my.safaribooksonline.com) For
more information, contact our corporate/institutional sales department: 800-998-9938
or corporate@oreilly.com.
Editor: Mike Loukides
April 2014: First Edition
Revision History for the First Edition:
2014-04-21: First release
2015-03-24: Second release
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered
trademarks of O’Reilly Media, Inc Antifragile Systems and Teams and related trade
dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their prod‐ ucts are claimed as trademarks Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
ISBN: 978-1-491-94796-8
[LSI]
Trang 5Table of Contents
Antifragile Systems and Teams 1
Impermanence 1
Fragile, Robust, and Beyond 2
Antifragility 3
DevOps and Antifragility 4
Culture, Part 1: Managing the Downside 5
Culture, Part 2: Skin In the Game 7
Culture, Part 3: Tinkering (or Continual Experimentation and Learning) 9
Automation and Measurement: A Cautionary Tale 10
Sharing 12
Is DevOps Antifragile? 12
Acknowledgments 13
iii
Trang 7Antifragile Systems and Teams
Impermanence
Systems break because of change, and we often go to great lengths to prevent or manage change with heavy-handed, bureaucratic
change-management processes What we forget is that systems also function
because of change
For instance, computer systems function through state transition: 0s changing to 1s and back again More fundamentally, computers work
because 0s can change into 1s, because computer systems are change‐
able
This is a somewhat subtle point, one that’s easy to overlook in search
of more tangible conditions required for systems to function or mal‐ function But it’s worth repeating: the fundamental reason that systems start, stop, or continue working—the one root cause of all functioning systems and all system outages—is their changeable, impermanent nature
More broadly, impermanence is a fundamental property of all com‐ pounded things, i.e., those that consist of two or more parts All of the systems that we work with certainly have two or more parts (how about five billion parts for the upcoming Xbox chip?)
So how can this theoretical and philosophical understanding of im‐ permanence help us? How can we make impermanence usable and useful?
First, impermanence is useful because it reminds us that all function‐
ing systems will eventually break down Understanding imperma‐ nence frees us from looking for the “single root cause” of outages, and
1
Trang 81 Dekker, Sidney The Field Guide to Understanding Human Error: Second Edition.
Farnham, Surrey, UK: Ashgate Publishing, 2006.
from the mistaken belief that there is none (Sidney Dekker proclaims that “What you call ‘root cause’ is simply the place where you stop looking any further.”1) The single root cause of all outages—and all functioning—is impermanence, the changeable nature of all systems Second, having accepted impermanence, we (as engineers) cannot ac‐ cept that things break or function entirely randomly If we mixed two parts hydrogen with one part oxygen and sometimes got water and at other times ice cream, we might as well give up our chosen profession! While we may not be able to identify all the conditions required to build or keep the system running or all those that result in malfunc‐
tions, we can certainly identify some of the conditions Engineering,
above all, is the discipline of studying conditions and their effects Most important, we can identify some of the conditions that we can actually impact While a butterfly fluttering its wings in Africa may be one of the conditions of a hurricane resulting in a power outage in Northern Virginia, we don’t have any control over the butterfly, but at least we have some say in whether our data centers have backup power generators
Fragile, Robust, and Beyond
While all systems break eventually (because they’re all breakable, im‐
permanent), they differ in their robustness to stress For instance, imagine you have a very important document that you want to protect and share widely Let’s say you decide to store it on a physical hard disk (the spinning platters variety) in a server connected to the Internet via
a single network connection This particular system would be fragile, since a failure of any single component (disk, server, network, etc.) would make the file unavailable, possibly forever Moreover, if the file became widely popular, its popularity would negatively impact its availability, as any of the subsystems could easily be saturated
As we can see, fragile systems dislike and are harmed by even small amounts of stress (a.k.a volatility, randomness, disorder, or imper‐ manence) One way to reduce fragility is to increase redundancy: we could build a more robust system by removing single points of failure
by mirroring the disks, setting up multiple servers in high-availability
2 | Antifragile Systems and Teams
Trang 9configurations and distributing them geographically, adding multiple higher-bandwidth Internet links, etc While this system would be more robust and increase the availability of the document, it would remain vulnerable to somewhat rare but catastrophic conditions like the leap second bug (because all servers ran the same version of the
OS) or equipment seizure by the government Moreover, this is a con‐ siderably more complex system than a single-server one, and it will at times function and break in unexpected and unpredictable ways
We usually think of robust systems as the opposite of fragile ones be‐ cause they don’t to care too much about (i.e., neither like nor dislike) stress In fact, robust systems are merely less fragile The true opposite
of a fragile system would be one that actually benefits from stress Nassim Nicholas Taleb calls such systems antifragile.
In the case of the aforementioned file, imagine we stored it in a system that made it more available the higher the demand for it was In fact, BitTorrent is precisely this type of system: the more our file is reques‐ ted, the more robust to failure and available it becomes because parts
of it are stored on a progressively larger number of computers In fact, the best-case scenario for a file stored on BitTorrent is that every per‐ son on earth would want to download it and then share it This sce‐ nario is exactly the absolute worst case for both fragile and robust systems Note also that our cost of distributing this file would remain constant—not so for the cost of making systems more robust to an‐ ticipate higher demand or improve resiliency
Antifragility
The main property of antifragile systems is that the potential downside due to stress (and its retinue) is lower than the potential upside, up to
a point Taleb defines this asymmetry as follows: you are antifragile to event intensity between xlow and xhigh if you are better off after the event than before (up to xhigh) The rare, entirely unpredictable event (a.k.a the Black Swan event) in which every person on earth wants to down‐ load a copy of a file distributed on BitTorrent would actually make the file maximally available, robust to failure, and practically impossible
to delete
BitTorrent is similar to another human/social system with antifragile properties: information shared via gossip The more someone (e.g., a government or a self-righteous group) tries to suppress, criticize, or disprove some information (e.g., an idea or a book), the more wide‐
Antifragile Systems and Teams | 3
Trang 102 McCann, Sean The World of Brendan Behan London: Twayne Publishers, 1966 56.
spread and persistent it becomes Critics of Miley Cyrus did her a great favor with their outraged responses to her 2013 VMA performance—
a smarter strategy would have been to ignore the train wreck entirely
In fact, Brendan Behan’s famous observation that “there’s no bad pub‐ licity except an obituary”2 provides a nice summary of antifragility: the downside to gossip and negative publicity (e.g., damage to one’s reputation) is smaller than the upside (e.g., selling more records), up
to a point (e.g., death)
The longevity and popularity of some books is due in part to being banned at one time or another; in fact, 46 of the 20th century’s top 100 novels were targets of ban attempts From my childhood in Soviet Russia, I remember my parents and their friends staying up nights to
read a samizdat version of The Gulag Archipelago, despite the fact that
those found in possession of the banned manuscript faced long prison sentences This illustrates another often overlooked property of anti‐ fragile systems: antifragility by layers Being banned may help books become antifragile, while reading banned books can make individuals fragile
More generally, complex systems contain myriad components, some‐ times organized in hierarchies or layers As in the above example, sometimes antifragility is achieved at a certain layer of a complex sys‐ tem at the cost of fragility of another Another example of antifragility
by layers is vaccination: while it appears to benefit populations, vac‐ cines are known to cause serious side effects for some individuals That
is, vaccination makes a population antifragile because the downside (a small number of individuals having negative side effects) is small in comparison to the upside (an entire population gaining immunity to
a disease) However, vaccination certainly makes the specific individ‐ uals who wind up having side effects or die from it fragile
DevOps and Antifragility
DevOps is a modern organizational philosophy, which can be applied
to improve the antifragility of complex systems (“Systems” includes people as well as computers and software.) Let’s explore how the layers
of DevOps (Culture, Automation, Measurement, and Sharing) con‐ tribute to the overall antifragility of organizations that adopt this phi‐ losophy
4 | Antifragile Systems and Teams
Trang 113 Gruver, Gary, Mike Young, and Pat Fulghum A Practical Approach to Large-Scale Agile
Development: How HP Transformed LaserJet FutureSmart Firmware Boston: Addison-Wesley, 2012.
Culture, Part 1: Managing the Downside
Etsy is known for deploying to production 40+ times per day Netflix routinely terminates production instances at random or introduces latency into their application An architect working on HP printer firmware “would purposely make changes that would break the code until it was architecturally correct.”3 What makes all these examples acceptable or even possible within organizations? Fundamentally, these organizations have realized that the potential downside of fre‐ quent changes (otherwise known as volatility, variance, etc.) is smaller than the potential upside This knowledge has become part of the cul‐ ture at these companies This is also the classic asymmetry of pain versus gain required to make these companies and their systems an‐ tifragile
Figure 1 The Antifragile asymmetry between potential downside and upside
The major upside of embracing higher volatility is that customers re‐ ceive higher-quality products and services (i.e., value) faster and at a lower cost than is possible with traditional, risk- and volatility-averse approaches The often overlooked and critical point is that this is only possible not only because these companies have identified potential
Antifragile Systems and Teams | 5
Trang 124. http://slidesha.re/1fsqzHB
5. http://bit.ly/1e9i40t
6 Gruver, Gary, Mike Young, and Pat Fulghum A Practical Approach to Large-Scale Agile
Development: How HP Transformed LaserJet FutureSmart Firmware Boston: Addison-Wesley, 2012.
upside, but also because they have become experts at minimizing the downside:
At Etsy “this isn’t license to break stuff, quickly Engineer-driven QA
and solid unit testing are integral parts of the process.” 4
Netflix “is known for being bold in its rapid pursuit of innovation and
high availability, but not to the point of callousness It is careful to
avoid any noticeable impact to customers from these failure-induction exercises.” 5
At HP “we could tell when [the architect] had found and plugged a
hole because a particular test area would have a huge drop in passing
tests or the build would break completely.” 6
It is the robust testing, quality assurance, and (continuous) deploy‐ ment practices that enable these companies to reduce the possibility
of any individual change causing a catastrophic event Both Etsy and Netflix have become experts at quickly going back to a previously working version of software should the newly deployed one be found defective Amazon takes this up a notch, performing “rollbacks” au‐ tomatically in the event that a deploy causes problems This is one of the reasons that only about 0.001% of software deployments at Ama‐ zon cause outages, even with the mean time between deployments a staggering 11.6 seconds! At the same time, in all these companies, some
of the potentially risker changes (e.g., database schema or OS updates
on core routers) are certainly not performed frequently or continu‐
ously
We should be careful not to attribute the antifragility of the above
companies entirely to effective testing or QA, or to the frequency and the skill with which they release to production For comparison, con‐ sider Knight Capital, which was a large global financial services firm
until trading losses forced it to be acquired in a fire sale Knight ap‐
pears to have had reasonable testing and QA processes and an agile-like, iterative software development and deployment approach similar
to the one practiced by Etsy and Netflix:
1 Initial development phase with no market interaction
6 | Antifragile Systems and Teams