IT training antifragile systems and teams khotailieu

Trang 2

“ Velocity is the most

valuable conference I have ever brought my team to For every person I took this year, I now have three who want to go next year.”

Join business technology leaders,

engineers, product managers,

system administrators, and developers

at the O’Reilly Velocity Conference

You’ll learn from the experts—and

each other—about the strategies,

tools, and technologies that are

building and supporting successful,

real-time businesses

Santa Clara, CA May 27–29, 2015

http://oreil.ly/SC15

Trang 3

Dave Zwieback

Antifragile Systems

and Teams

Trang 4

Antifragile Systems and Teams

by Dave Zwieback

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use.

Online editions are also available for most titles (http://my.safaribooksonline.com) For

more information, contact our corporate/institutional sales department: 800-998-9938

or corporate@oreilly.com.

Editor: Mike Loukides

April 2014: First Edition

Revision History for the First Edition:

2014-04-21: First release

2015-03-24: Second release

Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered

trademarks of O’Reilly Media, Inc Antifragile Systems and Teams and related trade

dress are trademarks of O’Reilly Media, Inc.

Many of the designations used by manufacturers and sellers to distinguish their prod‐ ucts are claimed as trademarks Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps.

While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.

ISBN: 978-1-491-94796-8

[LSI]

Trang 5

Table of Contents

Antifragile Systems and Teams 1

Impermanence 1

Fragile, Robust, and Beyond 2

Antifragility 3

DevOps and Antifragility 4

Culture, Part 1: Managing the Downside 5

Culture, Part 2: Skin In the Game 7

Culture, Part 3: Tinkering (or Continual Experimentation and Learning) 9

Automation and Measurement: A Cautionary Tale 10

Sharing 12

Is DevOps Antifragile? 12

Acknowledgments 13

iii

Trang 7

Antifragile Systems and Teams

Impermanence

Systems break because of change, and we often go to great lengths to prevent or manage change with heavy-handed, bureaucratic

change-management processes What we forget is that systems also function

because of change

For instance, computer systems function through state transition: 0s changing to 1s and back again More fundamentally, computers work

because 0s can change into 1s, because computer systems are change‐

able

This is a somewhat subtle point, one that’s easy to overlook in search

of more tangible conditions required for systems to function or mal‐ function But it’s worth repeating: the fundamental reason that systems start, stop, or continue working—the one root cause of all functioning systems and all system outages—is their changeable, impermanent nature

More broadly, impermanence is a fundamental property of all com‐ pounded things, i.e., those that consist of two or more parts All of the systems that we work with certainly have two or more parts (how about five billion parts for the upcoming Xbox chip?)

So how can this theoretical and philosophical understanding of im‐ permanence help us? How can we make impermanence usable and useful?

First, impermanence is useful because it reminds us that all function‐

ing systems will eventually break down Understanding imperma‐ nence frees us from looking for the “single root cause” of outages, and

1

Trang 8

1 Dekker, Sidney The Field Guide to Understanding Human Error: Second Edition.

Farnham, Surrey, UK: Ashgate Publishing, 2006.

from the mistaken belief that there is none (Sidney Dekker proclaims that “What you call ‘root cause’ is simply the place where you stop looking any further.”1) The single root cause of all outages—and all functioning—is impermanence, the changeable nature of all systems Second, having accepted impermanence, we (as engineers) cannot ac‐ cept that things break or function entirely randomly If we mixed two parts hydrogen with one part oxygen and sometimes got water and at other times ice cream, we might as well give up our chosen profession! While we may not be able to identify all the conditions required to build or keep the system running or all those that result in malfunc‐

tions, we can certainly identify some of the conditions Engineering,

above all, is the discipline of studying conditions and their effects Most important, we can identify some of the conditions that we can actually impact While a butterfly fluttering its wings in Africa may be one of the conditions of a hurricane resulting in a power outage in Northern Virginia, we don’t have any control over the butterfly, but at least we have some say in whether our data centers have backup power generators

Fragile, Robust, and Beyond

While all systems break eventually (because they’re all breakable, im‐

permanent), they differ in their robustness to stress For instance, imagine you have a very important document that you want to protect and share widely Let’s say you decide to store it on a physical hard disk (the spinning platters variety) in a server connected to the Internet via

a single network connection This particular system would be fragile, since a failure of any single component (disk, server, network, etc.) would make the file unavailable, possibly forever Moreover, if the file became widely popular, its popularity would negatively impact its availability, as any of the subsystems could easily be saturated

As we can see, fragile systems dislike and are harmed by even small amounts of stress (a.k.a volatility, randomness, disorder, or imper‐ manence) One way to reduce fragility is to increase redundancy: we could build a more robust system by removing single points of failure

by mirroring the disks, setting up multiple servers in high-availability

2 | Antifragile Systems and Teams

Trang 9

configurations and distributing them geographically, adding multiple higher-bandwidth Internet links, etc While this system would be more robust and increase the availability of the document, it would remain vulnerable to somewhat rare but catastrophic conditions like the leap second bug (because all servers ran the same version of the

OS) or equipment seizure by the government Moreover, this is a con‐ siderably more complex system than a single-server one, and it will at times function and break in unexpected and unpredictable ways

We usually think of robust systems as the opposite of fragile ones be‐ cause they don’t to care too much about (i.e., neither like nor dislike) stress In fact, robust systems are merely less fragile The true opposite

of a fragile system would be one that actually benefits from stress Nassim Nicholas Taleb calls such systems antifragile.

In the case of the aforementioned file, imagine we stored it in a system that made it more available the higher the demand for it was In fact, BitTorrent is precisely this type of system: the more our file is reques‐ ted, the more robust to failure and available it becomes because parts

of it are stored on a progressively larger number of computers In fact, the best-case scenario for a file stored on BitTorrent is that every per‐ son on earth would want to download it and then share it This sce‐ nario is exactly the absolute worst case for both fragile and robust systems Note also that our cost of distributing this file would remain constant—not so for the cost of making systems more robust to an‐ ticipate higher demand or improve resiliency

Antifragility

The main property of antifragile systems is that the potential downside due to stress (and its retinue) is lower than the potential upside, up to

a point Taleb defines this asymmetry as follows: you are antifragile to event intensity between xlow and xhigh if you are better off after the event than before (up to xhigh) The rare, entirely unpredictable event (a.k.a the Black Swan event) in which every person on earth wants to down‐ load a copy of a file distributed on BitTorrent would actually make the file maximally available, robust to failure, and practically impossible

to delete

BitTorrent is similar to another human/social system with antifragile properties: information shared via gossip The more someone (e.g., a government or a self-righteous group) tries to suppress, criticize, or disprove some information (e.g., an idea or a book), the more wide‐

Antifragile Systems and Teams | 3

Trang 10

2 McCann, Sean The World of Brendan Behan London: Twayne Publishers, 1966 56.

spread and persistent it becomes Critics of Miley Cyrus did her a great favor with their outraged responses to her 2013 VMA performance—

a smarter strategy would have been to ignore the train wreck entirely

In fact, Brendan Behan’s famous observation that “there’s no bad pub‐ licity except an obituary”2 provides a nice summary of antifragility: the downside to gossip and negative publicity (e.g., damage to one’s reputation) is smaller than the upside (e.g., selling more records), up

to a point (e.g., death)

The longevity and popularity of some books is due in part to being banned at one time or another; in fact, 46 of the 20th century’s top 100 novels were targets of ban attempts From my childhood in Soviet Russia, I remember my parents and their friends staying up nights to

read a samizdat version of The Gulag Archipelago, despite the fact that

those found in possession of the banned manuscript faced long prison sentences This illustrates another often overlooked property of anti‐ fragile systems: antifragility by layers Being banned may help books become antifragile, while reading banned books can make individuals fragile

More generally, complex systems contain myriad components, some‐ times organized in hierarchies or layers As in the above example, sometimes antifragility is achieved at a certain layer of a complex sys‐ tem at the cost of fragility of another Another example of antifragility

by layers is vaccination: while it appears to benefit populations, vac‐ cines are known to cause serious side effects for some individuals That

is, vaccination makes a population antifragile because the downside (a small number of individuals having negative side effects) is small in comparison to the upside (an entire population gaining immunity to

a disease) However, vaccination certainly makes the specific individ‐ uals who wind up having side effects or die from it fragile

DevOps and Antifragility

DevOps is a modern organizational philosophy, which can be applied

to improve the antifragility of complex systems (“Systems” includes people as well as computers and software.) Let’s explore how the layers

of DevOps (Culture, Automation, Measurement, and Sharing) con‐ tribute to the overall antifragility of organizations that adopt this phi‐ losophy

Trang 11

3 Gruver, Gary, Mike Young, and Pat Fulghum A Practical Approach to Large-Scale Agile

Development: How HP Transformed LaserJet FutureSmart Firmware Boston: Addison-Wesley, 2012.

Culture, Part 1: Managing the Downside

Etsy is known for deploying to production 40+ times per day Netflix routinely terminates production instances at random or introduces latency into their application An architect working on HP printer firmware “would purposely make changes that would break the code until it was architecturally correct.”3 What makes all these examples acceptable or even possible within organizations? Fundamentally, these organizations have realized that the potential downside of fre‐ quent changes (otherwise known as volatility, variance, etc.) is smaller than the potential upside This knowledge has become part of the cul‐ ture at these companies This is also the classic asymmetry of pain versus gain required to make these companies and their systems an‐ tifragile

Figure 1 The Antifragile asymmetry between potential downside and upside

The major upside of embracing higher volatility is that customers re‐ ceive higher-quality products and services (i.e., value) faster and at a lower cost than is possible with traditional, risk- and volatility-averse approaches The often overlooked and critical point is that this is only possible not only because these companies have identified potential

Antifragile Systems and Teams | 5

Trang 12

4. http://slidesha.re/1fsqzHB

5. http://bit.ly/1e9i40t

6 Gruver, Gary, Mike Young, and Pat Fulghum A Practical Approach to Large-Scale Agile

Development: How HP Transformed LaserJet FutureSmart Firmware Boston: Addison-Wesley, 2012.

upside, but also because they have become experts at minimizing the downside:

At Etsy “this isn’t license to break stuff, quickly Engineer-driven QA

and solid unit testing are integral parts of the process.” 4

Netflix “is known for being bold in its rapid pursuit of innovation and

high availability, but not to the point of callousness It is careful to

avoid any noticeable impact to customers from these failure-induction exercises.” 5

At HP “we could tell when [the architect] had found and plugged a

hole because a particular test area would have a huge drop in passing

tests or the build would break completely.” 6

It is the robust testing, quality assurance, and (continuous) deploy‐ ment practices that enable these companies to reduce the possibility

of any individual change causing a catastrophic event Both Etsy and Netflix have become experts at quickly going back to a previously working version of software should the newly deployed one be found defective Amazon takes this up a notch, performing “rollbacks” au‐ tomatically in the event that a deploy causes problems This is one of the reasons that only about 0.001% of software deployments at Ama‐ zon cause outages, even with the mean time between deployments a staggering 11.6 seconds! At the same time, in all these companies, some

of the potentially risker changes (e.g., database schema or OS updates

on core routers) are certainly not performed frequently or continu‐

ously

We should be careful not to attribute the antifragility of the above

companies entirely to effective testing or QA, or to the frequency and the skill with which they release to production For comparison, con‐ sider Knight Capital, which was a large global financial services firm

until trading losses forced it to be acquired in a fire sale Knight ap‐

pears to have had reasonable testing and QA processes and an agile-like, iterative software development and deployment approach similar

to the one practiced by Etsy and Netflix:

1 Initial development phase with no market interaction

Định dạng
Số trang	20
Dung lượng	5,84 MB