Then, orchestration tools like Chef, Puppet, Ansible, and Salt caught up to theproblem and things got a little bit easier.. The environment has to make sure that containers are spread re
Trang 3Scheduling the Future at Cloud Scale
David K Rensin
Trang 4by David Rensin
Copyright © 2015 O’Reilly Media, Inc All rights reserved
Printed in the United States of America
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472
O’Reilly books may be purchased for educational, business, or sales promotional use Online
editions are also available for most titles ( http://safaribooksonline.com ) For more information,contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com
Editor: Brian Anderson
Production Editor: Matt Hacker
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest
June 2015: First Edition
Revision History for the First Edition
responsibility for errors or omissions, including without limitation responsibility for damages
resulting from the use of or reliance on this work Use of the information and instructions contained inthis work is at your own risk If any code samples or other technology this work contains or describes
is subject to open source licenses or the intellectual property rights of others, it is your responsibility
to ensure that your use thereof complies with such licenses and/or rights
978-1-491-93188-2
[LSI]
Trang 5Chapter 1 In The Beginning…
Cloud computing has come a long way
Just a few years ago there was a raging religious debate about whether people and projects wouldmigrate en masse to public cloud infrastructures Thanks to the success of providers like AWS,
Google, and Microsoft, that debate is largely over
Introduction
In the “early days” (three years ago), managing a web-scale application meant doing a lot of tooling
on your own You had to manage your own VM images, instance fleets, load balancers, and more Itgot complicated fast Then, orchestration tools like Chef, Puppet, Ansible, and Salt caught up to theproblem and things got a little bit easier
A little later (approximately two years ago) people started to really feel the pain of managing theirapplications at the VM layer Even under the best circumstances it takes a brand new virtual machine
at least a couple of minutes to spin up, get recognized by a load balancer, and begin handling traffic.That’s a lot faster than ordering and installing new hardware, but not quite as fast as we expect oursystems to respond
Then came Docker
This is a non-trivial problem
In this book, I will introduce you to one of the solutions to this challenge—Kubernetes It’s not theonly way to skin this cat, but getting a good grasp on what it is and how it works will arm you withthe information you need to make good choices later
Trang 6Who I Am
Full disclosure: I work for Google
Specifically, I am the Director of Global Cloud Support and Services As you might imagine, I verydefinitely have a bias towards the things my employer uses and/or invented, and it would be prettysilly for me to pretend otherwise
That said, I used to work at their biggest competitor—AWS—and before that, I wrote a book for
O’Reilly on Cloud Computing, so I do have some perspective.
I’ll do my best to write in an evenhanded way, but it’s unlikely I’ll be able to completely stamp out
my biases for the sake of perfectly objective prose I promise to keep the preachy bits to a minimumand keep the text as non-denominational as I can muster
If you’re so inclined, you can see my full bio here
Finally, you should know that the words you read are completely my own This paper does not reflectthe views of Google, my family, friends, pets, or anyone I now know or might meet in the future Ispeak for myself and nobody else I own these words
So that’s me Let’s chat a little about you…
Who I Think You Are
For you to get the most out of this book, I need you to have accomplished the following basic things:
1 Spun up at least three instances in somebody’s public cloud infrastructure—it doesn’t matterwhose (Bonus points points if you’ve deployed behind a load balancer.)
2 Have read and digested the basics about Docker and containers
3 Have created at least one local container—just to play with
If any of those things are not true, you should probably wait to read this paper until they are If youdon’t, then you risk confusion
The Problem
Containers are really lightweight That makes them super flexible and fast However, they are
designed to be short-lived and fragile I know it seems odd to talk about system components that are
designed to not be particularly resilient, but there’s a good reason for it.
Instead of making each small computing component of a system bullet-proof, you can actually make
the whole system a lot more stable by assuming each compute unit is going to fail and designing your
overall process to handle it
Trang 7All the scheduling and orchestration systems gaining mindshare now— Kubernetes or others—aredesigned first and foremost with this principle in mind They will kill and re-deploy a container in a
cluster if it even thinks about misbehaving!
This is probably the thing people have the hardest time with when they make the jump from backed instances to containers You just can’t have the same expectation for isolation or resiliencywith a container as you do for a full-fledged virtual machine
VM-The comparison I like to make is between a commercial passenger airplane and the Apollo LunarModule (LM)
An airplane is meant to fly multiple times a day and ferry hundreds of people long distances It’smade to withstand big changes in altitude, the failure of at least one of its engines, and seriously
violent winds Discovery Channel documentaries notwithstanding, it takes a lot to make a properly
maintained commercial passenger jet fail
The LM, on the other hand, was basically made of tin foil and balsa wood It was optimized for
weight and not much else Little things could (and did during design and construction) easily destroythe thing That was OK, though It was meant to operate in a near vacuum and under very specificconditions It could afford to be lightweight and fragile because it only operated under very
orchestrated conditions
Any of this sound familiar?
VMs are a lot like commercial passenger jets They contain full operating systems—including
firewalls and other protective systems—and can be super resilient Containers, on the other hand, arelike the LM They’re optimized for weight and therefore are a lot less forgiving
In the real world, individual containers fail a lot more than individual virtual machines To
compensate for this, containers have to be run in managed clusters that are heavily scheduled andorchestrated The environment has to detect a container failure and be prepared to replace it
immediately The environment has to make sure that containers are spread reasonably evenly acrossphysical machines (so as to lessen the effect of a machine failure on the system) and manage overallnetwork and memory resources for the cluster
It’s a big job and well beyond the abilities of normal IT orchestration tools like Chef, Puppet, etc…
Trang 8Chapter 2 Go Big or Go Home!
If having to manage virtual machines gets cumbersome at scale, it probably won’t come as a surprise
to you that it was a problem Google hit pretty early on—nearly ten years ago, in fact If you’ve everhad to manage more than a few dozen VMs, this will be familiar to you Now imagine the problems
when managing and coordinating millions of VMs.
At that scale, you start to re-think the problem entirely, and that’s exactly what happened If your planfor scale was to have a staggeringly large fleet of identical things that could be interchanged at amoment’s notice, then did it really matter if any one of them failed? Just mark it as bad, clean it up,and replace it
Using that lens, the challenge shifts from configuration management to orchestration, scheduling, andisolation A failure of one computing unit cannot take down another (isolation), resources should bereasonably well balanced geographically to distribute load (orchestration), and you need to detectand replace failures near instantaneously (scheduling)
Introducing Kubernetes—Scaling through Scheduling
Pretty early on, engineers working at companies with similar scaling problems started playing aroundwith smaller units of deployment using cgroups and kernel namespaces to create process separation.The net result of these efforts over time became what we commonly refer to as containers
Google necessarily had to create a lot of orchestration and scheduling software to handle isolation,load balancing, and placement That system is called Borg, and it schedules and launches
approximately 7,000 containers a second on any given day.
With the initial release of Docker in March of 2013, Google decided it was finally time to take themost useful (and externalizable) bits of the Borg cluster management system, package them up andpublish them via Open Source
Kubernetes was born (You can browse the source code here.)
Applications vs Services
It is regularly said that in the new world of containers we should be thinking in terms of services (and
sometimes micro-services) instead of applications That sentiment is often confusing to a newcomer,
so let me try to ground it a little for you At first this discussion might seem a little off topic It isn’t Ipromise
Trang 9Danger—Religion Ahead!
To begin with, I need to acknowledge that the line between the two concepts can sometimes get blurry, and people
occasionally get religious in the way they argue over it I’m not trying to pick a fight over philosophy, but it’s important to
give a newcomer some frame of reference If you happen to be a more experienced developer and already have
well-formed opinions that differ from mine, please know that I’m not trying to provoke you.
A service is a process that:
1 is designed to do a small number of things (often just one)
2 has no user interface and is invoked solely via some kind of API
An application, on the other hand, is pretty much the opposite of that It has a user interface (even if
it’s just a command line) and often performs lots of different tasks It can also expose an API, butthat’s just bonus points in my book
It has become increasingly common for applications to call several services behind the scenes Theweb UI you interact with at https://www.google.com actually calls several services behind the
scenes
Where it starts to go off the rails is when people refer to the web page you open in your browser as a
web application That’s not necessarily wrong so much as it’s just too confusing Let me try to be
more precise
Your web browser is an application It has a user interface and does lots of different things Whenyou tell it to open a web page it connects to a web server It then asks the web server to do some stuffvia the HTTP protocol
The web server has no user interface, only does a limited number of things, and can only be interactedwith via an API (HTTP in this example) Therefore, in our discussion, the web server is really a
service—not an application.
This may seem a little too pedantic for this conversation, but it’s actually kind of important A
Kubernetes cluster does not manage a fleet of applications It manages a cluster of services Youmight run an application (often your web browser) that communicates with these services, but the twoconcepts should not be confused
A service running in a container managed by Kubernetes is designed to do a very small number ofdiscrete things As you design your overall system, you should keep that in mind I’ve seen a lot ofwell meaning websites fall over because they made their services do too much That stems from notkeeping this distinction in mind when they designed things
If your services are small and of limited purpose, then they can more easily be scheduled and arranged as your load demands Otherwise, the dependencies become too much to manage and eitheryour scale or your stability suffers
Trang 10re-The Master and Its Minions
At the end of the day, all cloud infrastructures resolve down to physical machines—lots and lots ofmachines that sit in lots and lots of data centers scattered all around the world For the sake of
explanation, here’s a simplified (but still useful) view of the basic Kubernetes layout
Bunches of machines sit networked together in lots of data centers Each of those machines is hosting
one or more Docker containers Those worker machines are called nodes.
NOTE
Nodes used to be called minions and you will sometimes still see them referred to in this way I happen to think they should
have kept that name because I like whimsical things, but I digress…
Other machines run special coordinating software that schedule containers on the nodes These
machines are called masters Collections of masters and nodes are known as clusters.
Figure 2-1 The Basic Kubernetes Layout
That’s the simple view Now let me get a little more specific
Masters and nodes are defined by which software components they run
The Master runs three main items:
1 API Server—nearly all the components on the master and nodes accomplish their respective
tasks by making API calls These are handled by the API Server running on the master.
Trang 112 Etcd—Etcd is a service whose job is to keep and replicate the current configuration and run
state of the cluster It is implemented as a lightweight distributed key-value store and was
developed inside the CoreOS project
3 Scheduler and Controller Manager—These processes schedule containers (actually, pods—
but more on them later) onto target nodes They also make sure that the correct numbers of thesethings are running at all times
A node usually runs three important processes:
1 Kubelet—A special background process (daemon that runs on each node whose job is to
respond to commands from the master to create, destroy, and monitor the containers on that host
2 Proxy—This is a simple network proxy that’s used to separate the IP address of a target
container from the name of the service it provides (I’ll cover this in depth a little later.)
3 cAdvisor (optional)—http://bit.ly/1izYGLi[Container Advisor (cAdvisor)] is a special daemon
that collects, aggregates, processes, and exports information about running containers Thisinformation includes information about resource isolation, historical usage, and key networkstatistics
These various parts can be distributed across different machines for scale or all run on the same hostfor simplicity The key difference between a master and a node comes down to who’s running whichset of processes
Figure 2-2 The Expanded Kubernetes Layout
If you’ve read ahead in the Kubernetes documentation, you might be tempted to point out that I glossed
Trang 12over some bits—particularly on the master You’re right, I did That was on purpose Right now, theimportant thing is to get you up to speed on the basics I’ll fill in some of the finer details a little later.
At this point in your reading I am assuming you have some basic familiarity with containers and have
created a least one simple one with Docker If that’s not the case, you should stop here and head over
to the main Docker site and run through the basic tutorial
NOTE
I have taken great care to keep this text “code free.” As a developer, I love program code, but the purpose of this book is
to introduce the concepts and structure of Kubernetes It’s not meant to be a how-to guide to setting up a cluster.
For a good introduction to the kinds of configuration files used for this, you should look here
That said, I will very occasionally sprinkle in a few lines of sample configuration to illustrate a point These will be written in
YAML because that’s the format Kubernetes expects for its configurations.
Pods
A pod is a collection of containers and volumes that are bundled and scheduled together because they
share a common resource—usually a filesystem or IP address
Figure 2-3 How Pods Fit in the Picture
Kubernetes introduces some simplifications with pods vs normal Docker In the standard Docker
Trang 13configuration, each container gets its own IP address Kubernetes simplifies this scheme by assigning
a shared IP address to the pod The containers in the pod all share the same address and communicate with one another via localhost In this way, you can think of a pod a little like a VM because it
basically emulates a logical host to the containers in it
This is a very important optimization Kubernetes schedules and orchestrates things at the pod level,
not the container level That means if you have several containers running in the same pod they have
to be managed together This concept—known as shared fate—is a key underpinning of any
clustering system
At this point you might be thinking that things would be easier if you just ran processes that need totalk to each other in the same container
You can do it, but I really wouldn’t It’s a bad idea.
If you do, you undercut a lot of what Kubernetes has to offer Specifically:
1 Management Transparency—If you are running more than one process in a container, then
you are responsible for monitoring and managing the resources each uses It is entirely possible
that one misbehaved process can starve the others within the container, and it will be up to you
to detect and fix that On the other hand, if you separate your logical units of work into separatecontainers, Kubernetes can manage that for you, which will make things easier to debug and fix
2 Deployment and Maintenance—Individual containers can be rebuilt and redeployed by you
whenever you make a software change That decoupling of deployment dependencies will makeyour development and testing faster It also makes it super easy to rollback in case there’s aproblem
3 Focus—If Kubernetes is handling your process and resource management, then your containers
can be lighter You can focus on your code instead of your overhead
Another key concept in any clustering system—including Kubernetes—is lack of durability Pods are
not durable things, and you shouldn’t count on them to be From time to time (as the overall health of
the cluster demands), the master scheduler may choose to evict a pod from its host That’s a polite
way of saying that it will delete the pod and bring up a new copy on another node
You are responsible for preserving the state of your application
That’s not as hard as it may seem It just takes a small adjustment to your planning Instead of storingyour state in memory in some non-durable way, you should think about using a shared data store likeRedis, Memcached, Cassandra, etc
That’s the architecture cloud vendors have been preaching for years to people trying to build scalable systems—even with more long-lived things like VMs—so this ought not come as a hugesurprise
super-There is some discussion in the Kubernetes community about trying to add migration to the system In
that case, the current running state (including memory) would be saved and moved from one node to
Trang 14another when an eviction occurs Google introduced something similar recently called live migration
to its managed VM offering (Google Compute Engine), but at the time of this writing, no such
mechanism exists in Kubernetes
Sharing and preserving state between the containers in your pod, however, has an even easier
solution: volumes.
Volumes
Those of you who have played with more than the basics of Docker will already be familiar with
Docker volumes In Docker, a volume is a virtual filesystem that your container can see and use.
An easy example of when to use a volume is if you are running a web server that has to have readyaccess to some static content The easy way to do that is to create a volume for the container and pre-populate it with the needed content That way, every time a new container is started it has access to alocal copy of the content
So far, that seems pretty straightforward
Kubernetes also has volumes, but they behave differently A Kubernetes volume is defined at the pod
level—not the container level This solves a couple of key problems.
1 Durability—Containers die and are reborn all the time If a volume is tied to a container, it will
also go away when the container dies If you’ve been using that space to write temporary files,
you’re out of luck If the volume is bound to the pod, on the other hand, then the data will
survive the death and rebirth of any container in that pod That solves one headache
2 Communication—Since volumes exist at the pod level, any container in the pod can see and
use them That makes moving temporary data between containers super easy
Trang 15Figure 2-4 Containers Sharing Storage
Because they share the same generic name—volume—it’s important to always be clear when
discussing storage Instead of saying “I have a volume that has…,” be sure to say something like “Ihave a container volume,” or “I have a pod volume.” That will make talking to other people (andgetting help) a little easier
Kubernetes currently supports a handful of different pod volume types—with many more in variousstages of development in the community Here are the three most popular types
EmptyDir
The most commonly used type is EmptyDir.
This type of volume is bound to the pod and is initially always empty when it’s first created (Hencethe name!) Since the volume is bound to the pod, it only exists for the life of the pod When the pod isevicted, the contents of the volume are lost
For the life of the pod, every container in the pod can read and write to this volume—which makessharing temporary data really easy As you can imagine, however, it’s important to be diligent andstore data that needs to live more permanently some other way
In general, this type of storage is known as ephemeral Storage whose contents survive the life of its host is known as persistent.
Trang 16Network File System (NFS)
Recently, Kubernetes added the ability to mount an NFS volume at the pod level That was a
particularly welcome enhancement because it meant that containers could store and retrieve importantfile-based data—like logs—easily and persistently, since NFS volumes exists beyond the life of thepod
GCEPersistentDisk (PD)
Google Cloud Platform (GCP) has a managed Kubernetes offering named GKE If you are using
Kubernetes via GKE, then you have the option of creating a durable network-attached storage volume
called a persistent disk (PD) that can also be mounted as a volume on a pod You can think of a PD
as a managed NFS service GCP will take care of all the lifecycle and process bits and you just
worry about managing your data They are long-lived and will survive as long as you want them to
From Bricks to House
Those are the basic building blocks of your cluster Now it’s time to talk about how these things
assemble to create scale, flexibility, and stability
Trang 17Chapter 3 Organize, Grow, and Go
Once you start creating pods, you’ll quickly discover how important it is to organize them As yourclusters grow in size and scope, you’ll need to use this organization to manage things effectively.More than that, however, you will need a way to find pods that have been created for a specific
purpose and route requests and data to them In an environment where things are being created anddestroyed with some frequency, that’s harder than you think!
Better Living through Labels, Annotations, and Selectors
Kubernetes provides two basic ways to document your infrastructure—labels and annotations.
Labels
A label is a key/value pair that you assign to a Kubernetes object (a pod in this case) You can use
pretty well any name you like for your label, as long as you follow some basic naming rules In this
case, the label will decorate a pod and will be part of the pod.yaml file you might create to define
your pods and containers
Let’s use an easy example to demonstrate Suppose you wanted to identify a pod as being part of the
front-end tier of your application You might create a label named tier and assign it a value of
frontend—like so:
“labels”: {
“tier”: “frontend”
}
The text “tier” is the key, and the text “frontend” is the value
Keys are a combination of zero or more prefixes followed by a “/” character followed by a name
string The prefix and slash are optional Two examples:
Trang 18The prefix part of the key can be one or more DNS Labels separated by “.” characters The total
length of the prefix (including dots) cannot exceed 253 characters
Values have the same rules but cannot be any longer than 63 characters.
Neither keys nor values may contain spaces.
Um…That Seems a Little “In the Weeds”
I’m embarrassed to tell you how many times I’ve tried to figure out why a certain request didn’t get properly routed to the right pod only to discover that my label was too long or had an invalid character Accordingly, I would be remiss if didn’t at
least try to keep you from suffering the same pain!
Label Selectors
Labels are queryable—which makes them especially useful in organizing things The mechanism forthis query is a label selector
Heads Up!
You will live and die by your label selectors Pay close attention here!
A label selector is a string that identifies which labels you are trying to match
There are two kinds of label selectors—equality-based and set-based.
An equality-based test is just a “IS/IS NOT” test For example:
tier = frontend
will return all pods that have a label with the key “tier” and the value “frontend” On the other hand,
if we wanted to get all the pods that were not in the frontend tier, we would say:
tier != frontend
You can also combine requirements with commas like so:
tier != frontend, game = super-shooter-2
This would return all pods that were part of the game named “super-shooter-2” but were not in itsfront end tier