ABOUT THE AUTHORS Container Solutions As experts in Cloud Native strategy and technology, Container Solutions support their clients with migrations to the cloud.. Container Solutions’ di
Trang 2ABOUT THIS BOOK/BLURB
This is a small book with a single purpose, to tell you all about Cloud Native - what it is, what it’s for, who’s using it and why
Go to any software conference and you’ll hear endless discussion
of containers, orchestrators and microservices Why are they so
fashionable? Are there good reasons for using them? What are the trade-offs and do you have to take a big bang approach to adoption? We step back from the hype, summarize the key concepts, and interview some of the enterprises who’ve adopted Cloud Native in production
Take copies of this book and pass them around or just zoom in to
increase the text size and ask your colleagues to read over your shoulder Horizontal and vertical scaling are fully supported
The only hard thing about this book is you can’t assume anyone else has read it and the narrator is notoriously unreliable
What did you think of this book? We’d love to hear from you with feedback or if you need help with a Cloud Native project
email info@container-solutions.com
This book is available in PDF form from the Container Solutions website
at www.container-solutions.com
First published in Great Britain in 2017 by Container Solutions
Publishing, a division of Container Solutions Ltd
Copyright © Anne Berger (nee Currie) and Container Solutions Ltd 2017Chapter 7 “Distributed Systems Are Hard” first appeared in
Trang 3Anne Currie
Anne Currie has been in the software industry for over 20 years working on everything from large scale servers and distributed systems in the ‘90’s
to early ecommerce platforms in the 00’s to cutting edge operational tech on the 10’s She has regularly written, spoken and consulted internationally She firmly believes in the importance of the technology industry to society and fears that we often forget how powerful we are She is currently working with Container Solutions
ABOUT THE AUTHORS
Container Solutions
As experts in Cloud Native strategy and technology, Container Solutions support their clients with migrations to the cloud Their unique approach starts with understanding the specific customer needs Then, together with your team, they design and implement custom solutions that last Container Solutions’ diverse team of experts is equipped with
a broad range of Cloud Native skills, with a focus on distributed system development
Container Solutions have global perspective and their office locations include the Netherlands, United Kingdom, Switzerland, Germany and Canada
Trang 45 7 11 14 17 21 25 28 30
CONTENT
ARE CASE STUDIES EVER USEFUL?
CASE STUDY / THE FINANCIAL TIMES
CASE STUDY / SKYSCANNER
CASE STUDY / ASOS
CASE STUDY / STARLING BANK
CASE STUDY / ITV
CASE STUDY / CONTAINER SOLUTIONS
DO THOSE CASE STUDIES TELL US ANYTHING?
APPENDIX / THE CONTAINER SOLUTIONS METHOD
Trang 5ARE CASE STUDIES
What I wanted from the interviews was to understand:
• What was their aim?
• What issues and roadblocks did they hit?
• Did they get what they wanted?
Trang 6Early adopter case studies are usually only
moderately useful Successful businesses are
unique with their own goals and risk profiles
Early adopters of Cloud Native will usually have
a different attitude to risk than folk starting
out now However, at least these folk are more
realistic role models for the average enterprise
than Google or Netflix
These case studies did give me a general idea of what industry pioneers have done, how difficult
it was and whether the path has become any easier over time
Are Case Studies Ever Useful?
Trang 7THE FINANCIAL TIMES
CASE STUDY
“Our goal of becoming a technologically agile company was a major success - the teams moved from deploys taking 120 days to only 15 min- utes”
Sarah Wells,
Trang 8Based in London, The Financial Times has an
average worldwide daily readership of 2.2
million Its paid circulation, including both
print and digital, is 856K Three quarters of its
subscribers are digital
The FT was a pioneer of content paywalls and
was the first mainstream UK newspaper to report
earning more from digital subscriptions than
print sales They are also unusual in earning more
from content than from advertising
The FT have been gradually adopting
microservices, continuous delivery, containers
and orchestrators for three years Like
Skyscanner (who I’ll talk about next), their
original motivation was to be able to move faster
and respond more quickly to changes in the
marketplace
As Sarah Wells, the high-profile tech lead of
the content platform, points out, “our goal
of becoming a technologically agile company
was a major success - the teams moved from
deploys taking 120 days to only 15 minutes”
In the process, according to senior project
manager Victoria Morgan-Smith, “the teams
were completely liberated”.
So how did they achieve all this? Broadly
speaking, they made incremental but constant
improvements
The FT have moved an increasing share of their
infrastructure into the cloud (IaaS) Six years
they use off-the-shelf, cloud-based services like databases-as-a-service (including AWS Aurora) and queues-as-a-service wherever possible
Again this is because operating this functionality
in house is “not a differentiator” for the company
Within the FT as a whole there was a strong inclination to move to a microservices-oriented architecture but in different parts of the company they took different approaches The
FT have three big programmes of work where they implemented a new system as a set of microservices One of those (subscription services) incrementally migrated their monolithic server to a microservice architecture by slowly carving off key components However, the remaining two projects (the new content platform and the new website) essentially both built a duplicate of their respective monoliths right from the start using microservices
Interestingly, both of those approaches worked successfully for the FT, suggesting that there
is no one correct way to do a monolith to microservice migration
After nearly three years the content platform has moved from a monolith to having around
150 microservices each of which broadly “does one thing” However, they have not followed the popular “Conway’s law” approach where one or more microservices represent the responsibilities of each team (many services
to one team) Instead multiple teams support each microservice (many to many) This helps maximize parallelism but is mostly because
CASE STUDY
The Financial Times
Trang 9They found that, in Wells’ words,
“infrastructure-as-code was necessary for
microservices”, and they evolved a strong
culture of automation and CD According to
Wells, “There is a fair amount of diversity within
the FT with some teams running a home-grown
continuous delivery system based on Puppet
while others wrap and deploy their services in
Docker containers on the container-friendly
Linux operating system CoreOS, with yet others
deploying to Heroku
Basically, we have at least:
1 A home-grown, puppet-based platform,
currently hosted on AWS without containers
2 A Heroku-hosted PaaS,
3 A Docker container-based environment using
CoreOS, hosted on AWS”
All of these environments work well, they are
each evolving and were each chosen by the
relevant tech team to meet their own needs at
the time Again, the FT’s experience suggests
there is more than one way to successfully
implement an architectural vision that is
microservice-oriented and runs in a cloud-based
environment with continuous delivery
Finally, the FT’s content platform team
found that containers were the gateway to
orchestration The content folk have been
orchestrating their Docker-containerized
processes in production for several years with
the original motivation being server density -
more efficient resource utilization By using large
AWS instances to host multiple containerized
processes, controlled with an orchestrator, they
reduced their hosting costs by around 75% As
very early users of orchestration they created
their own orchestrator from several open source
tools but are now evaluating the latest shelf products, in particular Kubernetes
off-the-So what unexpected results came out of this Cloud Native evolution for the FT? They anticipated the shift to faster deployments would increase risk In fact, they have moved from a 20% deployment rollback rate to ~0.1%, i.e a two order-of-magnitude reduction in their error rate They ascribe this to the ability
to release small changes more often with microservices They have invested heavily in monitoring and A/B testing, again building their own tools for the latter, and they replaced traditional pre-deployment acceptance tests with automated monitoring in production of key functionality
How have they handled the complexity of distributed systems? They chose to make heavy use of asynchronous queues-as-a-service, which simplified their distributed architecture
by limiting the knock-on effects of a single microservice outage (although this does increase system latency, a tradeoff they accepted) They also limit the use of chained synchronous calls
to avoid cascading failures as one failed service holds up a whole chain of services waiting
on outstanding synchronous requests They also struggled with issues around the order of microservice instantiation and are contemplating rules that microservices should exit if pre-
requisite services are not yet available, allowing the orchestrator to automatically re-start them (by which point their pre-requisite service should hopefully have appeared) Basically, it was difficult but they learned and improved as they went
CASE STUDY
The Financial Times
Trang 10According to project manager Victoria
Morgan-Smith:
“our goal throughout was to de-risk
experimentation” but that involved
“training, tools and trust”.
The FT heavily invested in internal on-the-job
training with an explicit remit for their devops
teams to disseminate the new operational
knowledge to developers and operations They
learned that their teams could be trusted to
make good judgments if they were informed,
given responsibility and had the right tools For
example, initially, their IaaS bills were very high,
but once developers were given training and access to billing tools and guidance on budgets the bills reduced
In common with many other early adopters the
FT experimented and built in-house and were prepared to accept a level of uncertainty and risk Sometimes their tech teams needed to re-assess
as the world changed, as with their move from private to public cloud, but they were persistent and trusted to make the occasional readjustment
in a rapidly changing environment Trust was a key factor in their progress
CASE STUDY
The Financial Times
Trang 12Launched in 2003 and headquartered in
Scotland, Skyscanner is a global travel search site
with more than 50 million monthly users Their
self-built technology, which includes websites
like www.skyscanner.com and a mobile app,
supports over 30 languages and 150 currencies
Skyscanner successfully use some highly
advanced Cloud Native strategies in a mixed
environment: they have a monolithic core system
and a fast-growing range of supplementary
microservices Part of their estate is hosted on
their own servers and part is in the
cloud on AWS
Skyscanner have now been using containers
and orchestrators in production for around two
years and their new code is generally “Cloud
Native” i.e microservice-based, containerized
and orchestrated The decision to move towards
Cloud Native was jointly made between their
operations and their development teams and
their motivation was speed The company
wanted to increase their deployment frequency
and velocity “We saw the ability to move
and adapt as a strategic asset”, said Stuart
Davidson who runs the enterprise’s build and
deployment teams According to visionary
ex-Amazon CTO Bryan Dove, Skyscanner’s goal is
to “react at the speed of the internet” His bold
ambition is “10,000 releases every day” and they
are moving rapidly towards achieving it
According to Davidson, “Back in 2014 we were
making around 100 deploys a month We
adopted continuous delivery and that helped
us deploy more quickly, but it didn’t solve all our problems - developers were still limited by which libraries and frameworks were supported
in production Moving to containerization plus CD was the game-changer. It increased our deployment rate by a factor of 500”
Skyscanner’s goal was to achieve “idea to user inside an afternoon” and they have mostly achieved this In around two years their delivery time has dropped from 6-8 weeks to a few hours In common with many early Cloud Native adopters Skyscanner achieved this as an entirely internal project using off-the-shelf or
self-built tooling
Supporting thousands of microservices within
a single environment involves defining some simplifying constraints Skyscanner have developed a microservice shell that provides useful standard defaults on some low level operational behaviours, network interface use, for example They also specify contracts and mandate contract consistency for any new microservices The key is that any constraints make the engineers’ lives easier, but don’t limit them, “batteries are included but removable”
For Skyscanner, the initial motivation for adopting microservices and containerization was deployment speed However, once they had successfully containerized they rapidly started
to use an orchestrator in production to reduce their operating costs “For us”, said Davidson,
“containers were the enabler for Cloud Native”
CASE STUDY
Skyscanner
Trang 13Skyscanner are an excellent example of evolving
strategy
• Several years ago, the team identified their
first goal as increased deployment speed and their first step as continuous delivery
To achieve this they successfully developed
a CD pipeline using the open source tool TeamCity
• They identified another bottleneck as
environmental limitations Developers wanted to use the latest library versions but were limited to what was currently supported in the build system and production instances The ops team set a goal to remove this limitation by allowing developers to bundle their chosen environment into their production deployments using containerization As a step in this process they moved to a more container-friendly build tool (Drone)
• Once they had successfully containerized,
the Skyscanner team moved again
They decided to improve their resilience and reduce costs by using a container orchestrator in part of their production environment They initially chose the easiest orchestrator for them to try out at the time - the newly launched Amazon Elastic Container Service (ECS)
They were happy with that and it achieved
the margin improvement they were looking
for As a result they have continued to extend
orchestrator use in their production environment
Having met all their goals so far, Skyscanner are
now considering their next challenges, which
include handling many different production
environments and making microservices even
smaller in order to move even faster
Skyscanner’s voyage has been a continuous, iterative process and by no means easy
According to Davidson, “The bumps and scars make you more skeptical Many of the tools
we tried did not live up to their hype” They had
to constantly experiment and to sometimes abandon one tool entirely and move to a new one as their needs changed or the tool proved inadequate to their growing scale They correctly didn’t view this as failure, but as a valuable learning process that is only possible in the reduced-risk environment of the cloud
However, according to Davidson, within his own team this learning came at a cost too, “every migration we had to do, every time we had to make our engineers change how they were doing something made us lose a little credibility that
we knew what we were doing As much as the engineering community in Skyscanner were awesome about this, I was always really aware
of the level of change fatigue we introduced”
To make this rate of change work, several of the company’s techies took the initiative to upgrade their change management skills with a course at Edinburgh Business School
Conceptually, the move to Cloud Native has been
a positive one for Skyscanner Their developers have embraced their new responsibilities to drive and test their own releases, ops no longer impose unnecessarily restrictive environmental constraints upon the development team and they improved their deployment speed 500-fold However, they don’t believe their operational journey is at, or will ever reach, an end Like their customers, their technology teams are keen to keep moving forward to new destinations
CASE STUDY
Skyscanner
Trang 14CASE STUDY
“The ability to provide fast response
Trang 15Founded in 2000, ASOS is a highly successful,
global eCommerce fashion retailer Across their
various mobile and web platforms they had 800
million visits in the first half of 2017 They have
21 million social media followers and their retail
sales were just under £1B in H1 (the first half
of) 2017 [8] ASOS’s mission is to become “the
world’s number-one online shopping destination
for fashion-loving 20-somethings”
Since their inception, ASOS have been tech
visionaries who built their own platform in-house
to meet their specific needs and resisted the
urge felt by many retailers to go for off-the-shelf
eCommerce products To advance their overall
objectives, ASOS have identified a number
of strategic goals including: faster feature
velocity (getting functionality from idea to user
more quickly), improved scalability to handle
peaks like Black Friday and the ever-faster site
response times that are famously key to online
conversion ASOS are in a slightly different
technical position from our other case studies
Unlike Skyscanner and FT, ASOS’s services run on
Windows not Linux
Several years ago, ASOS determined that a
key factor in achieving their goals would be to
transition from on-premises, owned and
self-managed servers to cloud hosting They have
been gradually moving all of their services from
their own data centres to Azure with the stated
aim of 100% cloud within 2 years
One of their aims from cloud was to significantly
reduce the operational load on their teams They
decided they would rather have their technical
folk focused on areas of greater business
advantage like features ASOS therefore chose
to run on Azure’s “Cloud Services” PaaS, i.e they use fully managed VMs provided, monitored, patched and supported by Microsoft ASOS just deploy applications to those VMs, using them as
“units of isolation” They found this did indeed reduce their operational overheads and so they went even further, transitioning to fully managed databases wherever possible (aka “database-as-a-service”) They now host their own stateful services only when Azure does not offer a fully managed alternative
As well as moving to the cloud, ASOS embraced
a Cloud Native-style approach with heavy use of microservices A microservice-oriented architecture on flexible cloud infrastructure has been core to their improved feature velocity
However, they have sometimes chosen to prioritize other goals Like FT and Skyscanner, ASOS creates their microservices to “do one thing” Unlike the other two, however, in some instances ASOS groups, links and deploys multiple microservices as a single process, with each microservice a Windows library
The majority of the ASOS estate consists of the more usual discrete services communicating over (typically) REST, but ASOS group and deploy their services together where performance
is particularly critical This improves the responsiveness of these service groups, which
is good, but the increased coupling does have a negative impact on ASOS’s agility and ability to change those particular services, which is bad In most cases, ASOS choose to prioritize agility and feature velocity over performance and therefore they deploy using the more common single-decoupled-microservice model in the majority
of cases
CASE STUDY
Asos
Trang 16Why is grouping services more performant
anyway? As we discussed in an earlier chapter,
microservices talking across multiple VM
instances can potentially introduce significant
intercommunication latency Grouping
microservices can make deployment trickier and
increase coupling, which slows feature velocity
(time from idea to deployment) However, it can
significantly improve execution speed, which
ASOS judge to be an important priority for them
for some services According to their Enterprise
Architect David Green “the ability to provide
fast response times is key to our business”
{Aside - for some Linux orchestrators you can
achieve a similar result using the “Affinity”
feature, which tells the orchestrator to make sure
that some services are always co-located on the
same VM instance or “node”)
This is a good demonstration that microservice
experts still make judgements and balance
tradeoffs on how they will implement a Cloud
Native approach This is true even on a
service-by-service basis
{Aside - interestingly in the ASOS architecture
the majority (obviously not all!) of their service
communication is with the (remote) client, so
cross-VM latency isn’t actually as big an issue as
we might think for most of their services.}
Their interest in execution speed is also reflected
in ASOS’s data architecture They keep data as
close to users as possible and make extensive use of NoSQL databases and caching One of the attractions of a microservice architecture for ASOS is the ability to make more granular choices about how and where data is maintained, all of which helps with their critical
response times
Another, completely different, aspect of a microservice architecture that particularly appealed to ASOS was the ability to parallelize teams and reduce handovers and blockages This also helped them improve their feature velocity
ASOS are extremely happy with the progress they have made using cloud and microservices Last year’s huge Black Friday beat all previous records for scale and responsiveness across their applications So, what next technically for ASOS?
As Azure continues to add managed stateful services, ASOS will transition to use them They would also like to improve their server density (effective resource utilization) To achieve this they are likely to investigate containers and orchestration, but that tooling is still less mature
on Windows than Linux
Overall a cloud (PaaS) and microservice-focussed strategy has worked very well for ASOS and they intend to continue on their current path
CASE STUDY
Asos
Trang 18Starling Bank was founded in 2014 Based in
London, it has been licensed and operating since
July 2016 The bank is a successful part of the
British Fintech scene, which is a spin-off from
the UK’s strong financial services sector
Starling are a mobile-only, challenger bank who
describe themselves as a “tech business with
a banking licence” They provide a full service
current account solely accessed from Android
and iOS mobile devices
They received $70m of investment in early 2016
Starling’s tech comprises a cloud-hosted
back-end system, talking to apps on users’ mobile
phones, and third party services As well as a full
current account, the bank provides Mastercard
debit cards (customers spend money on their
SB debit card and the authorizations and debits
arrive at Starling servers through third party
systems) They also support direct debits,
standing orders and faster payments, which are
again provided by back-end integrations with
other third party systems
Starting in 2016, Starling created their core
infrastructure on Amazon Web Services (AWS)
inside just 12 months Their highly articulate CTO
Greg Hawkins likes to say, “we built a bank in a
year”.
In common with everyone I’ve interviewed
for this series of case studies, Starling use a
microservices architecture of independent
services interacting via clearly defined APIs As of
March 2018 they have ~20 Java microservices
That number will increase
every service can be developed on by multiple teams They operate this way because they can
As Hawkins puts it, “we’re taking advantage
of the flexibility we get from our small size -
we can reconfigure ourselves very quickly.”
As they continue to grow, Greg recognizes that they will lose some of that flexibility, “it won’t last forever”, and will then adopt smaller microservices and a more Conway-like model
In terms of deployment and operations, whilst services can be deployed individually, for convenience Starling usually use a simultaneous deployment approach where all services in the back-end are deployed at once This is a tradeoff that has evolved between minimizing the small amount of overhead around releases and keeping release frequency up They built a rudimentary orchestrator themselves to drive rolling deploys based on version number changes (scale up AWS, create new services on the new instances, expose those new services instead
of the old ones, turn off the old ones and scale down their AWS instances)
Starling generally redeploy their whole estate 4-5 times per day to production So, new functionality reaches prod rapidly, and it’s business-as-usual to apply security patches fast when necessary
As always, API management is a tough challenge for frequent deployments You could argue (naively) that simultaneous deployment makes this easier because you are always re-deploying both sides of your API at once, but this isn’t really true for several reasons:
• Starling don’t mandate simultaneous
CASE STUDY
Starling Bank