Building SolutionsDeveloping for Hadoop is quite unlike common software development, as youare mostly concerned with building not a single, monolithic application butrather a concerted p
Trang 12 2 Compute & Storage
1 Computer architecture for Hadoopers
1 Commodity servers
2 Non-Uniform Memory Access
3 Server CPUs & RAM
4 The Linux Storage Stack
2 Server Form Factors
1 Other Form Factors
4 Cluster Configurations and Node Types
1 Master Nodes
2 Worker Nodes
3 Utility Nodes
4 Edge Nodes
5 Small Cluster Configurations
6 Medium Cluster Configurations
7 Large Cluster Configurations
3 3 High Availability
1 Planning for Failure
2 What do we mean by High Availability?
1 Lateral or Service HA
2 Vertical or Systemic HA
3 Automatic or Manual Failover
3 How available does it need to be?
Trang 21 Service Level Objectives
Trang 4Hadoop in the Enterprise: Architecture
A Guide to Successful Integration
Jan Kunigk, Lars George, Paul Wilkinson, Ian Buss
Trang 5Hadoop in the Enterprise:
Architecture
by Jan Kunigk , Lars George , Paul Wilkinson , and Ian Buss
Copyright © 2017 Jan Kunigk, Lars George, Ian Buss, and Paul Wilkinson Allrights reserved
Printed in the United States of America
Published by O’Reilly Media, Inc , 1005 Gravenstein Highway North,
Sebastopol, CA 95472
O’Reilly books may be purchased for educational, business, or sales
promotional use Online editions are also available for most titles (
http://oreilly.com/safari ) For more information, contact our
corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com
Editor: Nicole Tache
Production Editor: FILL IN PRODUCTION EDITOR
Copyeditor: FILL IN COPYEDITOR
Proofreader: FILL IN PROOFREADER
Indexer: FILL IN INDEXER
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest
September 2017: First Edition
Trang 6Revision History for the First
Edition
2017-03-22: First Early Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491969274 for releasedetails
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Hadoop
in the Enterprise: Architecture, the cover image, and related trade dress aretrademarks of O’Reilly Media, Inc
While the publisher and the author(s) have used good faith efforts to ensure thatthe information and instructions contained in this work are accurate, the
publisher and the author(s) disclaim all responsibility for errors or omissions,including without limitation responsibility for damages resulting from the use
of or reliance on this work Use of the information and instructions contained
in this work is at your own risk If any code samples or other technology thiswork contains or describes is subject to open source licenses or the
intellectual property rights of others, it is your responsibility to ensure thatyour use thereof complies with such licenses and/or rights
978-1-491-96927-4
[FILL IN]
Trang 7Chapter 1 Clusters
Big Data and Apache Hadoop are by no means trivial in practice, as there aremany moving parts and each requires its own set of considerations In fact,each component in Hadoop, for example HDFS, is supplying distributed
processes that have their own peculiarities and a long list of configurationparameters that all may have an impact on your cluster and use-case Or maybenot You need to whittle down everything in painstaking trial and error
experiments, or consult what you can find in regards to documentation In
addition, new releases of Hadoop—but also your own data pipelines built ontop of that—requires careful retesting and verification that everything holdstrue and works as expected We will discuss practical solutions to this andmany other issues throughout this book, invoking what the authors have learned(and are still learning) about implementing Hadoop clusters and Big Data
solutions at enterprises, both large and small
One thing though is obvious, Hadoop is a global player, and the leading
software stack when it comes to Big Data storage and processing No matterwhere you are in the world, you all may struggle with the same basic questionsaround Hadoop, its setup and subsequent operations By the time you are
finished reading this book, you should be much more confident in conceiving aHadoop based solution that may be applied to various and exciting new use-cases
In this chapter, we kick things off with a discussion about cluster environments,which is a topic often overlooked as it is assumed that the successful proof-of-concept cluster delivering the promised answers is also the production
environment running the new solution at scale, automated, reliable, and
maintainable—which is often far from the truth
Trang 8Building Solutions
Developing for Hadoop is quite unlike common software development, as youare mostly concerned with building not a single, monolithic application butrather a concerted pipeline of distinctive pieces, which in the end are to
deliver the final result Often this is insight into the data that was collected,
and on which is built further products, such as recommendation or other time decision making engines Hadoop itself is lacking graphical data
real-representation tools, though there are some ways to visualize informationduring discovery and data analysis, for example, using Apache Zeppelin orsimilar with charting support built-in
In other words, the main task in building Hadoop-based solutions is to apply
Big Data Engineering principles, that comprise the following selection (and,
optionally, creation) of suitable
hard- and software components,
data sources and preparation steps,
processing algorithms,
access and provisioning of resulting data, and
automation of processes for production
As outlined in Figure 1-1, the Big Data engineer is building a data pipeline,
which might include more traditional software development, for example, towrite an Apache Spark job that uses the supplied MLlib applying a linearregression algorithm to the incoming data But there is much more that needs to
be done to establish a whole chain of events that leads to the final result, or thewanted insight
Trang 9Figure 1-1 Big Data Engineering
A data pipeline comprises, in very generic terms,
the task of ingesting the incoming data, and staging it for processing,
processing the data itself in an automated fashion, triggered by time ordata events, and
delivering the final results (as in, new or enriched datasets) to the
consuming systems
These tasks are embedded into an environment, one that defines the boundariesand constraints in which to develop the pipeline (see Figure 1-2) In practicethe structure of this environment is often driven by the choice of Hadoop
distribution, placing an emphasis on the included Apache projects that form theplatform In recent times, distribution vendors are more often going their ownway and selecting components that are similar to others, but are not
interchangeable (for example choosing Apache Ranger vs Apache Sentry forauthorization within the cluster) This does result in vendor dependency, nomatter if all the tools are open-source or not
Trang 10Figure 1-2 Solutions are part of an environment
The result is, that an environment is usually a cluster with a specific Hadoopdistribution (see [Link to Come]), running one or more data pipelines on top of
it, which are representing the solution architecture Each solution is embedded
into further rules and guidelines, for example the broader topic if governance,
which includes backup (see [Link to Come]), metadata and data management,lineage, security, auditing, and other related tasks During development though,
or during rapid prototyping, say for a proof-of-concept project, it is commonthat only parts of the pipeline are built For example, it may suffice to stage thesource data in HDFS, but not devise a fully automated ingest setup Or the finalprovisioning of the results is covered by integration testing assertions, but notconnected to the actual consuming systems
No matter what the focus of the development is, in the end a fully planned data
pipeline is a must to be able to deploy the solution in the production
environment It is common for all of the other environments before that to
reflect the same approach, making the deployment process more predictable
Figure 1-3 summarizes the full Big Data Engineering flow, where a mixture ofengineers work on each major stage of the solution, including the automated
Trang 11ingest and processing, as well as final delivery of the results The solution isthen bundled into a package that also contains metadata, determining howgovernance should be applied to the included data and processes.
Trang 12Figure 1-3 Developing data pipelines
Ideally, the deployment and handling is backed by common development
techniques, such as continuous integration, automating the testing of changes
after they are committed by developers, and for new release after they havebeen sanctioned by the Big Data engineers The remaining question is, do youneed more than one environment, or, in other words, cluster?
Trang 13Single vs Many Clusters
When adding Hadoop to an existing IT landscape, a very common question is,how many clusters are needed? Especially in the established and commonsoftware development process1 we see sandboxed environments that allow for
separate teams to do their work without interrupting each other We are nowconfronted with two competing issues:
Roll out of new and updated applications and data pipelines, and
roll out of new platform software releases
The former is about making sure that new business logic performs as expectedwhile it is developed, tested, and eventually deployed Then there is the latter,which is needed when the platform itself changes, for example with a newHadoop release Updating the application code is the easier one obviously, as
it relies on (or rather, often implies for) all environments to run the same
platform version Rolling out a new platform version requires careful planningand might interrupt or delay the application development, since it may requirethe code to be compiled against a newer version of the dependent libraries Sowhat do we see in practice? Pretty much everything!
Indeed, we have seen users with a single cluster for everything, all the way tohaving a separate cluster for 3 or 4 of the development process stages,
including development, testing, quality assurance (QA), user-acceptance (UA),staging/pre-production, and production The driving factors are mostly costversus convenience and features: it requires many more servers to build all ofthe mentioned environments, and that might be a prohibitive factor The
following list typical combinations:
Single Cluster for Everything
Everything on one cluster, no room for errors, and possibly downtimewhen platform upgrades are needed
Trang 14This is in practice usually not an option What is obvious though is thatthere is often an assumption having a proof-of-concept (PoC) cluster thatworked well is the same as having a production cluster—not to mentionall the other possible environments A single PoC cluster built from scrapservers, or on an insufficient number of nodes (two or three servers donot make a Hadoop cluster, they start at five of more machines) is notgoing to suffice Proper planning and implementation has to go into setting
up Hadoop clusters, where networking is usually the greatest cost factorand often overlooked
Two Clusters (Dev/Test/QA/Pre-Production, and Production)
Separate everything else from production, allows testing of new releasesand platform versions before roll out Difficult to roll back, if at all
possible
Having two proper planned clusters is the minimal setup to run successfulBig Data solutions, with reduced business impact compared to havingonly a single cluster But you are overloading a single environment andwill have significant congestion between, for example, development andtesting teams
Three Clusters (Dev, Test/QA/PreProd, and Prod)
Basic setup with most major roles separated, allows to be flexible
between development, staging, and production
With three clusters the interdependencies are greatly reduced, but not fullyresolved, as there are still situations where the shared resources have to
be scheduled exclusively for one or another team
Four Clusters (Dev, Test, PreProd/QA, and Prod)
Provides the greatest flexibility, as every team is given their own
resources
If the goal is to allow the engineering groups to do their work in a timely
manner, you will have to have four cluster environments set up and available at
Trang 15all times Everything else, while possible, is including minor to majorcompromises and/or restrictions Figure 1-4 shows all of the variousenvironments that might be needed for Big Data engineering.
Trang 16Figure 1-4 Environments needed for Big Data Engineering
A common option to reduce cost is to specify the clusters according to theirtask, for example, as shown here:
Development
Could be a local development machine, or a small instance with4-5 virtual machine (see “Cloud Services”) Only functionaltests are performed
Trang 17One further consideration for determining how many clusters your project
needs is how and what data is provided to each environment This is mainly aquestion of getting production data to the other clusters so that they can performtheir duty This most likely entails data governance and security decisions, asPII (personally identifiable information) data might need to be secured and/orredacted (for example, cross out digits in Social Security Numbers) In regards
to controlling costs, it is also quite often the case that non-production clustersonly receive a fraction of the complete data This reduces storage and, with it,processing needs, but also means that only the production environment is
exposed to the full workload, making earlier load tests in the smaller
environments more questionable or at least difficult
Note
It is known from Facebook, which uses Hadoop to a great extent, that the livetraffic can be routed to a test cluster and even amplified to simulate any
anticipated seasonal or growth related increase This implies that the test
environment is at least as powerful as the existing production environment Ofcourse, this could also be used to perform a validation (see [Link to Come]) of
a new, and possibly improved, production platform
The latest trend is to fold together some of those environments and make use ofthe multitenancy feature of Hadoop For example, you could use the “two
cluster” setup above, but shift the pre-production role onto the production
cluster This helps to utilize the cluster better if there is enough spare capacity
in terms of all major resources, that is disk space, I/O, memory, and CPU Onthe other hand, you are now forced to handle pre-production very carefully so
as not to impact the production workloads
Finally, a common question is how to extrapolate cluster performance based onsmaller non-production environments While it is true that Hadoop mostlyscales linearly for its common workloads, there is also some initial cost to gettrue parallelization going This manifests itself in that very small “clusters”(we have seen three node clusters installed with the entire Hadoop stack) areoften much more fickle than expected You may see issues that do not show atall when you have, say, 10 nodes or more As for extrapolation of
performance, testing a smaller cluster with a subset of the data will give you
Trang 18some valuable insight You should be able to determine from there what toexpect of the production cluster But since Hadoop is a complex, distributedsystem, with many moving parts, scarce resources such as CPU, memory,network, disk space and general I/O, as well as possibly being shared acrossmany tenants, you have to once again be very careful evaluating your
predictions If you had equally sized test/QA/pre-production and productionclusters, mimicking the same workloads closely, only then you have morecertainty
Overall these possibilities have to be carefully evaluated as “going back andforth” is often not possible after the cluster reaches a certain size, or is tiedinto a production pipeline that should not be disturbed Plan early and withplenty of due diligence Plan also for the future, as in ask yourself how yougrow the solution as the company starts to adopt Big Data use-cases
Trang 19Having mentioned sharing a single cluster in an attempt to reduce the number of
environments needed, by means of the Hadoop built-in multitenancy features,
we have to also discuss its caveats The fact is that Hadoop is a fairly youngsoftware stack, just turning 10 years old in 2016 It is also a fact, that the
majority of users have Hadoop loaded with very few use-cases, and if theyhave, those use-cases are of very similar (of not the same) nature For
example, it is no problem today to run a Hadoop cluster red-hot with
MapReduce and Spark jobs, using YARN as the only cluster resource manager.This is a very common setup and used in many large enterprises throughout the
world In addition, one can enable control groups (cgroups) to further isolate
CPU and I/O resources from each YARN application to another So what is theproblem?
With the growth and adoption of Hadoop in the enterprise, the list of featuresthat were asked for led to a state where Hadoop is stretching itself to coverother workloads as well, for example MPP-style query, or search engines.These compete with the resources controlled by YARN and it may happen thatthese collide in the process Shoehorning in long-running processes, commonlyknown as services, into a mostly batch oriented architecture is difficult to saythe least Looking at efforts such as Llama2 or the more recent LLAP3 showhow non-managed resources are carved out of the larger resource pool to beready for low-latency, ad-hoc requests, which is something different to
scheduled job requirements
At to that the fact that HDFS has no accounting features built in, which makescolocated service handling nearly impossible For example, HBase is using thesame HDFS resources as Hive, MapReduce, or Spark Building a charge-backmodel on top of missing account is futile, leaving you with no choice but toeventually separate low-latency use-cases from batch or other, higher-latencyinteractive ones The multitenancy features in Hadoop are mainly focused
around authorization of requests, but not on dynamic resource tracking Whenyou run a MapReduce job as user foobar that reads from HBase, which in turnreads from HDFS, it is impossible to limit the I/O for the specific user, as
Trang 20HDFS only sees hbase causing the traffic.
Some distributions allow the static separation of resources into cgroups at theprocess level For example, you could allocate 40% of I/O and CPU to YARN,and the rest to HBase If you only read from HBase using YARN applications,this separation is useless for the above reasons If you then further mix thiswith Hadoop applications that natively read HDFS but may or may not useimpersonation—a feature that makes the actual user for whom the job is
executed visible to the lower-level systems, such as HDFS—the outcome oftrying to mix workloads is rather unpredictable While Hadoop is improvingover time, this particular deficiency has not seen much support by the largerHadoop contributor groups
You are left with the need to possible partition your cluster to physically
separate specific workloads (see Figure 1-5) This can be done if enough
resources in terms of server hardware is available If not, you will have tospend extra budget to provision such a setup You now also have another
environment to take care of and make part of the larger maintenance process Inother words, you may be forced to replicate the same split setup in the earlierenvironments, such as in pre-production, testing, or even development
Trang 21Figure 1-5 Workloads may force to set up separate production clusters
Trang 22Backup & Disaster Recovery
Once you get started with Hadoop, there comes the point where you ask
yourself: If I want to keep my data safe, what works when you are dealing withmultiple petabytes of data? This is as varied as the question of how many
environments you need, in regards to engineering Big Data solutions And yetagain we have seen all kinds from “no backup at all” to “cluster to cluster”replication For starters, volume is an issue at some point, but so is one of theother “V”s of Big Data: velocity If you batch load large chunks of data, youcan handle backup differently from when you receive updates in micro-
batches, for example using Flume or Kafka landing events separately Do youhave all the data already and then decide to back it up? Or are you about to getstarted with loading data and can devise a backup strategy upfront?
The most common combinations we see are these:
Trang 23Keep in mind that the backup strategy might be orthogonal to the developmentenvironment discussed above, i.e you may have a dev, test/QA/preprod, andprod cluster - and another one just for the backup Or you could save money (atthe cost of features and flexibility) and reuse for example the pre-productioncluster as the standby cluster for backups.
How is data actually copied or backed up? When you have a backup clusterwith the same platform software, then you may be able to use the providedtools such as distcp combined with Apache Oozie for automation, or use theproprietary tools that some vendors ship in addition to the platform itself, forexample Cloudera’s BDR It allows you to schedule regular backups betweenclusters A crucial part of the chosen strategy is to do incremental backupsonce the core data is synchronized
If you stream data into the cluster you could also consider teeing off the dataand land it in both clusters Maybe a combination with Kafka to buffer data forless stable connections between the two locations This setup also allows tobatch together updates and efficiently move them across at the speed of theshared interconnection But considering a true backup & disaster recoverystrategy, you will need at least one more environment to hold the same amount
of data, bringing the total now to more than five or six (including the abovebest case environment count and also accounting for low-latency use-cases, asshown in Figure 1-6)
Trang 24Figure 1-6 Environments needed including backup & disaster recovery
Trang 25Cloud Services
Another option is that the non-production clusters are in a hosted environment,that is, a cloud instance (be it internal or external, see [Link to Come]) Thatallows quick set up of these environments as needed or recreating them fornew platform releases Many Hadoop vendors have some tool on offer that canhelp make this really easy Of course this does require careful planning onwhere the data resides, since moving large quantities of data in and out of anexternal cloud might be costly That is where a private cloud is a good choice
Overall, using virtualized environments helps with two aspects concerningHadoop clusters:
Utilization of hardware resources, and
provisioning of clusters
The former is about what we discussed so far, that is, reducing the number ofphysical servers needed With virtual machines you could run the developmentand testing environments (and QA, etc.) on the same nodes This may save a lot
of Capex type cost (capital expenditure) upfront and turn running these clusters into an Opex type cost (operational expenditure) Of course, the drawbacks are
as expected, shared hardware may not be as powerful as dedicated hardware,making certain kinds of tests (for example, extreme performance tests)
impractical Figure 1-7 shows the environments that could be virtualized
Trang 26Figure 1-7 Some environments could be hosted in a cloud
The advantage of cloud services is usually the latter item, that is, ease ofprovisioning We will discuss this next
Trang 27Once you have decided how many environments you want to use, and on whatinfrastructure, you have to devise a plan on deploying the Hadoop distribution
of your choice on top of it (see [Link to Come] for details) For that, the
following approaches are common options found in practice:
Ambari), which includes security, and even provision other, auxiliarytools, such as applications or ingest processes The greatest advantage ofusing a CM system is that it automatically forces you to document all thenecessary steps in their recipes or playbooks, which in turn are usuallyversioned in a version control system (VCS)
Cloud
As discussed in “Cloud Services”, using virtual machines to deploy
prefabricated images onto shared—or dedicated—server hardware is aconvenient mechanism to bootstrap complex, distributed systems, such asHadoop It is very likely though that these images were initially set upusing the above approach of employing a configuration management
frameworks, such as Ansible or similar
Appliances
No matter if you run Hadoop on bare metal, in the private or public cloud,
or on engineered hardware in form of prefabricated appliances, you must
consider existing IT policies, potential preferences, or budget
Trang 28restrictions In the end Hadoop will work on any of those infrastructureplatforms, with some effects on the workloads There are different
variations of the engineered solutions, that might impact the applicationdesign in different ways, so it is best to follow the often jointly-
developed reference architectures published by the Hadoop or appliancevendor Those are proven configurations that have been tested and
verified to work best with CDH
Trang 29no downtime, data loss, and ideally a rollback feature in case testing the
release did not catch a severe deficiency Running a single Hadoop clusteritself is not trivial and should be supported by automatic, reproducible
procedures and processes Having to do that across many environments furthercomplicates this task, but is an unavoidable burden Consider running someenvironments on a private or public cloud infrastructure to offset cost and beable to stage and rotate releases using the provided mechanisms Otherwise,plan to build out the described environments for software engineering, and indue course disaster recovery as well as a partitioned production cluster
1 See, for example, Software Development Process for general information onthe topic
2 Originally provided by Cloudera as a GitHub repository
3 Added by Hortonworks to Hive under HIVE-7926
Trang 30Chapter 2 Compute & Storage
In this chapter we will cover every piece of IT infrastructure that is required tobuild a Hadoop cluster We start by talking about rather low-level details ofcomputer architecture and how this is used by the Linux operating system Wethen talk about server form factors, before we finally talk about cluster
configurations both in physical and virtual environments
You may not be required to know the facts in this chapter by hard, but Hadoop
is here to stay and the standards of building rock-solid architectures and beable to deploy them like a true champion are growing As Hadoop maturesfrom POC to Production, it remains a distributed system, which at times posesutterly complex problems related to the underlying software and hardwarestack on the individual server The goal of this chapter is thus to supply you as
an architects and/or engineer with the knowledge to size the cluster’s serversand how they connect to the surrounding infrastructure and to learn what isgoing on underneath to simplify problem analysis
During the initial years of enterprise adoption the requirements for Hadoop ITinfrastructure were simple, yet disruptive: Practitioners essentially only
recommended to run Hadoop on dedicated commodity servers with local
storage These requirements are at odds with both, state of the art compute andstorage virtualization as well with the emergence of cloud environments forHadoop However, since its beginnings as a large-scale backend batch
framework for the big Web 2.0 content providers, Hadoop is today evolvinginto a versatile framework required to perform in heterogenous IT
environments While Hadoop’s paradigm of colocating compute and storage isstill mandatory to excel in performance and efficiency, providers of Hadooptechnology in the meantime invest intensively to support Hadoop in Cloudenvironments to participate in the global growth of enterprise IT in Cloudenvironments
Most of the concepts in this chapter are developed by reference of on-premiseHadoop infrastructure While the number of cloud-based Hadoop deployments
is rapidly growing, on-premise installations are still the dominant form of
Trang 31deploying Hadoop We start this chapter with an in-depth discussion of
relevant concepts in computer architecture and the Linux operating system,before we introduce common server form-factors for on-premise deployments
We further discuss special architectures, where compute and storage are
separated The chapter is concluded with a discussion on hardware errors andreliability
Trang 32Computer architecture for
Trang 33Commodity servers
It is widely understood, that Hadoop, as most commercial computation
workload today, runs on commodity servers, which over the last 10 years havesimply become a commodity That being said, most modern servers are verypowerful and complex machines that need to keep up with the ever-increasingdemand of IT and computational needs in the enterprise and consumer sector.The lion’s share of today’s datacenters are comprised of x86-64 architecture-based systems, which feature up to 24 processing cores per processor
However, with many cores come many challenges Writing an application thatfully takes advantage of this degree of parallelism is often far from trivial Thedrastic increase in the amount of processing cores during recent years is atechnical necessity to maintain growth of computational capability of serverprocessors; due to physical limits of scaling a single core’s frequency, scalingcores is the only alternative As we will see, the majority of Hadoop clusters
is implemented with two processors per system, while it is as well possible toimplement with servers that feature four or even eight processors per system.The concept of using multiple cores per CPU and/or using multiple CPUs isreferred to as Symmetric Multi-Processing (SMP)
Figure 2-1 shows a simplified block diagram of the relevant hardware
components in a commodity server In this example two CPUs are
interconnected via a coherent inter-processor link The CPU cores on eachprocessor each have separate L1 and L2 caches and typically share an L3cache Each processor implements a memory controller to attach DDR3/4DRAM memory, which, as we described in detail in the following section,makes this system a so-called NUMA-system Input/Output operations areimplemented via a PCI-Express root complex which attaches downstream I/Ocontrollers for SATA/USB/Ethernet connections, etc All CPU-internal
components, cores, L3, memory controller, PCI-Express root complex and theinterconnect unit are themselves interconnected via an on-die interconnect bus.All commodity servers today abide the general structure illustrated in the
figure Commodity servers, that feature more than two CPUs, will typically beorganized in a ring topology, via the CPU interconnect, but otherwise adhere tothe same general structure as illustrated in Figure 2-1 While, it is always
Trang 34possible to populate a two-socket server with only a single CPU, here arerarely any commodity servers with only a single socket today.
Trang 35Figure 2-1 A modern computer
Trang 36Non-Uniform Memory Access
The most important take-away from the discussion around Symmetric MultiProcessing (SMP) is an understanding of the concept of Non-uniform-memoryaccess (NUMA), that it incurs When multiple processors share the memory inthe system, the mechanism by which it is made accessible becomes an
important factor in the overall system design In some early multi-processorcomputer designs all memory was exposed to the processors equally on a
common bus or via a crossbar switch
Today, this approach is mostly not practical Today, CPUs need to
accommodate DRAM with bus speeds beyond 2 GHz and also because CPUsare considered a modular pluggable entity, each processor directly implements
an interface to the DRAM If that is the case, any program running on a givenCPU wants, which needs to access memory from another CPU, must first
traverse the inter-processor link While the speed of this connection is in themulti-giga-transfer/s range and individual requests complete very quickly,running from another processor’s memory introduces a significant overheadwhen compared to running on the processor’s local memory This distinctionbetween local and distant memory is called Non-Uniform Memory Access(NUMA) A common example could be a process of a Hadoop service that isallowed to be very large and may actually be allocated in a memory range thatphysically must span both processors In this scenario multiple threads could
be running on both physical CPUs, trying to access a location of memory which
is distant for some of these threads and local to others This memory wouldhowever reside in the each processors L3/L2/L1 caches to improve accessspeed Upon each update the processors’ caches must reflect that update
anywhere coherently, i.e an update of a memory location on processor 1 that
represents an integer number by a thread on processor 1 must materialize onprocessor 2 atomically before any thread on processor 2 reads the shared
memory location If processor 2 were to increase the value of the shared
integer value, the prior update would be lost and the value would be wrong.This is maintained by the processor’s cache coherency protocol
In order to expose the information about the memory topology to programmers
Trang 37and users, and to provide a means to optimize runtime behavior on NUMA
architecture, Linux provides tools and interfaces via which users and programs
can influence NUMA behavior directly Most importantly, this allows to
request optimal placement for applications on a given processor, which in
NUMA terminology is called a NUMA node, not to be confused with a core or
a thread within a processor
As a Hadoop architect or engineer, the likelihood that you will deal with
NUMA directly is fairly low, since NUMA makes mostly sensible decisions
and modern Hadoop distributions explicitly request placement of processes on
NUMA nodes There are however problems that you will encounter with
systems performance which are related to NUMA, especially when you
dedicate large amounts of memory to single Hadoop services such as reporting
systems based on Hive/Impala or HBase As programmer, however, you
should generally be aware of NUMA If you know that your query, your spark
job or your own framework will need more memory than available on a single
NUMA node, you should make conscious decisions on the NUMA policy that
you run it with
Let us review briefly how information about NUMA for a process in Linux can
be obtained and influenced via the *numactl* command Assume that we have
a system with two processors as indicated in Figure 2-1 Each of the
processors controls 128 GB of memory Let’s start with displaying the
available NUMA nodes, i.e processors on the system
In the first row of the output, we see the number of available NUMA nodes
Next, the amount of attached and free memory is shown per node, before
Trang 38finally the output lists a table of NUMA distances Linux assigns a score of 10
for access to the local processor and 21 for an adjacent processor Higher cost
may be associated with topologies, where there is no direct connection
between the originating processor and the target processor in which case
access would occur by traversing through an adjacent processor in order to
reach the target processor In the example above we see that most memory on
this machine is not allocated and that existing allocations are fairly evenly
distributed
In Linux you can display NUMA information via the proc file system as shown
in the simplistic example below Here we see how a YARN NodeManager
maps the gcc runtime library
cat /proc/<process-id>/numa_maps|grep libgcc
7f527fa8e000 prefer:1 file=/usr/lib64/libgcc_s-4.8.5-20150702.so.1 mapped=3 N0=3
7f527faa3000 prefer:1 file=/usr/lib64/libgcc_s-4.8.5-20150702.so.1
7f527fca2000 prefer:1 file=/usr/lib64/libgcc_s-4.8.5-20150702.so.1 anon=1 dirty=1 active=0 N0=1 7f527fca3000 prefer:1 file=/usr/lib64/libgcc_s-4.8.5-20150702.so.1 anon=1 dirty=1 active=0 N0=1
Let us analyze this output (there are more possible fields in this output, which
is documented in1)
<address>
The first entry shows us the starting address in virtual memory address
space of the region mapped
prefer:1
shows the memory the NUMA placement policy It is always best practice
to prefer a specific NUMA node, such that reads to distant memory is
minimized For processes that consume lots of memory there will be a
point, where the preference can not be fulfilled anymore This can easily
happen for certain processes on Hadoop worker nodes such as Impala
daemons or HBase Region Server
file=
shows which file backs this mapping Often multiple disjoint mappings
are created for a file and often only part of the file is mapped
Trang 39N<node>=<number of mapped pages>
shows how many pages are mapped by a certain node This is what youshould look out for and may pose a problem with performance when yousee many Node entries (e.g N0=50000 N1=50000)
Linux allows you to control the NUMA characteristics when a process is
launched via the numactl command which we have already seen above
numactl provides options which on the one hand control on which NUMAnode
numactl preferred=0 <process>
This will launch `<process> and allocate its memory on node 0, but if memoryallocation is not possible there launch the process on other nodes When youlaunch a process this way all of its children will inherit the same NUMA
policy In ???, above, which shows actual NUMA mappings, all entries haveinherited their preference from the original command, which started the
NodeManager
Most Hadoop distributions today, leverage numactl to optimize the NUMAconfiguration for processes that are launched by their management tools such
Trang 40as Cloudera Manager or Apache Ambari (TODO confirm this for CM andAmbari once again).
As illustrated in figure Figure 2-1, the access to I/O hardware is also occurring
in a non-uniform fashion Each processor implements its southbound I/O fabric
in via PCI-Express, which is a high-speed point-to-point communications
protocol This incurs that the I/O chip, which connects further southbound bussystems like SATA/SAS/Ethernet, can only connect to a single upstream PCI-Express root complex For apparent reasons, only a single I/O chipset, suchthat all but one of the processors are required to communicate via the inter-processor link, before they can reach I/O Even though I/O completion timemay increase by up to 130% (TODO REF), due to the additional hop, this
overhead must be accepted, since all processors need to communicate with theoutside world via a single I/O hub However, when profiling workloads anddebugging performance issues it is required to be mindful of NUMA for bothcomputation and I/O