Hadoop in the enterprise architecture a guide to successful integration early release

Building SolutionsDeveloping for Hadoop is quite unlike common software development, as youare mostly concerned with building not a single, monolithic application butrather a concerted p

Trang 1

2 2 Compute & Storage

1 Computer architecture for Hadoopers

1 Commodity servers

2 Non-Uniform Memory Access

3 Server CPUs & RAM

4 The Linux Storage Stack

2 Server Form Factors

1 Other Form Factors

4 Cluster Configurations and Node Types

1 Master Nodes

2 Worker Nodes

3 Utility Nodes

4 Edge Nodes

5 Small Cluster Configurations

6 Medium Cluster Configurations

7 Large Cluster Configurations

3 3 High Availability

1 Planning for Failure

2 What do we mean by High Availability?

1 Lateral or Service HA

2 Vertical or Systemic HA

3 Automatic or Manual Failover

3 How available does it need to be?

Trang 2

1 Service Level Objectives

Trang 4

Hadoop in the Enterprise: Architecture

A Guide to Successful Integration

Jan Kunigk, Lars George, Paul Wilkinson, Ian Buss

Trang 5

Hadoop in the Enterprise:

Architecture

by Jan Kunigk , Lars George , Paul Wilkinson , and Ian Buss

Printed in the United States of America

Published by O’Reilly Media, Inc , 1005 Gravenstein Highway North,

Sebastopol, CA 95472

O’Reilly books may be purchased for educational, business, or sales

promotional use Online editions are also available for most titles (

http://oreilly.com/safari ) For more information, contact our

corporate/institutional sales department: 800-998-9938 or

corporate@oreilly.com

Editor: Nicole Tache

Production Editor: FILL IN PRODUCTION EDITOR

Copyeditor: FILL IN COPYEDITOR

Proofreader: FILL IN PROOFREADER

Indexer: FILL IN INDEXER

Interior Designer: David Futato

Cover Designer: Karen Montgomery

Illustrator: Rebecca Demarest

September 2017: First Edition

Trang 6

Revision History for the First

Edition

2017-03-22: First Early Release

See http://oreilly.com/catalog/errata.csp?isbn=9781491969274 for releasedetails

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Hadoop

in the Enterprise: Architecture, the cover image, and related trade dress aretrademarks of O’Reilly Media, Inc

While the publisher and the author(s) have used good faith efforts to ensure thatthe information and instructions contained in this work are accurate, the

publisher and the author(s) disclaim all responsibility for errors or omissions,including without limitation responsibility for damages resulting from the use

of or reliance on this work Use of the information and instructions contained

in this work is at your own risk If any code samples or other technology thiswork contains or describes is subject to open source licenses or the

intellectual property rights of others, it is your responsibility to ensure thatyour use thereof complies with such licenses and/or rights

978-1-491-96927-4

[FILL IN]

Trang 7

Chapter 1 Clusters

Big Data and Apache Hadoop are by no means trivial in practice, as there aremany moving parts and each requires its own set of considerations In fact,each component in Hadoop, for example HDFS, is supplying distributed

processes that have their own peculiarities and a long list of configurationparameters that all may have an impact on your cluster and use-case Or maybenot You need to whittle down everything in painstaking trial and error

experiments, or consult what you can find in regards to documentation In

addition, new releases of Hadoop—but also your own data pipelines built ontop of that—requires careful retesting and verification that everything holdstrue and works as expected We will discuss practical solutions to this andmany other issues throughout this book, invoking what the authors have learned(and are still learning) about implementing Hadoop clusters and Big Data

solutions at enterprises, both large and small

One thing though is obvious, Hadoop is a global player, and the leading

software stack when it comes to Big Data storage and processing No matterwhere you are in the world, you all may struggle with the same basic questionsaround Hadoop, its setup and subsequent operations By the time you are

finished reading this book, you should be much more confident in conceiving aHadoop based solution that may be applied to various and exciting new use-cases

In this chapter, we kick things off with a discussion about cluster environments,which is a topic often overlooked as it is assumed that the successful proof-of-concept cluster delivering the promised answers is also the production

environment running the new solution at scale, automated, reliable, and

maintainable—which is often far from the truth

Trang 8

Building Solutions

Developing for Hadoop is quite unlike common software development, as youare mostly concerned with building not a single, monolithic application butrather a concerted pipeline of distinctive pieces, which in the end are to

deliver the final result Often this is insight into the data that was collected,

and on which is built further products, such as recommendation or other time decision making engines Hadoop itself is lacking graphical data

real-representation tools, though there are some ways to visualize informationduring discovery and data analysis, for example, using Apache Zeppelin orsimilar with charting support built-in

In other words, the main task in building Hadoop-based solutions is to apply

Big Data Engineering principles, that comprise the following selection (and,

optionally, creation) of suitable

hard- and software components,

data sources and preparation steps,

processing algorithms,

access and provisioning of resulting data, and

automation of processes for production

As outlined in Figure 1-1, the Big Data engineer is building a data pipeline,

which might include more traditional software development, for example, towrite an Apache Spark job that uses the supplied MLlib applying a linearregression algorithm to the incoming data But there is much more that needs to

be done to establish a whole chain of events that leads to the final result, or thewanted insight

Trang 9

Figure 1-1 Big Data Engineering

A data pipeline comprises, in very generic terms,

the task of ingesting the incoming data, and staging it for processing,

processing the data itself in an automated fashion, triggered by time ordata events, and

delivering the final results (as in, new or enriched datasets) to the

consuming systems

These tasks are embedded into an environment, one that defines the boundariesand constraints in which to develop the pipeline (see Figure 1-2) In practicethe structure of this environment is often driven by the choice of Hadoop

distribution, placing an emphasis on the included Apache projects that form theplatform In recent times, distribution vendors are more often going their ownway and selecting components that are similar to others, but are not

interchangeable (for example choosing Apache Ranger vs Apache Sentry forauthorization within the cluster) This does result in vendor dependency, nomatter if all the tools are open-source or not

Trang 10

Figure 1-2 Solutions are part of an environment

The result is, that an environment is usually a cluster with a specific Hadoopdistribution (see [Link to Come]), running one or more data pipelines on top of

it, which are representing the solution architecture Each solution is embedded

into further rules and guidelines, for example the broader topic if governance,

which includes backup (see [Link to Come]), metadata and data management,lineage, security, auditing, and other related tasks During development though,

or during rapid prototyping, say for a proof-of-concept project, it is commonthat only parts of the pipeline are built For example, it may suffice to stage thesource data in HDFS, but not devise a fully automated ingest setup Or the finalprovisioning of the results is covered by integration testing assertions, but notconnected to the actual consuming systems

No matter what the focus of the development is, in the end a fully planned data

pipeline is a must to be able to deploy the solution in the production

environment It is common for all of the other environments before that to

reflect the same approach, making the deployment process more predictable

Figure 1-3 summarizes the full Big Data Engineering flow, where a mixture ofengineers work on each major stage of the solution, including the automated

Trang 11

ingest and processing, as well as final delivery of the results The solution isthen bundled into a package that also contains metadata, determining howgovernance should be applied to the included data and processes.

Trang 12

Figure 1-3 Developing data pipelines

Ideally, the deployment and handling is backed by common development

techniques, such as continuous integration, automating the testing of changes

after they are committed by developers, and for new release after they havebeen sanctioned by the Big Data engineers The remaining question is, do youneed more than one environment, or, in other words, cluster?

Trang 13

Single vs Many Clusters

When adding Hadoop to an existing IT landscape, a very common question is,how many clusters are needed? Especially in the established and commonsoftware development process1 we see sandboxed environments that allow for

separate teams to do their work without interrupting each other We are nowconfronted with two competing issues:

Roll out of new and updated applications and data pipelines, and

roll out of new platform software releases

The former is about making sure that new business logic performs as expectedwhile it is developed, tested, and eventually deployed Then there is the latter,which is needed when the platform itself changes, for example with a newHadoop release Updating the application code is the easier one obviously, as

it relies on (or rather, often implies for) all environments to run the same

platform version Rolling out a new platform version requires careful planningand might interrupt or delay the application development, since it may requirethe code to be compiled against a newer version of the dependent libraries Sowhat do we see in practice? Pretty much everything!

Indeed, we have seen users with a single cluster for everything, all the way tohaving a separate cluster for 3 or 4 of the development process stages,

including development, testing, quality assurance (QA), user-acceptance (UA),staging/pre-production, and production The driving factors are mostly costversus convenience and features: it requires many more servers to build all ofthe mentioned environments, and that might be a prohibitive factor The

following list typical combinations:

Single Cluster for Everything

Everything on one cluster, no room for errors, and possibly downtimewhen platform upgrades are needed

Trang 14

This is in practice usually not an option What is obvious though is thatthere is often an assumption having a proof-of-concept (PoC) cluster thatworked well is the same as having a production cluster—not to mentionall the other possible environments A single PoC cluster built from scrapservers, or on an insufficient number of nodes (two or three servers donot make a Hadoop cluster, they start at five of more machines) is notgoing to suffice Proper planning and implementation has to go into setting

up Hadoop clusters, where networking is usually the greatest cost factorand often overlooked

Two Clusters (Dev/Test/QA/Pre-Production, and Production)

Separate everything else from production, allows testing of new releasesand platform versions before roll out Difficult to roll back, if at all

possible

Having two proper planned clusters is the minimal setup to run successfulBig Data solutions, with reduced business impact compared to havingonly a single cluster But you are overloading a single environment andwill have significant congestion between, for example, development andtesting teams

Three Clusters (Dev, Test/QA/PreProd, and Prod)

Basic setup with most major roles separated, allows to be flexible

between development, staging, and production

With three clusters the interdependencies are greatly reduced, but not fullyresolved, as there are still situations where the shared resources have to

be scheduled exclusively for one or another team

Four Clusters (Dev, Test, PreProd/QA, and Prod)

Provides the greatest flexibility, as every team is given their own

resources

If the goal is to allow the engineering groups to do their work in a timely

manner, you will have to have four cluster environments set up and available at

Trang 15

all times Everything else, while possible, is including minor to majorcompromises and/or restrictions Figure 1-4 shows all of the variousenvironments that might be needed for Big Data engineering.

Trang 16

Figure 1-4 Environments needed for Big Data Engineering

A common option to reduce cost is to specify the clusters according to theirtask, for example, as shown here:

Development

Could be a local development machine, or a small instance with4-5 virtual machine (see “Cloud Services”) Only functionaltests are performed

Trang 17

One further consideration for determining how many clusters your project

needs is how and what data is provided to each environment This is mainly aquestion of getting production data to the other clusters so that they can performtheir duty This most likely entails data governance and security decisions, asPII (personally identifiable information) data might need to be secured and/orredacted (for example, cross out digits in Social Security Numbers) In regards

to controlling costs, it is also quite often the case that non-production clustersonly receive a fraction of the complete data This reduces storage and, with it,processing needs, but also means that only the production environment is

exposed to the full workload, making earlier load tests in the smaller

environments more questionable or at least difficult

Note

It is known from Facebook, which uses Hadoop to a great extent, that the livetraffic can be routed to a test cluster and even amplified to simulate any

anticipated seasonal or growth related increase This implies that the test

environment is at least as powerful as the existing production environment Ofcourse, this could also be used to perform a validation (see [Link to Come]) of

a new, and possibly improved, production platform

The latest trend is to fold together some of those environments and make use ofthe multitenancy feature of Hadoop For example, you could use the “two

cluster” setup above, but shift the pre-production role onto the production

cluster This helps to utilize the cluster better if there is enough spare capacity

in terms of all major resources, that is disk space, I/O, memory, and CPU Onthe other hand, you are now forced to handle pre-production very carefully so

as not to impact the production workloads

Finally, a common question is how to extrapolate cluster performance based onsmaller non-production environments While it is true that Hadoop mostlyscales linearly for its common workloads, there is also some initial cost to gettrue parallelization going This manifests itself in that very small “clusters”(we have seen three node clusters installed with the entire Hadoop stack) areoften much more fickle than expected You may see issues that do not show atall when you have, say, 10 nodes or more As for extrapolation of

performance, testing a smaller cluster with a subset of the data will give you

Trang 18

some valuable insight You should be able to determine from there what toexpect of the production cluster But since Hadoop is a complex, distributedsystem, with many moving parts, scarce resources such as CPU, memory,network, disk space and general I/O, as well as possibly being shared acrossmany tenants, you have to once again be very careful evaluating your

predictions If you had equally sized test/QA/pre-production and productionclusters, mimicking the same workloads closely, only then you have morecertainty

Overall these possibilities have to be carefully evaluated as “going back andforth” is often not possible after the cluster reaches a certain size, or is tiedinto a production pipeline that should not be disturbed Plan early and withplenty of due diligence Plan also for the future, as in ask yourself how yougrow the solution as the company starts to adopt Big Data use-cases

Trang 19

Having mentioned sharing a single cluster in an attempt to reduce the number of

environments needed, by means of the Hadoop built-in multitenancy features,

we have to also discuss its caveats The fact is that Hadoop is a fairly youngsoftware stack, just turning 10 years old in 2016 It is also a fact, that the

majority of users have Hadoop loaded with very few use-cases, and if theyhave, those use-cases are of very similar (of not the same) nature For

example, it is no problem today to run a Hadoop cluster red-hot with

MapReduce and Spark jobs, using YARN as the only cluster resource manager.This is a very common setup and used in many large enterprises throughout the

world In addition, one can enable control groups (cgroups) to further isolate

CPU and I/O resources from each YARN application to another So what is theproblem?

With the growth and adoption of Hadoop in the enterprise, the list of featuresthat were asked for led to a state where Hadoop is stretching itself to coverother workloads as well, for example MPP-style query, or search engines.These compete with the resources controlled by YARN and it may happen thatthese collide in the process Shoehorning in long-running processes, commonlyknown as services, into a mostly batch oriented architecture is difficult to saythe least Looking at efforts such as Llama2 or the more recent LLAP3 showhow non-managed resources are carved out of the larger resource pool to beready for low-latency, ad-hoc requests, which is something different to

scheduled job requirements

At to that the fact that HDFS has no accounting features built in, which makescolocated service handling nearly impossible For example, HBase is using thesame HDFS resources as Hive, MapReduce, or Spark Building a charge-backmodel on top of missing account is futile, leaving you with no choice but toeventually separate low-latency use-cases from batch or other, higher-latencyinteractive ones The multitenancy features in Hadoop are mainly focused

around authorization of requests, but not on dynamic resource tracking Whenyou run a MapReduce job as user foobar that reads from HBase, which in turnreads from HDFS, it is impossible to limit the I/O for the specific user, as

Trang 20

HDFS only sees hbase causing the traffic.

Some distributions allow the static separation of resources into cgroups at theprocess level For example, you could allocate 40% of I/O and CPU to YARN,and the rest to HBase If you only read from HBase using YARN applications,this separation is useless for the above reasons If you then further mix thiswith Hadoop applications that natively read HDFS but may or may not useimpersonation—a feature that makes the actual user for whom the job is

executed visible to the lower-level systems, such as HDFS—the outcome oftrying to mix workloads is rather unpredictable While Hadoop is improvingover time, this particular deficiency has not seen much support by the largerHadoop contributor groups

You are left with the need to possible partition your cluster to physically

separate specific workloads (see Figure 1-5) This can be done if enough

resources in terms of server hardware is available If not, you will have tospend extra budget to provision such a setup You now also have another

environment to take care of and make part of the larger maintenance process Inother words, you may be forced to replicate the same split setup in the earlierenvironments, such as in pre-production, testing, or even development

Trang 21

Figure 1-5 Workloads may force to set up separate production clusters

Trang 22

Backup & Disaster Recovery

Once you get started with Hadoop, there comes the point where you ask

yourself: If I want to keep my data safe, what works when you are dealing withmultiple petabytes of data? This is as varied as the question of how many

environments you need, in regards to engineering Big Data solutions And yetagain we have seen all kinds from “no backup at all” to “cluster to cluster”replication For starters, volume is an issue at some point, but so is one of theother “V”s of Big Data: velocity If you batch load large chunks of data, youcan handle backup differently from when you receive updates in micro-

batches, for example using Flume or Kafka landing events separately Do youhave all the data already and then decide to back it up? Or are you about to getstarted with loading data and can devise a backup strategy upfront?

The most common combinations we see are these:

Trang 23

Keep in mind that the backup strategy might be orthogonal to the developmentenvironment discussed above, i.e you may have a dev, test/QA/preprod, andprod cluster - and another one just for the backup Or you could save money (atthe cost of features and flexibility) and reuse for example the pre-productioncluster as the standby cluster for backups.

How is data actually copied or backed up? When you have a backup clusterwith the same platform software, then you may be able to use the providedtools such as distcp combined with Apache Oozie for automation, or use theproprietary tools that some vendors ship in addition to the platform itself, forexample Cloudera’s BDR It allows you to schedule regular backups betweenclusters A crucial part of the chosen strategy is to do incremental backupsonce the core data is synchronized

If you stream data into the cluster you could also consider teeing off the dataand land it in both clusters Maybe a combination with Kafka to buffer data forless stable connections between the two locations This setup also allows tobatch together updates and efficiently move them across at the speed of theshared interconnection But considering a true backup & disaster recoverystrategy, you will need at least one more environment to hold the same amount

of data, bringing the total now to more than five or six (including the abovebest case environment count and also accounting for low-latency use-cases, asshown in Figure 1-6)

Trang 24

Figure 1-6 Environments needed including backup & disaster recovery

Trang 25

Cloud Services

Another option is that the non-production clusters are in a hosted environment,that is, a cloud instance (be it internal or external, see [Link to Come]) Thatallows quick set up of these environments as needed or recreating them fornew platform releases Many Hadoop vendors have some tool on offer that canhelp make this really easy Of course this does require careful planning onwhere the data resides, since moving large quantities of data in and out of anexternal cloud might be costly That is where a private cloud is a good choice

Overall, using virtualized environments helps with two aspects concerningHadoop clusters:

Utilization of hardware resources, and

provisioning of clusters

The former is about what we discussed so far, that is, reducing the number ofphysical servers needed With virtual machines you could run the developmentand testing environments (and QA, etc.) on the same nodes This may save a lot

of Capex type cost (capital expenditure) upfront and turn running these clusters into an Opex type cost (operational expenditure) Of course, the drawbacks are

as expected, shared hardware may not be as powerful as dedicated hardware,making certain kinds of tests (for example, extreme performance tests)

impractical Figure 1-7 shows the environments that could be virtualized

Trang 26

Figure 1-7 Some environments could be hosted in a cloud

The advantage of cloud services is usually the latter item, that is, ease ofprovisioning We will discuss this next

Trang 27

Once you have decided how many environments you want to use, and on whatinfrastructure, you have to devise a plan on deploying the Hadoop distribution

of your choice on top of it (see [Link to Come] for details) For that, the

following approaches are common options found in practice:

Ambari), which includes security, and even provision other, auxiliarytools, such as applications or ingest processes The greatest advantage ofusing a CM system is that it automatically forces you to document all thenecessary steps in their recipes or playbooks, which in turn are usuallyversioned in a version control system (VCS)

Cloud

As discussed in “Cloud Services”, using virtual machines to deploy

prefabricated images onto shared—or dedicated—server hardware is aconvenient mechanism to bootstrap complex, distributed systems, such asHadoop It is very likely though that these images were initially set upusing the above approach of employing a configuration management

frameworks, such as Ansible or similar

Appliances

No matter if you run Hadoop on bare metal, in the private or public cloud,

or on engineered hardware in form of prefabricated appliances, you must

consider existing IT policies, potential preferences, or budget

Trang 28

restrictions In the end Hadoop will work on any of those infrastructureplatforms, with some effects on the workloads There are different

variations of the engineered solutions, that might impact the applicationdesign in different ways, so it is best to follow the often jointly-

developed reference architectures published by the Hadoop or appliancevendor Those are proven configurations that have been tested and

verified to work best with CDH

Trang 29

no downtime, data loss, and ideally a rollback feature in case testing the

release did not catch a severe deficiency Running a single Hadoop clusteritself is not trivial and should be supported by automatic, reproducible

procedures and processes Having to do that across many environments furthercomplicates this task, but is an unavoidable burden Consider running someenvironments on a private or public cloud infrastructure to offset cost and beable to stage and rotate releases using the provided mechanisms Otherwise,plan to build out the described environments for software engineering, and indue course disaster recovery as well as a partitioned production cluster

1 See, for example, Software Development Process for general information onthe topic

2 Originally provided by Cloudera as a GitHub repository

3 Added by Hortonworks to Hive under HIVE-7926

Trang 30

Chapter 2 Compute & Storage

In this chapter we will cover every piece of IT infrastructure that is required tobuild a Hadoop cluster We start by talking about rather low-level details ofcomputer architecture and how this is used by the Linux operating system Wethen talk about server form factors, before we finally talk about cluster

configurations both in physical and virtual environments

You may not be required to know the facts in this chapter by hard, but Hadoop

is here to stay and the standards of building rock-solid architectures and beable to deploy them like a true champion are growing As Hadoop maturesfrom POC to Production, it remains a distributed system, which at times posesutterly complex problems related to the underlying software and hardwarestack on the individual server The goal of this chapter is thus to supply you as

an architects and/or engineer with the knowledge to size the cluster’s serversand how they connect to the surrounding infrastructure and to learn what isgoing on underneath to simplify problem analysis

During the initial years of enterprise adoption the requirements for Hadoop ITinfrastructure were simple, yet disruptive: Practitioners essentially only

recommended to run Hadoop on dedicated commodity servers with local

storage These requirements are at odds with both, state of the art compute andstorage virtualization as well with the emergence of cloud environments forHadoop However, since its beginnings as a large-scale backend batch

framework for the big Web 2.0 content providers, Hadoop is today evolvinginto a versatile framework required to perform in heterogenous IT

environments While Hadoop’s paradigm of colocating compute and storage isstill mandatory to excel in performance and efficiency, providers of Hadooptechnology in the meantime invest intensively to support Hadoop in Cloudenvironments to participate in the global growth of enterprise IT in Cloudenvironments

Most of the concepts in this chapter are developed by reference of on-premiseHadoop infrastructure While the number of cloud-based Hadoop deployments

is rapidly growing, on-premise installations are still the dominant form of

Trang 31

deploying Hadoop We start this chapter with an in-depth discussion of

relevant concepts in computer architecture and the Linux operating system,before we introduce common server form-factors for on-premise deployments

We further discuss special architectures, where compute and storage are

separated The chapter is concluded with a discussion on hardware errors andreliability

Trang 32

Computer architecture for

Trang 33

Commodity servers

It is widely understood, that Hadoop, as most commercial computation

workload today, runs on commodity servers, which over the last 10 years havesimply become a commodity That being said, most modern servers are verypowerful and complex machines that need to keep up with the ever-increasingdemand of IT and computational needs in the enterprise and consumer sector.The lion’s share of today’s datacenters are comprised of x86-64 architecture-based systems, which feature up to 24 processing cores per processor

However, with many cores come many challenges Writing an application thatfully takes advantage of this degree of parallelism is often far from trivial Thedrastic increase in the amount of processing cores during recent years is atechnical necessity to maintain growth of computational capability of serverprocessors; due to physical limits of scaling a single core’s frequency, scalingcores is the only alternative As we will see, the majority of Hadoop clusters

is implemented with two processors per system, while it is as well possible toimplement with servers that feature four or even eight processors per system.The concept of using multiple cores per CPU and/or using multiple CPUs isreferred to as Symmetric Multi-Processing (SMP)

Figure 2-1 shows a simplified block diagram of the relevant hardware

components in a commodity server In this example two CPUs are

interconnected via a coherent inter-processor link The CPU cores on eachprocessor each have separate L1 and L2 caches and typically share an L3cache Each processor implements a memory controller to attach DDR3/4DRAM memory, which, as we described in detail in the following section,makes this system a so-called NUMA-system Input/Output operations areimplemented via a PCI-Express root complex which attaches downstream I/Ocontrollers for SATA/USB/Ethernet connections, etc All CPU-internal

components, cores, L3, memory controller, PCI-Express root complex and theinterconnect unit are themselves interconnected via an on-die interconnect bus.All commodity servers today abide the general structure illustrated in the

figure Commodity servers, that feature more than two CPUs, will typically beorganized in a ring topology, via the CPU interconnect, but otherwise adhere tothe same general structure as illustrated in Figure 2-1 While, it is always

Trang 34

possible to populate a two-socket server with only a single CPU, here arerarely any commodity servers with only a single socket today.

Trang 35

Figure 2-1 A modern computer

Trang 36

Non-Uniform Memory Access

The most important take-away from the discussion around Symmetric MultiProcessing (SMP) is an understanding of the concept of Non-uniform-memoryaccess (NUMA), that it incurs When multiple processors share the memory inthe system, the mechanism by which it is made accessible becomes an

important factor in the overall system design In some early multi-processorcomputer designs all memory was exposed to the processors equally on a

common bus or via a crossbar switch

Today, this approach is mostly not practical Today, CPUs need to

accommodate DRAM with bus speeds beyond 2 GHz and also because CPUsare considered a modular pluggable entity, each processor directly implements

an interface to the DRAM If that is the case, any program running on a givenCPU wants, which needs to access memory from another CPU, must first

traverse the inter-processor link While the speed of this connection is in themulti-giga-transfer/s range and individual requests complete very quickly,running from another processor’s memory introduces a significant overheadwhen compared to running on the processor’s local memory This distinctionbetween local and distant memory is called Non-Uniform Memory Access(NUMA) A common example could be a process of a Hadoop service that isallowed to be very large and may actually be allocated in a memory range thatphysically must span both processors In this scenario multiple threads could

be running on both physical CPUs, trying to access a location of memory which

is distant for some of these threads and local to others This memory wouldhowever reside in the each processors L3/L2/L1 caches to improve accessspeed Upon each update the processors’ caches must reflect that update

anywhere coherently, i.e an update of a memory location on processor 1 that

represents an integer number by a thread on processor 1 must materialize onprocessor 2 atomically before any thread on processor 2 reads the shared

memory location If processor 2 were to increase the value of the shared

integer value, the prior update would be lost and the value would be wrong.This is maintained by the processor’s cache coherency protocol

In order to expose the information about the memory topology to programmers

Trang 37

and users, and to provide a means to optimize runtime behavior on NUMA

architecture, Linux provides tools and interfaces via which users and programs

can influence NUMA behavior directly Most importantly, this allows to

request optimal placement for applications on a given processor, which in

NUMA terminology is called a NUMA node, not to be confused with a core or

a thread within a processor

As a Hadoop architect or engineer, the likelihood that you will deal with

NUMA directly is fairly low, since NUMA makes mostly sensible decisions

and modern Hadoop distributions explicitly request placement of processes on

NUMA nodes There are however problems that you will encounter with

systems performance which are related to NUMA, especially when you

dedicate large amounts of memory to single Hadoop services such as reporting

systems based on Hive/Impala or HBase As programmer, however, you

should generally be aware of NUMA If you know that your query, your spark

job or your own framework will need more memory than available on a single

NUMA node, you should make conscious decisions on the NUMA policy that

you run it with

Let us review briefly how information about NUMA for a process in Linux can

be obtained and influenced via the *numactl* command Assume that we have

a system with two processors as indicated in Figure 2-1 Each of the

processors controls 128 GB of memory Let’s start with displaying the

available NUMA nodes, i.e processors on the system

In the first row of the output, we see the number of available NUMA nodes

Next, the amount of attached and free memory is shown per node, before

Trang 38

finally the output lists a table of NUMA distances Linux assigns a score of 10

for access to the local processor and 21 for an adjacent processor Higher cost

may be associated with topologies, where there is no direct connection

between the originating processor and the target processor in which case

access would occur by traversing through an adjacent processor in order to

reach the target processor In the example above we see that most memory on

this machine is not allocated and that existing allocations are fairly evenly

distributed

In Linux you can display NUMA information via the proc file system as shown

in the simplistic example below Here we see how a YARN NodeManager

maps the gcc runtime library

cat /proc/<process-id>/numa_maps|grep libgcc

7f527fa8e000 prefer:1 file=/usr/lib64/libgcc_s-4.8.5-20150702.so.1 mapped=3 N0=3

7f527faa3000 prefer:1 file=/usr/lib64/libgcc_s-4.8.5-20150702.so.1

7f527fca2000 prefer:1 file=/usr/lib64/libgcc_s-4.8.5-20150702.so.1 anon=1 dirty=1 active=0 N0=1 7f527fca3000 prefer:1 file=/usr/lib64/libgcc_s-4.8.5-20150702.so.1 anon=1 dirty=1 active=0 N0=1

Let us analyze this output (there are more possible fields in this output, which

is documented in1)

The first entry shows us the starting address in virtual memory address

space of the region mapped

prefer:1

shows the memory the NUMA placement policy It is always best practice

to prefer a specific NUMA node, such that reads to distant memory is

minimized For processes that consume lots of memory there will be a

point, where the preference can not be fulfilled anymore This can easily

happen for certain processes on Hadoop worker nodes such as Impala

daemons or HBase Region Server

file=

shows which file backs this mapping Often multiple disjoint mappings

are created for a file and often only part of the file is mapped

Trang 39

N<node>=<number of mapped pages>

shows how many pages are mapped by a certain node This is what youshould look out for and may pose a problem with performance when yousee many Node entries (e.g N0=50000 N1=50000)

Linux allows you to control the NUMA characteristics when a process is

launched via the numactl command which we have already seen above

numactl provides options which on the one hand control on which NUMAnode

numactl preferred=0 <process>

This will launch `<process> and allocate its memory on node 0, but if memoryallocation is not possible there launch the process on other nodes When youlaunch a process this way all of its children will inherit the same NUMA

policy In ???, above, which shows actual NUMA mappings, all entries haveinherited their preference from the original command, which started the

NodeManager

Most Hadoop distributions today, leverage numactl to optimize the NUMAconfiguration for processes that are launched by their management tools such

Trang 40

as Cloudera Manager or Apache Ambari (TODO confirm this for CM andAmbari once again).

As illustrated in figure Figure 2-1, the access to I/O hardware is also occurring

in a non-uniform fashion Each processor implements its southbound I/O fabric

in via PCI-Express, which is a high-speed point-to-point communications

protocol This incurs that the I/O chip, which connects further southbound bussystems like SATA/SAS/Ethernet, can only connect to a single upstream PCI-Express root complex For apparent reasons, only a single I/O chipset, suchthat all but one of the processors are required to communicate via the inter-processor link, before they can reach I/O Even though I/O completion timemay increase by up to 130% (TODO REF), due to the additional hop, this

overhead must be accepted, since all processors need to communicate with theoutside world via a single I/O hub However, when profiling workloads anddebugging performance issues it is required to be mindful of NUMA for bothcomputation and I/O

Định dạng
Số trang	187
Dung lượng	13,25 MB