Cisco networkers 2009 session BRKAPP 2011 application and data delivery performance in a low latency 10GE environment DDU

High Performance Computing ClustersHPC Applications Parallel Applications Data Deliveryy NAS, Clustered NAS Block/File Parallel File Systems, Object Based Parallel File Systems Des

Trang 1

Application Performance

Optimization with 10 Gigabit Ethernet

BRKAPP-2011

Trang 2

High Performance Computing Clusters

HPC Applications Parallel Applications

Data Deliveryy

NAS, Clustered NAS Block/File Parallel File Systems, Object Based Parallel File Systems

DesignsDesigns

Trang 3

10 Gigabit Ethernet

Trang 4

10 Gigabit Ethernet

10 GE has been historically a higher priced interconnect primarily

used for inter-switch links within the data center within the campus

used for inter-switch links within the data center, within the campus backbone and for uplinks from high density closets

Implementation at the server has been price prohibitive with the

exception of a few very specific applications while the primary

server interconnect has remained Gigabit Ethernet

Many changes are driving the move towards 10GE as a

commonplace interconnect for server platforms which have p p

significant benefits to many applications even though the full

bandwidth is not being utilized

Per port costs have dropped from near $12,000 two to three years p pp yago to as low as $500 today

10GE adapter costs have seen reductions to 25% of what they

were in the recent paste e t e ece t past

Trang 5

10 Gigabit Ethernet – Host Changes

and eight-way quad core motherboards

bus frequencies memory bandwidth frequencies

bus frequencies, memory bandwidth frequencies,

kernel processing, context switching and message

copies

that does not allow the the necessary throughput rates

to feed more than four cores

to feed more than four cores.

Trang 6

10 GE Adapter Vendors

Trang 7

Applications

Trang 8

Database Cluster

Oracle Implementation

with 10 Gigabit Ethernet

Trang 9

Oracle RAC in the Data Center

Multiple interconnectspHeartbeat, IPC, Cache Fusion, Data load/unload

Database

Financial

High Performance

Trang 10

Oracle RAC Optimization

DB IPC communication acceleration

DB to App tier potential acceleration

Oracle 11 actually has the ability to leverage SDP in

Asynchronous I/O mode using iWARP for IB and Ethernet with

OFED 1.2

Oracle 10g uses UDP

Oracle 11g RAC – RDS standard within ofed 1.4 for RDS over

RDMA (iWARP) and RDS over TCP

Database

Financial

Trang 11

IPC

Trang 12

5 Cables 1Gbps max throughput for any single session

Storage Interconnect Fail-Over Cluster Interconnect

Trang 13

Trang 14

Trang 15

Financial Trading &

Compute Clusters

Trang 16

Financial Trading and Compute Clusters

Banking world.

Algorithmic trading

t 100’ f hi

up to 100’s of machinesEnd to end latency is king – but not just low latency, latency deviation is just as critical

Compute machines for pricing, risk analysis10,000’s to 100,000s of machines

Database

Financial

Trang 17

Algorithmic Trading

‘In any gun fight, it’s not enough just to shoot fast or to

shoot straight Survival depends on being able to do

both… The lone gun-slinger of the open-outcry trading

trading systems which are more akin to robots with

Trang 18

Deterministic Performance

#1 problem in financial trading environments

Deterministic Performance

Financials don’t care about MIN(latency) or

AVG(latency), but STDDEV(latency) at the

application level

A single frame dropped in a switch or adapter

causes significant impact on performance

TCP NACK delayed by up to 125 μs with most NICs with interrupt throttling enabled

TCP window shortenedTCP retransmit timeout 500ms standard usually 200ms implementation

Database

Financial

Trang 19

Why Is Latency / Performance a Problem ? Why Is Latency / Performance a Problem ?

Exchange Systems

Trade Price

Market Data Supplier

Distribution Platform

Trading Engine

Risk Software

Exchange Systems

Exec Trade

Response to changing market conditions is delayed by

system latency and creates significant loss of opportunity for

Response to changing market conditions is delayed by

trade execution, and affects trading strategies.

The goal of the Low Latency is to provide the required level of capacity to support

Trang 20

Traffic Growth Next 12 Months

4500

gb_received gb_sent Linear (gb_received) Linear (gb_sent)

Traffic Growth Next 12 Months

Trang 21

The Trading Challenge

1.20

3,500,000

Updates per Second and Mean Latency

3,000,000 1.00 1.00

Trang 22

-Market Data – Algorithmic Trading

Trang 23

Financial Compute Cluster

feedback mechanisms increase

Trang 24

High Performance

High-Performance

Computing Clusters

Trang 25

HPC Network Communication

Access Network Management Network

IPC Network Storage Network

Trang 27

Two Key Concepts—Terminology

Two Key Concepts Terminology

The ability to provide

predictable (and large)

computation throughput for

The ability to provide peak power for a specific amount

of time as to solve a problem experimentation, production

runs, testing

within a guaranteed time window

The Two Require Different Architectural Solutions, but in Practice the Same Infrastructure must Deliver Both This Leads

Naturally to Concepts like Virtualization, Grid-Computing and

Dynamic Provisioning y g

Database

Financial

High Performance

Trang 28

Application Based Design Criteria

Network

Application Based Design Criteria

Network Traffic

High Low

Bandwidth

Hi h L

Latency

Hi h L

Application

R i t

High Low

Requirements

CPU Architecture

AMD Intel

Storage System

Operating System

Trang 29

Flexible Infastructure

Tightly CoupledLoosely Coupled

Trang 30

Job Mix

application with different inputs – parametric execution

Parametric execution is widely used in HPC and accounts for more than 70% of cluster usage

Running multiple serial applications on one node or Running multiple serial applications on one node or

one core per serial application run

Database

Financial

Trang 31

Determine Network Topology

non-Star

a d equ a e tsized non-blocking switch “building blocks”

Sometimes combined

Sometimes combined with Star architecture to provide a hybrid

Fat Tree

p

Trang 32

Fluent - CFD

Computational Fluid Dynamics

Using Parallel IO for cell data (MPIIO/ROMIO)g ( )

MPI communication is to nearest neighbor using 64K – 128K MPI

messages

Benchmark performance shown as a rating

Rating is the number of times the benchmark can be run in a 24

hour time frame

Rating calculated as follows:

24 hours broken down in to total seconds * the number of cores used

in the job to get total CPU seconds Divide the total actual CPU seconds of the benchmark wall clock time by total available.

24*60*60 = 86,400 * 128 = 11,059,200 / 229,112.27 = 48.3

Database

Financial

Trang 33

Fluent - CFD

111 million cell – Very large benchmark

Rating of 11.7 on GE Rating of 48.3 on 10GE 412% better performance on 10GE Intel Oplin Classical 10GE with no RDMA

128 cores/2GB memory per core Parallel IO with ROMIO/MPIIO support in fluent 12.0

Database

Financial

High Performance

Trang 34

Fluent – CFD – 10GE versus IB

111 million cell – very large benchmark

Rating of 48.3 on 10GE Rating level of 51.8 over DDR IB

< 7% performance variation Intel Oplin Classical 10GE with no RDMA

128 cores/2GB memory per core Parallel IO with ROMIO/MPIIO support in fluent 12.0

Database

Financial

Trang 35

High-Performance Computing Solution

Management and I/O Network

Used for job scheduling, network monitoring TCP or UDP based – benefits from

Quality of Service and Multicast Netflow reporting, NSF/SSO Netflow reporting, NSF/SSO for high availability

Trang 36

Data Delivery

Trang 37

Storage Access Protocols and Technologies

Storage

Type

Block or File Access

Server Access

Back-End Storage Access

InfiniBand gateway

SCSI or Fiber Channel

Cluster NAS File Ethernet or

Infiniband gateway

SCSI or Fiber Infiniband gateway

SAN Block FC or InfiniBand Fiber Channel,

gateway InfiniBand, or iSCSI

Trang 38

Network Attached Storage (NAS)

Attaches via connections to the network using g

Gigabit and 10Gigabit Ethernet

Primarily using NFS (only standards based file y g ( y

systems in this space)

Perform well for small clusters but does not Perform well for small clusters but does not

scale well

Single point of access and single point of Single point of access and single point of

failure

Trang 39

Clustered NAS

Attaches via connections to the network using g

Gigabit and 10Gigabit Ethernet

Where the traditional NAS or NFS solution

uses a single filer or server, a cluster NAS

solution utilizes several heads with storage that

is connected directly to the heads or via some

type of storage network (fiber channel)

Each of the filer heads can only access the

storage assigned to it and not the storage

assigned to other filers

Trang 40

Clustered NAS

Access is limited to assigned storage g g

All filers have knowledge of the location of data regardless of which storage and filer the data is g g located

Depending on implementation data access Depending on implementation data access

occurs either via a process which moves data

from one filer to another or in an NFS gateway

f process with a parallel file system on the back-

end

Trang 41

Parallel File Systems

Attached via connections to the network using g

Gigabit Ethernet, 10Gigabit Ethernet and

Infiniband

Provides multiple or parallel access to storage

nodes also known as I/O nodes

PFS nodes have access to direct attached

storage

Implementations are file/block based and/or

object based

Trang 42

Parallel File Systems

For File/Block based, Metadata service is one

of the key bottle necks to scalability

Example: file write requests are made to the p q

metedata server which allocates the block(s)

Compute note then sends the data to the

metadata server which sends the data to the

file system and then to disk

Metadata services are either a dedicated or

shared/clustered implementation

Trang 43

Parallel File Systems – Object Based

functionality

as the nodes are not just storage bricks

The metadata server then passes a list of which storage

Trang 44

service is removed from the file operation as the node

will then write the data directly to an I/O node and then

to allow the location of the data to current within the

metadata records.

upon any number of variable The choice of network

architecture, interconnect and switch fabric can and will have a significant impact to the performance

Trang 45

functionality

as the nodes are not just storage bricks

The metadata server then passes a list of which storage

Trang 46

Data Latency – Seismic Processing

Data Latency Seismic Processing

GbE 10GbE InfiniBand FC

NFS Parallel NFS NAS Clustered Parallel File SystemNFS, Parallel NFS, NAS, Clustered,Parallel File System,

Trang 47

Data Latency

interconnects will drive more 10GbE interconnects than node connects until prices are at or near IB pricing

more than 50% versus 10GbE – 10GbE

Trang 48

Low Latency and Data Delivery

IB connected Parallel File System:

– Dark Matter application (Gig to 10Gbps)

35 hour run time with data delivered over GbE

O&G Seismic Processing 120,000+ cores – Oil and Gas exploration (Gig to 10GbE)

small jobs 2x reduction in wall clocklarge jobs 16x reduction in wall clock

The issue is how to deliver 10s of Gigabits/s of I/O to a large number of clusters

DOD/DOE labs share Peta-scale storage systems across clusters with 10GbE

with 10GbE

Trang 49

Data Delivery in HPC

Use of NFS Filers and Parallel File SystemsDirect Attached Storage and FC SAN with Parallel File System

lower wall clock times required in research and

business

PFS and large data sets are driving 10GE for IO node interconnect

Data latency impacts more applications than compute latency

bandwidth fabric will affect <30 % of many applications

Trang 50

Application Performance

Design and Latency

Trang 51

1 The time for encoding the packet for transmission and

transmitting it

2 the time for that serial data to traverse the network equipment

between the nodes, and

3 the time to get the data off the circuit

bound on latency is determined by the distance between communicating devices and the speed at which the signal propagates in the circuits (typically 70-95% of the speed of light).

Trang 52

End user of the data.

Is it core in a parallel application

It is another application

Is it an end user 3+ms one way latency cost from

the data center

It takes a batsmen in cricket 400ms to decide where and how to hit the ball when the bowler releases it

and how to hit the ball when the bowler releases it.

It takes a normal human 250ms just to recognize that data has been delivered to their screen – not to mention data has been delivered to their screen not to mention

Trang 53

Sources of Latency in Network Today

~3 us Blade switch ~7 us - Core ~3 us - ToR

Ping/Pong Latency 25 – 30 us

Problem needs to be solved end to-end Applications, NIC,

Trang 57

Latency Effects in an Ethernet World

100 us E2E at 1GbE reduces throughput by 15-20%

100us E2E at 10GbE reduces throughput by 20-25%

Thank our good friend TCP for that

Storage heavy applicationsLoad/unload operations

Trang 58

Traditional Server I/O Architecture

3.

5.

Access to I/O resource handled

by BIOS

A data packet is typically copied three to four times

CPU Interrupts, Bus bandwidth constrained, Memory bus constrained

Trang 59

Adapter and Protocol Considerations

Fundamental part of any solution

Great advantage of Ethernet

Highly competitive and open market placeOn-loading vs Off-loading camp, …

Linux vs Windows vs Solaris

iWARP – RDMA, RDDP, DDP

Single sided offload with zero copy kernel bypass

TCP over lossless Ethernet is just the beginning

Alternative protocols being considered

Trang 60

Kernel Bypass/Zero Copy Architecture

CPU

3

Bypass Capable Adapter

Application Memory

1.

2

I/O Memory Pool

Trang 61

Latency Performance Comparison

TCP/UDP

SDP OFED 1 3

MPI OFED 1 3

Gigabit Ethernet 10 GE

10 GE Zero Copy SDR IB DDR IB SDR IB DDR IB

10G LLE MVAPICH OMPI

Trang 62

Switch Architecture Value

20 μs

Trang 63

Implemented and

Reference Designs

Trang 65

Non-blocking at each switch

Suited for nearest neighbor communication

co u cat o

120 Gbps of bandwidth to each neighbor

4 Neighbors for each switch node

Trang 66

2D-Torus – 2400 Nodes

480 servers Non-blocking at each

it hswitch

120 Gbps of bandwidth to each neighbor

Total 48 x 10GE uplinks

12 VLANs / 4 port bundles from

4 x Neighbors for each switch node

Trang 67

Ring Design with L3 ECMP

240 Gbps BW 240 Gbps BW

Trang 68

Ring: Notes on Design

Trang 69

10GE connections

Trang 70

apollo 01-16 apollo 17-32 apollo 33-48 apollo 49-64 apollo 65-80 apollo 81-96 apollo 97-112 apollo 113-128

f 101 fex101

Trang 72

L2MP technologies

Trang 73

Layer 2 Multi Path

Etherchannel

VSS (Virtual Switching System)

vPC (virtual Port Channel)

EHV (Ethernet Host Virtualizer)

Cisco DCE

TRILL

Trang 74

Going Beyond Spanning Tree

Today Ethernet forwarding is

Today Ethernet forwarding is done according to Spanning Tree

In trees, going from the root toward the leaves, branches get smaller

I 2009/2010 d

In 2009/2010 datacenters most of the links will be 10GE

Single size branches

There is a concrete need to go beyond Spanning Tree

Định dạng
Số trang	88
Dung lượng	5,77 MB