High Performance Computing ClustersHPC Applications Parallel Applications Data Deliveryy NAS, Clustered NAS Block/File Parallel File Systems, Object Based Parallel File Systems Des
Trang 1Application Performance
Optimization with 10 Gigabit Ethernet
BRKAPP-2011
Trang 2 High Performance Computing Clusters
HPC Applications Parallel Applications
Data Deliveryy
NAS, Clustered NAS Block/File Parallel File Systems, Object Based Parallel File Systems
DesignsDesigns
Trang 310 Gigabit Ethernet
Trang 410 Gigabit Ethernet
10 GE has been historically a higher priced interconnect primarily
used for inter-switch links within the data center within the campus
used for inter-switch links within the data center, within the campus backbone and for uplinks from high density closets
Implementation at the server has been price prohibitive with the
exception of a few very specific applications while the primary
exception of a few very specific applications while the primary
server interconnect has remained Gigabit Ethernet
Many changes are driving the move towards 10GE as a
commonplace interconnect for server platforms which have p p
significant benefits to many applications even though the full
bandwidth is not being utilized
Per port costs have dropped from near $12,000 two to three years p pp yago to as low as $500 today
10GE adapter costs have seen reductions to 25% of what they
were in the recent paste e t e ece t past
Trang 510 Gigabit Ethernet – Host Changes
and eight-way quad core motherboards
bus frequencies memory bandwidth frequencies
bus frequencies, memory bandwidth frequencies,
kernel processing, context switching and message
copies
that does not allow the the necessary throughput rates
to feed more than four cores
to feed more than four cores.
Trang 610 GE Adapter Vendors
Trang 7Applications
Trang 8Database Cluster
Oracle Implementation
with 10 Gigabit Ethernet
Trang 9Oracle RAC in the Data Center
Multiple interconnectspHeartbeat, IPC, Cache Fusion, Data load/unload
Database
Financial
High Performance
Trang 10Oracle RAC Optimization
Oracle RAC Optimization
DB IPC communication acceleration
DB to App tier potential acceleration
Oracle 11 actually has the ability to leverage SDP in
Asynchronous I/O mode using iWARP for IB and Ethernet with
Asynchronous I/O mode using iWARP for IB and Ethernet with
OFED 1.2
Oracle 10g uses UDP
Oracle 11g RAC – RDS standard within ofed 1.4 for RDS over
RDMA (iWARP) and RDS over TCP
Database
Financial
Trang 11IPC
Trang 125 Cables 1Gbps max throughput for any single session
Storage Interconnect Fail-Over Cluster Interconnect
Trang 1314 Cables 1Gbps max throughput for any single session
Storage Interconnect Fail-Over Cluster Interconnect
Trang 145 Cables 10Gbps max throughput for any single session
Storage Interconnect Fail-Over Cluster Interconnect
Trang 15Financial Trading &
Financial Trading &
Compute Clusters
Trang 16Financial Trading and Compute Clusters
Banking world.
Algorithmic trading
t 100’ f hi
up to 100’s of machinesEnd to end latency is king – but not just low latency, latency deviation is just as critical
Compute machines for pricing, risk analysis10,000’s to 100,000s of machines
Database
Financial
Trang 17Algorithmic Trading
‘In any gun fight, it’s not enough just to shoot fast or to
shoot straight Survival depends on being able to do
both… The lone gun-slinger of the open-outcry trading
trading systems which are more akin to robots with
Trang 18Deterministic Performance
#1 problem in financial trading environments
Deterministic Performance
Financials don’t care about MIN(latency) or
AVG(latency), but STDDEV(latency) at the
application level
A single frame dropped in a switch or adapter
causes significant impact on performance
TCP NACK delayed by up to 125 μs with most NICs with interrupt throttling enabled
TCP window shortenedTCP retransmit timeout 500ms standard usually 200ms implementation
Database
Financial
Trang 19Why Is Latency / Performance a Problem ? Why Is Latency / Performance a Problem ?
Exchange Systems
Trade Price
Market Data Supplier
Distribution Platform
Trading Engine
Risk Software
Exchange Systems
Exec Trade
Response to changing market conditions is delayed by
system latency and creates significant loss of opportunity for
Response to changing market conditions is delayed by
system latency and creates significant loss of opportunity for
trade execution, and affects trading strategies.
system latency and creates significant loss of opportunity for
trade execution, and affects trading strategies.
The goal of the Low Latency is to provide the required level of capacity to support
Trang 20Traffic Growth Next 12 Months
4500
gb_received gb_sent Linear (gb_received) Linear (gb_sent)
Traffic Growth Next 12 Months
Trang 21The Trading Challenge
1.20
3,500,000
Updates per Second and Mean Latency
3,000,000 1.00 1.00
Trang 22-Market Data – Algorithmic Trading
Trang 23Financial Compute Cluster
feedback mechanisms increase
Trang 24High Performance
High-Performance
Computing Clusters
Trang 25HPC Network Communication
Access Network Management Network
IPC Network Storage Network
Trang 27Two Key Concepts—Terminology
Two Key Concepts Terminology
The ability to provide
predictable (and large)
computation throughput for
The ability to provide peak power for a specific amount
of time as to solve a problem experimentation, production
runs, testing
within a guaranteed time window
The Two Require Different Architectural Solutions, but in Practice the Same Infrastructure must Deliver Both This Leads
Naturally to Concepts like Virtualization, Grid-Computing and
Dynamic Provisioning y g
Database
Financial
High Performance
Trang 28Application Based Design Criteria
Network
Application Based Design Criteria
Network Traffic
High Low
Bandwidth
Hi h L
Latency
Hi h L
Application
R i t
High Low
High Low
Requirements
CPU Architecture
AMD Intel
Storage System
Operating System
Trang 29Flexible Infastructure
Tightly CoupledLoosely Coupled
Trang 30Job Mix
application with different inputs – parametric execution
Parametric execution is widely used in HPC and accounts for more than 70% of cluster usage
Running multiple serial applications on one node or Running multiple serial applications on one node or
one core per serial application run
Database
Financial
Trang 31Determine Network Topology
non-Star
a d equ a e tsized non-blocking switch “building blocks”
Sometimes combined
Sometimes combined with Star architecture to provide a hybrid
Fat Tree
p
Trang 32Fluent - CFD
Computational Fluid Dynamics
Using Parallel IO for cell data (MPIIO/ROMIO)g ( )
MPI communication is to nearest neighbor using 64K – 128K MPI
messages
Benchmark performance shown as a rating
Rating is the number of times the benchmark can be run in a 24
hour time frame
Rating calculated as follows:
24 hours broken down in to total seconds * the number of cores used
in the job to get total CPU seconds Divide the total actual CPU seconds of the benchmark wall clock time by total available.
24*60*60 = 86,400 * 128 = 11,059,200 / 229,112.27 = 48.3
Database
Financial
Trang 33Fluent - CFD
111 million cell – Very large benchmark
Rating of 11.7 on GE Rating of 48.3 on 10GE 412% better performance on 10GE Intel Oplin Classical 10GE with no RDMA
128 cores/2GB memory per core Parallel IO with ROMIO/MPIIO support in fluent 12.0
Database
Financial
High Performance
Trang 34Fluent – CFD – 10GE versus IB
111 million cell – very large benchmark
Rating of 48.3 on 10GE Rating level of 51.8 over DDR IB
< 7% performance variation Intel Oplin Classical 10GE with no RDMA
128 cores/2GB memory per core Parallel IO with ROMIO/MPIIO support in fluent 12.0
Database
Financial
Trang 35High-Performance Computing Solution
Management and I/O Network
Used for job scheduling, network monitoring TCP or UDP based – benefits from
Quality of Service and Multicast Netflow reporting, NSF/SSO Netflow reporting, NSF/SSO for high availability
Trang 36Data Delivery
Trang 37Storage Access Protocols and Technologies
Storage
Type
Block or File Access
Server Access
Back-End Storage Access
InfiniBand gateway
SCSI or Fiber Channel
Cluster NAS File Ethernet or
Infiniband gateway
SCSI or Fiber Infiniband gateway
SAN Block FC or InfiniBand Fiber Channel,
gateway InfiniBand, or iSCSI
Trang 38Network Attached Storage (NAS)
Attaches via connections to the network using g
Gigabit and 10Gigabit Ethernet
Primarily using NFS (only standards based file y g ( y
systems in this space)
Perform well for small clusters but does not Perform well for small clusters but does not
scale well
Single point of access and single point of Single point of access and single point of
failure
Trang 39Clustered NAS
Attaches via connections to the network using g
Gigabit and 10Gigabit Ethernet
Where the traditional NAS or NFS solution
uses a single filer or server, a cluster NAS
solution utilizes several heads with storage that
is connected directly to the heads or via some
type of storage network (fiber channel)
Each of the filer heads can only access the
storage assigned to it and not the storage
assigned to other filers
assigned to other filers
Trang 40Clustered NAS
Access is limited to assigned storage g g
All filers have knowledge of the location of data regardless of which storage and filer the data is g g located
Depending on implementation data access Depending on implementation data access
occurs either via a process which moves data
from one filer to another or in an NFS gateway
f process with a parallel file system on the back-
end
Trang 41Parallel File Systems
Attached via connections to the network using g
Gigabit Ethernet, 10Gigabit Ethernet and
Infiniband
Provides multiple or parallel access to storage
nodes also known as I/O nodes
PFS nodes have access to direct attached
storage
Implementations are file/block based and/or
object based
Trang 42Parallel File Systems
For File/Block based, Metadata service is one
of the key bottle necks to scalability
Example: file write requests are made to the p q
metedata server which allocates the block(s)
Compute note then sends the data to the
metadata server which sends the data to the
file system and then to disk
Metadata services are either a dedicated or
shared/clustered implementation
Trang 43Parallel File Systems – Object Based
functionality
as the nodes are not just storage bricks
The metadata server then passes a list of which storage
Trang 44Parallel File Systems – Object Based
service is removed from the file operation as the node
will then write the data directly to an I/O node and then
to allow the location of the data to current within the
metadata records.
upon any number of variable The choice of network
upon any number of variable The choice of network
architecture, interconnect and switch fabric can and will have a significant impact to the performance
Trang 45Parallel File Systems – Object Based
functionality
as the nodes are not just storage bricks
The metadata server then passes a list of which storage
Trang 46Data Latency – Seismic Processing
Data Latency Seismic Processing
GbE 10GbE InfiniBand FC
NFS Parallel NFS NAS Clustered Parallel File SystemNFS, Parallel NFS, NAS, Clustered,Parallel File System,
Trang 47Data Latency
interconnects will drive more 10GbE interconnects than node connects until prices are at or near IB pricing
more than 50% versus 10GbE – 10GbE
Trang 48Low Latency and Data Delivery
IB connected Parallel File System:
– Dark Matter application (Gig to 10Gbps)
35 hour run time with data delivered over GbE
O&G Seismic Processing 120,000+ cores – Oil and Gas exploration (Gig to 10GbE)
small jobs 2x reduction in wall clocklarge jobs 16x reduction in wall clock
The issue is how to deliver 10s of Gigabits/s of I/O to a large number of clusters
DOD/DOE labs share Peta-scale storage systems across clusters with 10GbE
with 10GbE
Trang 49Data Delivery in HPC
Use of NFS Filers and Parallel File SystemsDirect Attached Storage and FC SAN with Parallel File System
lower wall clock times required in research and
business
PFS and large data sets are driving 10GE for IO node interconnect
Data latency impacts more applications than compute latency
bandwidth fabric will affect <30 % of many applications
bandwidth fabric will affect <30 % of many applications
Trang 50Application Performance
Application Performance
Design and Latency
Trang 511 The time for encoding the packet for transmission and
transmitting it
2 the time for that serial data to traverse the network equipment
2 the time for that serial data to traverse the network equipment
between the nodes, and
3 the time to get the data off the circuit
bound on latency is determined by the distance between communicating devices and the speed at which the signal propagates in the circuits (typically 70-95% of the speed of light).
Trang 52 End user of the data.
Is it core in a parallel application
It is another application
It is another application
Is it an end user 3+ms one way latency cost from
the data center
It takes a batsmen in cricket 400ms to decide where and how to hit the ball when the bowler releases it
and how to hit the ball when the bowler releases it.
It takes a normal human 250ms just to recognize that data has been delivered to their screen – not to mention data has been delivered to their screen not to mention
Trang 53Sources of Latency in Network Today
~3 us Blade switch ~7 us - Core ~3 us - ToR
Ping/Pong Latency 25 – 30 us
Problem needs to be solved end to-end Applications, NIC,
Trang 57Latency Effects in an Ethernet World
100 us E2E at 1GbE reduces throughput by 15-20%
100us E2E at 10GbE reduces throughput by 20-25%
Thank our good friend TCP for that
Storage heavy applicationsLoad/unload operations
Trang 58Traditional Server I/O Architecture
3.
5.
Access to I/O resource handled
by BIOS
A data packet is typically copied three to four times
CPU Interrupts, Bus bandwidth constrained, Memory bus constrained
Trang 59Adapter and Protocol Considerations
Fundamental part of any solution
Great advantage of Ethernet
Highly competitive and open market placeOn-loading vs Off-loading camp, …
Linux vs Windows vs Solaris
iWARP – RDMA, RDDP, DDP
Single sided offload with zero copy kernel bypass
TCP over lossless Ethernet is just the beginning
Alternative protocols being considered
Trang 60Kernel Bypass/Zero Copy Architecture
CPU
3
Bypass Capable Adapter
Application Memory
1.
2
I/O Memory Pool
Trang 61Latency Performance Comparison
TCP/UDP
SDP OFED 1 3
MPI OFED 1 3
Gigabit Ethernet 10 GE
10 GE Zero Copy SDR IB DDR IB SDR IB DDR IB
10G LLE MVAPICH OMPI
Trang 62Switch Architecture Value
20 μs
Trang 63Implemented and
Implemented and
Reference Designs
Trang 65 Non-blocking at each switch
Suited for nearest neighbor communication
co u cat o
120 Gbps of bandwidth to each neighbor
4 Neighbors for each switch node
Trang 662D-Torus – 2400 Nodes
480 servers Non-blocking at each
it hswitch
120 Gbps of bandwidth to each neighbor
Total 48 x 10GE uplinks
12 VLANs / 4 port bundles from
4 x Neighbors for each switch node
Trang 67Ring Design with L3 ECMP
240 Gbps BW 240 Gbps BW
Trang 68Ring: Notes on Design
Trang 6910GE connections
10GE connections
Trang 70apollo 01-16 apollo 17-32 apollo 33-48 apollo 49-64 apollo 65-80 apollo 81-96 apollo 97-112 apollo 113-128
f 101 fex101
Trang 72L2MP technologies
Trang 73Layer 2 Multi Path
Etherchannel
VSS (Virtual Switching System)
vPC (virtual Port Channel)
EHV (Ethernet Host Virtualizer)
Cisco DCE
TRILL
Trang 74Going Beyond Spanning Tree
Going Beyond Spanning Tree
Today Ethernet forwarding is
Today Ethernet forwarding is done according to Spanning Tree
In trees, going from the root toward the leaves, branches get smaller
I 2009/2010 d
In 2009/2010 datacenters most of the links will be 10GE
Single size branches
There is a concrete need to go beyond Spanning Tree