APEX 2020 Technical Requirements Document for Crossroads and NERSC-9 Systems

Table 1 Crossroads/NERSC-9 Schedule Crossroads and NERSC-9 2 System Description 2.1 Architectural Description The Offeror shall provide a detailed full system architectural description o

Trang 1

APEX 2020 Technical Requirements

Lawrence Berkeley National Laboratories is operated by the University of California for the U.S

Department of Energy under contract NO DE-AC02-05CH11231.

Los Alamos National Laboratory, an affirmative action/equal opportunity employer, is operated

by Los Alamos National Security, LLC, for the National Nuclear Security Administration of the

U.S Department of Energy under contract DE-AC52-06NA25396 LA-UR-15-28541 Approved for

public release; distribution is unlimited.

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia

Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S

Trang 2

Department of Energy’s National Nuclear Security Administration under contract

DE-AC04-94AL85000 SAND2016-4325 O.

Trang 3

2.3 P RODUCT R OADMAP D ESCRIPTION 9

3 TARGETS FOR SYSTEM DESIGN, FEATURES, AND PERFORMANCE METRICS 9

3.1 S CALABILITY 10

3.2 S YSTEM S OFTWARE AND R UNTIME 12

3.3 S OFTWARE T OOLS AND P ROGRAMMING E NVIRONMENT 13

3.4 P LATFORM S TORAGE 17

3.5 A PPLICATION P ERFORMANCE 20

3.6 R ESILIENCE , R ELIABILITY , AND A VAILABILITY 24

3.7 A PPLICATION T RANSITION S UPPORT AND E ARLY A CCESS TO APEX T ECHNOLOGIES 25

3.8 T ARGET S YSTEM C ONFIGURATION 26

3.9 S YSTEM O PERATIONS 27

3.10 P OWER AND E NERGY 29

3.11 F ACILITIES AND S ITE I NTEGRATION 30

5.1 U PGRADES , E XPANSIONS AND A DDITIONS 38

5.2 E ARLY A CCESS D EVELOPMENT S YSTEM 38

5.3 T EST S YSTEMS 39

5.4 O N S ITE S YSTEM AND A PPLICATION S OFTWARE A NALYSTS 39

5.5 D EINSTALLATION 39

Trang 4

5.6 M AINTENANCE AND S UPPORT 39

6.1 P RE - DELIVERY T ESTING 42

6.2 S ITE I NTEGRATION AND P OST - DELIVERY T ESTING 42

6.3 A CCEPTANCE T ESTING 43

8.1 D OCUMENTATION 44

8.2 T RAINING 44

APPENDIX B: LANS/UC SPECIFIC PROJECT MANAGEMENT REQUIREMENTS 61

Trang 5

1 Introduction

Los Alamos National Security, LLC (LANS), in furtherance of its participation

in the Alliance for Computing at Extreme Scale (ACES), a collaboration

between Los Alamos National Laboratory and Sandia National Laboratories;

in coordination with the Regents of the University of California (UC), which

operates the National Energy Research Scientific Computing (NERSC) Center

residing within the Lawrence Berkeley National Laboratory (LBNL), is

releasing a joint Request for Proposal (RFP) for two next generation systems,

Crossroads and NERSC-9, under the Alliance for application Performance at

EXtreme scale (APEX), to be delivered in the 2020 time frame

The successful Offeror will be responsible for delivering and installing the

Crossroads and NERSC-9 systems at their respective locations The targets/

requirements in this document are predominately joint targets/

requirements for the two systems; however, where differences between the

systems are described, Offerors should provide clear and complete details

showing how their proposed Crossroads and NERSC-9 systems differ

Each response/proposed solution within this document shall clearly describe

the role of any lower-tier subcontractor(s) and the technology or

technologies, both hardware and software, and value added that the

lower-tier subcontractor(s) provide(s), where appropriate

The scope of work and technical specifications for any subcontracts resulting

from this RFP will be negotiated based on this Technical Requirements

Document and the successful Offeror’s responses/proposed solutions

Crossroads and NERSC-9 each have maximum funding limits over their

system lives, to include all design and development, site preparation,

maintenance, support and analysts Total ownership costs will be considered

in system selection The Offeror must respond with a configuration and

pricing for both systems

Application performance and workflow efficiency are essential to these

procurements Success will be defined as meeting APEX 2020 mission needs

while at the same time serving as a pre-exascale system that enables our

applications to begin to evolve using yet to be defined next generation

programming models The advanced technology aspects of the APEX systems

will be pursued both by fielding first of a kind technologies on the path to

exascale as part of system build and by selecting and participating in

strategic NRE projects with the Offeror and applicable technology providers

A compelling set of NRE projects will be crucial for the success of these

platforms, by enabling the deployment of first of a kind technologies in such a

way as to maximize their utility The NRE areas of collaboration should

Trang 6

provide substantial value to the Crossroads and NERSC-9 systems with the

goals of:

 Increasing application performance

 Increasing workflow efficiency

 Increasing the resilience, and reliability of the system

The details of the NRE are more completely described in section 4

To support the goals of application performance and workflow efficiency an

accompanying whitepaper, “APEX Workflows,” is provided that describes

how application teams use High Performance Computing (HPC) resources

today to advance scientific goals The whitepaper is designed to provide a

framework for reasoning about the optimal solution to these challenges (The

Crossroads/NERSC-9 workflows document can be found on the APEX

website.)

1.1 Crossroads

The Department of Energy (DOE) National Nuclear Security Administration

(NNSA) Advanced Simulation and Computing (ASC) Program requires a

computing system be deployed in 2020 to support the Stockpile Stewardship

Program In the 2020 timeframe, Trinity, the first ASC Advanced Technology

System (ATS-1), will be nearing the end of its useful lifetime Crossroads, the

proposed ATS-3 system, provides a replacement, tri-lab computing resource

for existing simulation codes and provides a larger resource for

ever-increasing computing requirements to support the weapons program The

Crossroads system, to be sited at Los Alamos, NM, is projected to provide a

large portion of the ATS resources for the NNSA ASC tri-lab simulation

community: Los Alamos National Laboratory (LANL), Sandia National

Laboratories (SNL), and Lawrence Livermore National Laboratory (LLNL),

during the 2021-2025 timeframe

In order to fulfill its mission, the NNSA Stockpile Stewardship Program

requires higher performance computational resources than are currently

available within the Nuclear Security Enterprise (NSE) These capabilities are

required for supporting stockpile stewardship certification and assessments

to ensure that the nation’s nuclear stockpile is safe, reliable, and secure

The ASC Program is faced with significant challenges by the ongoing

technology revolution It must continue to meet the mission needs of the

current applications but also adapt to radical change in technology in order

to continue running the most demanding applications in the future The ASC

Program recognizes that the simulation environment of the future will be

transformed with new computing architectures and new programming

models that will take advantage of the new architectures Within this context,

ASC recognizes that ASC applications must begin the transition to the new

Trang 7

simulation environment or they may become obsolete as a result of not

leveraging technology driven by market trends With this challenge of

technology change, it is a major programmatic driver to provide an

architecture that keeps ASC moving forward and allows applications to fully

explore and exploit upcoming technologies, in addition to meeting NNSA

Defense Programs’ mission needs It is possible that major modifications to

the ASC simulation tools will be required in order to take full advantage of

the new technology However, codes running on NNSA Advanced Technology

Systems (Trinity and Sierra) in the 2019 timeframe are expected to run on

Crossroads In some cases new applications also may need to be developed

Crossroads is expected to help technology development for the ASC Program

to meet the requirements of future systems with greater computational

performance or capability Crossroads will serve as a technology path for

future ASC systems in the next decade

To directly support the ASC Roadmap, which states that “work in this

timeframe will establish a strong technological foundation to build toward

exascale computing environments, which predictive capability may demand,”

it is critical for the ASC Program to both explore the rapidly changing

technology of future systems and to provide systems with higher

performance and more memory capacity for predictive capability Therefore,

a design goal of Crossroads is to achieve a balance between usability of

current NNSA ASC simulation codes and adaptation to new computing

technologies

1.2 NERSC-9

The DOE Office of Science (SC) requires a high performance production

computing system in the 2020 timeframe to provide a significant upgrade to

the current computational and data capabilities that support the basic and

applied research programs that help accomplish the mission of DOE SC

The system also needs to provide a firm foundation for future exascale

systems in 2023 and beyond; a need that is called out in the DOE’s Strategic

Plan 2014-2018, that calls out for “advanced scientific computing to analyze,

model, simulate and predict complex phenomena, including the scientific

potential that exascale simulation and data will provide in the future.”

NERSC Center supports nearly 6000 users and about 600 different

application codes from a broad range of science disciplines covering all six

program offices in SC The scientific goals are well summarized in the

2012-2014 series of requirements reviews commissioned by the Advanced

Scientific Computing Research (ASCR) office that brought together

application scientists, computer scientists, applied mathematicians, DOE

program managers and NERSC personnel The 2012-2014 requirements

reviews indicated that compute-intensive research and research that

attempts scientific discovery through the analysis of experimental and

Trang 8

observational data both have a clear need for major increases in

computational capability and capacity in the 2017 timeframe and beyond In

addition, several science areas also have a burgeoning need for HPC

resources that satisfy an increased compute workload and provide strong

support for data-centric workflows and real-time observational science.

More details about the DOE SC application requirements are in the reviews

located at: http://www.nersc.gov/science/hpc-requirements-reviews/

NERSC has already begun transitioning the SC user base to energy efficient

architectures, with the procurement of the NERSC-8 “Cori” system In the

2020 time frame, NERSC also expects a need to address early exascale

hardware and software technologies, including the areas of processor

technology, memory hierarchies, networking technology, and programming

models

The NERSC-9 system is expected to run for 4-6 years and will be housed in

the Wang Hall (Building 59) at LBNL that currently houses the “Cori” system

and other resources that NERSC supports. The system must integrate into

the NERSC environment and provide high bandwidth access to existing data

stored by continuing research projects For more information about NERSC

and the current systems, environment, and support provided for our users,

see http://www.nersc.gov

1.3 Schedule

The following is the tentative schedule for the Crossroads and NERSC-9

systems

Table 1 Crossroads/NERSC-9 Schedule

Crossroads and NERSC-9

2 System Description

2.1 Architectural Description

The Offeror shall provide a detailed full system architectural description of

both the Crossroads and NERSC-9 systems, including diagrams and text

describing the following details as they pertain to the Offeror’s system

architecture(s):

Trang 9

 Component architecture – details of all processor(s), memory

technologies, storage technologies, network interconnect(s) and any

other applicable components

 Node architecture(s) – details of how components are combined into the

node architecture(s) Details shall include bandwidth and latency

specifications (or projections) between components

 Board and/or blade architecture(s) – details of how the node

architecture(s) is integrated at the board and/or blade level Details

should include all inter-node and inter-board/blade communication

paths and any additional board/blade level components

 Rack and/or cabinet architecture(s) – details of how board and/or blades

are organized and integrated into racks and/or cabinets Details should

include all inter rack/cabinet communication paths and any additional

rack/cabinet level components

 Platform storage – details of how storage is integrated with the system,

including a platform storage architectural diagram

 System architecture – details of how rack or cabinets are combined to

produce system architecture, including the high-speed interconnects and

network topologies (if multiple) and platform storage

 Proposed floor plan – including details of the physical footprint of the

system and all of the supporting components

2.2 Software Description

The Offeror shall provide a detailed description of the proposed software

eco-system, including a high-level software architectural diagram including

the provenance of the software component, for example open source or

proprietary and support mechanism for each (for the lifetime of the system

including updates)

2.3 Product Roadmap Description

The Offeror shall describe how the system does or does not fit into the

Offeror’s long-term product roadmap and a potential follow-on system

acquisition in the 2025 and beyond timeframe

3 Targets for System Design, Features, and

Performance Metrics

This section contains targets for detailed system design, features and

performance metrics It is desirable that the Offeror’s proposal meet or

exceed the targets outlined in this section If a target cannot be met, it is

desirable that the Offeror provide a development and deployment plan,

Trang 10

including a schedule, to satisfy the target.

The Offeror may also propose any hardware and/or software architectural

features that will provide improvements for any aspect of the system

3.1 Scalability

The scale of the system necessary to meet the needs of the application

requirements of the APEX laboratories adds significant challenges The

Offeror should propose a system that enables application performance up to

the full scale of the system Additionally, the system proposed should provide

functionality that assists users in obtaining performance at up to full scale

Scalability features, both hardware and software, that benefit both current

and future programming models are essential

1.1.1 The system should support running jobs up to and including the full scale of

the system

1.1.2 The system should support launching an application at full system scale in

less than 30 seconds The Offeror shall describe factors (such as executable

size) that could potentially affect application launch time

1.1.3 The Offeror shall describe how applications launch scales with the number of

concurrent launch requests (pers second) and scale of each launch request

(resources requested, such as the number of scheduleable units etc.),

including information such as:

 All system-level and node-level overhead in the process startup including

how overhead scales with node count for parallel applications, or how

overhead scales with the application count for large numbers of serial

applications

 Any limitations for processes on compute nodes from interfacing with an

external work-flow manager, external database or message queue

system

1.1.4 The system should support thousands of concurrent users and more than

20,000 concurrent batch jobs The system should allow a mix of application

or user identity wherein at least a subset of nodes can run multiple

independent applications from multiple users The Offeror shall describe

details, including limitations of their proposed support for this requirement

1.1.5 The Offeror shall describe all areas of the system in which node-level

resource usage (hardware and software) increases as a job scales up (node,

core or thread count)

Trang 11

1.1.6 The system should utilize an optimized job placement algorithm to reduce

job runtime, lower variability, minimize latency, etc The Offeror shall

describe in detail how the algorithm is optimized to the system architecture

1.1.7 The system should include an application programming interface to allow

applications access to the physical-to-logical mapping information of the

job’s node allocation – including a mapping between MPI ranks and network

topology coordinates, and core, node and rack identifiers

1.1.8 The system software solution should provide a low jitter environment for

applications and should provide an estimate of a compute node operating

system’s noise profile, both while idle and while running a non-trivial MPI

application If core specialization is used, the Offeror shall describe the

system software activity that remains on the application cores

1.1.9 The system should provide correct numerical results and consistent

runtimes (i.e wall clock time) that do not vary more than 3% from run to run

in dedicated mode and 5% in production mode The Offeror shall describe

strategies for minimizing runtime variability

1.1.10 The system’s high speed interconnect should support a high messaging

bandwidth, high injection rate, low latency, high throughput, and

independent progress The Offeror shall describe:

 The system interconnect in detail, including any mechanisms for adapting

to heavy loads or inoperable links, as well as a description of how

different types of failures will be addressed

 How the interface will allow all cores in the system to simultaneously

communicate synchronously or asynchronously with the high speed

interconnect

 How the interconnect will enable low-latency communication for one-

and two-sided paradigms

1.1.11 The Offeror shall describe how both hardware and software components of

the interconnect support effective computation and communication overlap

for both point-to-point operations and collective operations (i.e., the ability

of the interconnect subsystem to progress outstanding communication

requests in the background of the main computation thread)

1.1.12 The Offeror shall report or project the proposed system’s node

injection/ejection bandwidth

1.1.13 The Offeror shall report or project the proposed system’s bit error rate of the

interconnect in terms of time period between errors that interrupt a job

running at the full scale of the system

Trang 12

1.1.14 The Offeror shall describe how the interconnect of the system will provide

Quality of Service (QoS) capabilities (e.g., in the form of virtual channels or

other sub-system QoS capabilities), including but not limited to:

 An explanation of how these capabilities can be used to prevent core

communication traffic from interfering with other classes of

communication, such as debugging and performance tools or with I/O

traffic

 An explanation of how these capabilities allow efficient adaptive routing

as well as a capability to prevent traffic from different applications

interfering with each other (either through QoS capabilities or

appropriate job partitioning)

 An explanation of any sub-system QoS capabilities (e.g platform storage

QoS features)

1.1.15 The Offeror shall describe specialized hardware or software features of the

system that accelerate workflows or components of workflows such as data

analysis or visualization, and describe any limits their scalability on the

system The hardware should be on the same high speed network as the

main compute resources and should have equal access to other compute

resources (e.g file systems and platform storage) It is desirable that the

hardware have the same node level architecture as the main compute

resources, but could, for example, have more memory per node

3.2 System Software and Runtime

The system should include a well-integrated and supported system software

environment The overall imperative is to provide users with a productive,

high-performing, reliable, and scalable system software environment that

enables efficient use of the full capability of the system

1.1.16 The system should include a full-featured Linux operating system

environment on all user visible service partitions (e.g., front-end nodes,

service nodes, I/O nodes) The Offeror shall describe the proposed

full-featured Linux operating system environment

1.1.17 The system should include an optimized compute partition operating system

that provides an efficient execution environment for applications running up

to full-system scale The Offeror shall describe any HPC relevant

optimizations made to the compute partition operating system

1.1.18 The Offeror shall describe the security capabilities of the operating systems

proposed in targets 1.1.16 and 1.1.17

Trang 13

1.1.19 The system should include efficient support for dynamic shared libraries,

both at job load time and during runtime The Offeror shall describe how

applications using shared libraries will execute at full system scale with

minimal performance overhead compared to statically linked applications

1.1.20 The system should include resource management functionality, including job

migration, backfill, targeting of specified resources (e.g., platform storage),

advance and persistent reservations, job preemption, job accounting,

architecture-aware job placement, power management, job dependencies

(e.g., workload management), and resilience management The Offeror may

propose multiple solutions for a vendor-supported resource manager and

should describe the benefits of each

1.1.21 The system should support jobs consisting of multiple individual applications

running simultaneously (inter-node or intra-node) and cooperating as part of

an overall multi-component application (e.g., a job that couples a simulation

application to an analysis application) The Offeror shall describe in detail

how this will be supported by the system software infrastructure (e.g., user

interfaces, security model, and inter-application communication)

1.1.22 The system should include a mechanism that will allow users to provide

containerized software images without requiring privileged access to the

system or allowing a user to escalate privilege The startup time for

launching a parallel application in a containerized software image at full

system scale should not greatly exceed the startup time for launching a

parallel application in the vendor-provided image

1.1.23 The system should include a mechanism for dynamically configuring external

IPv4/IPv6 connectivity to and from compute nodes, enabling special

connectivity paths for subsets of nodes on a per-batch-job basis, and allowing

fully routable interactions with external services

1.1.24 The Successful Offeror should provide access to source code, and necessary

build environment, for all software except for firmware, compilers, and third

party products The Successful Offeror should provide updates of source

code, and any necessary build environment, for all software over the life of

the subcontract

3.3 Software Tools and Programming Environment

The primary programming models used in production applications in this

time frame are the Message Passing Interface (MPI), for inter-node

communication, and OpenMP, for fine-grained on-node parallelism While

MPI+OpenMP will be the majority of the workload, the APEX laboratories

expect some new applications to exercise emerging asynchronous

Trang 14

programming models System support that would accelerate these

programming models/runtimes and benefit MPI+OpenMP is desirable

1.1.25 The system should include an implementation of the MPI version 3.1 (or

most current) standard specification The Offeror shall provide a detailed

description of the MPI implementation (including specification version) and

support for features such as accelerated collectives, and shall describe any

limitations relative to the MPI standard

1.1.26 The Offeror shall describe at what parallel granularity the system can be

utilized by MPI-only applications

1.1.27 The system should include optimized implementations of collective

operations utilizing both inter-node and intra-node features where

appropriate, including MPI_Barrier, MPI_Allreduce, MPI_Reduce,

MPI_Allgather, and MPI_Gather

1.1.28 The Offeror shall describe the network transport layer of the system

including support for OpenUCX, Portals, libfabric, libverbs, and any other

transport layer including any optimizations of their implementation that will

benefit application performance or workflow efficiency

1.1.29 The system should include a complete implementation of the OpenMP

version 4.1 (or most current) standard including, if applicable, accelerator

directives, as well as a supporting programming environment The Offeror

shall provide a detailed feature description of the OpenMP

implementation(s) and describe any expected deviations from the OpenMP

standard

1.1.30 The Offeror shall provide a description of how OpenMP 3.1 applications will

be compiled and executed on the system

1.1.31 The Offeror shall provide a description of any proposed hardware or

software features that enable OpenMP performance optimizations

1.1.32 The Offeror shall list any PGAS languages and/or libraries that are supported

(e.g UPC, SHMEM, CAF, Global Arrays) and describe any hardware and/or

programming environment software that optimizes any of the listed PGAS

languages supported on the system The system should include a mechanism

to compile, run, and debug UPC applications The Offeror shall describe

interoperability with MPI+OpenMP

1.1.33 The Offeror shall describe and list support for any emerging programming

models such as asynchronous task/data models (e.g., Legion, STAPL, HPX, or

OCR) and describe any system hardware and/or programming environment

software it will provide that optimizes any of the supported models The

Offeror shall describe interoperability with MPI+OpenMP

Trang 15

1.1.34 The Offeror shall describe the proposed hardware and software environment

support for:

 Fast thread synchronization of subsets of execution threads

 Atomic add, fetch-and-add, multiply, bitwise operations, and

compare-and-swap operations over integer, single-precision, and double-precision

operands

 Atomic compare-and-swap operations over 16-byte wide operands that

comprise two double precision values or two memory pointer operands

 Fast context switching or task-switching

 Fast task spawning for unique and identical task with data dependencies

 Support for active messages

1.1.35 The Offeror shall describe in detail all programming APIs, languages,

compliers and compiler extensions, etc other than MPI and OpenMP (e.g

OpenACC, CUDA, OpenCL, etc.) that will be supported by the system It is

desirable that instances of all programming models provided be

interoperable and efficient when used within a single process or single job

running on the same compute node

1.1.36 The system should include support for the languages C, C++ (including

complete C++11/14/17), Fortran 77, Fortran 90, and Fortran 2008

programming languages Providing multiple compilation environments is

highly desirable The Offeror shall describe any limitations that can be

expected in meeting full C++17 support based on current expectations

1.1.37 The system should include a Python implementation that will run on the

compute partition with optimized MPI4Py, NumPy, and SciPy libraries

1.1.38 The system should include a programming toolchain(s) that enables runtime

coexistence of threading in C, C++, and Fortran, from within applications and

any supporting libraries using the same toolchain The Offeror shall describe

the interaction between OpenMP and native parallelism expressed in

language standards

1.1.39 The system should include C++ compiler(s) that can successfully build the

Boost C++ library, http://www.boost.org The Offeror shall support the most

recent stable version of Boost

Trang 16

1.1.40 The system should include optimized versions of libm, libgsl, BLAS levels 1, 2

and 3, LAPACK, ScaLAPACK, HDF5, NetCDF, and FFTW It is desirable for

these to efficiently interoperate with applications that utilize OpenMP The

Offeror shall describe all other optimized libraries that will be supported,

including a description of the interoperability of these libraries with the

programming environments proposed

1.1.41 The system should include a mechanism that enables control of task and

memory placement within a node for efficient performance The Offeror

shall provide a detailed description of controls provided and any limitations

that may exist

1.1.42 The system should include a comprehensive software development

environment with configuration and source code management tools On

heterogeneous systems, a mechanism (e.g., an upgraded autoconf) should be

provided to create configure scripts to build cross-compiled applications on

login nodes

1.1.43 The system should include an interactive parallel debugger with an

X11-based graphical user interface The debugger should provide a single point of

control that can debug applications in all supported languages using all

granularities of parallelism (e.g MPI+X) and programming environments

provided and scale up to 25% of the system

1.1.44 The system should include a suite of tools for detailed performance analysis

and profiling of user applications At least one tool should support all

granularities of parallelism in mixed MPI+OpenMP programs and any

additional programming models supported on the system The tool suite

must provide the ability to support multi-node integrated profiling of

on-node parallelism and communication performance analysis The Offeror shall

describe all proposed tools and the scalability limitations of each The Offeror

shall describe tools for measuring I/O behavior of user applications

1.1.45 The system should include event-tracing tools Event tracing of interest

includes: message-passing event tracing, I/O event tracing, floating point

exception tracing, and message-passing profiling The event-tracing tool API

should provide functions to activate and deactivate event monitoring during

execution from within a process

1.1.46 The system should include single- and multi-node stack-tracing tools The

tool set should include a source-level stack trace back, including an API that

allows a running process or thread to query its current stack trace

Trang 17

1.1.47 The system should include tools to assist the programmer in introducing

limited levels of parallelism and data structure refactoring to codes using any

proposed programming models and languages Tool(s) should additionally

be provided to assist application developers in the design and placement of

the data structures with the goal of optimizing data movement/placement

for the classes of memory proposed in the system

1.1.48 The system should include software licenses to enable the following number

of simultaneous users on the system:

Crossroads NERSC-9

3.4 Platform Storage

Platform storage is certain to be one of the advanced technology areas

included in any system delivered in this timeframe The APEX laboratories

anticipate these emerging technologies will enable new usage models With

this in mind, an accompanying whitepaper, “APEX Workflows,” is provided

that describes how application teams use HPC resources today to advance

scientific goals The whitepaper is designed to provide a framework for

reasoning about the optimal solution to these challenges The whitepaper is

intended to help an Offeror develop a platform storage architecture response

that accelerates the science workflows while minimizing the total number of

platform storage tiers The Crossroads/NERSC-9 workflows document can be

found on the APEX website

1.1.49 The system should include platform storage capable of retaining all

application input, output, and working data for 12 weeks (84 days),

estimated at a minimum of 36% of baseline system memory per day

1.1.50 The system should include platform storage with an appropriate durability

or a maintenance plan such that the platform storage is capable of absorbing

approximately four times the systems baseline memory per day for the life of

the system

1.1.51 The Offeror shall describe how the system provides sufficient bandwidth to

support a JMTTI/Delta-Ckpt ratio of greater than 200 (where Delta-Ckpt is

less than 7.2 minutes)

1.1.52 The Offeror shall describe the projected characteristics of all platform

storage devices for the system, including but not limited to:

 Usable capacity, access latencies, platform storage interfaces (e.g NVMe,

PCIe), expected lifetime (warranty period, MTTF, total writes, etc.), and

media and device error rates

Trang 18

 Relevant software/firmware features

 Compression technologies used by the platform storage devices, the

resources used to implement the compression/decompression

algorithms, the expected compression rates, and all

compression/decompression-related performance impacts

1.1.53 The Offeror shall describe all available interfaces to platform storage for the

system, including but not limited to:

 POSIX

 APIs

 Exceptions to POSIX compliance

 Time to consistency and any potential delays for reliable data

consumption

 Any special requirements for users to achieve performance and/or

consistent data

1.1.54 The Offeror shall describe the reliability characteristics of platform storage,

including but not limited to:

 Any single point of failure for all proposed platform storage tiers (note

any component failure that will lead to temporary or permanent loss of

data availability)

 Mean time to data loss for each platform storage tier provided

 Enumerate platform storage tiers that are designed to be less reliable or

do not use data protection techniques (e.g., replication, erasure coding)

 The magnitudes and duration of performance and reliability degradation

brought about by a single or multiple component failures for each reliable

platform storage tier

 Vendor supplied mechanisms to ensure data integrity for each platform

storage tier (e.g., data scrubbing processes, background checksum

verification, etc.)

 Enumerate any platform storage failures that potentially impact

scheduled or currently executing jobs that impact the platform storage or

system performance and/or availability

 Login or interactive nodes access to platform storage when the compute

nodes are unavailable

1.1.55 The Offeror shall describe system features for platform storage tier

management designed to accelerate workflows, including but not limited to:

Trang 19

 Mechanisms for migrating data between platform storage tiers, including

manual, scheduled, and/or automatic data migration to include

rebalancing, draining, or rewriting data across devices within a tier

 How platform storage will be instantiated with each job if it needs to be,

and how platform storage may be persisted across jobs

 The capabilities provided to define per-user policies and automate data

movement between different tiers of platform storage or external storage

resources (e.g., archives)

 The ability to serialize namespaces no longer in use (e.g., snapshots)

 The ability to restore namespaces needed for a scheduled job that is not

currently available

 The ability to integrate with or act as a site-wide scheduling resource

 A mechanism to incrementally add capacity and bandwidth to a

particular tier of platform storage without requiring a tier-wide outage

 Capabilities to manage or interface platform storage with external

storage resources or archives (e.g., fast storage layers or HPSS)

1.1.56 The Offeror shall describe software features that allow users to optimize I/O

for the workflows of the system, including but not limited to:

 Batch data movement capabilities, especially when data resides on

multiple tiers of platform storage

 Methods for users to create and manage platform storage allocations

 Any ability to directly write to or read from a tier not directly (logically)

adjacent to the compute resources

 Locality-aware job/data scheduling

 I/O utilization for reservations

 Features to prevent data duplication on more than one platform storage

1.1.57 The Offeror shall describe the method for walking the entire platform storage

metadata, and describe any special capabilities that would mitigate user

performance issues for daily full-system namespace walks; expect at least 1

billion objects

Trang 20

1.1.58 The Offeror shall describe any capabilities to comprehensively collect

platform storage usage data (in a scalable way), for the system, including but

not limited to:

 Per client metrics and frequency of collection, including but not limited

to: the number of bytes read or written, number of read or write

invocations, client cache statistics, and metadata statistics such as

number of opens, closes, creates, and other system calls of relevance to

the performance of platform storage

 Job level metrics, such as the number of sessions each job initiates with

each platform storage tier, session duration, total data transmitted

(separated as reads and writes) during the session, and the number of

total platform storage invocations made during the session

 Platform storage tier metrics and frequency of collection, such as the

number of bytes read, number of bytes written, number of read

invocations, number of write invocations, bytes deleted/purged, number

of I/O sessions established, and periods of outage/unavailability

 Job level metrics describing usage of a tiered platform storage hierarchy,

such as how long files are resident in each tier, hit rate of file pages in

each tier (i.e., whether pages are actually read and how many times data

is re-read), fraction of data moved between tiers because of a) explicit

programmer control and b) transparent caching, and time interval

between accesses to the same file (e.g., how long until an analysis

program reads a simulation generated output file)

1.1.59 The Offeror shall propose a method for providing access to platform storage

from other systems at the facility In the case of tiered platform storage, at

least one tier must satisfy this requirement

1.1.60 The Offeror shall describe the capability for platform storage tiers to be

repaired, serviced, and incrementally patched/upgraded while running

different versions of software or firmware without requiring a storage

tier-wide outage The Offeror shall describe the level of performance degradation,

if any, anticipated during the repair or service interval

1.1.61 The Offerer shall specify the minimum number of compute nodes required to

read and write the following data sets from/to platform storage:

 A 1 TB data set of 20 GB files in 2 seconds

 A 5 TB data set of any chosen file size in 10 seconds Offeror shall report

the file size chosen

 A 1 PB data set of 32 MB files in 1 hour

Trang 21

3.5 Application Performance

Assuring that real applications perform well on both the Crossroads and

NERSC-9 systems is key for their success Because the full applications are

large, often with millions of lines of code, and in some cases are export

controlled, a suite of benchmarks have been developed for RFP response

evaluation and system acceptance The benchmark codes are representative

of the workloads of the APEX laboratories but often smaller than the full

applications

The performance of the benchmarks will be evaluated as part of both the RFP

response and system acceptance Final benchmark acceptance performance

targets will be negotiated after a final system configuration is defined All

performance tests must continue to meet negotiated acceptance criteria

throughout the lifetime of the system

System acceptance for Crossroads will also include an ASC Simulation Code

Suite comprised of at least two (2) but no more than four (4) ASC

applications from the three NNSA laboratories, Sandia, Los Alamos and

Lawrence Livermore

The Crossroads/NERSC-9 benchmarks, information regarding the Crossroads

acceptance codes, and supplemental materials can be found on the APEX

website

1.1.62 The Offeror shall provide responses to the benchmarks (SNAP, PENNANT,

HPCG, MiniPIC, UMT, MILC, MiniDFT, GTC, and Meraculous) provided on the

Crossroads/NERSC-9 benchmarks link on the APEX website All modifications

or new variants of the benchmarks (including makefiles, build scripts, and

environment variables) are to be supplied in the Offeror’s response

 The results of all problem sizes (baseline and optimized) should be

provided in the Offeror's Scalable System Improvement (SSI)

spreadsheets SSI is the calculation used for measuring improvement and

is documented on the APEX website, along with the SSI spreadsheets If

predicted or extrapolated results are provided, the methodology used to

derive them should be documented

 The Offeror shall provide licenses for the system for all compilers,

libraries, and runtimes used to achieve benchmark performance

1.1.63 The Offeror shall provide performance results for the system that may be

benchmarked, predicted, and/or extrapolated for the baseline MPI+OpenMP

(or UPC for Meraculous) variants of the benchmarks The Offeror may modify

the benchmarks to include extra OpenMP pragmas as required, but the

benchmark must remain a standard-compliant program that maintains

existing output subject to the validation criteria described in the benchmark

run rules

Trang 22

1.1.64 The Offeror shall optionally provide performance results from an Offeror

optimized variant of the benchmarks The Offeror may modify the

benchmarks, including the algorithm and/or programming model used to

demonstrate high system performance If algorithmic changes are made, the

Offeror shall provide an explanation of why the results may deviate from

validation criteria described in the benchmark run rules

1.1.65 For the Crossroads system only: in addition to the Crossroads/NERSC-9

benchmarks, an ASC Simulation Code Suite representing the three NNSA

laboratories will be used to judge performance at time of acceptance The

Crossroads system should achieve a minimum of at least 6 times (6X)

improvement over the ASC Trinity system (Knights Landing partition) for

each code, measured using SSI The Offeror shall specify a baseline

performance greater than or equal to 6X at time of response Final

acceptance performance targets will be negotiated after a final system

configuration is defined Information regarding ASC Simulation Code Suite

run rules and acceptance can be found on the APEX website Source code will

be provided to the Offeror but will require compliance with export control

laws and no cost licensing agreements

1.1.66 The Offeror shall report or project the number of cores necessary to saturate

the available node baseline memory bandwidth as measured by the

Crossroads/NERSC-9 memory bandwidth benchmark found on the APEX

website

 If the node contains heterogeneous cores, the Offeror shall report the

number of cores of each architecture necessary to saturate the available

baseline memory bandwidth

 If multiple tiers of memory are available, the Offeror shall report the

above for every functional combination of core architecture and baseline

or extended memory tier

1.1.67 The Offeror shall report or project the sustained dense matrix multiplication

performance on each type of processor core (individually and/or in parallel)

of the system node architecture(s) as measured by the Crossroads/NERSC-9

multithreaded DGEMM benchmark found on the APEX website

 The Offeror shall describe the percentage of theoretical double-precision

(64-bit) computational peak, which the benchmark GFLOP/s rate

achieves for each type of compute core/unit in the response, and describe

how this is calculated

1.1.68 The Offeror shall report, or project, the MPI two-sided message rate of the

nodes in the system under the following conditions measured by the

communication benchmark specified on the APEX website:

Trang 23

 Using a single MPI rank per node with MPI_THREAD_SINGLE.

 Using two, four, and eight MPI ranks per node with

MPI_THREAD_SINGLE

 Using one, two, four, and eight MPI ranks per node and multiple threads

per rank with MPI_THREAD_MULTIPLE

 The Offeror may additionally choose to report on other configurations

1.1.69 The Offeror shall report, or project, the MPI one-sided message rate of the

nodes in the system for all passive synchronization RMA methods with both

pre-allocated and dynamic memory windows under the following conditions

measured by the communication benchmark specified on the APEX website

using:

 A single MPI rank per node with MPI_THREAD_SINGLE

 Two, four, and eight MPI ranks per node with MPI_THREAD_SINGLE

 One, two, four, and eight MPI ranks per node and multiple threads per

rank with MPI_THREAD_MULTIPLE

 The Offeror may additionally choose to report on other configurations

1.1.70 The Offeror shall report, or project, the time to perform the following

collective operations for full, half, and quarter machine size in the system and

report on core occupancy during the operations measured by the

communication benchmark specified on the APEX website for :

 An 8 byte MPI_Allreduce operation

 An 8 byte per rank MPI_Allgather operation

1.1.71 The Offeror shall report, or project, the minimum and maximum off-node

latency of the system for MPI two-sided messages using the following

threading modes measured by the communication benchmark specified on

the APEX website:

 MPI_THREAD_SINGLE with a single thread per rank

 MPI_THREAD_MULTIPLE with two or more threads per rank

1.1.72 The Offeror shall report, or project, the minimum and maximum off-node

latency for MPI one-sided messages of the system for all passive

synchronization RMA methods with both pre-allocated and dynamic memory

windows using the following threading modes measured by the

communication benchmark specified on the APEX website:

 MPI_THREAD_SINGLE with a single thread per rank

 MPI_THREAD_MULTIPLE with two or more threads per rank

Trang 24

1.1.73 The Offeror shall provide an efficient implementation of

MPI_THREAD_MULTIPLE Bandwidth, latency, and message throughput

measurements using the MPI_THREAD_MULTIPLE thread support level

should have no more than a 10% performance degradation when compared

to using the MPI_THREAD_SINGLE support level as measured by the

communication benchmark specified on the APEX website

1.1.74 The Offeror shall report, or project, the maximum I/O bandwidths of the

system as measured by the IOR benchmark specified on the APEX website

1.1.75 The Offeror shall report, or project, the metadata rates of the system as

measured by the MDTEST benchmark specified on the APEX website

1.1.76 The Successful Offeror shall be required at time of acceptance to meet

specified targets for acceptance benchmarks, and mission codes for

Crossroads, listed on the APEX website

1.1.77 The Offeror shall describe how the system may be configured to support a

high rate and bandwidth of TCP/IP connections to external services both

from compute nodes and directly to and from the platform storage, including:

 Compute node external access should allow all nodes to each initiate 1

connection concurrently within a 1 second window

 Transfer of data over the external network to and from the compute

nodes and platform storage at 100 GB/s per direction of a 1 TB dataset

comprised of 20 GB files in 10 seconds

3.6 Resilience, Reliability, and Availability

The ability to achieve the APEX mission goals hinges on the productivity of

system users System availability is therefore essential and requires

system-wide focus to achieve a resilient, reliable, and available system For each

metric specified below, the Offeror must describe how they arrived at their

estimates

1.1.78 Failure of the system management and/or RAS system(s) should not cause a

system or job interrupt This requirement does not apply to a RAS system

feature, which automatically shuts down the system for safety reasons, such

as an overheating condition

1.1.79 The minimum System Mean Time Between Interrupt (SMTBI) should be

greater than 720 hours

1.1.80 The minimum Job Mean Time To Interrupt (JMTTI) should be greater than 24

hours Automatic restarts do not mitigate a job interrupt for this metric

Trang 25

1.1.81 The ratio of JMTTI/Delta-Ckpt should be greater than 200 This metric is a

measure of the system’s ability to make progress over a long period of time

and corresponds to an efficiency of approximately 90% If, for example, the

JMTTI requirement is not met, the target JMTTI/Delta-Ckpt ratio ensures this

minimum level of efficiency

1.1.82 An immediate re-launch of an interrupted job should not require a complete

resource reallocation If a job is interrupted, there should be a mechanism

that allows re-launch of the application using the same allocation of resource

(e.g., compute nodes) that it had before the interrupt or an augmented

allocation when part of the original allocation experiences a hard failure

1.1.83 A complete system initialization should take no more than 30 minutes The

Offeror shall describe the full system initialization sequence and timings

1.1.84 The system should achieve 99% scheduled system availability System

availability is defined in the glossary

1.1.85 The Offeror shall describe the resilience, reliability, and availability

mechanisms and capabilities of the system including, but not limited to:

 Any condition or event that can potentially cause a job interrupt

 Resiliency features to achieve the availability targets

 Single points of failure (hardware or software), and the potential effect on

running applications and system availability

 How a job maintains its resource allocation and is able to relaunch an

application after an interrupt

 A system-level mechanism to collect failure data for each kind of

component

3.7 Application Transition Support and Early Access to APEX

Technologies

The Crossroads and NERSC-9 systems will include numerous pre-exascale

technologies The Offeror shall include in their proposal a plan to effectively

utilize these technologies and assist in transitioning the mission workflows

to the systems For the Crossroads system only, the Successful Offeror shall

support efforts to transition the Advanced Technology Development

Mitigation (ATDM) codes to the systems ATDM codes are currently being

developed by the three NNSA weapons laboratories, Sandia, Los Alamos, and

Lawrence Livermore These codes may require compliance with export

control laws and no cost licensing agreements Information about the ATDM

program can be found on the NNSA website

Trang 26

1.1.86 The Successful Offeror should provide a vehicle for supporting the successful

demonstration of the application performance requirements and the

transition of key applications to the Crossroads and NERSC-9 systems (e.g., a

Center of Excellence) Support should be provided by the Offeror and all of

its key advanced technology providers (e.g., processor vendors, integrators,

etc) The Successful Offeror should provide experts in the areas of

application porting and performance optimization in the form of staff

training, general user training, and deep-dive interactions with a set of

application code teams Support should include compilers to enable timely

bug fixes as well as to enable new functionality Support should be provided

from the date of subcontract execution through two (2) years after final

acceptance of the systems

1.1.87The Offeror shall describe which of the proposed APEX hardware and

software technologies (physical hardware, emulators, and/or simulators) ,

will be available for access before system delivery and in what timeframe

The proposed technologies should provide value in advanced preparation for

the delivery of the final APEX system(s) for pre-system-delivery application

porting and performance assessment activities

3.8 Target System Configuration

APEX determined the following targets for Crossroads and NERSC-9 System

Configurations Offerors shall state projections for their proposed system

configurations relative to these targets

Table 2 Target System Configuration

Baseline Memory Capacity

Excludes all levels of

on-die-CPU cache

Benchmark SSI increase

over Edison system

Platform Storage > 30X Baseline Memory > 30X Baseline Memory

Trang 27

Calculated for a single job

running in the entire

System management should be an integral feature of the overall system and

should provide the ability to effectively manage system resources with high

utilization and throughput under a workload with a wide range of

concurrencies The Successful Offeror should provide system administrators,

security officers, and user-support personnel with productive and efficient

system configuration management capabilities and an enhanced diagnostic

environment

1.1.88 The system should include scalable integrated system management

capabilities that provide human interfaces and APIs for system configuration

and its ability to be automated, software management, change management,

local site integration, and system configuration backup and recovery

1.1.89 The system should include a means for tracking and analyzing all software

updates, software and hardware failures, and hardware replacements over

the lifetime of the system

Trang 28

1.1.90 The system should include the ability to perform rolling upgrades and

rollbacks on a subset of the system while the balance of the system remains

in production operation The Offeror shall describe the mechanisms,

capabilities, and limitations of rolling upgrades and rollbacks No more than

half the system partition should be required to be down for rolling upgrades

and rollbacks

1.1.91 The system should include an efficient mechanism for reconfiguring and

rebooting compute nodes The Offeror shall describe in detail the compute

node reboot mechanism, differentiating types of boots (warmboot vs

coldboot) required for different node features, as well as how the time

required to reboot scales with the number of nodes being rebooted

1.1.92 The system should include a mechanism whereby all monitoring data and

logs captured are available to the system owner, and will support an open

monitoring API to facilitate lossless, scalable sampling and data collection for

monitored data. Any filtering that may need to occur will be at the option of

the system manager The system will include a sampling and connection

framework that allows the system manager to configure independent

alternative parallel data streams to be directed off the system to

site-configurable consumers

1.1.93 The system should include a mechanism to collect and provide metrics and

logs which monitor the status, health, and performance of the system,

including, but not limited to:

 Environmental measurement capabilities for all systems and peripherals

and their sub-systems and supporting infrastructure, including power

and energy consumption and control

 Internal HSN performance counters, including measures of network

congestion and network resource consumption

 All levels of integrated and attached platform storage

 The system as a whole, including hardware performance counters for

metrics for all levels of integrated and attached platform storage

1.1.94 The Offeror shall describe what tools it will provide for the collection,

analysis, integration, and visualization of metrics and logs produced by the

system (e.g., peripherals, integrated and attached platform storage, and

environmental data, including power and energy consumption)

1.1.95 The Offeror shall describe the system configuration management and

diagnostic capabilities of the system that address the following topics:

 Detailed description of the system management support

Trang 29

 Any effect or overhead of software management tool components on the

CPU or memory available on compute nodes

 Release plan, with regression testing and validation for all system related

software and security updates

 Support for multiple simultaneous or alternative system software

configurations, including estimated time and effort required to install

both a major and a minor system software update

 User activity tracking, such as audit logging and process accounting

 Unrestricted privileged access to all hardware components delivered

with the system

3.10 Power and Energy

Power, energy, and temperature will be critical factors in how the APEX

laboratories manage systems in this time frame and must be an integral part

of overall Systems Operations The solution must be well integrated into

other intersecting areas (e.g., facilities, resource management, runtime

systems, and applications) The APEX laboratories expect a growing number

of use cases in this area that will require a vertically integrated solution

1.1.96 The Offeror shall describe all power, energy, and temperature measurement

capabilities (system, rack/cabinet, board, node, component, and

sub-component level) for the system, including control and response times,

sampling frequency, accuracy of the data, and timestamps of the data for

individual points of measurement and control

1.1.97 The Offeror shall describe all control capabilities it will provide to affect

power or energy use (system, rack/cabinet, board, node, component, and

sub-component level)

1.1.98 The system should include system-level interfaces that enable measurement

and dynamic control of power and energy relevant characteristics of the

system, including but not limited to:

 AC measurement capabilities at the system or rack level

 System-level minimum and maximum power settings (e.g., power caps)

 System-level power ramp up and down rate

 Scalable collection and retention all measurement data such as:

 point-in-time power data

 energy usage information

 minimum and maximum power data

Trang 30

1.1.99 The system should include resource manager interfaces that enable

measurement and dynamic control of power and energy relevant

characteristics of the system, including but not limited to:

 Job and node level minimum and maximum power settings

 Job and node level power ramp up and down rate

 Job and node level processor and/or core frequency control

 System and job level profiling and forecasting

 e.g., prediction of hourly power averages >24 hours in advance with a 1

MW tolerance

1.1.100 The system should include application and runtime system interfaces

that enable measurement and dynamic control of power and energy relevant

characteristics of the system including but not limited to:

 Node level minimum and maximum power settings

 Node level processor and/or core frequency control

 Node level application hints, such as:

 application entering serial, parallel, computationally intense, I/O intense

or communication intense phase

1.1.101 The system should include an integrated API for all levels of

measurement and control of power relevant characteristics of the system It

is preferable that the provided API complies with the High Performance

Computing Power Application Programming Interface Specification

(http://powerapi.sandia.gov)

1.1.102 The Offeror shall project (and report) the Wall Plate, Peak, Nominal,

and Idle Power of the system

1.1.103 The Offeror shall describe any controls available to enforce or limit

power usage below wall plate power and the reaction time of this mechanism

(e.g., what duration and magnitude can power usage exceed the imposed

limits)

1.1.104 The Offeror shall describe the status of the system when in an Idle

State (describe all Idle States if multiple are available) and the time to

transition from the Idle State (or each Idle State if there are multiple) to the

start of job execution

Trang 31

3.11 Facilities and Site Integration

1.1.105 The system should use 3-phase 480V AC Other system infrastructure

components (e.g., disks, switches, login nodes, and mechanical subsystems

such as CDUs) must use either phase 480V AC (strongly preferred),

3-phase 208V AC (second choice), or single-3-phase 120/240V AC (third choice)

The total number of individual branch circuits and phase load imbalance

should be minimized

1.1.106 All equipment and power control hardware of the system should be

Nationally Recognized Testing Laboratories (NRTL) certified and bear

appropriate NRTL labels

1.1.107 Every rack, network switch, interconnect switch, node, and disk

enclosure should be clearly labeled with a unique identifier visible from the

front of the rack and/or the rear of the rack, as appropriate, when the rack

door is open These labels will be high quality so that they do not fall off, fade,

disintegrate, or otherwise become unusable or unreadable during the

lifetime of the system Nodes will be labeled from the rear with a unique

serial number for inventory tracking It is desirable that motherboards also

have a unique serial number for inventory tracking Serial numbers shall be

visible without having to disassemble the node, or they must be able to be

queried from the system management console

1.1.108 Table 3 below shows target facility requirements identified by APEX

for the Crossroads and NERSC-9 systems The Offeror shall describe the

features of its proposed systems relative to site integration at the respective

facilities, including:

 Description of the physical packaging of the system, including

dimensioned drawings of individual cabinets types and the floor layout of

the entire system

 Remote environmental monitoring capabilities of the system and how it

would integrate into facility monitoring

 Emergency shutdown capabilities

 Detailed descriptions of power and cooling distributions throughout the

system, including power consumption for all subsystems

Trang 32

 Description of parasitic power losses within Offeror’s equipment, such as

fans, power supply conversion losses, power-factor effects, etc For the

computational and platform storage subsystems separately, give an

estimate of the total power and parasitic power losses (whose difference

should be power used by computational or platform storage components)

at the minimum and maximum ITUE, which is defined as the ratio of total

equipment power over power used by computational or platform storage

components Describe the conditions (e.g “idle”) at which the extrema

occur

 OS distributions or other client requirements to support off-system

access to the platform storage (e.g LANL File Transfer Agents)

Table 3 Crossroads and NERSC-9 Facility Requirements

Laboratory, Los Alamos,

NM The system will be housed in the Strategic Computing Complex (SCC), Building 2327

National Energy Research Scientific Computing Center, Lawrence BerkeleyNational Laboratory, Berkeley, CA

The system will be housed

in Wang Hall, Building 59 (formerly known as the Computational Theory andResearch Facility)

seismic isolation floor

System cabinets should have an attachment mechanism that will enable them to be firmly attached to each other andthe isolation floor When secured via these

attachments, the cabinets should withstand seismic design accelerations per the California Building Code and LBNL Lateral

Trang 33

operate in conformance with ASHRAE Class W2 guidelines (dated 2011)

The facility will provide operating water

temperature that nominally varies between60-75°F, at up to 35PSI differential pressure at the system cabinets However, Offeror should note if the system is capable of operating at higher temperatures

Note: LANL facility will provide inlet water at a nominal 75°F It may go

to as low as 60°F based

on facility and/or environmental factors

Total flow requirements may not exceed

9600GPM

Same

Note: NERSC facility will provide inlet water at a nominal 65°F It may go

as high as 75°F based on facility and/or

environmental factors

Total flow requirements may not exceed 9600GPM

with facility water meeting basic ASHRAE water chemistry Special chemistry water is not available in the main building loop and would require a separate

Same

Trang 34

tertiary loop provided with the system If tertiary loops are included in the system, the Offeror shall describe their operation and maintenance, including coolant chemistry, pressures, and flow controls All coolant loops within the system should have reliable leak detection, temperature, and flow alarms, with automatic protection and notification mechanisms

with supply air at 76°F orbelow, with a relative humidity from 30%-70%

The rate of airflow is between 800-1500 CFM/floor tile. No more than 3MW of heat should

be removed by air cooling

The system must operate

with supply air at 76°F or

below, with a relative humidity from 30%-80%

The current facility can support up to 60K CFM of airflow, and remove 500KW of heat Expansion

is possible to 300K CFM and 1.5MW, but at added expense

Maximum Power Rate of

Change The hourly average in system power should not

exceed the 2MW wide power band negotiated atleast 2 hours in advance

N/A

resilient to incoming power fluctuations at least to the level guaranteed by the ITIC power quality curve

Same

Trang 35

6” ceiling plenum 17’10” ceiling however maximum cabinet height

is 9’5”

Maximum Footprint 8000 square feet; 80 feet

long and 100 feet deep 64’x92’, or 5888 square feet (inclusive of compute,

platform storage and service aisles) This area

is itself surrounded by a minimum 4’ aisle that can

be used in the system layout It is preferred that cabinet rows run parallel

to the short dimension

Shipment Dimensions and

Weight No restrictions. For delivery, system components should weigh

less than 7000 pounds andshould fit into an elevator whose door is 6ft 6in wideand 9ft 0 in high and whose depth is 8ft 3in

Clear internal width is 8ft

4 in

over the effective area should be no more than

300 pounds per square foot The effective area is the actual loading area plus at most a foot of surrounding fully unloaded area A maximum limit of 300 pounds per square foot also applies to all loads during installation The Offeror shall describe how the weight will be distributed over the footprint of the rack (point loads, line loads, or

The floor loading should not exceed a uniform load

of 500 pounds per square foot Raised floor tiles are ASM FS400 with an isolated point load of 2000pounds and a rolling load

of 1200 pounds

Trang 36

evenly distributed over the entire footprint) A point load applied on a one square inch area should not exceed 1500 pounds A dynamic load using a CISCA Wheel 1 size should not exceed

1250 pounds (CISCA Wheel 2 – 1000 pounds)

water connections should

be below the access floor

It is preferable that all other cabling (e.g., systeminterconnect) is above floor and integrated into the system cabinetry

Under floor cables (if unavoidable) should be plenum rated and complywith NEC 300.22 and NEC645.5 All

communications cables, wherever installed, should be

source/destination labeled at both ends All communications cables and fibers over 10 meters

in length and installed under the floor should also have a unique serial number and dB loss data document (or equivalent)delivered at time of installation for each cable, if a method of measurement exists for cable type

Same

External network 1Gb, 10Gb, 40Gb, 100Gb, Same

Trang 37

External bandwidth on/off

the system for general

TCP/IP connectivity

> 100 GB/s per direction Same

the system for accessing

the system’s PFS

the system for accessing

external, site supplied file

systems E.g GPFS, NFS

4 Non-Recurring Engineering

The APEX team expects to award two (2) Non-Recurring Engineering (NRE)

subcontracts, separate from the two (2) system subcontracts It is expected

that Crossroads and NERSC personnel will collaborate in both NRE

subcontracts It is anticipated that the NRE subcontracts will be

approximately 10%-15% of the combined Crossroads and NERSC-9 system

budgets The Offeror is encouraged to provide proposals for areas of

collaboration they feel provide substantial value to the Crossroads and

NERSC-9 systems with the goals of:

 Increasing application performance

 Increasing workflow performance

 Increasing the resilience, and reliability of the system

Proposed collaboration areas should focus on topics that provide added

value beyond planned roadmap activities Proposals should not focus on

one-off point solutions or gaps created by their proposed design that should be

otherwise provided as part of a vertically integrated solution It is expected

that NRE collaborations will have impact on both the Crossroads and

NERSC-9 systems and follow-on systems procured by the U.S Department of

Energy's NNSA and Office of Science

NRE topics of interest include, but are not limited to, the following:

Trang 38

 Development and optimization of hardware and software capabilities to

increase the performance of MPI+OpenMP and future task-based

asychronous programming models

 Development and optimization of hardware and software capabilities to

increase the performance of application workflows, including

consideration of consistency requirements, data-migration needs, and

system-wide resource management

 Development of scalable system management capabilities to enhance the

reliability, resilience, power, and energy usage of Crossroads/NERSC-9

The APEX team expects to have future requirements for system upgrades

and/or additional quantities of components based on the configurations

proposed in response to this solicitation The Offeror should address any

technical challenges foreseen with respect to scaling and any other

production issues Proposals should be as detailed as possible

5.1 Upgrades, Expansions and Additions

1.1.109 The Offeror shall propose and separately price upgrades, expansions

or procurement of additional system configurations by the following

fractions of the system as measured by the Sustained System Improvement

1.1.110 The Offeror shall propose a configuration or configurations which

double the baseline memory capacity

1.1.111 The Offeror shall propose upgrades, expansions or procurement of

additional platform storage capacity (per tier if multiple tiers are present) in

increments of 25%

5.2 Early Access Development System

To allow for early and/or accelerated development of applications or development

of functionality required as a part of the statement of work, the Offeror shall

propose options for early access development systems These systems can be in

support of the baseline requirements or any proposed options

Trang 39

1.1.112 The Offeror shall propose an Early Access Development System The

primary purpose is to expose the application to the same programming

environment as will be found on the final system It is acceptable for the early

access system to not use the final processor, node, or high-speed

interconnect architectures However, the programming and runtime

environment must be sufficiently similar that a port to the final system is

trivial The early access system shall contain similar functionality of the final

system, including file systems, but scaled down to the appropriate

configuration The Offeror shall propose an option for the following

configurations based on the size of the final Crossroads/NERSC-9 systems

 2% of the compute partition

1.1.113 The Offeror shall propose development test bed systems that will

reduce risk and aid the development of any advanced functionality that is

exercised as a part of the statement of work For example, any topics

proposed for NRE

5.3 Test Systems

The Offeror shall propose the following test systems The systems shall contain all

the functionality of the main system, including file systems, but scaled down to the

appropriate configuration Multiple test systems may be awarded

1.1.114 The Offeror shall propose an Application Regression test system,

which should contain at least 200 compute nodes

1.1.115 The Offeror shall propose a System Development test system, which

should contain at least 50 compute nodes

5.4 On Site System and Application Software Analysts

1.1.116 The Offeror shall propose and separately price two (2) System

Software Analysts and two (2) Applications Software Analysts for each site

Offerors shall presume each analyst will be utilized for four (4) years For

Crossroads, these positions require a DOE Q-clearance for access

5.5 Deinstallation

The Offeror shall propose to deinstall, remove and/or recycle the system and

supporting infrastructure at end of life Storage media shall be wiped or

destroyed to the satisfaction of ACES and NERSC, and/or returned to ACES

and NERSC at their request

Tiêu đề	APEX 2020 Technical Requirements Document for Crossroads and NERSC-9 Systems
Trường học	University of California
Chuyên ngành	Technical Requirements
Thể loại	technical requirements document
Năm xuất bản	2016
Thành phố	Los Alamos

Định dạng
Số trang	78
Dung lượng	446,5 KB