Table 1 Crossroads/NERSC-9 Schedule Crossroads and NERSC-9 2 System Description 2.1 Architectural Description The Offeror shall provide a detailed full system architectural description o
Trang 1APEX 2020 Technical Requirements
Lawrence Berkeley National Laboratories is operated by the University of California for the U.S
Department of Energy under contract NO DE-AC02-05CH11231.
Los Alamos National Laboratory, an affirmative action/equal opportunity employer, is operated
by Los Alamos National Security, LLC, for the National Nuclear Security Administration of the
U.S Department of Energy under contract DE-AC52-06NA25396 LA-UR-15-28541 Approved for
public release; distribution is unlimited.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia
Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S
Trang 2Department of Energy’s National Nuclear Security Administration under contract
DE-AC04-94AL85000 SAND2016-4325 O.
Trang 32.3 P RODUCT R OADMAP D ESCRIPTION 9
3 TARGETS FOR SYSTEM DESIGN, FEATURES, AND PERFORMANCE METRICS 9
3.1 S CALABILITY 10
3.2 S YSTEM S OFTWARE AND R UNTIME 12
3.3 S OFTWARE T OOLS AND P ROGRAMMING E NVIRONMENT 13
3.4 P LATFORM S TORAGE 17
3.5 A PPLICATION P ERFORMANCE 20
3.6 R ESILIENCE , R ELIABILITY , AND A VAILABILITY 24
3.7 A PPLICATION T RANSITION S UPPORT AND E ARLY A CCESS TO APEX T ECHNOLOGIES 25
3.8 T ARGET S YSTEM C ONFIGURATION 26
3.9 S YSTEM O PERATIONS 27
3.10 P OWER AND E NERGY 29
3.11 F ACILITIES AND S ITE I NTEGRATION 30
5.1 U PGRADES , E XPANSIONS AND A DDITIONS 38
5.2 E ARLY A CCESS D EVELOPMENT S YSTEM 38
5.3 T EST S YSTEMS 39
5.4 O N S ITE S YSTEM AND A PPLICATION S OFTWARE A NALYSTS 39
5.5 D EINSTALLATION 39
Trang 45.6 M AINTENANCE AND S UPPORT 39
6.1 P RE - DELIVERY T ESTING 42
6.2 S ITE I NTEGRATION AND P OST - DELIVERY T ESTING 42
6.3 A CCEPTANCE T ESTING 43
8.1 D OCUMENTATION 44
8.2 T RAINING 44
APPENDIX B: LANS/UC SPECIFIC PROJECT MANAGEMENT REQUIREMENTS 61
Trang 51 Introduction
Los Alamos National Security, LLC (LANS), in furtherance of its participation
in the Alliance for Computing at Extreme Scale (ACES), a collaboration
between Los Alamos National Laboratory and Sandia National Laboratories;
in coordination with the Regents of the University of California (UC), which
operates the National Energy Research Scientific Computing (NERSC) Center
residing within the Lawrence Berkeley National Laboratory (LBNL), is
releasing a joint Request for Proposal (RFP) for two next generation systems,
Crossroads and NERSC-9, under the Alliance for application Performance at
EXtreme scale (APEX), to be delivered in the 2020 time frame
The successful Offeror will be responsible for delivering and installing the
Crossroads and NERSC-9 systems at their respective locations The targets/
requirements in this document are predominately joint targets/
requirements for the two systems; however, where differences between the
systems are described, Offerors should provide clear and complete details
showing how their proposed Crossroads and NERSC-9 systems differ
Each response/proposed solution within this document shall clearly describe
the role of any lower-tier subcontractor(s) and the technology or
technologies, both hardware and software, and value added that the
lower-tier subcontractor(s) provide(s), where appropriate
The scope of work and technical specifications for any subcontracts resulting
from this RFP will be negotiated based on this Technical Requirements
Document and the successful Offeror’s responses/proposed solutions
Crossroads and NERSC-9 each have maximum funding limits over their
system lives, to include all design and development, site preparation,
maintenance, support and analysts Total ownership costs will be considered
in system selection The Offeror must respond with a configuration and
pricing for both systems
Application performance and workflow efficiency are essential to these
procurements Success will be defined as meeting APEX 2020 mission needs
while at the same time serving as a pre-exascale system that enables our
applications to begin to evolve using yet to be defined next generation
programming models The advanced technology aspects of the APEX systems
will be pursued both by fielding first of a kind technologies on the path to
exascale as part of system build and by selecting and participating in
strategic NRE projects with the Offeror and applicable technology providers
A compelling set of NRE projects will be crucial for the success of these
platforms, by enabling the deployment of first of a kind technologies in such a
way as to maximize their utility The NRE areas of collaboration should
Trang 6provide substantial value to the Crossroads and NERSC-9 systems with the
goals of:
Increasing application performance
Increasing workflow efficiency
Increasing the resilience, and reliability of the system
The details of the NRE are more completely described in section 4
To support the goals of application performance and workflow efficiency an
accompanying whitepaper, “APEX Workflows,” is provided that describes
how application teams use High Performance Computing (HPC) resources
today to advance scientific goals The whitepaper is designed to provide a
framework for reasoning about the optimal solution to these challenges (The
Crossroads/NERSC-9 workflows document can be found on the APEX
website.)
1.1 Crossroads
The Department of Energy (DOE) National Nuclear Security Administration
(NNSA) Advanced Simulation and Computing (ASC) Program requires a
computing system be deployed in 2020 to support the Stockpile Stewardship
Program In the 2020 timeframe, Trinity, the first ASC Advanced Technology
System (ATS-1), will be nearing the end of its useful lifetime Crossroads, the
proposed ATS-3 system, provides a replacement, tri-lab computing resource
for existing simulation codes and provides a larger resource for
ever-increasing computing requirements to support the weapons program The
Crossroads system, to be sited at Los Alamos, NM, is projected to provide a
large portion of the ATS resources for the NNSA ASC tri-lab simulation
community: Los Alamos National Laboratory (LANL), Sandia National
Laboratories (SNL), and Lawrence Livermore National Laboratory (LLNL),
during the 2021-2025 timeframe
In order to fulfill its mission, the NNSA Stockpile Stewardship Program
requires higher performance computational resources than are currently
available within the Nuclear Security Enterprise (NSE) These capabilities are
required for supporting stockpile stewardship certification and assessments
to ensure that the nation’s nuclear stockpile is safe, reliable, and secure
The ASC Program is faced with significant challenges by the ongoing
technology revolution It must continue to meet the mission needs of the
current applications but also adapt to radical change in technology in order
to continue running the most demanding applications in the future The ASC
Program recognizes that the simulation environment of the future will be
transformed with new computing architectures and new programming
models that will take advantage of the new architectures Within this context,
ASC recognizes that ASC applications must begin the transition to the new
Trang 7simulation environment or they may become obsolete as a result of not
leveraging technology driven by market trends With this challenge of
technology change, it is a major programmatic driver to provide an
architecture that keeps ASC moving forward and allows applications to fully
explore and exploit upcoming technologies, in addition to meeting NNSA
Defense Programs’ mission needs It is possible that major modifications to
the ASC simulation tools will be required in order to take full advantage of
the new technology However, codes running on NNSA Advanced Technology
Systems (Trinity and Sierra) in the 2019 timeframe are expected to run on
Crossroads In some cases new applications also may need to be developed
Crossroads is expected to help technology development for the ASC Program
to meet the requirements of future systems with greater computational
performance or capability Crossroads will serve as a technology path for
future ASC systems in the next decade
To directly support the ASC Roadmap, which states that “work in this
timeframe will establish a strong technological foundation to build toward
exascale computing environments, which predictive capability may demand,”
it is critical for the ASC Program to both explore the rapidly changing
technology of future systems and to provide systems with higher
performance and more memory capacity for predictive capability Therefore,
a design goal of Crossroads is to achieve a balance between usability of
current NNSA ASC simulation codes and adaptation to new computing
technologies
1.2 NERSC-9
The DOE Office of Science (SC) requires a high performance production
computing system in the 2020 timeframe to provide a significant upgrade to
the current computational and data capabilities that support the basic and
applied research programs that help accomplish the mission of DOE SC
The system also needs to provide a firm foundation for future exascale
systems in 2023 and beyond; a need that is called out in the DOE’s Strategic
Plan 2014-2018, that calls out for “advanced scientific computing to analyze,
model, simulate and predict complex phenomena, including the scientific
potential that exascale simulation and data will provide in the future.”
NERSC Center supports nearly 6000 users and about 600 different
application codes from a broad range of science disciplines covering all six
program offices in SC The scientific goals are well summarized in the
2012-2014 series of requirements reviews commissioned by the Advanced
Scientific Computing Research (ASCR) office that brought together
application scientists, computer scientists, applied mathematicians, DOE
program managers and NERSC personnel The 2012-2014 requirements
reviews indicated that compute-intensive research and research that
attempts scientific discovery through the analysis of experimental and
Trang 8observational data both have a clear need for major increases in
computational capability and capacity in the 2017 timeframe and beyond In
addition, several science areas also have a burgeoning need for HPC
resources that satisfy an increased compute workload and provide strong
support for data-centric workflows and real-time observational science.
More details about the DOE SC application requirements are in the reviews
located at: http://www.nersc.gov/science/hpc-requirements-reviews/
NERSC has already begun transitioning the SC user base to energy efficient
architectures, with the procurement of the NERSC-8 “Cori” system In the
2020 time frame, NERSC also expects a need to address early exascale
hardware and software technologies, including the areas of processor
technology, memory hierarchies, networking technology, and programming
models
The NERSC-9 system is expected to run for 4-6 years and will be housed in
the Wang Hall (Building 59) at LBNL that currently houses the “Cori” system
and other resources that NERSC supports. The system must integrate into
the NERSC environment and provide high bandwidth access to existing data
stored by continuing research projects For more information about NERSC
and the current systems, environment, and support provided for our users,
see http://www.nersc.gov
1.3 Schedule
The following is the tentative schedule for the Crossroads and NERSC-9
systems
Table 1 Crossroads/NERSC-9 Schedule
Crossroads and NERSC-9
2 System Description
2.1 Architectural Description
The Offeror shall provide a detailed full system architectural description of
both the Crossroads and NERSC-9 systems, including diagrams and text
describing the following details as they pertain to the Offeror’s system
architecture(s):
Trang 9 Component architecture – details of all processor(s), memory
technologies, storage technologies, network interconnect(s) and any
other applicable components
Node architecture(s) – details of how components are combined into the
node architecture(s) Details shall include bandwidth and latency
specifications (or projections) between components
Board and/or blade architecture(s) – details of how the node
architecture(s) is integrated at the board and/or blade level Details
should include all inter-node and inter-board/blade communication
paths and any additional board/blade level components
Rack and/or cabinet architecture(s) – details of how board and/or blades
are organized and integrated into racks and/or cabinets Details should
include all inter rack/cabinet communication paths and any additional
rack/cabinet level components
Platform storage – details of how storage is integrated with the system,
including a platform storage architectural diagram
System architecture – details of how rack or cabinets are combined to
produce system architecture, including the high-speed interconnects and
network topologies (if multiple) and platform storage
Proposed floor plan – including details of the physical footprint of the
system and all of the supporting components
2.2 Software Description
The Offeror shall provide a detailed description of the proposed software
eco-system, including a high-level software architectural diagram including
the provenance of the software component, for example open source or
proprietary and support mechanism for each (for the lifetime of the system
including updates)
2.3 Product Roadmap Description
The Offeror shall describe how the system does or does not fit into the
Offeror’s long-term product roadmap and a potential follow-on system
acquisition in the 2025 and beyond timeframe
3 Targets for System Design, Features, and
Performance Metrics
This section contains targets for detailed system design, features and
performance metrics It is desirable that the Offeror’s proposal meet or
exceed the targets outlined in this section If a target cannot be met, it is
desirable that the Offeror provide a development and deployment plan,
Trang 10including a schedule, to satisfy the target.
The Offeror may also propose any hardware and/or software architectural
features that will provide improvements for any aspect of the system
3.1 Scalability
The scale of the system necessary to meet the needs of the application
requirements of the APEX laboratories adds significant challenges The
Offeror should propose a system that enables application performance up to
the full scale of the system Additionally, the system proposed should provide
functionality that assists users in obtaining performance at up to full scale
Scalability features, both hardware and software, that benefit both current
and future programming models are essential
1.1.1 The system should support running jobs up to and including the full scale of
the system
1.1.2 The system should support launching an application at full system scale in
less than 30 seconds The Offeror shall describe factors (such as executable
size) that could potentially affect application launch time
1.1.3 The Offeror shall describe how applications launch scales with the number of
concurrent launch requests (pers second) and scale of each launch request
(resources requested, such as the number of scheduleable units etc.),
including information such as:
All system-level and node-level overhead in the process startup including
how overhead scales with node count for parallel applications, or how
overhead scales with the application count for large numbers of serial
applications
Any limitations for processes on compute nodes from interfacing with an
external work-flow manager, external database or message queue
system
1.1.4 The system should support thousands of concurrent users and more than
20,000 concurrent batch jobs The system should allow a mix of application
or user identity wherein at least a subset of nodes can run multiple
independent applications from multiple users The Offeror shall describe
details, including limitations of their proposed support for this requirement
1.1.5 The Offeror shall describe all areas of the system in which node-level
resource usage (hardware and software) increases as a job scales up (node,
core or thread count)
Trang 111.1.6 The system should utilize an optimized job placement algorithm to reduce
job runtime, lower variability, minimize latency, etc The Offeror shall
describe in detail how the algorithm is optimized to the system architecture
1.1.7 The system should include an application programming interface to allow
applications access to the physical-to-logical mapping information of the
job’s node allocation – including a mapping between MPI ranks and network
topology coordinates, and core, node and rack identifiers
1.1.8 The system software solution should provide a low jitter environment for
applications and should provide an estimate of a compute node operating
system’s noise profile, both while idle and while running a non-trivial MPI
application If core specialization is used, the Offeror shall describe the
system software activity that remains on the application cores
1.1.9 The system should provide correct numerical results and consistent
runtimes (i.e wall clock time) that do not vary more than 3% from run to run
in dedicated mode and 5% in production mode The Offeror shall describe
strategies for minimizing runtime variability
1.1.10 The system’s high speed interconnect should support a high messaging
bandwidth, high injection rate, low latency, high throughput, and
independent progress The Offeror shall describe:
The system interconnect in detail, including any mechanisms for adapting
to heavy loads or inoperable links, as well as a description of how
different types of failures will be addressed
How the interface will allow all cores in the system to simultaneously
communicate synchronously or asynchronously with the high speed
interconnect
How the interconnect will enable low-latency communication for one-
and two-sided paradigms
1.1.11 The Offeror shall describe how both hardware and software components of
the interconnect support effective computation and communication overlap
for both point-to-point operations and collective operations (i.e., the ability
of the interconnect subsystem to progress outstanding communication
requests in the background of the main computation thread)
1.1.12 The Offeror shall report or project the proposed system’s node
injection/ejection bandwidth
1.1.13 The Offeror shall report or project the proposed system’s bit error rate of the
interconnect in terms of time period between errors that interrupt a job
running at the full scale of the system
Trang 121.1.14 The Offeror shall describe how the interconnect of the system will provide
Quality of Service (QoS) capabilities (e.g., in the form of virtual channels or
other sub-system QoS capabilities), including but not limited to:
An explanation of how these capabilities can be used to prevent core
communication traffic from interfering with other classes of
communication, such as debugging and performance tools or with I/O
traffic
An explanation of how these capabilities allow efficient adaptive routing
as well as a capability to prevent traffic from different applications
interfering with each other (either through QoS capabilities or
appropriate job partitioning)
An explanation of any sub-system QoS capabilities (e.g platform storage
QoS features)
1.1.15 The Offeror shall describe specialized hardware or software features of the
system that accelerate workflows or components of workflows such as data
analysis or visualization, and describe any limits their scalability on the
system The hardware should be on the same high speed network as the
main compute resources and should have equal access to other compute
resources (e.g file systems and platform storage) It is desirable that the
hardware have the same node level architecture as the main compute
resources, but could, for example, have more memory per node
3.2 System Software and Runtime
The system should include a well-integrated and supported system software
environment The overall imperative is to provide users with a productive,
high-performing, reliable, and scalable system software environment that
enables efficient use of the full capability of the system
1.1.16 The system should include a full-featured Linux operating system
environment on all user visible service partitions (e.g., front-end nodes,
service nodes, I/O nodes) The Offeror shall describe the proposed
full-featured Linux operating system environment
1.1.17 The system should include an optimized compute partition operating system
that provides an efficient execution environment for applications running up
to full-system scale The Offeror shall describe any HPC relevant
optimizations made to the compute partition operating system
1.1.18 The Offeror shall describe the security capabilities of the operating systems
proposed in targets 1.1.16 and 1.1.17
Trang 131.1.19 The system should include efficient support for dynamic shared libraries,
both at job load time and during runtime The Offeror shall describe how
applications using shared libraries will execute at full system scale with
minimal performance overhead compared to statically linked applications
1.1.20 The system should include resource management functionality, including job
migration, backfill, targeting of specified resources (e.g., platform storage),
advance and persistent reservations, job preemption, job accounting,
architecture-aware job placement, power management, job dependencies
(e.g., workload management), and resilience management The Offeror may
propose multiple solutions for a vendor-supported resource manager and
should describe the benefits of each
1.1.21 The system should support jobs consisting of multiple individual applications
running simultaneously (inter-node or intra-node) and cooperating as part of
an overall multi-component application (e.g., a job that couples a simulation
application to an analysis application) The Offeror shall describe in detail
how this will be supported by the system software infrastructure (e.g., user
interfaces, security model, and inter-application communication)
1.1.22 The system should include a mechanism that will allow users to provide
containerized software images without requiring privileged access to the
system or allowing a user to escalate privilege The startup time for
launching a parallel application in a containerized software image at full
system scale should not greatly exceed the startup time for launching a
parallel application in the vendor-provided image
1.1.23 The system should include a mechanism for dynamically configuring external
IPv4/IPv6 connectivity to and from compute nodes, enabling special
connectivity paths for subsets of nodes on a per-batch-job basis, and allowing
fully routable interactions with external services
1.1.24 The Successful Offeror should provide access to source code, and necessary
build environment, for all software except for firmware, compilers, and third
party products The Successful Offeror should provide updates of source
code, and any necessary build environment, for all software over the life of
the subcontract
3.3 Software Tools and Programming Environment
The primary programming models used in production applications in this
time frame are the Message Passing Interface (MPI), for inter-node
communication, and OpenMP, for fine-grained on-node parallelism While
MPI+OpenMP will be the majority of the workload, the APEX laboratories
expect some new applications to exercise emerging asynchronous
Trang 14programming models System support that would accelerate these
programming models/runtimes and benefit MPI+OpenMP is desirable
1.1.25 The system should include an implementation of the MPI version 3.1 (or
most current) standard specification The Offeror shall provide a detailed
description of the MPI implementation (including specification version) and
support for features such as accelerated collectives, and shall describe any
limitations relative to the MPI standard
1.1.26 The Offeror shall describe at what parallel granularity the system can be
utilized by MPI-only applications
1.1.27 The system should include optimized implementations of collective
operations utilizing both inter-node and intra-node features where
appropriate, including MPI_Barrier, MPI_Allreduce, MPI_Reduce,
MPI_Allgather, and MPI_Gather
1.1.28 The Offeror shall describe the network transport layer of the system
including support for OpenUCX, Portals, libfabric, libverbs, and any other
transport layer including any optimizations of their implementation that will
benefit application performance or workflow efficiency
1.1.29 The system should include a complete implementation of the OpenMP
version 4.1 (or most current) standard including, if applicable, accelerator
directives, as well as a supporting programming environment The Offeror
shall provide a detailed feature description of the OpenMP
implementation(s) and describe any expected deviations from the OpenMP
standard
1.1.30 The Offeror shall provide a description of how OpenMP 3.1 applications will
be compiled and executed on the system
1.1.31 The Offeror shall provide a description of any proposed hardware or
software features that enable OpenMP performance optimizations
1.1.32 The Offeror shall list any PGAS languages and/or libraries that are supported
(e.g UPC, SHMEM, CAF, Global Arrays) and describe any hardware and/or
programming environment software that optimizes any of the listed PGAS
languages supported on the system The system should include a mechanism
to compile, run, and debug UPC applications The Offeror shall describe
interoperability with MPI+OpenMP
1.1.33 The Offeror shall describe and list support for any emerging programming
models such as asynchronous task/data models (e.g., Legion, STAPL, HPX, or
OCR) and describe any system hardware and/or programming environment
software it will provide that optimizes any of the supported models The
Offeror shall describe interoperability with MPI+OpenMP
Trang 151.1.34 The Offeror shall describe the proposed hardware and software environment
support for:
Fast thread synchronization of subsets of execution threads
Atomic add, fetch-and-add, multiply, bitwise operations, and
compare-and-swap operations over integer, single-precision, and double-precision
operands
Atomic compare-and-swap operations over 16-byte wide operands that
comprise two double precision values or two memory pointer operands
Fast context switching or task-switching
Fast task spawning for unique and identical task with data dependencies
Support for active messages
1.1.35 The Offeror shall describe in detail all programming APIs, languages,
compliers and compiler extensions, etc other than MPI and OpenMP (e.g
OpenACC, CUDA, OpenCL, etc.) that will be supported by the system It is
desirable that instances of all programming models provided be
interoperable and efficient when used within a single process or single job
running on the same compute node
1.1.36 The system should include support for the languages C, C++ (including
complete C++11/14/17), Fortran 77, Fortran 90, and Fortran 2008
programming languages Providing multiple compilation environments is
highly desirable The Offeror shall describe any limitations that can be
expected in meeting full C++17 support based on current expectations
1.1.37 The system should include a Python implementation that will run on the
compute partition with optimized MPI4Py, NumPy, and SciPy libraries
1.1.38 The system should include a programming toolchain(s) that enables runtime
coexistence of threading in C, C++, and Fortran, from within applications and
any supporting libraries using the same toolchain The Offeror shall describe
the interaction between OpenMP and native parallelism expressed in
language standards
1.1.39 The system should include C++ compiler(s) that can successfully build the
Boost C++ library, http://www.boost.org The Offeror shall support the most
recent stable version of Boost
Trang 161.1.40 The system should include optimized versions of libm, libgsl, BLAS levels 1, 2
and 3, LAPACK, ScaLAPACK, HDF5, NetCDF, and FFTW It is desirable for
these to efficiently interoperate with applications that utilize OpenMP The
Offeror shall describe all other optimized libraries that will be supported,
including a description of the interoperability of these libraries with the
programming environments proposed
1.1.41 The system should include a mechanism that enables control of task and
memory placement within a node for efficient performance The Offeror
shall provide a detailed description of controls provided and any limitations
that may exist
1.1.42 The system should include a comprehensive software development
environment with configuration and source code management tools On
heterogeneous systems, a mechanism (e.g., an upgraded autoconf) should be
provided to create configure scripts to build cross-compiled applications on
login nodes
1.1.43 The system should include an interactive parallel debugger with an
X11-based graphical user interface The debugger should provide a single point of
control that can debug applications in all supported languages using all
granularities of parallelism (e.g MPI+X) and programming environments
provided and scale up to 25% of the system
1.1.44 The system should include a suite of tools for detailed performance analysis
and profiling of user applications At least one tool should support all
granularities of parallelism in mixed MPI+OpenMP programs and any
additional programming models supported on the system The tool suite
must provide the ability to support multi-node integrated profiling of
on-node parallelism and communication performance analysis The Offeror shall
describe all proposed tools and the scalability limitations of each The Offeror
shall describe tools for measuring I/O behavior of user applications
1.1.45 The system should include event-tracing tools Event tracing of interest
includes: message-passing event tracing, I/O event tracing, floating point
exception tracing, and message-passing profiling The event-tracing tool API
should provide functions to activate and deactivate event monitoring during
execution from within a process
1.1.46 The system should include single- and multi-node stack-tracing tools The
tool set should include a source-level stack trace back, including an API that
allows a running process or thread to query its current stack trace
Trang 171.1.47 The system should include tools to assist the programmer in introducing
limited levels of parallelism and data structure refactoring to codes using any
proposed programming models and languages Tool(s) should additionally
be provided to assist application developers in the design and placement of
the data structures with the goal of optimizing data movement/placement
for the classes of memory proposed in the system
1.1.48 The system should include software licenses to enable the following number
of simultaneous users on the system:
Crossroads NERSC-9
3.4 Platform Storage
Platform storage is certain to be one of the advanced technology areas
included in any system delivered in this timeframe The APEX laboratories
anticipate these emerging technologies will enable new usage models With
this in mind, an accompanying whitepaper, “APEX Workflows,” is provided
that describes how application teams use HPC resources today to advance
scientific goals The whitepaper is designed to provide a framework for
reasoning about the optimal solution to these challenges The whitepaper is
intended to help an Offeror develop a platform storage architecture response
that accelerates the science workflows while minimizing the total number of
platform storage tiers The Crossroads/NERSC-9 workflows document can be
found on the APEX website
1.1.49 The system should include platform storage capable of retaining all
application input, output, and working data for 12 weeks (84 days),
estimated at a minimum of 36% of baseline system memory per day
1.1.50 The system should include platform storage with an appropriate durability
or a maintenance plan such that the platform storage is capable of absorbing
approximately four times the systems baseline memory per day for the life of
the system
1.1.51 The Offeror shall describe how the system provides sufficient bandwidth to
support a JMTTI/Delta-Ckpt ratio of greater than 200 (where Delta-Ckpt is
less than 7.2 minutes)
1.1.52 The Offeror shall describe the projected characteristics of all platform
storage devices for the system, including but not limited to:
Usable capacity, access latencies, platform storage interfaces (e.g NVMe,
PCIe), expected lifetime (warranty period, MTTF, total writes, etc.), and
media and device error rates
Trang 18 Relevant software/firmware features
Compression technologies used by the platform storage devices, the
resources used to implement the compression/decompression
algorithms, the expected compression rates, and all
compression/decompression-related performance impacts
1.1.53 The Offeror shall describe all available interfaces to platform storage for the
system, including but not limited to:
POSIX
APIs
Exceptions to POSIX compliance
Time to consistency and any potential delays for reliable data
consumption
Any special requirements for users to achieve performance and/or
consistent data
1.1.54 The Offeror shall describe the reliability characteristics of platform storage,
including but not limited to:
Any single point of failure for all proposed platform storage tiers (note
any component failure that will lead to temporary or permanent loss of
data availability)
Mean time to data loss for each platform storage tier provided
Enumerate platform storage tiers that are designed to be less reliable or
do not use data protection techniques (e.g., replication, erasure coding)
The magnitudes and duration of performance and reliability degradation
brought about by a single or multiple component failures for each reliable
platform storage tier
Vendor supplied mechanisms to ensure data integrity for each platform
storage tier (e.g., data scrubbing processes, background checksum
verification, etc.)
Enumerate any platform storage failures that potentially impact
scheduled or currently executing jobs that impact the platform storage or
system performance and/or availability
Login or interactive nodes access to platform storage when the compute
nodes are unavailable
1.1.55 The Offeror shall describe system features for platform storage tier
management designed to accelerate workflows, including but not limited to:
Trang 19 Mechanisms for migrating data between platform storage tiers, including
manual, scheduled, and/or automatic data migration to include
rebalancing, draining, or rewriting data across devices within a tier
How platform storage will be instantiated with each job if it needs to be,
and how platform storage may be persisted across jobs
The capabilities provided to define per-user policies and automate data
movement between different tiers of platform storage or external storage
resources (e.g., archives)
The ability to serialize namespaces no longer in use (e.g., snapshots)
The ability to restore namespaces needed for a scheduled job that is not
currently available
The ability to integrate with or act as a site-wide scheduling resource
A mechanism to incrementally add capacity and bandwidth to a
particular tier of platform storage without requiring a tier-wide outage
Capabilities to manage or interface platform storage with external
storage resources or archives (e.g., fast storage layers or HPSS)
1.1.56 The Offeror shall describe software features that allow users to optimize I/O
for the workflows of the system, including but not limited to:
Batch data movement capabilities, especially when data resides on
multiple tiers of platform storage
Methods for users to create and manage platform storage allocations
Any ability to directly write to or read from a tier not directly (logically)
adjacent to the compute resources
Locality-aware job/data scheduling
I/O utilization for reservations
Features to prevent data duplication on more than one platform storage
1.1.57 The Offeror shall describe the method for walking the entire platform storage
metadata, and describe any special capabilities that would mitigate user
performance issues for daily full-system namespace walks; expect at least 1
billion objects
Trang 201.1.58 The Offeror shall describe any capabilities to comprehensively collect
platform storage usage data (in a scalable way), for the system, including but
not limited to:
Per client metrics and frequency of collection, including but not limited
to: the number of bytes read or written, number of read or write
invocations, client cache statistics, and metadata statistics such as
number of opens, closes, creates, and other system calls of relevance to
the performance of platform storage
Job level metrics, such as the number of sessions each job initiates with
each platform storage tier, session duration, total data transmitted
(separated as reads and writes) during the session, and the number of
total platform storage invocations made during the session
Platform storage tier metrics and frequency of collection, such as the
number of bytes read, number of bytes written, number of read
invocations, number of write invocations, bytes deleted/purged, number
of I/O sessions established, and periods of outage/unavailability
Job level metrics describing usage of a tiered platform storage hierarchy,
such as how long files are resident in each tier, hit rate of file pages in
each tier (i.e., whether pages are actually read and how many times data
is re-read), fraction of data moved between tiers because of a) explicit
programmer control and b) transparent caching, and time interval
between accesses to the same file (e.g., how long until an analysis
program reads a simulation generated output file)
1.1.59 The Offeror shall propose a method for providing access to platform storage
from other systems at the facility In the case of tiered platform storage, at
least one tier must satisfy this requirement
1.1.60 The Offeror shall describe the capability for platform storage tiers to be
repaired, serviced, and incrementally patched/upgraded while running
different versions of software or firmware without requiring a storage
tier-wide outage The Offeror shall describe the level of performance degradation,
if any, anticipated during the repair or service interval
1.1.61 The Offerer shall specify the minimum number of compute nodes required to
read and write the following data sets from/to platform storage:
A 1 TB data set of 20 GB files in 2 seconds
A 5 TB data set of any chosen file size in 10 seconds Offeror shall report
the file size chosen
A 1 PB data set of 32 MB files in 1 hour
Trang 213.5 Application Performance
Assuring that real applications perform well on both the Crossroads and
NERSC-9 systems is key for their success Because the full applications are
large, often with millions of lines of code, and in some cases are export
controlled, a suite of benchmarks have been developed for RFP response
evaluation and system acceptance The benchmark codes are representative
of the workloads of the APEX laboratories but often smaller than the full
applications
The performance of the benchmarks will be evaluated as part of both the RFP
response and system acceptance Final benchmark acceptance performance
targets will be negotiated after a final system configuration is defined All
performance tests must continue to meet negotiated acceptance criteria
throughout the lifetime of the system
System acceptance for Crossroads will also include an ASC Simulation Code
Suite comprised of at least two (2) but no more than four (4) ASC
applications from the three NNSA laboratories, Sandia, Los Alamos and
Lawrence Livermore
The Crossroads/NERSC-9 benchmarks, information regarding the Crossroads
acceptance codes, and supplemental materials can be found on the APEX
website
1.1.62 The Offeror shall provide responses to the benchmarks (SNAP, PENNANT,
HPCG, MiniPIC, UMT, MILC, MiniDFT, GTC, and Meraculous) provided on the
Crossroads/NERSC-9 benchmarks link on the APEX website All modifications
or new variants of the benchmarks (including makefiles, build scripts, and
environment variables) are to be supplied in the Offeror’s response
The results of all problem sizes (baseline and optimized) should be
provided in the Offeror's Scalable System Improvement (SSI)
spreadsheets SSI is the calculation used for measuring improvement and
is documented on the APEX website, along with the SSI spreadsheets If
predicted or extrapolated results are provided, the methodology used to
derive them should be documented
The Offeror shall provide licenses for the system for all compilers,
libraries, and runtimes used to achieve benchmark performance
1.1.63 The Offeror shall provide performance results for the system that may be
benchmarked, predicted, and/or extrapolated for the baseline MPI+OpenMP
(or UPC for Meraculous) variants of the benchmarks The Offeror may modify
the benchmarks to include extra OpenMP pragmas as required, but the
benchmark must remain a standard-compliant program that maintains
existing output subject to the validation criteria described in the benchmark
run rules
Trang 221.1.64 The Offeror shall optionally provide performance results from an Offeror
optimized variant of the benchmarks The Offeror may modify the
benchmarks, including the algorithm and/or programming model used to
demonstrate high system performance If algorithmic changes are made, the
Offeror shall provide an explanation of why the results may deviate from
validation criteria described in the benchmark run rules
1.1.65 For the Crossroads system only: in addition to the Crossroads/NERSC-9
benchmarks, an ASC Simulation Code Suite representing the three NNSA
laboratories will be used to judge performance at time of acceptance The
Crossroads system should achieve a minimum of at least 6 times (6X)
improvement over the ASC Trinity system (Knights Landing partition) for
each code, measured using SSI The Offeror shall specify a baseline
performance greater than or equal to 6X at time of response Final
acceptance performance targets will be negotiated after a final system
configuration is defined Information regarding ASC Simulation Code Suite
run rules and acceptance can be found on the APEX website Source code will
be provided to the Offeror but will require compliance with export control
laws and no cost licensing agreements
1.1.66 The Offeror shall report or project the number of cores necessary to saturate
the available node baseline memory bandwidth as measured by the
Crossroads/NERSC-9 memory bandwidth benchmark found on the APEX
website
If the node contains heterogeneous cores, the Offeror shall report the
number of cores of each architecture necessary to saturate the available
baseline memory bandwidth
If multiple tiers of memory are available, the Offeror shall report the
above for every functional combination of core architecture and baseline
or extended memory tier
1.1.67 The Offeror shall report or project the sustained dense matrix multiplication
performance on each type of processor core (individually and/or in parallel)
of the system node architecture(s) as measured by the Crossroads/NERSC-9
multithreaded DGEMM benchmark found on the APEX website
The Offeror shall describe the percentage of theoretical double-precision
(64-bit) computational peak, which the benchmark GFLOP/s rate
achieves for each type of compute core/unit in the response, and describe
how this is calculated
1.1.68 The Offeror shall report, or project, the MPI two-sided message rate of the
nodes in the system under the following conditions measured by the
communication benchmark specified on the APEX website:
Trang 23 Using a single MPI rank per node with MPI_THREAD_SINGLE.
Using two, four, and eight MPI ranks per node with
MPI_THREAD_SINGLE
Using one, two, four, and eight MPI ranks per node and multiple threads
per rank with MPI_THREAD_MULTIPLE
The Offeror may additionally choose to report on other configurations
1.1.69 The Offeror shall report, or project, the MPI one-sided message rate of the
nodes in the system for all passive synchronization RMA methods with both
pre-allocated and dynamic memory windows under the following conditions
measured by the communication benchmark specified on the APEX website
using:
A single MPI rank per node with MPI_THREAD_SINGLE
Two, four, and eight MPI ranks per node with MPI_THREAD_SINGLE
One, two, four, and eight MPI ranks per node and multiple threads per
rank with MPI_THREAD_MULTIPLE
The Offeror may additionally choose to report on other configurations
1.1.70 The Offeror shall report, or project, the time to perform the following
collective operations for full, half, and quarter machine size in the system and
report on core occupancy during the operations measured by the
communication benchmark specified on the APEX website for :
An 8 byte MPI_Allreduce operation
An 8 byte per rank MPI_Allgather operation
1.1.71 The Offeror shall report, or project, the minimum and maximum off-node
latency of the system for MPI two-sided messages using the following
threading modes measured by the communication benchmark specified on
the APEX website:
MPI_THREAD_SINGLE with a single thread per rank
MPI_THREAD_MULTIPLE with two or more threads per rank
1.1.72 The Offeror shall report, or project, the minimum and maximum off-node
latency for MPI one-sided messages of the system for all passive
synchronization RMA methods with both pre-allocated and dynamic memory
windows using the following threading modes measured by the
communication benchmark specified on the APEX website:
MPI_THREAD_SINGLE with a single thread per rank
MPI_THREAD_MULTIPLE with two or more threads per rank
Trang 241.1.73 The Offeror shall provide an efficient implementation of
MPI_THREAD_MULTIPLE Bandwidth, latency, and message throughput
measurements using the MPI_THREAD_MULTIPLE thread support level
should have no more than a 10% performance degradation when compared
to using the MPI_THREAD_SINGLE support level as measured by the
communication benchmark specified on the APEX website
1.1.74 The Offeror shall report, or project, the maximum I/O bandwidths of the
system as measured by the IOR benchmark specified on the APEX website
1.1.75 The Offeror shall report, or project, the metadata rates of the system as
measured by the MDTEST benchmark specified on the APEX website
1.1.76 The Successful Offeror shall be required at time of acceptance to meet
specified targets for acceptance benchmarks, and mission codes for
Crossroads, listed on the APEX website
1.1.77 The Offeror shall describe how the system may be configured to support a
high rate and bandwidth of TCP/IP connections to external services both
from compute nodes and directly to and from the platform storage, including:
Compute node external access should allow all nodes to each initiate 1
connection concurrently within a 1 second window
Transfer of data over the external network to and from the compute
nodes and platform storage at 100 GB/s per direction of a 1 TB dataset
comprised of 20 GB files in 10 seconds
3.6 Resilience, Reliability, and Availability
The ability to achieve the APEX mission goals hinges on the productivity of
system users System availability is therefore essential and requires
system-wide focus to achieve a resilient, reliable, and available system For each
metric specified below, the Offeror must describe how they arrived at their
estimates
1.1.78 Failure of the system management and/or RAS system(s) should not cause a
system or job interrupt This requirement does not apply to a RAS system
feature, which automatically shuts down the system for safety reasons, such
as an overheating condition
1.1.79 The minimum System Mean Time Between Interrupt (SMTBI) should be
greater than 720 hours
1.1.80 The minimum Job Mean Time To Interrupt (JMTTI) should be greater than 24
hours Automatic restarts do not mitigate a job interrupt for this metric
Trang 251.1.81 The ratio of JMTTI/Delta-Ckpt should be greater than 200 This metric is a
measure of the system’s ability to make progress over a long period of time
and corresponds to an efficiency of approximately 90% If, for example, the
JMTTI requirement is not met, the target JMTTI/Delta-Ckpt ratio ensures this
minimum level of efficiency
1.1.82 An immediate re-launch of an interrupted job should not require a complete
resource reallocation If a job is interrupted, there should be a mechanism
that allows re-launch of the application using the same allocation of resource
(e.g., compute nodes) that it had before the interrupt or an augmented
allocation when part of the original allocation experiences a hard failure
1.1.83 A complete system initialization should take no more than 30 minutes The
Offeror shall describe the full system initialization sequence and timings
1.1.84 The system should achieve 99% scheduled system availability System
availability is defined in the glossary
1.1.85 The Offeror shall describe the resilience, reliability, and availability
mechanisms and capabilities of the system including, but not limited to:
Any condition or event that can potentially cause a job interrupt
Resiliency features to achieve the availability targets
Single points of failure (hardware or software), and the potential effect on
running applications and system availability
How a job maintains its resource allocation and is able to relaunch an
application after an interrupt
A system-level mechanism to collect failure data for each kind of
component
3.7 Application Transition Support and Early Access to APEX
Technologies
The Crossroads and NERSC-9 systems will include numerous pre-exascale
technologies The Offeror shall include in their proposal a plan to effectively
utilize these technologies and assist in transitioning the mission workflows
to the systems For the Crossroads system only, the Successful Offeror shall
support efforts to transition the Advanced Technology Development
Mitigation (ATDM) codes to the systems ATDM codes are currently being
developed by the three NNSA weapons laboratories, Sandia, Los Alamos, and
Lawrence Livermore These codes may require compliance with export
control laws and no cost licensing agreements Information about the ATDM
program can be found on the NNSA website
Trang 261.1.86 The Successful Offeror should provide a vehicle for supporting the successful
demonstration of the application performance requirements and the
transition of key applications to the Crossroads and NERSC-9 systems (e.g., a
Center of Excellence) Support should be provided by the Offeror and all of
its key advanced technology providers (e.g., processor vendors, integrators,
etc) The Successful Offeror should provide experts in the areas of
application porting and performance optimization in the form of staff
training, general user training, and deep-dive interactions with a set of
application code teams Support should include compilers to enable timely
bug fixes as well as to enable new functionality Support should be provided
from the date of subcontract execution through two (2) years after final
acceptance of the systems
1.1.87The Offeror shall describe which of the proposed APEX hardware and
software technologies (physical hardware, emulators, and/or simulators) ,
will be available for access before system delivery and in what timeframe
The proposed technologies should provide value in advanced preparation for
the delivery of the final APEX system(s) for pre-system-delivery application
porting and performance assessment activities
3.8 Target System Configuration
APEX determined the following targets for Crossroads and NERSC-9 System
Configurations Offerors shall state projections for their proposed system
configurations relative to these targets
Table 2 Target System Configuration
Crossroads NERSC-9
Baseline Memory Capacity
Excludes all levels of
on-die-CPU cache
Benchmark SSI increase
over Edison system
Platform Storage > 30X Baseline Memory > 30X Baseline Memory
Trang 27Calculated for a single job
running in the entire
System management should be an integral feature of the overall system and
should provide the ability to effectively manage system resources with high
utilization and throughput under a workload with a wide range of
concurrencies The Successful Offeror should provide system administrators,
security officers, and user-support personnel with productive and efficient
system configuration management capabilities and an enhanced diagnostic
environment
1.1.88 The system should include scalable integrated system management
capabilities that provide human interfaces and APIs for system configuration
and its ability to be automated, software management, change management,
local site integration, and system configuration backup and recovery
1.1.89 The system should include a means for tracking and analyzing all software
updates, software and hardware failures, and hardware replacements over
the lifetime of the system
Trang 281.1.90 The system should include the ability to perform rolling upgrades and
rollbacks on a subset of the system while the balance of the system remains
in production operation The Offeror shall describe the mechanisms,
capabilities, and limitations of rolling upgrades and rollbacks No more than
half the system partition should be required to be down for rolling upgrades
and rollbacks
1.1.91 The system should include an efficient mechanism for reconfiguring and
rebooting compute nodes The Offeror shall describe in detail the compute
node reboot mechanism, differentiating types of boots (warmboot vs
coldboot) required for different node features, as well as how the time
required to reboot scales with the number of nodes being rebooted
1.1.92 The system should include a mechanism whereby all monitoring data and
logs captured are available to the system owner, and will support an open
monitoring API to facilitate lossless, scalable sampling and data collection for
monitored data. Any filtering that may need to occur will be at the option of
the system manager The system will include a sampling and connection
framework that allows the system manager to configure independent
alternative parallel data streams to be directed off the system to
site-configurable consumers
1.1.93 The system should include a mechanism to collect and provide metrics and
logs which monitor the status, health, and performance of the system,
including, but not limited to:
Environmental measurement capabilities for all systems and peripherals
and their sub-systems and supporting infrastructure, including power
and energy consumption and control
Internal HSN performance counters, including measures of network
congestion and network resource consumption
All levels of integrated and attached platform storage
The system as a whole, including hardware performance counters for
metrics for all levels of integrated and attached platform storage
1.1.94 The Offeror shall describe what tools it will provide for the collection,
analysis, integration, and visualization of metrics and logs produced by the
system (e.g., peripherals, integrated and attached platform storage, and
environmental data, including power and energy consumption)
1.1.95 The Offeror shall describe the system configuration management and
diagnostic capabilities of the system that address the following topics:
Detailed description of the system management support
Trang 29 Any effect or overhead of software management tool components on the
CPU or memory available on compute nodes
Release plan, with regression testing and validation for all system related
software and security updates
Support for multiple simultaneous or alternative system software
configurations, including estimated time and effort required to install
both a major and a minor system software update
User activity tracking, such as audit logging and process accounting
Unrestricted privileged access to all hardware components delivered
with the system
3.10 Power and Energy
Power, energy, and temperature will be critical factors in how the APEX
laboratories manage systems in this time frame and must be an integral part
of overall Systems Operations The solution must be well integrated into
other intersecting areas (e.g., facilities, resource management, runtime
systems, and applications) The APEX laboratories expect a growing number
of use cases in this area that will require a vertically integrated solution
1.1.96 The Offeror shall describe all power, energy, and temperature measurement
capabilities (system, rack/cabinet, board, node, component, and
sub-component level) for the system, including control and response times,
sampling frequency, accuracy of the data, and timestamps of the data for
individual points of measurement and control
1.1.97 The Offeror shall describe all control capabilities it will provide to affect
power or energy use (system, rack/cabinet, board, node, component, and
sub-component level)
1.1.98 The system should include system-level interfaces that enable measurement
and dynamic control of power and energy relevant characteristics of the
system, including but not limited to:
AC measurement capabilities at the system or rack level
System-level minimum and maximum power settings (e.g., power caps)
System-level power ramp up and down rate
Scalable collection and retention all measurement data such as:
point-in-time power data
energy usage information
minimum and maximum power data
Trang 301.1.99 The system should include resource manager interfaces that enable
measurement and dynamic control of power and energy relevant
characteristics of the system, including but not limited to:
Job and node level minimum and maximum power settings
Job and node level power ramp up and down rate
Job and node level processor and/or core frequency control
System and job level profiling and forecasting
e.g., prediction of hourly power averages >24 hours in advance with a 1
MW tolerance
1.1.100 The system should include application and runtime system interfaces
that enable measurement and dynamic control of power and energy relevant
characteristics of the system including but not limited to:
Node level minimum and maximum power settings
Node level processor and/or core frequency control
Node level application hints, such as:
application entering serial, parallel, computationally intense, I/O intense
or communication intense phase
1.1.101 The system should include an integrated API for all levels of
measurement and control of power relevant characteristics of the system It
is preferable that the provided API complies with the High Performance
Computing Power Application Programming Interface Specification
(http://powerapi.sandia.gov)
1.1.102 The Offeror shall project (and report) the Wall Plate, Peak, Nominal,
and Idle Power of the system
1.1.103 The Offeror shall describe any controls available to enforce or limit
power usage below wall plate power and the reaction time of this mechanism
(e.g., what duration and magnitude can power usage exceed the imposed
limits)
1.1.104 The Offeror shall describe the status of the system when in an Idle
State (describe all Idle States if multiple are available) and the time to
transition from the Idle State (or each Idle State if there are multiple) to the
start of job execution
Trang 313.11 Facilities and Site Integration
1.1.105 The system should use 3-phase 480V AC Other system infrastructure
components (e.g., disks, switches, login nodes, and mechanical subsystems
such as CDUs) must use either phase 480V AC (strongly preferred),
3-phase 208V AC (second choice), or single-3-phase 120/240V AC (third choice)
The total number of individual branch circuits and phase load imbalance
should be minimized
1.1.106 All equipment and power control hardware of the system should be
Nationally Recognized Testing Laboratories (NRTL) certified and bear
appropriate NRTL labels
1.1.107 Every rack, network switch, interconnect switch, node, and disk
enclosure should be clearly labeled with a unique identifier visible from the
front of the rack and/or the rear of the rack, as appropriate, when the rack
door is open These labels will be high quality so that they do not fall off, fade,
disintegrate, or otherwise become unusable or unreadable during the
lifetime of the system Nodes will be labeled from the rear with a unique
serial number for inventory tracking It is desirable that motherboards also
have a unique serial number for inventory tracking Serial numbers shall be
visible without having to disassemble the node, or they must be able to be
queried from the system management console
1.1.108 Table 3 below shows target facility requirements identified by APEX
for the Crossroads and NERSC-9 systems The Offeror shall describe the
features of its proposed systems relative to site integration at the respective
facilities, including:
Description of the physical packaging of the system, including
dimensioned drawings of individual cabinets types and the floor layout of
the entire system
Remote environmental monitoring capabilities of the system and how it
would integrate into facility monitoring
Emergency shutdown capabilities
Detailed descriptions of power and cooling distributions throughout the
system, including power consumption for all subsystems
Trang 32 Description of parasitic power losses within Offeror’s equipment, such as
fans, power supply conversion losses, power-factor effects, etc For the
computational and platform storage subsystems separately, give an
estimate of the total power and parasitic power losses (whose difference
should be power used by computational or platform storage components)
at the minimum and maximum ITUE, which is defined as the ratio of total
equipment power over power used by computational or platform storage
components Describe the conditions (e.g “idle”) at which the extrema
occur
OS distributions or other client requirements to support off-system
access to the platform storage (e.g LANL File Transfer Agents)
Table 3 Crossroads and NERSC-9 Facility Requirements
Crossroads NERSC-9
Laboratory, Los Alamos,
NM The system will be housed in the Strategic Computing Complex (SCC), Building 2327
National Energy Research Scientific Computing Center, Lawrence BerkeleyNational Laboratory, Berkeley, CA
The system will be housed
in Wang Hall, Building 59 (formerly known as the Computational Theory andResearch Facility)
seismic isolation floor
System cabinets should have an attachment mechanism that will enable them to be firmly attached to each other andthe isolation floor When secured via these
attachments, the cabinets should withstand seismic design accelerations per the California Building Code and LBNL Lateral
Trang 33operate in conformance with ASHRAE Class W2 guidelines (dated 2011)
The facility will provide operating water
temperature that nominally varies between60-75°F, at up to 35PSI differential pressure at the system cabinets However, Offeror should note if the system is capable of operating at higher temperatures
Note: LANL facility will provide inlet water at a nominal 75°F It may go
to as low as 60°F based
on facility and/or environmental factors
Total flow requirements may not exceed
9600GPM
Same
Note: NERSC facility will provide inlet water at a nominal 65°F It may go
as high as 75°F based on facility and/or
environmental factors
Total flow requirements may not exceed 9600GPM
with facility water meeting basic ASHRAE water chemistry Special chemistry water is not available in the main building loop and would require a separate
Same
Trang 34Crossroads NERSC-9
tertiary loop provided with the system If tertiary loops are included in the system, the Offeror shall describe their operation and maintenance, including coolant chemistry, pressures, and flow controls All coolant loops within the system should have reliable leak detection, temperature, and flow alarms, with automatic protection and notification mechanisms
with supply air at 76°F orbelow, with a relative humidity from 30%-70%
The rate of airflow is between 800-1500 CFM/floor tile. No more than 3MW of heat should
be removed by air cooling
The system must operate
with supply air at 76°F or
below, with a relative humidity from 30%-80%
The current facility can support up to 60K CFM of airflow, and remove 500KW of heat Expansion
is possible to 300K CFM and 1.5MW, but at added expense
Maximum Power Rate of
Change The hourly average in system power should not
exceed the 2MW wide power band negotiated atleast 2 hours in advance
N/A
resilient to incoming power fluctuations at least to the level guaranteed by the ITIC power quality curve
Same
Trang 35Crossroads NERSC-9
6” ceiling plenum 17’10” ceiling however maximum cabinet height
is 9’5”
Maximum Footprint 8000 square feet; 80 feet
long and 100 feet deep 64’x92’, or 5888 square feet (inclusive of compute,
platform storage and service aisles) This area
is itself surrounded by a minimum 4’ aisle that can
be used in the system layout It is preferred that cabinet rows run parallel
to the short dimension
Shipment Dimensions and
Weight No restrictions. For delivery, system components should weigh
less than 7000 pounds andshould fit into an elevator whose door is 6ft 6in wideand 9ft 0 in high and whose depth is 8ft 3in
Clear internal width is 8ft
4 in
over the effective area should be no more than
300 pounds per square foot The effective area is the actual loading area plus at most a foot of surrounding fully unloaded area A maximum limit of 300 pounds per square foot also applies to all loads during installation The Offeror shall describe how the weight will be distributed over the footprint of the rack (point loads, line loads, or
The floor loading should not exceed a uniform load
of 500 pounds per square foot Raised floor tiles are ASM FS400 with an isolated point load of 2000pounds and a rolling load
of 1200 pounds
Trang 36Crossroads NERSC-9
evenly distributed over the entire footprint) A point load applied on a one square inch area should not exceed 1500 pounds A dynamic load using a CISCA Wheel 1 size should not exceed
1250 pounds (CISCA Wheel 2 – 1000 pounds)
water connections should
be below the access floor
It is preferable that all other cabling (e.g., systeminterconnect) is above floor and integrated into the system cabinetry
Under floor cables (if unavoidable) should be plenum rated and complywith NEC 300.22 and NEC645.5 All
communications cables, wherever installed, should be
source/destination labeled at both ends All communications cables and fibers over 10 meters
in length and installed under the floor should also have a unique serial number and dB loss data document (or equivalent)delivered at time of installation for each cable, if a method of measurement exists for cable type
Same
External network 1Gb, 10Gb, 40Gb, 100Gb, Same
Trang 37External bandwidth on/off
the system for general
TCP/IP connectivity
> 100 GB/s per direction Same
External bandwidth on/off
the system for accessing
the system’s PFS
External bandwidth on/off
the system for accessing
external, site supplied file
systems E.g GPFS, NFS
4 Non-Recurring Engineering
The APEX team expects to award two (2) Non-Recurring Engineering (NRE)
subcontracts, separate from the two (2) system subcontracts It is expected
that Crossroads and NERSC personnel will collaborate in both NRE
subcontracts It is anticipated that the NRE subcontracts will be
approximately 10%-15% of the combined Crossroads and NERSC-9 system
budgets The Offeror is encouraged to provide proposals for areas of
collaboration they feel provide substantial value to the Crossroads and
NERSC-9 systems with the goals of:
Increasing application performance
Increasing workflow performance
Increasing the resilience, and reliability of the system
Proposed collaboration areas should focus on topics that provide added
value beyond planned roadmap activities Proposals should not focus on
one-off point solutions or gaps created by their proposed design that should be
otherwise provided as part of a vertically integrated solution It is expected
that NRE collaborations will have impact on both the Crossroads and
NERSC-9 systems and follow-on systems procured by the U.S Department of
Energy's NNSA and Office of Science
NRE topics of interest include, but are not limited to, the following:
Trang 38 Development and optimization of hardware and software capabilities to
increase the performance of MPI+OpenMP and future task-based
asychronous programming models
Development and optimization of hardware and software capabilities to
increase the performance of application workflows, including
consideration of consistency requirements, data-migration needs, and
system-wide resource management
Development of scalable system management capabilities to enhance the
reliability, resilience, power, and energy usage of Crossroads/NERSC-9
The APEX team expects to have future requirements for system upgrades
and/or additional quantities of components based on the configurations
proposed in response to this solicitation The Offeror should address any
technical challenges foreseen with respect to scaling and any other
production issues Proposals should be as detailed as possible
5.1 Upgrades, Expansions and Additions
1.1.109 The Offeror shall propose and separately price upgrades, expansions
or procurement of additional system configurations by the following
fractions of the system as measured by the Sustained System Improvement
1.1.110 The Offeror shall propose a configuration or configurations which
double the baseline memory capacity
1.1.111 The Offeror shall propose upgrades, expansions or procurement of
additional platform storage capacity (per tier if multiple tiers are present) in
increments of 25%
5.2 Early Access Development System
To allow for early and/or accelerated development of applications or development
of functionality required as a part of the statement of work, the Offeror shall
propose options for early access development systems These systems can be in
support of the baseline requirements or any proposed options
Trang 391.1.112 The Offeror shall propose an Early Access Development System The
primary purpose is to expose the application to the same programming
environment as will be found on the final system It is acceptable for the early
access system to not use the final processor, node, or high-speed
interconnect architectures However, the programming and runtime
environment must be sufficiently similar that a port to the final system is
trivial The early access system shall contain similar functionality of the final
system, including file systems, but scaled down to the appropriate
configuration The Offeror shall propose an option for the following
configurations based on the size of the final Crossroads/NERSC-9 systems
2% of the compute partition
5% of the compute partition
10% of the compute partition
1.1.113 The Offeror shall propose development test bed systems that will
reduce risk and aid the development of any advanced functionality that is
exercised as a part of the statement of work For example, any topics
proposed for NRE
5.3 Test Systems
The Offeror shall propose the following test systems The systems shall contain all
the functionality of the main system, including file systems, but scaled down to the
appropriate configuration Multiple test systems may be awarded
1.1.114 The Offeror shall propose an Application Regression test system,
which should contain at least 200 compute nodes
1.1.115 The Offeror shall propose a System Development test system, which
should contain at least 50 compute nodes
5.4 On Site System and Application Software Analysts
1.1.116 The Offeror shall propose and separately price two (2) System
Software Analysts and two (2) Applications Software Analysts for each site
Offerors shall presume each analyst will be utilized for four (4) years For
Crossroads, these positions require a DOE Q-clearance for access
5.5 Deinstallation
The Offeror shall propose to deinstall, remove and/or recycle the system and
supporting infrastructure at end of life Storage media shall be wiped or
destroyed to the satisfaction of ACES and NERSC, and/or returned to ACES
and NERSC at their request