Just-In-Time Workload Management Scalable Resource Sharing on the Open Science Grid

The Condor High Throughput Computing System software “Condor” is an established distributed workload management system for large scale and compute intensive applications, with facilities

Trang 1

For the period July 1, 2006 – June 30, 2009

Lead Principal Investigator

San Diego, CAPh: 858 774 7035fkw@fnal.gov

DOE/Office of Science Program Office:

Office of Advanced Scientific Computing and Research

DOE/Office of Science Program Office Technical Contact:

Mary Anne Scott (301) 9036368 scott@er.doe.gov

Trang 2

Table of Contents

Trang 3

The Large Hadron Collider (LHC) experiments are collaborating with computer scientists to develop gridcomputing as the enabler for data intensive, globally distributed computing at the unprecedented scalesrequired by LHC science The first LHC physics run in 2008 is expected to provide sensitivity to excitingnew physics within weeks The 2008 run will deliver 10PB each to ATLAS and CMS, requiring100MSpecInt2k at 50 centers worldwide to support physics analysis by communities of order 1000physicists LHC's collaboration with the grid computing community is taking place within projects thatalso involve other applications sciences US projects like the Trillium projects, Grid3 and recently OpenScience Grid (OSG), collaborating with others overseas, have delivered a grid infrastructure andapplications that are now in production in many experiments, layered over a common baseline gridinfrastructure that is now being hardened as the foundation for systems scalable to LHC requirements Wepropose a three year partnership between ATLAS and CMS collaborators and computer scientists from theCondor project to build on this foundation to develop and deploy a workload management system capable

of meeting key science-driven requirements in opportunistic resource utilization, diverse usage modesfrom production to analysis, managing dynamic workloads, and automation which are relevant both for theLHC and the broader science community of the OSG The partners involved bring demonstrated expertise

in developing and successfully deploying systems and supporting middleware following the highlyscalable “just in time” workflow design the project will employ

Trang 4

A Project Narrative

In the following sections we describe the proposed project technically and organizationally. Section A.1 provides some background on the HEP computing challenge at the Large Hadron Collider (LHC) [1], the approach the LHC experiments have taken to addressing this challenge through collaboration with

computer science on dataintensive grid computing, the status of this program, and the motivation and significance of this proposal in this context. Section A.2 describes work done to date that motivates and informs the objectives and workplan of this proposal, and establishes the partners involved as highly qualified to carry the program to success. Section A.3 describes the specific program of work: our

architectural and technical approach to workflow management and the advantages thereof; the form our application/computer science partnership will take; how we will organize ourselves and apply manpower; and milestones/deliverables. Finally, Section A.4 gives the specifics of the relationship between this project and the Open Science Grid Consortium [2]

A.1 Background and Significance

The Large Hadron Collider (LHC) experiments are collaborating with computer scientists to develop grid computing as the enabler for data intensive, globally distributed computing at the unprecedented scales required by the LHC science program The LHC computing challenge is immense and has no margin for error The luminosity expected in the first LHC physics run in 2008 is sufficient to provide sensitivity to new physics such as supersymmetry within a few weeks The computing systems must be ready, and at scale: the 2008 run will deliver about 10PB each to ATLAS and CMS (the volume for a nominal LHC year), requiring about 100MSpecInt2k [3] at about 50 centers worldwide to analyze in order to understand detector performance and extract the physics There will be 500-1000 physicists per experiment actively engaged worldwide in data analysis and detector performance studies; the computing infrastructure must support this individual usage equally well with managed production The priorities and workloads will change frequently and urgently and must be accommodated quickly Computing will be resource-limited, demanding a system capable of opportunistically exploiting a diverse and dynamic array of computing sites and resources Operations manpower will be very limited, so the system must be highly automated and robust against instabilities and failures

LHC's collaboration with the grid computing community to meet these challenges is well-established, and

is taking place within projects that involve other applications sciences as well While the LHC's

requirements differ in scale from other domains, in most respects they do not differ in kind, and these projects have established tools and systems used in common across domains In the US, these projects include the Trillium projects (PPDG[4], GriPhyN[5], iVDGL[6]), Grid3[7] and most recently the Open Science Grid They have delivered a grid computing infrastructure and grid-capable applications that are now in production in many experiments

Both in the US and in Europe the efforts to date have established a common baseline grid computing infrastructure This is now being consolidated and hardened to serve as the foundation for systems able to scale to LHC requirements The work has also established an invaluable experience base to inform the remaining effort to deliver systems meeting all the requirements of LHC computing

We propose to develop improvements in workload management that are necessary to meet the LHC requirements described above The targeted requirements opportunistic resource utilization, good support for diverse usage modes from managed production to individual analysis, flexible and fast

management of dynamic workloads, extensive automation to simplify operations, and needed scalability are relevant to all or most grid users, not just the LHC For this reason we anticipate that this work will be widely beneficial within the OSG consortium

Trang 5

Experience with existing systems (both our own and others) has demonstrated the advantages of a "late binding" or "just-in-time" workload management system In conventional workload management, the 'payload' of a job dispatched to the grid the processing task to be performed is sent as an intrinsic part

of the job at time of submission to the grid In a late binding scheme, the submitted job is merely a

container for a payload to be acquired later, once the job successfully launches on a worker node When the container ('pilot') job launches on a worker node, it contacts a queue manager and requests a task In this way work is pulled to worker nodes that are acquired by a resource harvesting system that is largely decoupled from the application's workload management

This scheme offers a number of benefits:

 It enables opportunistic acquisition of resources on any site or grid to which pilot jobs can be delivered, with the VO then able to deliver work appropriate to the capabilities of the resource (as reported by the pilot)

 It allows the VO to flexibly and quickly adjust workflows to changing requests and priorities, since all tasks are held in a VOmanaged queue until the moment of their release (to a pilot) for processing

 It provides robustness against site and worker node failures; sites and worker nodes that do not successfully launch a communicating pilot will be ignored by workload management

 It is capable of very short job launch latencies, not bounded by the latency of submitting a grid job,because the workload management sees a steady stream of communicating pilots to which it can dispatch tasks immediately as they arrive. This is particularly valuable for supporting distributed interactive analysis

 It maximizes uniformity across heterogeneous grids and scheduling systems, as seen by the VO's workload management; the heterogeneity is primarily isolated in the harvesting system, with the queue management and pilot interaction systems common across all environments

 It allows sophisticated, dynamic, VOmanaged brokerage to take place in the decision process by which the workload manager selects a task to deliver to a requesting pilot. For example, data placement constraints can easily be applied; the brokerage may require that input data be prepositioned at a site to avoid data transfer latencies and failure modes. VO policies such as user priorities and quotas can also be imposed in a homogeneous way across heterogeneous resources.These benefits are not expectations in the abstract for such a system; they are seen in deployed systems as will be described in the next section. The deployed systems to date are however experiment specific; this proposal seeks to provide a broadly usable system deployed to and supported on the OSG

A.2 Preliminary Studies and Project Drivers

The late binding approach has been used in various forms by LHCb[8], ALICE[9], and CDF[10] in the DIRAC[11], ALIEN, and GlideCAF systems respectively, with very good success This approach is also closely related to the way Condor employs matchmaking to bind resources and consumers This proposal leverages a recent entry among the late binding systems, the Panda system developed by US ATLAS [12], together with Condor and GlideCAF from CDF/CMS In the following subsections we describe these systems as they exist today and as they relate to the workload management objectives of this proposal

Trang 6

Dependable and effective access to large amounts of sustained computing power, often referred to as throughput computing, is critical to today’s scientists and engineers The Condor Project at the University

high-of Wisconsin-Madison (UW-Madison) Department high-of Computer Sciences has been engaged in research, software development and software deployment in this area since 1985 [13] and consist of a team of about

35 staff and students The Condor High Throughput Computing System software (“Condor”) is an

established distributed workload management system for large scale and compute intensive applications, with facilities for resource monitoring and management, job scheduling, priority scheme, and workflow supervision Condor provides easy access to large amounts of dependable and reliable computational power over prolonged periods of time by effectively harnessing all available resources, including both dedicated compute clusters and non-dedicated machines under the control of interactive users or

autonomous batch systems Condor’s unique architecture and mechanisms enable it to perform

particularly well in opportunistic environments [14] Opportunistic mechanisms such as process

checkpoint/migration [15] and redirected I/O allow Condor to effectively harness non-dedicated desktop workstations as well as dedicated compute clusters, and Condor’s special attention to fault-tolerance enables it to provide large amounts of computational throughput over prolonged periods of time or in a grid environment that crosses different administrative domains

Originally, the Condor job submission agent could launch jobs only upon Condor-managed resources In

2001, Condor-G [16] was developed as an enhanced submission agent that can launch and supervise jobs upon resources controlled by a growing list of workload management and grid middleware systems, permitting computing environments that cross administrative boundaries – a primary requirement for grid computing Condor-G (which stands for Condor to Grid) was originally developed to submit jobs to Globus Toolkit (GT) 2.x middleware via the GRAM protocol [17], but has since been extended to support submission to GT4, Nordugrid, PBS, and others Used as a front-end to a computational grid, Condor-G can manage thousands of jobs destined to run at distributed sites Condor-G provides job monitoring, logging, notification, policy enforcement, fault tolerance, credential management, and it can handle complex job interdependencies Of course, Condor can also launch jobs upon remote Condor pools; this issometimes referred to as Condor-C (which stands for Condor to Condor), thereby allowing multiple Condor sites to work together

It is not uncommon for both Condor-G and Condor to be utilized within a computational grid deployment For example, Condor-G can be used as the reliable submission and job management service for one or more sites, the Condor High Throughput Computing system can be used as the fabric management service (a grid “generator”) for one or more sites, and finally Globus Toolkit services can be used as the bridge between them

One disadvantage of common grid job submission protocols today such as GRAM is they usually result in either over-subsription by submitting jobs to multiple queues at once or under-subscription by submitting jobs to potentially long queues This problem can be solved with a technique called Condor GlideIn [Error: Reference source not found] To take advantage of both the powerful reach of general-purpose protocols such as GRAM and the full Condor machinery, a personal Condor pool may be carved out of remote resources [18] This requires three steps In the first step, Condor-G is used to submit the standard Condor servers as jobs to remote batch systems From the remote system's perspective, the Condor servers are ordinary jobs with no special privileges In the second step, the servers begin executing and contact a personal Condor matchmaker started by the user In step three, the user may submit normal jobs to the Condor agent, which are then “just-in-time” matched to and executed on the remote resources

The term “Condor” has become an umbrella term that refers to the services collectively offered by Condor,Condor-G, and Condor-C

Trang 7

A.2.b The Panda Workload Manager

US ATLAS launched development of the Panda (Production ANd Distributed Analysis) system in August

2005 with an architectural approach driven by the need for major improvements in system throughput, scalability, operations manpower requirements, and efficient integrated data/processing management relative to the previous generation production system in order to meet LHC requirements Key design elements are

 Support for a full range of usages from managed production to group and user level production to individual interactive distributed analysis

 Just-in-time workload delivery to pilots on processing nodes, as described above

 System-wide task queue holds all jobs until pilot dispatch, allowing very flexible and dynamic brokerage coupled to data distribution policy and real-time resource availability

 Data-driven system design, with data management playing a central role and data pre-placement a prerequisite to workload dispatch to a site

 Tight integration with the ATLAS distributed data management system Don Quixote 2 (DQ2) [19], and using DQ2’s model of datasets (file collections) and subscriptions to them as the basis of data management and movement

 Major attention to monitoring and automation to make operations workload low and problem diagnostics rapid

 Pilot job delivery subsystem able to employ multiple job scheduler implementations transparently

to the rest of the system (currently Condor-G and PBS)

 Minimal requirements to include a site: pilot delivery (via grid or local queue), outbound HTTP, and data management support

 Lightweight, highly scalable communication protocols based on REST-style [20] HTTP

communication

The Panda architecture is shown in Figure 1 Within the ATLAS production system Panda functions as a regional (OSG) ‘executor’ interacting with an ATLAS production system ‘supervisor’ Eowyn to receive and report production work Panda’s support for a multiplicity of workload sources and types is reflected

in a number of ‘regional usage interfaces’ in addition to the ATLAS interface (all supported by a common Panda client interface) to submit OSG regional production, user jobs, and distributed analysis jobs The Panda server receives work from these front ends into a job queue, upon which a brokerage module operates to prioritize and assign work on the basis of job type, priority, input data and their locality, and available CPU resources Allocation of job blocks to sites is followed by the dispatch of input data to thosesites, handled by a data service interacting with the distributed data management system; jobs are not released for processing until the data arrives When data dispatch completes, jobs are made available to a job dispatcher An independent subsystem manages the scheduling of pilot jobs to deliver them to worker nodes via a range of scheduling systems A pilot upon launching on a worker node contacts the dispatcher and receives an available job appropriate to the site (If no appropriate job is available, the pilot

immediately exits.) An important attribute of this scheme for interactive analysis, where minimal latency from job submission to launch is important, is that the pilot dispatch mechanism bypasses any latencies in the scheduling system for submitting and launching the pilot itself

Trang 8

Figure 1: Panda architecture

In Figure 2 the current implementation of Panda is shown, in which all components of the architecture are realized Implemented front ends are the ATLAS production system, a command line equivalent of the same, and two distributed analysis systems, pathena (an interface to the ATLAS offline software

framework, Athena) and the DIAL distributed analysis system The Panda server containing the central components of the system is implemented in python (as are all components of Panda, and the DDM systemDQ2) and runs under Apache as a web service (in the REST sense; communication is based on HTTP GET/POST with the messaging contained in the URL and optionally a message payload of various

formats) MySQL databases implement the job queue and all metadata and monitoring repositories Condor-G and PBS are the implemented schedulers in the pilot scheduling (resource harvesting)

subsystem A monitoring server works with the MySQL DBs, including a logging DB populated by systemcomponents recording incidents via a simple web service behind the standard python logging module, to provide web browser based monitoring and browsing of the system and its jobs and data

Trang 9

Figure 2: Current Panda implementation

Figure 3 shows Panda’s DQ2-based automated data handling All data handling is at the dataset level (file collections, with a ‘data block’ being an immutable dataset) Sites are subscribed to datasets to trigger automated dataflow, and distributed (HTTP URL) callbacks provide notification of transfer completion andare used to trigger job release on data arrival and other chained operations This automated dataflow together with enforced data pre-placement as a precondition to job dispatch has been key to minimizing operational manpower and maximizing robustness against transfer failures and SE problems

Figure 3: Panda dataflow

Trang 10

Panda has evolved rapidly from a proof-of-concept prototype to a deployed system that took over US ATLAS production responsibilities in December 2005 Panda is now operating as an integral part of the ATLAS production system, managing US production at five Tier 1 and Tier 2 centers It has processed 11,000 jobs/day peak to date, limited by available CPU resources; no Panda scaling limits have been seen, and the Panda design target is ~100k jobs/day In ATLAS Computing System Commissioning production, Panda has processed 30% of the 113k ATLAS total through early February In January it exceeded its target efficiency (job failure rate not arising from the workload itself) for this early phase of development, 90%, and by late February was typically <2-5% Operations manpower is much less than the previous system and within our targets (1 shift person part-time), despite the still rapid evolution of the system.While Panda currently processes only ATLAS data, it is designed to be flexible in the workloads it can accept and the resources it can utilize Through its client library and API supporting job submission it currently accepts work from the ATLAS production system, a command-line production job interface, and two distributed analysis tools (pathena and DIAL) A further interface to GANGA is in development The scheduler subsystem’s Condor-G implementation supports most sites, and PBS supports one The Panda monitor [21] is designed as a unified and comprehensive web interface to all usage modes, clients, and the internals of the system itself, as well as the closely allied data management system While Panda and the distributed data management system DQ2 are closely coupled, DDM interfaces are localized and could be generalized to support systems other than DQ2 that have a similar architecture of automated, dataset-baseddata organization and movement.

Panda was reported at CHEP 2006 [22] The Panda program has been reviewed positively in the context of a

US ATLAS internal computing review in Jan 2006 and a DOE/NSF review of US ATLAS computing in Feb 2006 The program also has the support of ATLAS computing management

CMS identified the need for a just-in time workload management system in its Computing Technical Design Report [23] CMS is particularly interested in this technology in order to flexibly and quickly adjust workflows to changing requests and priorities of the experiment Within US CMS, responsibility for integrating this functionality into CMS operations on the Open Science Grid falls to the “Distributed Computing Tools” (DCT) group in US CMS, lead by Würthwein Within CMS this area has so far receivedlittle attention, largely because Würthwein’s group in CDF has focused on it, and is presently operating a production system referred to as GlideCAF on OSG and EGEE resources

There are two important features that the existing CDF system does not address At present, CDF operates under a security exemption at FNAL which is expected to expire within the first year of this proposal After that, FNAL will require the Condor GlideIn to authenticate against some site-controlled mechanism prior to launching a user application An initial design for an appropriate mechanism exists, and will be delivered as a UCSD contribution in the present proposal As part of the Particle Physics Data Grid project,UCSD-CMS has participated in design and implementation of the OSG authorization framework and is thus in a perfect position to contribute in this area

The second shortcoming of the GlideCAF system involves data placement GlideCAF does not address collocation of data and CPU in any way It is used by CDF either as a pure Monte Carlo submission framework across the OSG [24], or by requiring users to choose the site based on the data they need Given the nature of the interaction rates, bunch spacings, and time constants of detector components at the LHC, any realistic Monte Carlo requires reading of so called “pile-up” data during the “digitization” stage of the Monte Carlo workflow The problem of collocation of data and CPU is thus fundamental to virtually all

Trang 11

CMS use cases, and needs to be addressed by the just-in-time workload management system Initial designdiscussions with the Condor team have started We expect this to be addressed in the context of enhancing Condor Matchmaking to include matchmaking based on bitmaps.

In addition, there are overall scalability concerns The Condor GlideIn concept as deployed in GlideCAF isbased on creating a Condor pool across all resources harvested via the pilot jobs, or GlideIn The system thus has the same scaling properties as a regular Condor pool The UCSD CDF group has worked with Condor on increasing the total number of running jobs supportable in a Condor pool via a single schedd Various solutions including Condor-C are being pursued to exceed the present limitation of about 2000 jobs in order to meet the LHC requirements of more than 10,000 simultaneously running jobs The UCSD CDF group is operating a Condor testbed parasitically in conjunction with its 1500 CPU production cluster

at FNAL This testbed has been used to understand scalability of various Condor components

In summary, we expect CMS to focus on integration of justintime workload management into the Open Science Grid authorization infrastructure, design and implementation of a mechanism to support

collocation of data and CPU, and system tests addressing the overall scalability of the justintime

workload management system. The onproject effort is only a small portion of the overall effort CMS expects to devote to this, the remainder coming from CMS DCT, and DISUN as explained in Section D

The principals on this proposal have played leading roles in the existing work described and are well positioned to ensure both that this project is able to draw effectively on the resources and expertise they represent, and that this project in its direction and deliverables is fully in line with the requirements and expectations of the participants The PI and SAP lead on the science application side, Torre Wenaus, co-leads the ATLAS Panda project and is US ATLAS manager for distributed software He co-leads the Applications Area of the OSG Within ATLAS he co-leads the Database and Data Management Project andhas overseen the development of ATLAS’s Distributed Data Management System Don Quixote 2 He recently completed a three year term as Applications Area Manager in the LHC Computing Grid Project, developing common physics applications software for the LHC experiments Miron Livny, SAP lead for

CS, leads the Condor project and is one of the principal architects of grid computing He leads the Virtual Data Toolkit project which provides the grid middleware foundation for the major grids in the US and Europe He leads the Facilities Area of the OSG Frank Würthwein chaired the OSG interim executive board and now co-leads the Applications Area of the OSG He leads the Distributed Computing Tools group within US CMS and is the Technical Coordinator of the Data Intensive Science University Network (DISUN), an NSF funded cyberinfrastructure proposal Würthwein’s CDF group has worked with the Condor team on a number of issues over the last few years, including scheduling policy (adding the concept of a “group”), Generic Connection Broker, Condor-C, Computing on Demand, and krb5 security context CDF’s role in all of this has been to provide input on requirements, and feedback on testing, integrating, and large scale operations of the various middleware components within Condor The bulk of the work in CDF has traditionally been done by graduate students, with supervision from faculty, ResearchScientists or Computing Professionals We will emulate this for the SA-CMS portion of the present

proposal (proposed SA/SAP components are described below)

Trang 12

A.3 Research Design and Methods

The present Panda system has demonstrated many of the benefits of the just-in-time approach but it is far from being an at-scale LHC solution, or a solution readily usable outside ATLAS We plan through the work proposed here to implement and extend the functionality demonstrated by Panda using common grid middleware components, building on existing middleware and specifically Condor, and targeting the scaling and functionality requirements of the LHC Our deliverable will be a workload management system based on late binding that is not specific to ATLAS and is based on integration of and refinements

to common grid middleware We will deploy this system on the OSG such that consortium members can make use of either the complete system or components thereof

CMS like ATLAS is following a late binding approach to workload management and will participate in thiswork as already outlined The participation of two experiments will help ensure and validate that

deliverables are in fact experiment-neutral and effectively usable by others

As described, Panda currently uses Condor in the resource harvesting component of the system

Discussions with the Condor team in recent months have identified many possible ways in which Condor's role in Panda could be extended in order to increase the capacity and functionality of the system while decreasing the code base and thereby the development and support costs of Panda itself Expanding Condor's role is also in line with our objective of delivering an experiment-neutral system employing common middleware for deployment on the OSG One reason there is so much scope for leveraging Condor is that Condor supports design principles central to systems like Panda: just-in-time matchmaking

in support of late binding between work payload and resource; data placement as a first class player in the design of workflow and brokerage; and a master-worker architecture

This proposal puts forward a partnership between communities on the HEP applications and grid

computing sides of large scale data-intensive distributed processing development to pool complementary software and expertise and develop, in a science-driven program, new capability in workflow management that will excel in scalability, capability and scope of application The core of the program we propose is this direct HEP-CS partnership, evolving Panda through integration with Condor Our program also includes science application (SA) work that supports and complements the core effort These two aspects are discussed below

The core effort we see as a perfect match to the SciDAC Scientific Application Partnerships (SAPs) described in the call for proposals Through a direct HEP-CS partnership we seek to inject specific grid middleware components from a proven middleware provider, Condor, to provide new functionality in a specific and existing science application, Panda, such that Condor in clearly identified ways can enable Panda to reach its science-driven objective of LHC-scale workflow management Furthermore, an integral part of the program is to turn Panda into a generic system in the course of this integration, deploying and supporting it on the OSG for the use of that community, thus bringing new middleware-enabled capability

in the form of a science-driven tool to a broad scientific community

We accordingly seek identification and funding of part of this proposal as a Science Application

Partnership Specifically, a full-FTE person at Madison on the Condor team, and a full-FTE person at BNL

on the Panda team, will constitute the funded component of this Partnership The Madison and BNL

Trang 13

contingents will partner in the shared goal of developing and deploying a Condor-enabled, neutral workflow management system, as an evolution of Panda, on the OSG The prioritized technical objectives of this Panda/Condor integration program are described below in Section A.3.c.

experiment-This core Partnership is complemented in this proposal by specifically SA activities, for which 1 FTE funding is requested (0.5 BNL-ATLAS, 0.5 CMS-UCSD) SA activities include the following, and will be undertaken primarily by SA-designated FTEs, with the support of SAP FTEs

 The present Panda system must be initially adapted from an ATLAS-specific tool to a generic system; this will be done in the first year

 Over the course of the Panda/Condor integration work the uninterrupted viability of Panda as a production processing and analysis system must be supported; this requires additional incremental effort over and above developing and supporting baseline Panda, and accordingly must receive support through this proposal, since no such manpower is otherwise available

 Finally, and importantly, we seek through this proposal to leverage not only HEP-CS partnership but HEP-HE(N)P partnership

On the latter point, the program of this proposal in developing a generic workflow management system offers the opportunity to leverage and integrate capabilities in this area across experiments ATLAS and CMS collaboration will be enabled through their joint participation on this program to identify and pursue the adaptation of in-house developments into common tools We will apply program resources to identify and exploit opportunities for cross-experiment common development in workflow management Pursuing common efforts is a resource drain in the near term; it requires an incremental effort over and above what

an in-house program requires because of expanded cross-experiment requirements and collaborative overheads This was demonstrated for example by the experience of the LCG Applications Area, led for three years by Wenaus, which successfully produced common physics applications software systems but only because of a substantial investment in common manpower over and above individual experiment teams The 1 FTE/year (half ATLAS, half CMS) SA allocation distinct from the SAP component of this proposal’s program will provide a modest level of resources to make common effort possible

In addition to ATLAS-CMS collaboration we will seek to leverage RHIC and in particular STAR work, benefiting from STAR’s collocation at BNL with the ATLAS effort and STAR’s extensive activities in grid computing An area identified for possible collaboration is in generic high level job description, needed by Panda to remove ATLAS specificity

Specific areas identified for Panda/Condor integration that we expect to constitute the core of the program

of work are the following Preliminary priorities (Priority 1 is top priority) are defined for pursuing these areas, and the program of work following these priorities is then discussed

i Forwarding of certificate/identity to switch worker node identities to match the user,

particularly for analysis jobs, to address security/authentication concerns. (Priority 1) CondorG will manage the storage, delegation, and refreshing of a user’s credentials in order to enable to task running at the worker node to impersonate the submitting user [Error: Reference source not found]; CondorG will leverage the services of a MyProxy [25] credential wallet if it is available. This service is important to enable the pilot job to authenticate as the user to thirdparty services such as storage appliances, or may be required by some sites for secure auditing purposes

Trang 14

ii Use of the Condor Startd agent as the basis for pilot jobs, providing robust machine and pilot monitoring, job removal and multiple job management capabilities. (Priority 1) The Condor Startd agent runs on the worker node in a typical Condor installation. It provides an extensible framework called HawkEye [26] for monitoring the node, and extensions have already been implemented to discover and report over 100 different machine characteristics (such as load average, swap usage, available memory, free disk space, process table info, etc). This information can be used for auditing, troubleshooting, and scheduling. In addition, after a task is launched, the Condor Startd can provide usage snapshots about the task itself including all processes ids

spawned by the task (needed for proper cleanup), image size, CPU usage, network activity, and many other characteristics. Finally, the Startd is able to carveup the physical resources of a machine into multiple “virtual machines”, in order to enable the management and scheduling of multiple jobs in a timesharing fashion on one physical resource.

iii Condor VM for managing analysis pilots, which must be continuously ready to accept latencysensitive analysis work. (Priority 1) See previous point. Analysis pilots could run in parallel with production jobs within a single pilot, via the VM capability of Startd

iv Condor GlideIn for opportunistic access to Globusmanaged resources, and for localizing pilot job management to sites for robustness and scalability. (Priority 2)

v Use of MasterWorker framework to add a job management hierarchy and thereby greater scalability, with support for HTTPS as the protocol. (Priority 1) Condor is a robust system for high throughput computing. This robustness, though, comes at the cost of longer latency for job startup. While this works well for the many scientific workloads which consist of thousands of jobs each running for hours, it does not work so well for workloads which may consist of millions

of jobs running for short periods. When these latter workloads are dynamic, or have interjob dependencies, the simple solution of batching smaller tasks into larger jobs for scheduling

purposes will not work.

Condor's MasterWorker framework (MW) solves this problem by creating a "task" abstraction, and multiplexes the running of these tasks inside the context of a normal Condor job. [27] In the same way that operating system threads are a lighterweight construct riding on top of a

heavyweight process, many MW tasks can run in succession on top of one Condor job. This amortizes the time it takes Condor to acquiring a machine across many shortrunning tasks. MW allows a user to build one master process to create and manage tasks which run on many

intermittently available workers. This framework has been used for many applications across different disciplines, but most notably to solve the longstanding NUG30 Quadratic Assignment Problem. [28]

Today, MW has a hard limit of one master process per computation. This limits scalability to roughly two thousand concurrent workers. We propose to fix this by removing the two main limitations to running large MW applications in a grid environment. The first limitation is

securityrelated. Many networks across a grid are protected by firewalls. By enhancing MW to allow communication via HTTPS, firewall traversal and use of http proxies becomes possible. The second limitation, raw scalability, will be solved by allowing a tree hierarchy of masters. Each master, possibly running on a distinct machine, will manage one or two thousand workers, and master to master communication will loadbalance tasks along the tree.

vi Condor Stork for scheduling reliable data movement. (Priority 2) The large amount of data involved with the experiments and the distributed nature of the compute resources requires moving

Trang 15

it the problem of efficient and reliable data placement. A typical approach to solve the data placement problem is employing simple scripts before and after the computation to explicitly move the data; these scripts frequently lack needed automation and fault tolerance capabilities.

Condor’s Stork technology strives to make data placement activities a first class citizen in the grid,

similar to computational jobs. With Stork, data placement activities can be queued, scheduled, monitored, managed, and checkpointed. [29] Most importantly, Stork will ensure that data transferscomplete successfully and without any human interaction. In addition, Stork assists working in a heterogeneous data storage environment by enabling transfers to, from, and between different types of storage systems and protocols. Stork supports numerous emerging and established grid storage systems including GridFTP, NeST, SRB, dCache, UniTree, and others

vii Explore common ground between pilot mechanism and the similar Condor Grid Exerciser. Use pilots as payload carried by exerciser, e.g. for scaling tests. (Priority 2) The Grid

Exerciser is an application developed and run by the Condor team to validate the availability and usability of sites in the Open Science Grid (OSG). It is an automated regression test process that runs jobs through the batch queue on a site, testing the authentication, file staging, and job

execution environment. As the number of failures detected by the exerciser decreases, the number

of jobs and sites tested is increased, consequently moving the Exerciser from testing availability

to testing scalability. The Grid Exerciser could be leveraged to be a valuable tool for performing tests of the availability, scalability and fault tolerance of our tools for ATLAS and CMS by being retooled to carry Panda pilots as the payload

viii ClassAds for workflow description and management logic. (Priority 2) For many years, Condor has successfully used ClassAd technology to describe resources and jobs [30]. A job ClassAd uses a simple but powerful language to describe characteristics about itself as well as characteristics about required resources. Because ClassAds use a semistructured data model without a fixed schema, they are well adopted to work naturally in a heterogeneous environment. Expressions within the ClassAd describe job resource requirements, job resource preferences, and job failure semantics. Because the ClassAd language enables job policy to be expressed within the

ad itself, individual jobs are able to state their own private policies for failure semantics without hardcoded logic in the scheduler itself. This enables limitless job policies to be expressed withoutthe need to ever change the underlying middleware software. For instance, in Condor arbitrary logic expressions in job ClassAds can control when a job should be killed, suspended, held, removed, or rescheduled

ix Condor leases to manage worker node retention. (Priority 2) Condor implements a leasebasedalgorithm [31] to provide faulttolerance in the event that the submission node reboots, or in the event of network failure between the submission and worker node. The worker node obtains a lease from the submission agent, and periodically refreshes the lease time. If the lease expires, the worker node is relinquished and the submission node will attempt to restart the job on a different worker. The lease implementation is compatible with both Condor and WSResource Framework Globus Toolkit 4.x middleware

x Support for access to heterogeneously managed resources, eg Condor GlideIn for Globusmanaged resources and schedd support for local batch queues. (Proiority 1) Seamless access

to resources across a wide variety of administrative domains requires support for a number of localjob workload managers as well as networkbased grid middleware. The challenge is to preserve semantic behaviors and guarantees across job management systems, such as run atmost once, run atleast once, and job failure semantics. Jobs submitted to Condor’s workload manager can be

Trang 16

by utilizing client interfaces to middleware stacks from Condor, Globus 2.x (GRAM), Globus 4.x (managed job service), Unicore, and Nordugrid

xi Use of Condor Quill for relational database based monitoring of pilots. (Priority 1)

Accelerated by a three year grant from the National Science Foundation awarded in July 2005 [32], database researchers at the UWMadison have been investigating the application of relational databases towards batch and workflowdriven computing [33]. The first phase of the work, to instrument Condor with a data model for the underlying support of Condor's operational data whilepreserving the semistructured nature how Condor represents heterogeneous tasks and resources,

has already produced tangible results. One result of this initial work is the condor_quill service.

The condor_quill service has been released in recent versions of Condor and safely mirrors all Condor job state data into the open source PostgreSQL relational database management system (RDBMS) [34]. By using the condor_quill technology, scalability and integration is enhanced. Integration is enhanced because monitoring systems can use wellunderstood SQL and widely implemented database APIs to query about the state of the system

xii Use of the ‘global metamatchmaker’ planned for Condor as a basis for highlevel task queuemanagement. (Priority 1) Although currently Condor can interface with a wide variety of batch workload and grid middleware systems (see xi above), its ability to seamlessly delegate jobs across a heterogeneous grid consisting of multiple system types is limited. For instance, in some

cases the submitting user needs to know a priori if the available worker node will be managed by

Globus or Condor middleware. We propose to work with the Condor team to specify, test, and deploy the Condor Project’s planned next generation scheduler that will remove the need for any a priori knowledge on the part of the submitter.

Liaisons on the Condor and Panda teams have been established to begin addressing these areas

The first priority Panda/Condor integration items described above will be given immediate attention on project launch to define specific integration objectives, prioritize within the top priority list, and establish aconcrete workplan that we expect will occupy us into the second year At the end of the first year, the workplan will be reviewed together with the second priority items indicated above, and a workplan for the second and third years established The slate of items above, together with modifications and additions as

we gain experience, is expected to occupy at least the first two years of the program to complete a pass integration

first-The integration work will involve not just direct application of current Condor to the workflow system, butalso requirements/capability review and prototyping to identify Condor extensions and Panda adaptations required to mate the two Integration experience will reveal further requirements and refinements,

requiring iterations on both Condor and Panda components These iterations must be done in the context of

a continuously functional production Panda system in the hands of users; initially ATLAS users, but OSG users also within 12 months of project launch, when the work of removing ATLAS specificity from Panda

is done and initial OSG deployment takes place Given the iterations bound to be required to arrive at a hardened and production quality integrated product, we expect development activity to require the full three years of the project Nonetheless, even after the first year the manpower distribution will begin a migration from development to community support for our deliverables, and this migration will continue through the life of the project We anticipate the community support workload to be 20-30% across all FTEs by the end of the first year and 50-60% in the third year

Trang 17

A.3.d Management and Organization

This proposal represents a partnership and collaboration among ATLAS, CMS and the Condor team to develop common software components to be integrated as part of the OSG middleware stack and deployedfor the use of the broad OSG scientific community. At the same time the work will deliver capability that

is important to the workload management needs of ATLAS and CMS. Accordingly, this project will be managed in close alliance with the OSG, while also ensuring that US ATLAS and US CMS computing also play close managerial roles to ensure their requirements are met. This ostensibly complex

management is in fact natural and practical by virtue of the roles of the principals. Wenaus and Würthweincolead the Applications Area of the OSG which encompasses middleware extensions including workload management enhancements. They are also responsible for distributed software tools within US ATLAS and US CMS respectively. Livny as leader of the Facilities Area of the OSG is responsible for the OSG software stack (VDT). These three principals will constitute the management team of the project. Wenaus

as PI and coleader of the Panda project will be the management lead responsible for ensuring the

coherence of the program across all activities and individuals. Given their OSG and project roles this team can achieve the close integration with OSG we seek in this work, while remaining confident that the specific deliverables of this project, to ATLAS, CMS and Condor as well as OSG, will be met. Direct line management of BNL, Madison and UCSD personnel will come from Wenaus, Livny and Würthwein respectively

The overall program of this proposal is to on the one hand leverage the longestablished Condor

middleware, and on the other to evolve currently experimentspecific software to experimentneutral software available to the community via OSG. Accordingly, Condordirected work within our program will use the established Condor software development procedures, and will lead to software released by Condor. On the experiment side, initial work will begin in the also wellestablished software development contexts of ATLAS and CMS, and will evolve with the software itself to software development, QA, release management and distribution processes of the OSG. The end result of the work will be software added to and managed as part of the OSG’s VDT software stack. All software development among

involved parties is open source

The manpower support we seek for each year is the following, reflected in the budget summary below and the attached budget sheets:

 1 SAP FTE, science application, BNL (SAPBNL)

 1 SAP FTE, computer science, UW Madison (SAPMadison)

 0.5 SA FTE, BNL ATLAS (SAATLAS)

 1.0 SA graduate student FTE plus 10% effort of a Computing Professional, UC San Diego (SACMS)

These funded FTEs will be complemented by existing personnel: 45 FTEs on the ATLAS Panda team at BNL, UT Arlington, U Chicago and Oklahoma U; roughly 21 FTEs on the Condor team at UWMadison; and roughly 3 FTEs within the CMS Distributed Tools group. SAPBNL and SAPMadison will constitute the core of a distributed team working on the PandaCondor integration program described above. SAATLAS will work on removing ATLAS specificity from Panda, including working with SACMS and BNL STAR to identify, adapt and incorporate common workflow management related components

originating in CMS, CDF and STAR. SACMS will share work in the latter from the CMS side, and

Trang 18

We note that the funding we request is far below that required to deliver the work proposed here. The workwill leverage the substantial inhouse support within US ATLAS for Panda development, as well as Condor core team efforts. The essential addition this proposal seeks is manpower specifically charged withpartnering with Condor on developing an integrated system, and supporting some of the incremental manpower costs associated with HENP experiment partnering and removing ATLAS specificity from Panda, as well as benefiting from experiences gained with the CDF GlideCAF. Without the resources proposed here, the existing work is manpowerconstrained to continue as largely experimentspecific efforts

High level milestones and deliverables are shown here, based on the workplan discussed above

Mont

1 Joint Review and finalize first priority Condor integration priorities

6 UCSD Integration into OSG authorization infrastructure to satisfy security rules at FNAL

9 BNL Deliver experiment-neutral Panda

12 BNL Deploy generic Panda on OSG as supported workload management toolkit

12 BNL,

UW

Deliver Panda/Condor integration phase 1, targeting security, GlideIn & schedd for pilot management, and scalability Deploy to OSG as supported system 2-3 months later.

12 Joint Benchmark scalability and define scaling strategy for years 2-3

12 Joint Review and define workplan for years 2-3

24 Joint Deliver Panda/Condor integration phase 2, targeting generalized data-CPU collocation

mechanism, scalability enhancements, matchmaking and workflow logic 36

Joint Deliver and deploy final development iteration focused on addressing production feedback, robustness and hardening, scaling improvements, analysis workload

improvements

A.4 Consortium Arrangements – OSG Collaboration

The proposed program of work is strongly aligned with that of the Open Science Grid and will be

organized in close coordination with it The Open Science Grid (OSG) is a US grid computing

infrastructure and development community that supports scientific computing via an open collaboration of science researchers, software developers and computing, storage and network providers The OSG

Consortium builds and operates the OSG as a coherent distributed computing infrastructure built on the foundation of a robust common middleware infrastructure It brings resources and researchers from universities and national laboratories together domestically, and cooperates with other efforts

internationally, to give scientists from many fields access to shared resources worldwide The OSG is also developing science-driven, targeted programs to extend the capability of the common middleware stack in key areas, of which workload management is one This is where the present proposal fits naturally as a contribution to the OSG See Appendix/Section G for a letter of support from OSG for this project

Trang 19

B Biographical Sketches of Investigators

Brief biographical sketches for M. Livny, T. Wenaus and F. Würthwein follow

Trang 20

1995 Professor, Computer Sciences Department, University of Wisconsin-Madison

1989 Associate Professor, Computer Sciences Department, University of Wisconsin-Madison

1984 Assistant Professor, Computer Sciences Department, University of Wisconsin-Madison

1983 Instructor, Computer Sciences Department, University of Wisconsin-Madison

1979 Research Assistant, Hebrew University, Center for Agricultural Economy, Rehovot, Israel

Tiêu đề	Just-In-Time Workload Management: Scalable Resource Sharing on the Open Science Grid
Tác giả	Torre Wenaus, Miron Livny, Frank Würthwein
Trường học	University of Wisconsin Madison
Chuyên ngành	Computer Science
Thể loại	collaborative project
Năm xuất bản	2006-2009
Thành phố	Madison

Định dạng
Số trang	40
Dung lượng	1,17 MB