Five Year Mission • Observation: Internet systems complex, fragile, manually managed, evolving rapidly – To scale Ebay, must build Ebay-sized company – To scale YouTube, get acquired by
Trang 1Berkeley RAD Lab:
Research in Internet-scale Computing Systems
Randy H Katz randy@cs.berkeley.edu
28 March 2007
Trang 2Five Year Mission
• Observation: Internet systems complex, fragile, manually managed, evolving rapidly
– To scale Ebay, must build Ebay-sized company
– To scale YouTube, get acquired by a Google-sized company
• Mission: Enable a single person to create, evolve, and operate the next-generation IT service
– “The Fortune 1 Million” by enabling rapid innovation
• Approach: Create core technology spanning systems, networking, and machine learning
• Focus: Making datacenter easier to manage to enable one person to Analyze, Deploy, Operate a scalable IT
service
Trang 3Jan 07 Announcements by
Microsoft and Google
• Microsoft and Google race to build next-gen DCs
– Microsoft announces a $550 million DC in TX
– Google confirm plans for a $600 million site in NC
– Google two more DCs in SC; may cost another $950
million about 150,000 computers each
• Internet DCs are the next computing platform
• Power availability drives deployment decisions
Trang 4Datacenter is the Computer
• Google program == Web search, Gmail,…
• Google computer ==
Warehouse-sized facilities and workloads likely more common
Luiz Barroso’s talk at RAD Lab 12/11/06
Sun Project Blackbox
10/17/06 Compose datacenter from 20 ft containers!
– Power/cooling for 200 KW – External taps for electricity, network, cold water
– 250 Servers, 7 TB DRAM,
or 1.5 PB disk in 2006 – 20% energy savings – 1/10th? cost of a building
Trang 5Datacenter Programming
System
• Ruby on Rails: open source Web framework
optimized for programmer happiness and sustainable productivity:
– Convention over configuration – Scaffolding: automatic, Web-based, UI to stored data – Program the client: write browser-side code in Ruby, compile to Javascript
– “Duck Typing/Mix-Ins”
• Proven Expressiveness
– Lines of code Java vs RoR: 3:1 – Lines of configuration Java vs RoR: 10:1
• More than a fad
– Java on Rails, Python on Rails, …
Trang 6Datacenter Synthesis + OS
• Synthesis: change DC via written specification
– DC Spec Language compiled to logical configuration
• OS: allocate, monitor, adjust during operation
– Director using machine learning, Drivers send commands
Trang 7“System” Statistical Machine Learning
• S2ML Strengths
– Handle SW churn: Train vs write the logic
– Beyond queuing models: Learns how to handle/make policy between steady states
– Beyond control theory: Coping with complex cost
functions
– Discovery: Finding trends, needles in data haystack– Exploit cheap processing advances: fast enough to run online
• S2ML as an integral component of DC OS
Trang 8Datacenter Monitoring
• S2ML needs data to analyze
• DC components come with sensors already
– CPUs (performance counters)
– Disks (SMART interface)
• Add sensors to software
– Log files
– D-trace for Solaris, Mac OS
• Trace 10K++ nodes within and between DCs
– *Trace: App-oriented path recording framework
– X-Trace: Cross-layer/-domain including network layer
Trang 9Middleboxes in Today’s DC
• Middle boxes inserted on
physical path– Policy via plumbing– Weakest link: 1 point of failure, bottleneck
– Expensive to upgrade and introduce new
functionality
• Identity-based Routing Layer: policy not plumbing
to route classified packets
to appropriate middlebox services
High Speed Network
Trang 10– Bringing processor resources on/off-line: Dynamic
environment, complex cost function, measurement- driven decisions
• Preserve 100% Service Level Agreements
• Don’t hurt hardware reliability
• Then conserve energy
• Conserve energy and improve reliability
– MTTF: stress of on/off cycle vs benefits of off-hours
Trang 11DC Networking and Power
• Within DC racks, network equipment often the “hottest” components in the hot spot
• Network opportunities for power reduction
– Transition to higher speed interconnects (10 Gbs) at DC scales and densities
– High function/high power assists embedded in network element (e.g., TCAMs)
Trang 12QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
Thermal Image of Typical
Cluster Rack
Rack
Switch
M K Patterson, A Pratt, P Kumar,
“From UPS to Silicon: an end-to-end evaluation of datacenter efficiency”, Intel Corporation
Trang 13DC Networking and Power
• Selectively power down ports/portions of net elements
• Enhanced power-awareness in the network stack
– Power-aware routing and support for system virtualization
• Support for datacenter “slice” power down and restart
– Application and power-aware media access/control
• Dynamic selection of full/half duplex
• Directional asymmetry to save power, e.g., 10Gb/s send, 100Mb/s receive
– Power-awareness in applications and protocols
• Hard state (proxying), soft state (caching), protocol/data “streamlining” for power as well as b/w reduction
• Power implications for topology design
– Tradeoffs in redundancy/high-availability vs power consumption – VLANs support for power-aware system virtualization
Trang 14Why University Research?
• Imperative that future technical leaders learn to deal with scale in modern computing systems
• Draw on talented but inexperienced people
– Pick from worldwide talent pool for students & faculty
– Don’t know what they can’t do
• Inexpensive allows focus on speculative ideas
– Mostly grad student salaries
– Faculty part time
• Tech Transfer engine
– Success = Train students to go forth and replicate
– Promiscuous publication, including source code
– Ideal launching point for startups
Trang 15Why a New Funding Model?
• DARPA has exiting long-term research in
experimental computing systems
• NSF swamped with proposals, yielding
even more conservative decisions
• Community emphasis on theoretical vs
experimental-oriented systems-building
research
• Alternative: turn to Industry for funding
– Opportunity to shape research agenda
Trang 16QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
New Funding Model
• 30 grad students + 5 undergrads+ 6 faculty + 4 staff
• Foundation Companies: $500K/yr for 5 years
– Google, Microsoft, Sun Microsystems
– Prefer founding partner technology in prototypes
– Many from company attend retreats, advise on directions, head start
on research results
– Putting IP in Public Domain so partners use but not sued
• Large Affiliates $100K/yr: Fujitsu, HP, IBM, Siemens
• Small Affiliates $50K/yr: Nortel, Oracle
• State matching programs add $1M/year: MICRO, Discovery
Trang 17• “DC is the Computer”
– OS: ML+VM, Net: Identity-based Routing, FS: Web Storage
– Prog Sys: RoR, Libraries: Web Services
– Development Environment: RAMP (simulator), AWE (tester), Web 2.0 apps (benchmarks)
– Debugging Environment: *Trace + X-Trace
• Milestones
– DC Energy Conservation + Reliability Enhancement– Web 2.0 Apps in RoR
Trang 18Conclusions
• Develop-Analyze-Deploy-Operate modern systems at
Internet scale
– Ruby-on-Rails for rapid applications development
– Declarative datacenter for correct-by-construction system
configuration and operation
– Resource management by System Statistical Machine Learning – Virtual Machines and Network Storage for flexible resource
allocation
– Power reduction and reliability enhancement by fast
power-down/restart for processing nodes
– Pervasive monitoring, tracing, simultation, workload generation for runtime analysis/operation
Trang 19Discussion Points
• Jointly designed datacenter testbed
– Mini-DC consisting of clusters, middleboxes, and
network equipment
– Representative network topology
• Power-aware networking
– Evaluation of existing network elements
– Platform for investigating power reduction schemes in network elements
• Mutual information exchange
– Network storage architecture
– System Statistical Machine Learning
Trang 20Ruby on Rails = DC PL
• Reasons to love Ruby on Rails
1 Convention over Configuration
• Rails framework feature enabled by Ruby language
feature (Meta Object Programming)
2 Scaffolding: automatic, Web based, (pedestrian)
User Interface to stored data
3 Program the client: v 1.1 write browser-side code
in Ruby then compile to Javascript
4 “Duck Typing/Mix-Ins”
• Looks like string, responds like string, it’s a string!
• Mix-in improvement over multiple inheritance
Trang 21DC Monitoring
• Imagine a world where path information always passed along so that can always track user
requests throughout system
• Across apps, OS, network components and
layers, different computers on LAN, …
Trang 22*Trace: The 1% Solution
• *Trace Goal: Make Path Based Analysis have low
overhead so it can be always on inside datacenter
– “Baseline” path info collection with ≤ 1% overhead
– Selectively add more local detail for specific requests
• *Trace: an end-to-end path recording framework
– Capture & timestamp a unique requestID across all system
components
– “Top level” log contains path traces
– Local logs contain additional detail,
correlated to path ID
– Built on X-trace
Trang 23X-Trace: comprehensive tracing through Layers, Networks, Apps
• Trace connectivity of distributed
components
– Capture causal connections
between requests/responses
• Cross-layer
– Include network and middleware
services such as IP and LDAP
• Cross-domain
– Multiple datacenters, composed
services, overlays, mash-ups
– Control to individual
administrative domains
• “Network path” sensor
– Put individual requests/responses, at different network layers, in the context of an end-to-end
request
Trang 24Actuator:
Policy-based Routing Layer
• Assign ID to incoming packets (hash + table lookup)
• Route based on IDs, not locations (i.e., not IP addr)
– Sets up logical paths without changing network topology
• Set of common middle boxes get single ID
– No single weakest link: robust, scalable throughput
Identity-based Routing Layer
Firewall
(IDF)
Balancer (IDLB)
Load- Detection (IDID)
Intrusion-Service (IDS)
(IDID,IDS) pkt (IDF,IDLB) pkt
pkt
• So simple can be done
in FPGA?
• More general than MPLS
Trang 25Other RAD Lab Projects
• Research Accelerator for MP (RAMP)
Trang 26• Good match to Machine Learning
– An optimization, so imperfection not catastrophic
– Lots of data to measure, dynamically changing
workload, complex cost function
• Not steady state, so not queuing theory
• PG&E trying to change behavior of datacenters
• Properly state problem:
1 Preserve 100% Service Level Agreements
2 Don’t hurt hardware reliability
3 Then conserve energy
• Radical idea: can conserving energy improve
hardware reliability?
1st Milestone:
DC Energy Conservation
Trang 27• Improve component reliability?
• Disks: Lifetimes measured in Powered On
Hours, but limited to 50,000 start/stop cycles
• Idea, if turn off disks 50%, then ≈ 50% annual
failure rate as long as don’t exceed 50,000 start/
stop cycles (≈ once per hour)
• Integrated Circuits: lifetimes affected by Thermal Cycling (fast change bad), Electromigration (turn off helps), Dielectric Breakdown (turn off helps)
• Idea: If limited number of times cycled thermally, could cut IC failure rate due EM, DB by ≈ 30%?
1st Milestone: Conserve
Energy & Improve Reliability
See “A Case For Adaptive Datacenters To Conserve Energy and Improve
Trang 28• Demonstrate RAD Lab vision of 1 person
creating next great service and scale up
• Where get example great apps, given grad
students creating the technology?
• Use “Undergraduate Computing Clubs” to create
exciting apps in RoR using RAD Lab equipment,
technology
– Armando Fox is RoR club leader
– Recruited Real World RoR programmer to develop
code and advise RoR computing club
– ≈30 students joined club Jan 2007
– Hire best ugrads to build RoR apps in RAD Lab
RAD Lab 2.0 2nd Milestone:
Killer Web 2.0 Apps
Trang 29Miracle of University Research
• Talented (inexperienced) people
– Pick from worldwide talent pool for students & faculty
– Don’t know what they can’t do
• Inexpensive
– Mostly grad student salaries ($50k-$75k/yr overhead)
– Faculty part time ($75k-$100k/yr including overhead)
• Berkeley & Stanford Swing for Fences (R, not r or D)
• Even if hit a single, train next generation of leaders
• Technology Transfer engine
– Success = Train students to go forth & multiply
– Publish everything, including source code
– Ideal launching point for startups
Trang 30Chance to Partner with
a Great University
• Chance to Work on the “Next Great Thing”
• US News & World Report ranking of CS Systems universities: 1
Berkeley, 2 CMU, 2 MIT, 4 Stanford
• Berkeley & Stanford some the top suppliers of systems students
to industry (and academia)
• National Academy study mentions Berkeley in 7 of 19 $1B+
industries from IT research, Stanford 4 times
• Timesharing (SDS 940), Client-Server Computing (BSD Unix),
VLSI Design (Spice),
Trang 31Years to > $1B IT industry from Research Start
Trang 32• Communication inversely proportional to distance
– Almost never if > 100 feet or on different floor
• Everyone (including faculty) in open offices
• Great Meeting Rooms, Ubiquitous Whiteboards
• Technology to concentrate: Cell phone, Ipod, laptop
• Google “Physical RAD Lab” to learn more
Trang 33Example of Next Great Thing
Berkeley Adaptive Distributed systems
Laboratory (“RAD Lab”)
– Founded 12/2005: with Google,
Microsoft, Sun as founding partners
– Armando Fox, Randy Katz, Mike Jordan,
Anthony Joseph, Dave Patterson, Scott Shenker, Ion Stoica
– Google “RAD Lab” to learn more
Trang 34RAD Lab Goal:
Enable Next “Ebay”
• Create technology to enable next great Internet Service to grow rapidly without growing the organization rapidly
– Machine Learning + Systems is secret sauce
• Position: “The datacenter is the computer”
– Leverage point is simplifying datacenter management
• What is the programming language of the datacenter?
• What is CAD for the datacenter?
• What is the OS for the datacenter?