KEMBAR78
Cluster Tutorial | PPT
High Performance Cluster Computing ( Architecture, Systems, and Applications) Rajkumar Buyya,   Monash University, Melbourne. Email: rajkumar@csse.monash.edu.au / rajkumar@buyya.com Web: http://www.ccse.monash.edu.au/~rajkumar / www.buyya.com ISCA 2000
Objectives Learn  and  Share  Recent advances in cluster computing (both in research and commercial settings): Architecture,  System Software Programming Environments and Tools Applications Cluster Computing Infoware: (tutorial online) http://www.buyya.com/cluster/
Agenda Overview of Computing Motivations & Enabling Technologies Cluster Architecture & its Components Clusters Classifications Cluster Middleware Single System Image Representative Cluster Systems  Resources and Conclusions
Computing Elements Hardware Operating System Applications Programming Paradigms P P P P P P   Microkernel Multi-Processor Computing System Threads Interface Process Processor Thread P
Architectures System Software Applications  P.S.Es Architectures  System Software Applications  P.S.Es  Two Eras of Computing Sequential Era Parallel Era 1940  50  60  70  80  90  2000  2030 Commercialization  R & D  Commodity
Computing Power and Computer Architectures
Computing Power (HPC) Drivers Solving  grand challenge applications using computer  modeling ,  simulation  and  analysis Life Sciences CAD/CAM Aerospace Military Applications Digital Biology Military Applications Military Applications E-commerce/anything
How to Run App. Faster ? There are 3 ways to improve performance: 1. Work Harder 2. Work Smarter 3. Get Help Computer Analogy 1. Use faster hardware: e.g. reduce the time per instruction (clock cycle). 2. Optimized algorithms and techniques 3. Multiple computers to solve problem: That is, increase no. of instructions executed per clock cycle.
 
Application Case Study  Web Serving and E-Commerce”
E-Commerce and PDC ? What are/will be the major problems/issues in eCommerce? How will or can PDC be applied to solve some of them? Other than “Compute Power”, what else can PDC contribute to e-commerce? How would/could the different forms of PDC (clusters, hypercluster, GRID,…) be applied to e-commerce? Could you describe one hot research topic for PDC applying to e-commerce? A killer e-commerce application for PDC ? … ...
Killer Applications of Clusters Numerous Scientific & Engineering Apps. Parametric Simulations Business Applications E-commerce Applications (Amazon.com, eBay.com ….) Database Applications (Oracle on cluster) Decision Support Systems Internet Applications Web serving / searching Infowares (yahoo.com, AOL.com) ASPs (application service providers) eMail, eChat, ePhone, eBook, eCommerce, eBank, eSociety, eAnything! Computing Portals Mission Critical Applications command control systems, banks, nuclear reactor control, star-war, and handling life threatening situations.
Major problems/issues in E-commerce Social Issues Capacity Planning Multilevel Business Support (e.g., B2P2C) Information Storage, Retrieval, and Update Performance Heterogeneity System Scalability System Reliability Identification and Authentication System Expandability Security Cyber Attacks Detection and Control (cyberguard) Data Replication, Consistency, and Caching Manageability (administration and control)
Amazon.com: Online sales/trading killer E-commerce Portal Several Thousands of Items books, publishers, suppliers Millions of Customers  Customers details, transactions details, support for transactions update  (Millions) of Partners Keep track of partners details, tracking referral link to partner and sales and payment  Sales based on advertised price Sales through auction/bids A mechanism for participating in the bid (buyers/sellers define rules of the game)
Can these drive E-Commerce ? Clusters are already in use for web serving, web-hosting, and number of other Internet applications including E-commerce scalability, availability, performance, reliable-high performance-massive storage and database support. Attempts to support online detection of cyber attacks (through data mining) and control Hyperclusters and the GRID: Support for transparency in (secure) Site/Data Replication for high availability and quick response time (taking site close to the user). Compute power from hyperclusters/Grid can be used for data mining for cyber attacks and fraud detection and control. Helps to build Compute Power Market, ASPs, and Computing Portals. 2100 2100 2100 2100 2100 2100 2100 2100
Science Portals - e.g., PAPIA system PAPIA PC Cluster Pentiums Myrinet NetBSD/Linuux PM Score-D MPC++ RWCP Japan: http://www.rwcp.or.jp/papia/
PDC hot topics for E-commerce Cluster based web-servers, search engineers, portals… Scheduling and Single System Image. Heterogeneous Computing Reliability and High Availability and Data Recovery  Parallel Databases and high performance-reliable-mass storage systems. CyberGuard! Data mining for detection of cyber attacks, frauds, etc. detection and online control.  Data Mining for identifying sales pattern and automatically tuning portal to special sessions/festival sales eCash, eCheque, eBank, eSociety, eGovernment, eEntertainment, eTravel, eGoods, and so on. Data/Site Replications and Caching techniques Compute Power Market Infowares (yahoo.com, AOL.com) ASPs (application service providers) . . .
Sequential Architecture Limitations Sequential architectures reaching physical limitation (speed of light, thermodynamics) Hardware improvements like  pipelining, Superscalar, etc., are non-scalable  and requires sophisticated Compiler Technology. Vector Processing works well for certain kind of problems .
Computational Power Improvement No. of Processors C.P.I. 1  2 .  .  .  . Multiprocessor Uniprocessor
Human Physical Growth Analogy : Computational Power Improvement Age Growth 5  10  15  20  25  30  35  40  45   . . . .  Vertical Horizontal
The Tech. of PP is mature and can be exploited commercially;  significant  R & D   work on development of tools & environment. Significant development in  Networking  technology is paving a way for  heterogeneous  computing. Why Parallel Processing NOW?
History of Parallel Processing PP can be traced to a tablet dated around 100 BC. Tablet has 3 calculating positions. Infer that multiple positions: Reliability/ Speed
Aggregated speed with  which complex calculations  carried out by millions of neurons in human brain is amazing! although  individual neurons response is  slow (milli sec.) - demonstrate the feasibility of PP Motivating Factors
Simple classification by Flynn:  (No. of instruction and data streams) SISD   - conventional SIMD   - data parallel, vector computing MISD   - systolic arrays MIMD  - very general, multiple approaches. Current focus is on MIMD model, using general purpose processors or multicomputers.  Taxonomy of Architectures
Main HPC Architectures..1a  SISD  - mainframes, workstations, PCs.  SIMD Shared Memory - Vector machines, Cray...  MIMD   Shared Memory  - Sequent, KSR, Tera, SGI, SUN. SIMD Distributed Memory  - DAP, TMC CM-2...  MIMD Distributed Memory  - Cray T3D, Intel, Transputers, TMC CM-5, plus recent workstation  clusters (IBM SP2, DEC, Sun, HP).
Motivation for using Clusters The communications bandwidth  between workstations is increasing as new networking technologies and protocols are implemented in LANs and WANs. Workstation clusters are easier to integrate  into existing networks than special parallel computers.
Main HPC Architectures..1b. NOTE:   Modern sequential machines are not purely SISD - advanced RISC processors use  many concepts from  vector and parallel architectures  (pipelining, parallel execution of instructions, prefetching  of data, etc) in order to achieve one or more arithmetic  operations per clock cycle.
Parallel Processing Paradox Time required to develop a parallel application for solving GCA is equal to:  Half Life of Parallel Supercomputers.
The Need for Alternative Supercomputing Resources Vast numbers of under utilised workstations available to use. Huge numbers of unused processor cycles and resources  that could be put to good use in a wide variety of applications areas. Reluctance to buy Supercomputer due to their cost and short life span. Distributed compute resources “fit” better into today's funding model.
Technology Trend
Scalable Parallel Computers
Design Space of Competing Computer Architecture
Towards Inexpensive Supercomputing It is: Cluster Computing.. The Commodity Supercomputing!
Cluster Computing - Research Projects   Beowulf  (CalTech and NASA) - USA CCS  (Computing Centre Software) - Paderborn, Germany  Condor  - Wisconsin State University, USA  DQS  (Distributed Queuing System) - Florida State University, US. EASY  - Argonne National Lab, USA HPVM  -(High Performance Virtual Machine),UIUC&now UCSB,US far  - University of Liverpool, UK  Gardens  - Queensland University of Technology, Australia MOSIX  - Hebrew University of Jerusalem, Israel MPI  (MPI Forum, MPICH is one of the popular implementations)  NOW  (Network of Workstations) - Berkeley, USA  NIMROD  - Monash University, Australia  NetSolve  - University of Tennessee, USA PBS  (Portable Batch System) - NASA Ames and LLNL, USA  PVM  - Oak Ridge National Lab./UTK/Emory, USA
Cluster Computing - Commercial Software   Codine  (Computing in Distributed Network Environment) - GENIAS GmbH, Germany  LoadLeveler  - IBM Corp., USA  LSF  (Load Sharing Facility) - Platform Computing, Canada  NQE  (Network Queuing Environment) - Craysoft Corp., USA  OpenFrame  - Centre for Development of Advanced Computing, India RWPC  (Real World Computing Partnership), Japan Unixware  (SCO-Santa Cruz Operations,), USA Solaris-MC  (Sun Microsystems), USA ClusterTools  (A number for free HPC clusters tools from Sun) A number of commercial vendors worldwide are offering clustering solutions including IBM, Compaq, Microsoft, a number of startups like TurboLinux, HPTI, Scali, BlackStone…..)
Motivation for using Clusters Surveys show  utilisation of CPU cycles  of  desktop workstations is typically  <10%. Performance of workstations  and PCs is rapidly improving  As performance grows,  percent utilisation will decrease even further !   Organisations are reluctant to buy  large supercomputers, due to the large expense and short useful life span.
Motivation for using Clusters The development tools  for workstations are more mature than the contrasting proprietary solutions for parallel computers - mainly due to the non-standard nature of many parallel systems. Workstation clusters are a cheap  and readily available alternative to specialised High Performance Computing (HPC) platforms. Use of clusters of workstations as a distributed compute resource is very cost effective - incremental growth of system!!!
Cycle Stealing  Usually a workstation will be  owned  by an individual , group, department, or organisation - they are dedicated to the exclusive use by the  owners .  This brings problems when attempting to form a cluster of workstations for running distributed applications.
Cycle Stealing Typically, there are three types of owners, who use their workstations mostly for:   1 . Sending and receiving email  and preparing documents.  2.  Software development  - edit, compile, debug and test cycle.  3.  Running compute-intensive  applications.
Cycle Stealing Cluster computing  aims to steal spare cycles  from (1) and (2) to provide resources for (3).  However, this requires  overcoming the  ownership hurdle  - people are very protective of  their   workstations.  Usually requires  organisational mandate  that computers are to be used in this way.   Stealing cycles outside standard work hours  (e.g. overnight) is easy, stealing idle cycles during work hours without impacting interactive use (both CPU and memory) is much harder .
Rise & Fall of Computing Technologies Mainframes Minis PCs Minis PCs Network Computing 1970   1980    1995
Original Food Chain Picture
1984 Computer Food Chain Mainframe Vector Supercomputer Mini Computer Workstation PC
Mainframe Vector Supercomputer MPP Workstation PC 1994 Computer Food Chain Mini Computer (hitting wall soon) (future is bleak)
Computer Food Chain (Now and Future)
What is a cluster? A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected  stand-alone/complete  computers  cooperatively working together as a  single , integrated computing resource. A typical cluster: Network: Faster, closer connection than a typical network (LAN) Low latency communication protocols Looser connection than SMP
Why Clusters now? (Beyond Technology and Cost) Building block is big enough complete computers (HW & SW) shipped in millions: killer micro, killer RAM, killer disks, killer OS, killer networks, killer apps. Workstations performance is doubling every 18 months. Networks are faster Higher link bandwidth (v 10Mbit Ethernet) Switch based networks coming (ATM) Interfaces simple & fast (Active Msgs) Striped files preferred (RAID) Demise of Mainframes, Supercomputers, & MPPs
Architectural Drivers…(cont) Node architecture dominates performance processor, cache, bus, and memory design and engineering  $  => performance Greatest demand for performance is on large systems must track the leading edge of technology without lag MPP network technology => mainstream system area networks System on every node is a powerful enabler very high speed I/O, virtual memory, scheduling, …
...Architectural Drivers Clusters can be grown: Incremental scalability (up, down, and across) Individual nodes performance can be improved by adding additional resource (new memory blocks/disks) New nodes can be added or nodes can be removed Clusters of Clusters and Metacomputing Complete software tools  Threads, PVM, MPI, DSM, C, C++, Java, Parallel C++, Compilers, Debuggers, OS, etc. Wide class of applications Sequential and grand challenging parallel applications
Clustering of Computers    for Collective Computing: Trends 1960 1990 1995+ 2000 ?
Example Clusters: Berkeley NOW 100 Sun UltraSparcs 200 disks Myrinet SAN 160 MB/s Fast comm. AM, MPI, ... Ether/ATM switched external net Global OS Self Config
Basic Components $ M I/O bus MyriNet Sun Ultra 170 Myricom NIC 160 MB/s M P P
Massive Cheap Storage Cluster Basic unit:  2 PCs double-ending four SCSI chains of 8 disks each Currently serving Fine Art at http://www.thinker.org/imagebase/
Cluster of SMPs (CLUMPS) Four Sun E5000s 8 processors 4 Myricom NICs each Multiprocessor, Multi-NIC, Multi-Protocol NPACI => Sun 450s
Millennium PC Clumps Inexpensive, easy to manage Cluster Replicated in many departments Prototype for very large PC cluster
Adoption of the Approach
So What’s So Different? Commodity parts? Communications Packaging? Incremental Scalability? Independent Failure? Intelligent Network Interfaces? Complete System on every node virtual memory scheduler files ...
OPPORTUNITIES  &  CHALLENGES
Opportunity of Large-scale Computing on NOW Shared Pool of Computing Resources: Processors, Memory, Disks Interconnect Guarantee atleast one workstation to many individuals (when active) Deliver large % of collective resources to few individuals at any one time
Windows of Opportunities MPP/DSM:  Compute across multiple systems: parallel. Network RAM:  Idle memory in other nodes. Page across other nodes idle memory  Software RAID: file system supporting parallel I/O and reliablity, mass-storage.  Multi-path Communication: Communicate across multiple networks: Ethernet, ATM, Myrinet
Parallel Processing Scalable Parallel Applications require good floating-point performance low overhead communication scalable network bandwidth parallel file system
Network RAM Performance gap between processor and disk has widened. Thrashing to disk degrades performance significantly Paging across networks can be effective with high  performance networks and OS that recognizes idle machines Typically thrashing to network RAM can be 5 to 10 times faster than thrashing to disk
Software RAID: Redundant Array of Workstation Disks I/O Bottleneck:  Microprocessor performance is improving more than 50% per year. Disk access improvement is < 10% Application often perform I/O RAID cost per byte is high compared to single disks RAIDs are connected to host computers which are often a performance and availability bottleneck RAID in software, writing data across an array of workstation disks provides performance and some degree of redundancy provides availability.
Software RAID, Parallel File Systems, and Parallel I/O
Cluster Computer and its Components
Clustering Today Clustering gained momentum when 3 technologies converged: 1. Very HP Microprocessors workstation performance = yesterday supercomputers 2. High speed communication Comm. between cluster nodes >= between processors in an SMP. 3 . Standard tools for parallel/ distributed computing & their growing popularity.
Cluster Computer Architecture
Cluster Components...1a Nodes Multiple High Performance Components: PCs Workstations SMPs (CLUMPS) Distributed HPC Systems leading to Metacomputing They can be based on different architectures and running difference  OS
Cluster Components...1b Processors There are many (CISC/RISC/VLIW/Vector..) Intel: Pentiums, Xeon, Merceed…. Sun: SPARC, ULTRASPARC HP PA IBM RS6000/PowerPC SGI MPIS Digital Alphas Integrate Memory, processing and networking into a single chip IRAM (CPU & Mem): (http://iram.cs.berkeley.edu) Alpha 21366 (CPU, Memory Controller, NI)
Cluster Components…2 OS State of the art OS: Linux (Beowulf) Microsoft NT (Illinois HPVM) SUN Solaris (Berkeley NOW) IBM AIX (IBM SP2) HP UX (Illinois - PANDA) Mach (Microkernel based OS) (CMU) Cluster Operating Systems (Solaris MC, SCO Unixware, MOSIX (academic project) OS gluing layers: (Berkeley Glunix)
Cluster Components…3 High Performance Networks Ethernet (10Mbps),  Fast Ethernet (100Mbps),  Gigabit Ethernet (1Gbps) SCI  (Dolphin - MPI- 12micro-sec latency) ATM Myrinet (1.2Gbps) Digital Memory Channel FDDI
Cluster Components…4 Network Interfaces Network Interface Card Myrinet has NIC User-level access support Alpha 21364 processor integrates processing, memory controller, network interface into a single chip..
Cluster Components…5 Communication Software Traditional OS supported facilities (heavy weight due to protocol processing).. Sockets (TCP/IP), Pipes, etc. Light weight protocols (User Level) Active Messages (Berkeley) Fast Messages (Illinois) U-net (Cornell) XTP (Virginia) System systems can be built on top of the above protocols
Cluster Components…6a Cluster Middleware Resides Between OS and Applications and offers in infrastructure for supporting: Single System Image (SSI) System Availability (SA) SSI makes collection appear as single machine (globalised view of system resources). Telnet cluster.myinstitute.edu SA - Check pointing and process migration..
Cluster Components…6b Middleware Components Hardware  DEC Memory Channel, DSM (Alewife, DASH) SMP Techniques OS /  Gluing Layers Solaris MC, Unixware, Glunix) Applications and Subsystems System management and electronic forms Runtime systems (software DSM, PFS etc.) Resource management and scheduling (RMS): CODINE, LSF, PBS, NQS, etc.
Cluster Components…7a Programming environments Threads (PCs, SMPs, NOW..)  POSIX Threads Java Threads MPI Linux, NT, on many Supercomputers PVM Software DSMs (Shmem)
Cluster Components…7b Development Tools ? Compilers C/C++/Java/ ;  Parallel programming with C++ (MIT Press book) RAD (rapid application development tools).. GUI based tools for PP modeling Debuggers Performance Analysis Tools Visualization Tools
Cluster Components…8 Applications Sequential  Parallel / Distributed (Cluster-aware app.) Grand Challenging applications Weather Forecasting Quantum Chemistry Molecular Biology Modeling Engineering Analysis (CAD/CAM) ……………… . PDBs, web servers,data-mining
Key Operational Benefits of Clustering System availability (HA).   offer inherent high system availability due to the redundancy of hardware, operating systems, and applications. Hardware Fault Tolerance.  redundancy for most system components (eg. disk-RAID), including both hardware and software.  OS and application reliability.  run multiple copies of the OS and applications, and through this redundancy Scalability.  adding servers to the cluster or by adding more clusters to the network as the need arises or CPU to SMP. High Performance.  (running cluster enabled programs)
Classification of Cluster Computer
Clusters Classification..1 Based on Focus (in Market) High Performance (HP) Clusters Grand Challenging Applications High Availability (HA) Clusters Mission Critical applications
HA Cluster: Server Cluster with &quot;Heartbeat&quot; Connection
Clusters Classification..2 Based on Workstation/PC Ownership Dedicated Clusters Non-dedicated clusters Adaptive parallel computing Also called Communal multiprocessing
Clusters Classification..3 Based on Node Architecture.. Clusters of PCs (CoPs) Clusters of Workstations (COWs) Clusters of SMPs (CLUMPs)
Building Scalable Systems:  Cluster of SMPs (Clumps) Performance of SMP Systems Vs. Four-Processor Servers in a Cluster
Clusters Classification..4 Based on Node OS Type.. Linux Clusters (Beowulf) Solaris Clusters  (Berkeley NOW) NT Clusters (HPVM) AIX Clusters (IBM SP2) SCO/Compaq Clusters (Unixware) …… .Digital VMS Clusters, HP clusters, ………………..
Clusters Classification..5 Based on node components architecture &  configuration (Processor Arch, Node Type: PC/Workstation.. & OS: Linux/NT..): Homogeneous Clusters All nodes will have similar configuration Heterogeneous Clusters Nodes based on different processors and running different OSes.
Clusters Classification..6a   Dimensions of Scalability & Levels of Clustering Uniprocessor SMP Cluster MPP CPU / I/O / Memory / OS (1) (2) (3) Campus Enterprise Workgroup Department Public Metacomputing (GRID) Network Technology Platform
Clusters Classification..6b Levels of Clustering Group Clusters (#nodes: 2-99)  (a set of dedicated/non-dedicated computers - mainly connected by SAN like Myrinet) Departmental Clusters (#nodes: 99-999) Organizational Clusters (#nodes: many 100s) (using ATMs Net) Internet-wide Clusters=Global Clusters:  (#nodes: 1000s to many millions) Metacomputing Web-based Computing Agent Based Computing Java plays a major in web and agent based computing
Size Scalability (physical & application) Enhanced Availability (failure management) Single System Image (look-and-feel of one system) Fast Communication (networks & protocols) Load Balancing (CPU, Net, Memory, Disk)  Security and Encryption (clusters of clusters) Distributed Environment (Social issues) Manageability (admin. And control) Programmability  (simple API if required) Applicability (cluster-aware and non-aware app.) Major issues in cluster design
Cluster Middleware  and  Single System Image
A typical Cluster Computing Environment PVM / MPI/ RSH Application Hardware/OS ???
CC  should  support Multi-user ,  time-sharing  environments Nodes with different CPU  speeds  and  memory sizes (heterogeneous configuration) Many processes , with  unpredictable  requirements Unlike SMP : insufficient “ bonds ” between nodes Each computer operates independently Inefficient  utilization of  resources
The missing link is provide by cluster middleware/underware PVM / MPI/ RSH Application Hardware/OS Middleware or Underware
SSI Clusters-- SMP services on a CC Adaptive resource usage for better performance Ease of use -  almost like SMP Scalable configurations  -  by decentralized control Result:  HPC/HAC at PC/Workstation prices “ Pool Together”  the  “Cluster-Wide”  resources
What is Cluster Middleware ? An interface between between use applications and cluster hardware and OS platform. Middleware packages support each other at the management, programming, and implementation levels. Middleware Layers: SSI Layer Availability Layer: It enables the cluster services of Checkpointing, Automatic Failover, recovery from failure,  fault-tolerant operating among all cluster nodes.
Middleware Design Goals Complete Transparency (Manageability) Lets the see a single cluster system.. Single entry point, ftp, telnet, software loading... Scalable Performance Easy growth of cluster no change of API & automatic load distribution. Enhanced Availability Automatic Recovery from failures Employ checkpointing & fault tolerant technologies Handle consistency of data when replicated..
What is Single System Image (SSI) ? A single system image is the  illusion , created by software or hardware, that presents a collection of resources as one, more powerful resource. SSI makes the cluster appear like a single machine to the user, to applications, and to the network. A cluster without a SSI is not a cluster
Benefits of Single System Image Usage of system resources transparently Transparent process migration and load balancing across nodes. Improved reliability and higher availability Improved system response time and performance Simplified system management Reduction in the risk of operator errors User need not be aware of the underlying system architecture to use these machines effectively
Desired SSI Services Single Entry Point telnet cluster.my_institute.edu telnet node1.cluster. institute.edu Single File Hierarchy: xFS, AFS, Solaris MC Proxy Single Control Point: Management from single GUI Single virtual networking Single memory space - Network RAM / DSM Single Job Management: Glunix, Codine, LSF Single User Interface: Like workstation/PC windowing environment (CDE in Solaris/NT), may it can use Web technology
Availability Support  Functions Single I/O Space (SIO): any node can access any peripheral or disk devices without the knowledge of physical location. Single Process Space (SPS) Any process on any node create process with cluster wide process wide and they communicate through signal, pipes, etc, as if they are one a single node. Checkpointing and Process Migration. Saves the process state and intermediate results in memory to disk to support rollback recovery when node fails. PM for Load balancing... Reduction in the risk of operator errors User need not be aware of the underlying system architecture to use these machines effectively
Scalability Vs. Single System Image UP
SSI Levels/How do we implement SSI ? It is a computer science notion of levels of abstractions (house is at a higher level of abstraction than walls, ceilings, and floors). Hardware Level Application and Subsystem Level Operating System Kernel Level
SSI at Application and Subsystem Level (c) In search of clusters Level Examples Boundary Importance application cluster batch system, system management subsystem file system distributed DB, OSF DME, Lotus  Notes, MPI, PVM an application what a user wants Sun NFS, OSF, DFS, NetWare, and so on a subsystem SSI for all applications of the subsystem implicitly supports  many applications  and subsystems shared portion of  the file system toolkit OSF DCE, Sun ONC+, Apollo Domain best level of support for heter- ogeneous system explicit toolkit facilities: user, service name,time
SSI at Operating System Kernel Level (c) In search of clusters Level Examples Boundary Importance Kernel/ OS Layer Solaris MC, Unixware  MOSIX, Sprite,Amoeba / GLunix kernel interfaces virtual memory UNIX (Sun) vnode, Locus (IBM) vproc each name space: files, processes,  pipes, devices, etc. kernel support for applications, adm subsystems none supporting operating system kernel type of kernel objects: files, processes, etc. modularizes SSI code within  kernel may simplify implementation of kernel objects each distributed virtual memory space microkernel Mach, PARAS, Chorus, OSF/1AD, Amoeba implicit SSI for all system services each service outside the microkernel
SSI at Harware Level Level Examples Boundary Importance memory SCI, DASH better communica- tion and synchro- nization memory space memory  and I/O SCI, SMP techniques lower overhead cluster I/O memory and I/O device space (c) In search of clusters Application and Subsystem Level Operating System Kernel Level
SSI Characteristics 1. Every SSI has a boundary 2. Single system support can exist at different levels within a system, one able to be build on another
SSI Boundaries -- an applications SSI boundary Batch System (c) In search of clusters SSI Boundary
Relationship Among Middleware Modules
SSI via OS path! 1. Build as a layer on top of the existing OS Benefits: makes the system quickly portable, tracks vendor software upgrades, and reduces development time. i.e. new systems can be built quickly by mapping new services onto the functionality provided by the layer beneath. Eg: Glunix 2. Build SSI at kernel level, True Cluster OS Good, but  Can’t leverage of OS improvements by vendor E.g. Unixware, Solaris-MC, and MOSIX
SSI Representative Systems OS level SSI SCO NSC UnixWare Solaris-MC MOSIX, …. Middleware level SSI PVM, TreadMarks (DSM), Glunix, Condor, Codine, Nimrod, …. Application level SSI PARMON, Parallel Oracle, ...
SCO NonStop ®  Cluster for UnixWare Users, applications, and  systems management Standard OS  kernel calls Modular kernel extensions  Extensions UP or SMP node Users, applications, and  systems management   Standard OS    kernel calls Modular kernel extensions Extensions Devices Devices ServerNet ™ UP or SMP node Standard SCO UnixWare ®  with clustering hooks Standard SCO UnixWare   with clustering hooks Other nodes http://www.sco.com/products/clustering/
How does NonStop Clusters Work? Modular Extensions and Hooks to Provide: Single Clusterwide Filesystem view Transparent Clusterwide device access Transparent swap space sharing Transparent Clusterwide IPC High Performance Internode Communications Transparent Clusterwide Processes, migration,etc. Node down cleanup and resource failover  Transparent Clusterwide parallel TCP/IP networking Application Availability Clusterwide Membership and Cluster timesync Cluster System Administration Load Leveling
Solaris-MC: Solaris for MultiComputers global file system globalized process management globalized networking and I/O http://www.sun.com/research/solaris-mc/
Solaris MC components Object and communication support High availability support PXFS global distributed file system Process mangement Networking
M ulticomputer   OS  for   UN IX (MOSIX) An  OS  module  (layer)  that provides the applications with the illusion of working on a single system Remote operations are performed like local operations Transparent to the application - user interface unchanged PVM / MPI / RSH M O S I X Application Hardware/OS http://www.mosix.cs.huji.ac.il/
Main tool Supervised  by distributed  algorithms  that  respond  on-line  to global resource availability -  transparently Load-balancing  - migrate process from over-loaded to under-loaded nodes Memory ushering  - migrate processes from a node that  has exhausted its memory,  to prevent paging/swapping Preemptive process migration that can migrate---> any process, anywhere, anytime
MOSIX for Linux at HUJI A scalable cluster configuration: 50 Pentium-II  300 MHz 38 Pentium-Pro 200 MHz (some are SMPs) 16 Pentium-II 400 MHz (some are SMPs) Over 12 GB cluster-wide  RAM Connected by the Myrinet 2.56 G.b/s LAN Runs  Red-Hat 6.0 , based on  Kernel 2.2.7 Upgrade: HW with Intel, SW with Linux Download MOSIX: http://www.mosix.cs.huji.ac.il/
NOW @ Berkeley Design & Implementation of higher-level system Global OS (Glunix) Parallel File Systems (xFS) Fast Communication (HW for Active Messages) Application Support Overcoming technology shortcomings Fault tolerance System Management NOW Goal: Faster for Parallel AND Sequential http://now.cs.berkeley.edu/
NOW Software Components AM L.C.P. VN segment Driver Unix Workstation AM L.C.P. VN segment Driver Unix Workstation AM L.C.P. VN segment Driver Unix Workstation AM L.C.P. VN segment Driver Unix (Solaris) Workstation Global Layer Unix  Myrinet Scalable Interconnect Large Seq. Apps Parallel Apps Sockets, Split-C, MPI, HPF, vSM Active Messages Name Svr Scheduler
3 Paths for Applications on  NOW? Revolutionary (MPP Style): write new programs from scratch using MPP languages, compilers, libraries,… Porting: port programs from mainframes, supercomputers, MPPs, … Evolutionary: take sequential program & use 1) Network RAM: first use memory of many computers to reduce disk accesses; if not fast enough, then: 2) Parallel I/O: use many disks in parallel for accesses not in file cache; if not fast enough, then: 3) Parallel program: change program until it sees enough processors that is fast=> Large speedup without fine grain parallel program
Comparison of 4 Cluster Systems
Cluster Programming Environments Shared Memory Based DSM Threads/OpenMP (enabled for clusters) Java threads (HKU JESSICA, IBM cJVM) Message Passing Based PVM (PVM) MPI (MPI) Parametric Computations Nimrod/Clustor Automatic Parallelising Compilers Parallel Libraries & Computational Kernels (NetSolve)
Code-Granularity Code Item Large grain (task level) Program Medium grain (control level) Function (thread) Fine grain (data level) Loop (Compiler) Very fine grain (multiple issue) With hardware Levels of  Parallelism Task i-l Task i Task i+1 func1 ( ) { .... .... } func2 ( ) { .... .... } func3 ( ) { .... .... } a ( 0 ) =.. b ( 0 ) =.. a ( 1 )=.. b ( 1 )=.. a ( 2 )=.. b ( 2 )=.. + x Load PVM/MPI Threads Compilers CPU
MPI (Message Passing Interface) A standard message passing interface. MPI 1.0 - May 1994 (started in 1992) C and Fortran bindings (now Java) Portable (once coded, it can run on virtually all HPC platforms including clusters! Performance (by exploiting native hardware features) Functionality (over 115 functions in MPI 1.0) environment management, point-to-point & collective communications, process group, communication world, derived data types, and virtual topology routines. Availability - a variety of implementations available, both vendor and public domain.   http://www.mpi-forum.org/
A Sample MPI Program... # include <stdio.h> # include <string.h> #include “mpi.h” main( int argc, char *argv[ ]) { int my_rank; /* process rank */ int p; /*no. of processes*/ int source; /* rank of sender */ int dest; /* rank of receiver */ int tag = 0; /* message tag, like “email subject” */ char message[100]; /* buffer */ MPI_Status status; /* function return status */ /* Start up MPI */ MPI_Init( &argc, &argv ); /* Find our process rank/id */ MPI_Comm_rank( MPI_COM_WORLD, &my_rank); /*Find out how many processes/tasks part of this run */ MPI_Comm_size( MPI_COM_WORLD, &p); … (master) (workers) Hello,...
A Sample MPI Program if( my_rank == 0)  /* Master Process */  { for( source = 1; source < p; source++) { MPI_Recv( message, 100, MPI_CHAR, source, tag, MPI_COM_WORLD, &status); printf(“%s \n”, message); } } else /* Worker Process */ { sprintf( message, “Hello, I am your worker process %d!”, my_rank ); dest = 0; MPI_Send( message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COM_WORLD); } /* Shutdown MPI environment */ MPI_Finalise(); }
Execution % cc -o hello hello.c -lmpi % mpirun -p2 hello Hello, I am process 1! % mpirun -p4 hello Hello, I am process 1! Hello, I am process 2! Hello, I am process 3! % mpirun hello (no output, there are no workers.., no greetings)
PARMON: A Cluster Monitoring Tool PARMON Server on each node PARMON Client on JVM http://www.buyya.com/parmon/ PARMON High-Speed Switch parmond parmon
Resource Utilization at a Glance
Single I/O Space and  Design Issues Globalised Cluster Storage Reference: Designing SSI Clusters with Hierarchical Checkpointing and Single I/O Space”,  IEEE Concurrency , March, 1999 by K. Hwang, H. Jin et.al
Without  Single I/O Space Users With Single I/O Space Services Users Single I/O Space Services Clusters with & without Single I/O Space
Benefits of Single I/O Space Eliminate the gap between accessing local disk(s) and remote disks Support persistent programming paradigm Allow striping on remote disks, accelerate parallel I/O operations Facilitate the implementation of distributed checkpointing and recovery schemes
Single I/O Space Design Issues Integrated I/O Space Addressing and Mapping Mechanisms Data movement procedures
Integrated I/O Space Sequential addresses   . . . B11 SD 1 SD 2 SD m . . .  . . .  . . .  . . .  . . . . . . D11 D12 D1t D21 D22 D2t Dn1 Dn2 Dnt B12 B1k B21 B22 B2k Bm1 Bm2 Bmk LD 1 LD2 LD n Local  Disks,  (RADD Space) Shared  RAIDs, (NASD Space ) . . . P 1 P h .  .  . Peripherals (NAP Space)
User-level Middleware plus some Modified OS System Calls User Applications RADD I/O Agent Name Agent Disk/RAID/ NAP Mapper Block Mover I/O Agent NASD I/O Agent NAP I/O Agent Addressing and Mapping
Data Movement Procedures Node 1 LD 2  or SD i of the NASD Block Mover User Application I/O Agent Node 2 I/O Agent A A LD 1 Node 1 LD 2  or SD i   of the NASD Block Mover User  Application I/O Agent Node 2 I/O Agent A Request  Data Block A LD 1
What Next ?? Clusters of Clusters (HyperClusters) Global Grid Interplanetary Grid Universal Grid??
Clusters of Clusters (HyperClusters) Scheduler Master Daemon Execution Daemon Submit Graphical Control Clients Cluster 2 Scheduler Master Daemon Execution Daemon Submit Graphical Control Clients Cluster 3 Scheduler Master Daemon Execution Daemon Submit Graphical Control Clients Cluster 1 LAN/WAN
Towards Grid Computing…. For illustration, placed resources arbitrarily on the GUSTO test-bed!!
What is Grid ? An infrastructure that couples Computers (PCs, workstations, clusters, traditional supercomputers, and even laptops, notebooks, mobile computers, PDA, and so on) Software ? (e.g., renting expensive special purpose applications on demand) Databases (e.g., transparent access to human genome database) Special Instruments (e.g., radio telescope--SETI@Home Searching for Life in galaxy, Austrophysics@Swinburne for pulsars) People (may be even animals who knows ?) across the local/wide-area networks (enterprise, organisations, or Internet) and presents them as an unified integrated (single) resource.
Conceptual view of the Grid Leading to Portal (Super)Computing http://www.sun.com/hpc/
Grid Application-Drivers Old and New applications getting enabled due to coupling of computers, databases, instruments, people, etc: (distributed) Supercomputing Collaborative engineering high-throughput computing large scale simulation & parameter studies Remote software access / Renting Software Data-intensive computing On-demand computing
Grid Components Grid Fabric Networked Resources across Organisations Computers Clusters Data Sources Scientific Instruments Storage Systems Local Resource Managers Operating Systems Queuing Systems TCP/IP & UDP … Libraries & App Kernels … Distributed Resources Coupling Services Comm. Sign on & Security Information … QoS Process Data Access Development Environments and Tools Languages Libraries Debuggers … Web tools Resource Brokers Monitoring Applications and Portals Prob. Solving Env. Scientific … Collaboration Engineering Web enabled Apps Grid Apps. Grid Middleware Grid Tools
Many GRID Projects and Initiatives PUBLIC FORUMS Computing Portals Grid Forum European Grid Forum IEEE TFCC! GRID’2000 and more. Australia Nimrod/G EcoGrid and GRACE DISCWorld Europe UNICORE MOL METODIS Globe Poznan Metacomputing CERN Data Grid MetaMPI DAS JaWS and many more... Public Grid Initiatives Distributed.net [email_address] Compute Power Grid USA Globus Legion JAVELIN AppLes NASA IPG Condor Harness NetSolve NCSA Workbench WebFlow EveryWhere and many more... Japan Ninf Bricks and many more... http://www.gridcomputing.com/
 
http://www.epm.ornl.gov/harness/
 
Nimrod - A Job Management System http://www.dgs.monash.edu.au/~davida/nimrod.html
Job processing with Nimrod
Nimrod/G Architecture Middleware Services Nimrod/G Client Nimrod/G Client Nimrod/G Client Grid Information Services Schedule Advisor Trading Manager Nimrod Engine GUSTO Test Bed Persistent Store Grid Explorer GE  GIS TM  TS RM & TS RM & TS RM & TS Dispatcher RM: Local Resource Manager, TS: Trade Server
User Application Resource Broker A Resource Domain Grid Explorer Schedule Advisor Trade Manager Job Control Agent Deployment Agent Trade Server Resource Allocation Resource Reservation R 1 Other services Trading Grid Information Server R 2 R n … Charging Alg. Accounting Compute Power Market
Pointers to Literature on Cluster Computing
Reading Resources..1a Internet & WWW Computer Architecture: http://www.cs.wisc.edu/~arch/www/ PFS & Parallel I/O http://www.cs.dartmouth.edu/pario/ Linux Parallel Procesing http://yara.ecn.purdue.edu/~pplinux/Sites/ DSMs http://www.cs.umd.edu/~keleher/dsm.html
Reading Resources..1b Internet & WWW Solaris-MC http://www.sunlabs.com/research/solaris-mc Microprocessors: Recent Advances http://www.microprocessor.sscc.ru Beowulf: http://www.beowulf.org Metacomputing http://www.sis.port.ac.uk/~mab/Metacomputing/
Reading Resources..2 Books In Search of Cluster by G.Pfister, Prentice Hall (2ed), 98 High Performance Cluster Computing Volume1:  Architectures and Systems Volume2: Programming and Applications Edited by Rajkumar Buyya, Prentice Hall, NJ, USA. Scalable Parallel Computing by K Hwang & Zhu, McGraw Hill,98
Reading Resources..3 Journals A Case of NOW, IEEE Micro, Feb’95 by Anderson, Culler, Paterson Fault Tolerant COW with SSI, IEEE Concurrency, (to appear) by Kai Hwang, Chow, Wang, Jin, Xu Cluster Computing: The Commodity Supercomputing, Journal of Software Practice and Experience-(get from my web) by Mark Baker & Rajkumar Buyya
Cluster Computing Infoware http://www.csse.monash.edu.au/~rajkumar/cluster/
Cluster Computing Forum IEEE Task Force on Cluster Computing  (TFCC) http://www.ieeetfcc.org
TFCC Activities... Network Technologies OS Technologies Parallel I/O Programming Environments Java Technologies Algorithms and Applications >Analysis and Profiling  Storage Technologies High Throughput Computing
TFCC Activities... High Availability Single System Image Performance Evaluation Software Engineering Education Newsletter Industrial Wing TFCC Regional Activities All the above have there own pages, see pointers from:  http://www.ieeetfcc.org
TFCC Activities... Mailing list, Workshops, Conferences, Tutorials, Web-resources etc. Resources for introducing subject in senior undergraduate and graduate levels. Tutorials/Workshops at IEEE Chapters.. … .. and so on. FREE MEMBERSHIP, please join! Visit TFCC Page for more details: http://www.ieeetfcc.org  (updated daily!).
Clusters Revisited
Summary We have discussed Clusters Enabling Technologies Architecture & its Components Classifications Middleware Single System Image Representative Systems
Conclusions Clusters are promising.. Solve parallel processing paradox Offer incremental growth and matches with funding pattern. New trends in hardware and software technologies are likely to make clusters more promising..so that Clusters based supercomputers can be seen everywhere!
 
Thank You  ... ?
Backup Slides...
SISD : A Conventional Computer Speed is limited by the rate at which computer can transfer information internally. Ex:PC, Macintosh,  Workstations Processor Data Input Data Output Instructions
The MISD Architecture More of an intellectual exercise than a practical configuration. Few built, but commercially not available Data  Input Stream Data  Output Stream Processor A Processor B Processor C Instruction Stream A Instruction Stream B Instruction  Stream C
SIMD Architecture Ex: CRAY machine vector processing, Thinking machine cm* C i <= A i  * B i Instruction Stream Processor A Processor B Processor C Data Input stream A Data Input stream B Data Input stream C Data Output stream A Data Output stream B Data Output stream C
Unlike SISD, MISD, MIMD computer works asynchronously. Shared memory (tightly coupled) MIMD Distributed memory (loosely coupled) MIMD MIMD Architecture Processor A Processor B Processor C Data Input stream A Data Input stream B Data Input stream C Data Output stream A Data Output stream B Data Output stream C Instruction Stream  A Instruction Stream B Instruction Stream C
Shared Memory MIMD machine Comm:  Source PE writes data to GM & destination retrieves it  Easy to build, conventional OSes of SISD can be easily be ported Limitation : reliability & expandability.  A memory component or any processor failure affects the whole system. Increase of processors leads to memory contention. Ex. : Silicon graphics supercomputers.... Global Memory System Processor A Processor B Processor C MEMORY BUS MEMORY BUS MEMORY BUS
Distributed Memory MIMD Communication : IPC on High Speed Network. Network can be configured to ... Tree, Mesh, Cube, etc. Unlike Shared MIMD easily/ readily expandable Highly reliable (any CPU  failure does not affect the whole system) Processor A Processor B Processor C IPC channel IPC channel MEMORY BUS MEMORY BUS MEMORY BUS Memory System  A Memory System  B Memory System C

Cluster Tutorial

  • 1.
    High Performance ClusterComputing ( Architecture, Systems, and Applications) Rajkumar Buyya, Monash University, Melbourne. Email: rajkumar@csse.monash.edu.au / rajkumar@buyya.com Web: http://www.ccse.monash.edu.au/~rajkumar / www.buyya.com ISCA 2000
  • 2.
    Objectives Learn and Share Recent advances in cluster computing (both in research and commercial settings): Architecture, System Software Programming Environments and Tools Applications Cluster Computing Infoware: (tutorial online) http://www.buyya.com/cluster/
  • 3.
    Agenda Overview ofComputing Motivations & Enabling Technologies Cluster Architecture & its Components Clusters Classifications Cluster Middleware Single System Image Representative Cluster Systems Resources and Conclusions
  • 4.
    Computing Elements HardwareOperating System Applications Programming Paradigms P P P P P P   Microkernel Multi-Processor Computing System Threads Interface Process Processor Thread P
  • 5.
    Architectures System SoftwareApplications P.S.Es Architectures System Software Applications P.S.Es Two Eras of Computing Sequential Era Parallel Era 1940 50 60 70 80 90 2000 2030 Commercialization R & D Commodity
  • 6.
    Computing Power andComputer Architectures
  • 7.
    Computing Power (HPC)Drivers Solving grand challenge applications using computer modeling , simulation and analysis Life Sciences CAD/CAM Aerospace Military Applications Digital Biology Military Applications Military Applications E-commerce/anything
  • 8.
    How to RunApp. Faster ? There are 3 ways to improve performance: 1. Work Harder 2. Work Smarter 3. Get Help Computer Analogy 1. Use faster hardware: e.g. reduce the time per instruction (clock cycle). 2. Optimized algorithms and techniques 3. Multiple computers to solve problem: That is, increase no. of instructions executed per clock cycle.
  • 9.
  • 10.
    Application Case Study Web Serving and E-Commerce”
  • 11.
    E-Commerce and PDC? What are/will be the major problems/issues in eCommerce? How will or can PDC be applied to solve some of them? Other than “Compute Power”, what else can PDC contribute to e-commerce? How would/could the different forms of PDC (clusters, hypercluster, GRID,…) be applied to e-commerce? Could you describe one hot research topic for PDC applying to e-commerce? A killer e-commerce application for PDC ? … ...
  • 12.
    Killer Applications ofClusters Numerous Scientific & Engineering Apps. Parametric Simulations Business Applications E-commerce Applications (Amazon.com, eBay.com ….) Database Applications (Oracle on cluster) Decision Support Systems Internet Applications Web serving / searching Infowares (yahoo.com, AOL.com) ASPs (application service providers) eMail, eChat, ePhone, eBook, eCommerce, eBank, eSociety, eAnything! Computing Portals Mission Critical Applications command control systems, banks, nuclear reactor control, star-war, and handling life threatening situations.
  • 13.
    Major problems/issues inE-commerce Social Issues Capacity Planning Multilevel Business Support (e.g., B2P2C) Information Storage, Retrieval, and Update Performance Heterogeneity System Scalability System Reliability Identification and Authentication System Expandability Security Cyber Attacks Detection and Control (cyberguard) Data Replication, Consistency, and Caching Manageability (administration and control)
  • 14.
    Amazon.com: Online sales/tradingkiller E-commerce Portal Several Thousands of Items books, publishers, suppliers Millions of Customers Customers details, transactions details, support for transactions update (Millions) of Partners Keep track of partners details, tracking referral link to partner and sales and payment Sales based on advertised price Sales through auction/bids A mechanism for participating in the bid (buyers/sellers define rules of the game)
  • 15.
    Can these driveE-Commerce ? Clusters are already in use for web serving, web-hosting, and number of other Internet applications including E-commerce scalability, availability, performance, reliable-high performance-massive storage and database support. Attempts to support online detection of cyber attacks (through data mining) and control Hyperclusters and the GRID: Support for transparency in (secure) Site/Data Replication for high availability and quick response time (taking site close to the user). Compute power from hyperclusters/Grid can be used for data mining for cyber attacks and fraud detection and control. Helps to build Compute Power Market, ASPs, and Computing Portals. 2100 2100 2100 2100 2100 2100 2100 2100
  • 16.
    Science Portals -e.g., PAPIA system PAPIA PC Cluster Pentiums Myrinet NetBSD/Linuux PM Score-D MPC++ RWCP Japan: http://www.rwcp.or.jp/papia/
  • 17.
    PDC hot topicsfor E-commerce Cluster based web-servers, search engineers, portals… Scheduling and Single System Image. Heterogeneous Computing Reliability and High Availability and Data Recovery Parallel Databases and high performance-reliable-mass storage systems. CyberGuard! Data mining for detection of cyber attacks, frauds, etc. detection and online control. Data Mining for identifying sales pattern and automatically tuning portal to special sessions/festival sales eCash, eCheque, eBank, eSociety, eGovernment, eEntertainment, eTravel, eGoods, and so on. Data/Site Replications and Caching techniques Compute Power Market Infowares (yahoo.com, AOL.com) ASPs (application service providers) . . .
  • 18.
    Sequential Architecture LimitationsSequential architectures reaching physical limitation (speed of light, thermodynamics) Hardware improvements like pipelining, Superscalar, etc., are non-scalable and requires sophisticated Compiler Technology. Vector Processing works well for certain kind of problems .
  • 19.
    Computational Power ImprovementNo. of Processors C.P.I. 1 2 . . . . Multiprocessor Uniprocessor
  • 20.
    Human Physical GrowthAnalogy : Computational Power Improvement Age Growth 5 10 15 20 25 30 35 40 45 . . . . Vertical Horizontal
  • 21.
    The Tech. ofPP is mature and can be exploited commercially; significant R & D work on development of tools & environment. Significant development in Networking technology is paving a way for heterogeneous computing. Why Parallel Processing NOW?
  • 22.
    History of ParallelProcessing PP can be traced to a tablet dated around 100 BC. Tablet has 3 calculating positions. Infer that multiple positions: Reliability/ Speed
  • 23.
    Aggregated speed with which complex calculations carried out by millions of neurons in human brain is amazing! although individual neurons response is slow (milli sec.) - demonstrate the feasibility of PP Motivating Factors
  • 24.
    Simple classification byFlynn: (No. of instruction and data streams) SISD - conventional SIMD - data parallel, vector computing MISD - systolic arrays MIMD - very general, multiple approaches. Current focus is on MIMD model, using general purpose processors or multicomputers. Taxonomy of Architectures
  • 25.
    Main HPC Architectures..1a SISD - mainframes, workstations, PCs. SIMD Shared Memory - Vector machines, Cray... MIMD Shared Memory - Sequent, KSR, Tera, SGI, SUN. SIMD Distributed Memory - DAP, TMC CM-2... MIMD Distributed Memory - Cray T3D, Intel, Transputers, TMC CM-5, plus recent workstation clusters (IBM SP2, DEC, Sun, HP).
  • 26.
    Motivation for usingClusters The communications bandwidth between workstations is increasing as new networking technologies and protocols are implemented in LANs and WANs. Workstation clusters are easier to integrate into existing networks than special parallel computers.
  • 27.
    Main HPC Architectures..1b.NOTE: Modern sequential machines are not purely SISD - advanced RISC processors use many concepts from vector and parallel architectures (pipelining, parallel execution of instructions, prefetching of data, etc) in order to achieve one or more arithmetic operations per clock cycle.
  • 28.
    Parallel Processing ParadoxTime required to develop a parallel application for solving GCA is equal to: Half Life of Parallel Supercomputers.
  • 29.
    The Need forAlternative Supercomputing Resources Vast numbers of under utilised workstations available to use. Huge numbers of unused processor cycles and resources that could be put to good use in a wide variety of applications areas. Reluctance to buy Supercomputer due to their cost and short life span. Distributed compute resources “fit” better into today's funding model.
  • 30.
  • 31.
  • 32.
    Design Space ofCompeting Computer Architecture
  • 33.
    Towards Inexpensive SupercomputingIt is: Cluster Computing.. The Commodity Supercomputing!
  • 34.
    Cluster Computing -Research Projects Beowulf (CalTech and NASA) - USA CCS (Computing Centre Software) - Paderborn, Germany Condor - Wisconsin State University, USA DQS (Distributed Queuing System) - Florida State University, US. EASY - Argonne National Lab, USA HPVM -(High Performance Virtual Machine),UIUC&now UCSB,US far - University of Liverpool, UK Gardens - Queensland University of Technology, Australia MOSIX - Hebrew University of Jerusalem, Israel MPI (MPI Forum, MPICH is one of the popular implementations) NOW (Network of Workstations) - Berkeley, USA NIMROD - Monash University, Australia NetSolve - University of Tennessee, USA PBS (Portable Batch System) - NASA Ames and LLNL, USA PVM - Oak Ridge National Lab./UTK/Emory, USA
  • 35.
    Cluster Computing -Commercial Software Codine (Computing in Distributed Network Environment) - GENIAS GmbH, Germany LoadLeveler - IBM Corp., USA LSF (Load Sharing Facility) - Platform Computing, Canada NQE (Network Queuing Environment) - Craysoft Corp., USA OpenFrame - Centre for Development of Advanced Computing, India RWPC (Real World Computing Partnership), Japan Unixware (SCO-Santa Cruz Operations,), USA Solaris-MC (Sun Microsystems), USA ClusterTools (A number for free HPC clusters tools from Sun) A number of commercial vendors worldwide are offering clustering solutions including IBM, Compaq, Microsoft, a number of startups like TurboLinux, HPTI, Scali, BlackStone…..)
  • 36.
    Motivation for usingClusters Surveys show utilisation of CPU cycles of desktop workstations is typically <10%. Performance of workstations and PCs is rapidly improving As performance grows, percent utilisation will decrease even further ! Organisations are reluctant to buy large supercomputers, due to the large expense and short useful life span.
  • 37.
    Motivation for usingClusters The development tools for workstations are more mature than the contrasting proprietary solutions for parallel computers - mainly due to the non-standard nature of many parallel systems. Workstation clusters are a cheap and readily available alternative to specialised High Performance Computing (HPC) platforms. Use of clusters of workstations as a distributed compute resource is very cost effective - incremental growth of system!!!
  • 38.
    Cycle Stealing Usually a workstation will be owned by an individual , group, department, or organisation - they are dedicated to the exclusive use by the owners . This brings problems when attempting to form a cluster of workstations for running distributed applications.
  • 39.
    Cycle Stealing Typically,there are three types of owners, who use their workstations mostly for: 1 . Sending and receiving email and preparing documents. 2. Software development - edit, compile, debug and test cycle. 3. Running compute-intensive applications.
  • 40.
    Cycle Stealing Clustercomputing aims to steal spare cycles from (1) and (2) to provide resources for (3). However, this requires overcoming the ownership hurdle - people are very protective of their workstations. Usually requires organisational mandate that computers are to be used in this way. Stealing cycles outside standard work hours (e.g. overnight) is easy, stealing idle cycles during work hours without impacting interactive use (both CPU and memory) is much harder .
  • 41.
    Rise & Fallof Computing Technologies Mainframes Minis PCs Minis PCs Network Computing 1970 1980 1995
  • 42.
  • 43.
    1984 Computer FoodChain Mainframe Vector Supercomputer Mini Computer Workstation PC
  • 44.
    Mainframe Vector SupercomputerMPP Workstation PC 1994 Computer Food Chain Mini Computer (hitting wall soon) (future is bleak)
  • 45.
    Computer Food Chain(Now and Future)
  • 46.
    What is acluster? A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone/complete computers cooperatively working together as a single , integrated computing resource. A typical cluster: Network: Faster, closer connection than a typical network (LAN) Low latency communication protocols Looser connection than SMP
  • 47.
    Why Clusters now?(Beyond Technology and Cost) Building block is big enough complete computers (HW & SW) shipped in millions: killer micro, killer RAM, killer disks, killer OS, killer networks, killer apps. Workstations performance is doubling every 18 months. Networks are faster Higher link bandwidth (v 10Mbit Ethernet) Switch based networks coming (ATM) Interfaces simple & fast (Active Msgs) Striped files preferred (RAID) Demise of Mainframes, Supercomputers, & MPPs
  • 48.
    Architectural Drivers…(cont) Nodearchitecture dominates performance processor, cache, bus, and memory design and engineering $ => performance Greatest demand for performance is on large systems must track the leading edge of technology without lag MPP network technology => mainstream system area networks System on every node is a powerful enabler very high speed I/O, virtual memory, scheduling, …
  • 49.
    ...Architectural Drivers Clusterscan be grown: Incremental scalability (up, down, and across) Individual nodes performance can be improved by adding additional resource (new memory blocks/disks) New nodes can be added or nodes can be removed Clusters of Clusters and Metacomputing Complete software tools Threads, PVM, MPI, DSM, C, C++, Java, Parallel C++, Compilers, Debuggers, OS, etc. Wide class of applications Sequential and grand challenging parallel applications
  • 50.
    Clustering of Computers for Collective Computing: Trends 1960 1990 1995+ 2000 ?
  • 51.
    Example Clusters: BerkeleyNOW 100 Sun UltraSparcs 200 disks Myrinet SAN 160 MB/s Fast comm. AM, MPI, ... Ether/ATM switched external net Global OS Self Config
  • 52.
    Basic Components $M I/O bus MyriNet Sun Ultra 170 Myricom NIC 160 MB/s M P P
  • 53.
    Massive Cheap StorageCluster Basic unit: 2 PCs double-ending four SCSI chains of 8 disks each Currently serving Fine Art at http://www.thinker.org/imagebase/
  • 54.
    Cluster of SMPs(CLUMPS) Four Sun E5000s 8 processors 4 Myricom NICs each Multiprocessor, Multi-NIC, Multi-Protocol NPACI => Sun 450s
  • 55.
    Millennium PC ClumpsInexpensive, easy to manage Cluster Replicated in many departments Prototype for very large PC cluster
  • 56.
  • 57.
    So What’s SoDifferent? Commodity parts? Communications Packaging? Incremental Scalability? Independent Failure? Intelligent Network Interfaces? Complete System on every node virtual memory scheduler files ...
  • 58.
    OPPORTUNITIES & CHALLENGES
  • 59.
    Opportunity of Large-scaleComputing on NOW Shared Pool of Computing Resources: Processors, Memory, Disks Interconnect Guarantee atleast one workstation to many individuals (when active) Deliver large % of collective resources to few individuals at any one time
  • 60.
    Windows of OpportunitiesMPP/DSM: Compute across multiple systems: parallel. Network RAM: Idle memory in other nodes. Page across other nodes idle memory Software RAID: file system supporting parallel I/O and reliablity, mass-storage. Multi-path Communication: Communicate across multiple networks: Ethernet, ATM, Myrinet
  • 61.
    Parallel Processing ScalableParallel Applications require good floating-point performance low overhead communication scalable network bandwidth parallel file system
  • 62.
    Network RAM Performancegap between processor and disk has widened. Thrashing to disk degrades performance significantly Paging across networks can be effective with high performance networks and OS that recognizes idle machines Typically thrashing to network RAM can be 5 to 10 times faster than thrashing to disk
  • 63.
    Software RAID: RedundantArray of Workstation Disks I/O Bottleneck: Microprocessor performance is improving more than 50% per year. Disk access improvement is < 10% Application often perform I/O RAID cost per byte is high compared to single disks RAIDs are connected to host computers which are often a performance and availability bottleneck RAID in software, writing data across an array of workstation disks provides performance and some degree of redundancy provides availability.
  • 64.
    Software RAID, ParallelFile Systems, and Parallel I/O
  • 65.
    Cluster Computer andits Components
  • 66.
    Clustering Today Clusteringgained momentum when 3 technologies converged: 1. Very HP Microprocessors workstation performance = yesterday supercomputers 2. High speed communication Comm. between cluster nodes >= between processors in an SMP. 3 . Standard tools for parallel/ distributed computing & their growing popularity.
  • 67.
  • 68.
    Cluster Components...1a NodesMultiple High Performance Components: PCs Workstations SMPs (CLUMPS) Distributed HPC Systems leading to Metacomputing They can be based on different architectures and running difference OS
  • 69.
    Cluster Components...1b ProcessorsThere are many (CISC/RISC/VLIW/Vector..) Intel: Pentiums, Xeon, Merceed…. Sun: SPARC, ULTRASPARC HP PA IBM RS6000/PowerPC SGI MPIS Digital Alphas Integrate Memory, processing and networking into a single chip IRAM (CPU & Mem): (http://iram.cs.berkeley.edu) Alpha 21366 (CPU, Memory Controller, NI)
  • 70.
    Cluster Components…2 OSState of the art OS: Linux (Beowulf) Microsoft NT (Illinois HPVM) SUN Solaris (Berkeley NOW) IBM AIX (IBM SP2) HP UX (Illinois - PANDA) Mach (Microkernel based OS) (CMU) Cluster Operating Systems (Solaris MC, SCO Unixware, MOSIX (academic project) OS gluing layers: (Berkeley Glunix)
  • 71.
    Cluster Components…3 HighPerformance Networks Ethernet (10Mbps), Fast Ethernet (100Mbps), Gigabit Ethernet (1Gbps) SCI (Dolphin - MPI- 12micro-sec latency) ATM Myrinet (1.2Gbps) Digital Memory Channel FDDI
  • 72.
    Cluster Components…4 NetworkInterfaces Network Interface Card Myrinet has NIC User-level access support Alpha 21364 processor integrates processing, memory controller, network interface into a single chip..
  • 73.
    Cluster Components…5 CommunicationSoftware Traditional OS supported facilities (heavy weight due to protocol processing).. Sockets (TCP/IP), Pipes, etc. Light weight protocols (User Level) Active Messages (Berkeley) Fast Messages (Illinois) U-net (Cornell) XTP (Virginia) System systems can be built on top of the above protocols
  • 74.
    Cluster Components…6a ClusterMiddleware Resides Between OS and Applications and offers in infrastructure for supporting: Single System Image (SSI) System Availability (SA) SSI makes collection appear as single machine (globalised view of system resources). Telnet cluster.myinstitute.edu SA - Check pointing and process migration..
  • 75.
    Cluster Components…6b MiddlewareComponents Hardware DEC Memory Channel, DSM (Alewife, DASH) SMP Techniques OS / Gluing Layers Solaris MC, Unixware, Glunix) Applications and Subsystems System management and electronic forms Runtime systems (software DSM, PFS etc.) Resource management and scheduling (RMS): CODINE, LSF, PBS, NQS, etc.
  • 76.
    Cluster Components…7a Programmingenvironments Threads (PCs, SMPs, NOW..) POSIX Threads Java Threads MPI Linux, NT, on many Supercomputers PVM Software DSMs (Shmem)
  • 77.
    Cluster Components…7b DevelopmentTools ? Compilers C/C++/Java/ ; Parallel programming with C++ (MIT Press book) RAD (rapid application development tools).. GUI based tools for PP modeling Debuggers Performance Analysis Tools Visualization Tools
  • 78.
    Cluster Components…8 ApplicationsSequential Parallel / Distributed (Cluster-aware app.) Grand Challenging applications Weather Forecasting Quantum Chemistry Molecular Biology Modeling Engineering Analysis (CAD/CAM) ……………… . PDBs, web servers,data-mining
  • 79.
    Key Operational Benefitsof Clustering System availability (HA). offer inherent high system availability due to the redundancy of hardware, operating systems, and applications. Hardware Fault Tolerance. redundancy for most system components (eg. disk-RAID), including both hardware and software. OS and application reliability. run multiple copies of the OS and applications, and through this redundancy Scalability. adding servers to the cluster or by adding more clusters to the network as the need arises or CPU to SMP. High Performance. (running cluster enabled programs)
  • 80.
  • 81.
    Clusters Classification..1 Basedon Focus (in Market) High Performance (HP) Clusters Grand Challenging Applications High Availability (HA) Clusters Mission Critical applications
  • 82.
    HA Cluster: ServerCluster with &quot;Heartbeat&quot; Connection
  • 83.
    Clusters Classification..2 Basedon Workstation/PC Ownership Dedicated Clusters Non-dedicated clusters Adaptive parallel computing Also called Communal multiprocessing
  • 84.
    Clusters Classification..3 Basedon Node Architecture.. Clusters of PCs (CoPs) Clusters of Workstations (COWs) Clusters of SMPs (CLUMPs)
  • 85.
    Building Scalable Systems: Cluster of SMPs (Clumps) Performance of SMP Systems Vs. Four-Processor Servers in a Cluster
  • 86.
    Clusters Classification..4 Basedon Node OS Type.. Linux Clusters (Beowulf) Solaris Clusters (Berkeley NOW) NT Clusters (HPVM) AIX Clusters (IBM SP2) SCO/Compaq Clusters (Unixware) …… .Digital VMS Clusters, HP clusters, ………………..
  • 87.
    Clusters Classification..5 Basedon node components architecture & configuration (Processor Arch, Node Type: PC/Workstation.. & OS: Linux/NT..): Homogeneous Clusters All nodes will have similar configuration Heterogeneous Clusters Nodes based on different processors and running different OSes.
  • 88.
    Clusters Classification..6a Dimensions of Scalability & Levels of Clustering Uniprocessor SMP Cluster MPP CPU / I/O / Memory / OS (1) (2) (3) Campus Enterprise Workgroup Department Public Metacomputing (GRID) Network Technology Platform
  • 89.
    Clusters Classification..6b Levelsof Clustering Group Clusters (#nodes: 2-99) (a set of dedicated/non-dedicated computers - mainly connected by SAN like Myrinet) Departmental Clusters (#nodes: 99-999) Organizational Clusters (#nodes: many 100s) (using ATMs Net) Internet-wide Clusters=Global Clusters: (#nodes: 1000s to many millions) Metacomputing Web-based Computing Agent Based Computing Java plays a major in web and agent based computing
  • 90.
    Size Scalability (physical& application) Enhanced Availability (failure management) Single System Image (look-and-feel of one system) Fast Communication (networks & protocols) Load Balancing (CPU, Net, Memory, Disk) Security and Encryption (clusters of clusters) Distributed Environment (Social issues) Manageability (admin. And control) Programmability (simple API if required) Applicability (cluster-aware and non-aware app.) Major issues in cluster design
  • 91.
    Cluster Middleware and Single System Image
  • 92.
    A typical ClusterComputing Environment PVM / MPI/ RSH Application Hardware/OS ???
  • 93.
    CC should support Multi-user , time-sharing environments Nodes with different CPU speeds and memory sizes (heterogeneous configuration) Many processes , with unpredictable requirements Unlike SMP : insufficient “ bonds ” between nodes Each computer operates independently Inefficient utilization of resources
  • 94.
    The missing linkis provide by cluster middleware/underware PVM / MPI/ RSH Application Hardware/OS Middleware or Underware
  • 95.
    SSI Clusters-- SMPservices on a CC Adaptive resource usage for better performance Ease of use - almost like SMP Scalable configurations - by decentralized control Result: HPC/HAC at PC/Workstation prices “ Pool Together” the “Cluster-Wide” resources
  • 96.
    What is ClusterMiddleware ? An interface between between use applications and cluster hardware and OS platform. Middleware packages support each other at the management, programming, and implementation levels. Middleware Layers: SSI Layer Availability Layer: It enables the cluster services of Checkpointing, Automatic Failover, recovery from failure, fault-tolerant operating among all cluster nodes.
  • 97.
    Middleware Design GoalsComplete Transparency (Manageability) Lets the see a single cluster system.. Single entry point, ftp, telnet, software loading... Scalable Performance Easy growth of cluster no change of API & automatic load distribution. Enhanced Availability Automatic Recovery from failures Employ checkpointing & fault tolerant technologies Handle consistency of data when replicated..
  • 98.
    What is SingleSystem Image (SSI) ? A single system image is the illusion , created by software or hardware, that presents a collection of resources as one, more powerful resource. SSI makes the cluster appear like a single machine to the user, to applications, and to the network. A cluster without a SSI is not a cluster
  • 99.
    Benefits of SingleSystem Image Usage of system resources transparently Transparent process migration and load balancing across nodes. Improved reliability and higher availability Improved system response time and performance Simplified system management Reduction in the risk of operator errors User need not be aware of the underlying system architecture to use these machines effectively
  • 100.
    Desired SSI ServicesSingle Entry Point telnet cluster.my_institute.edu telnet node1.cluster. institute.edu Single File Hierarchy: xFS, AFS, Solaris MC Proxy Single Control Point: Management from single GUI Single virtual networking Single memory space - Network RAM / DSM Single Job Management: Glunix, Codine, LSF Single User Interface: Like workstation/PC windowing environment (CDE in Solaris/NT), may it can use Web technology
  • 101.
    Availability Support Functions Single I/O Space (SIO): any node can access any peripheral or disk devices without the knowledge of physical location. Single Process Space (SPS) Any process on any node create process with cluster wide process wide and they communicate through signal, pipes, etc, as if they are one a single node. Checkpointing and Process Migration. Saves the process state and intermediate results in memory to disk to support rollback recovery when node fails. PM for Load balancing... Reduction in the risk of operator errors User need not be aware of the underlying system architecture to use these machines effectively
  • 102.
    Scalability Vs. SingleSystem Image UP
  • 103.
    SSI Levels/How dowe implement SSI ? It is a computer science notion of levels of abstractions (house is at a higher level of abstraction than walls, ceilings, and floors). Hardware Level Application and Subsystem Level Operating System Kernel Level
  • 104.
    SSI at Applicationand Subsystem Level (c) In search of clusters Level Examples Boundary Importance application cluster batch system, system management subsystem file system distributed DB, OSF DME, Lotus Notes, MPI, PVM an application what a user wants Sun NFS, OSF, DFS, NetWare, and so on a subsystem SSI for all applications of the subsystem implicitly supports many applications and subsystems shared portion of the file system toolkit OSF DCE, Sun ONC+, Apollo Domain best level of support for heter- ogeneous system explicit toolkit facilities: user, service name,time
  • 105.
    SSI at OperatingSystem Kernel Level (c) In search of clusters Level Examples Boundary Importance Kernel/ OS Layer Solaris MC, Unixware MOSIX, Sprite,Amoeba / GLunix kernel interfaces virtual memory UNIX (Sun) vnode, Locus (IBM) vproc each name space: files, processes, pipes, devices, etc. kernel support for applications, adm subsystems none supporting operating system kernel type of kernel objects: files, processes, etc. modularizes SSI code within kernel may simplify implementation of kernel objects each distributed virtual memory space microkernel Mach, PARAS, Chorus, OSF/1AD, Amoeba implicit SSI for all system services each service outside the microkernel
  • 106.
    SSI at HarwareLevel Level Examples Boundary Importance memory SCI, DASH better communica- tion and synchro- nization memory space memory and I/O SCI, SMP techniques lower overhead cluster I/O memory and I/O device space (c) In search of clusters Application and Subsystem Level Operating System Kernel Level
  • 107.
    SSI Characteristics 1.Every SSI has a boundary 2. Single system support can exist at different levels within a system, one able to be build on another
  • 108.
    SSI Boundaries --an applications SSI boundary Batch System (c) In search of clusters SSI Boundary
  • 109.
  • 110.
    SSI via OSpath! 1. Build as a layer on top of the existing OS Benefits: makes the system quickly portable, tracks vendor software upgrades, and reduces development time. i.e. new systems can be built quickly by mapping new services onto the functionality provided by the layer beneath. Eg: Glunix 2. Build SSI at kernel level, True Cluster OS Good, but Can’t leverage of OS improvements by vendor E.g. Unixware, Solaris-MC, and MOSIX
  • 111.
    SSI Representative SystemsOS level SSI SCO NSC UnixWare Solaris-MC MOSIX, …. Middleware level SSI PVM, TreadMarks (DSM), Glunix, Condor, Codine, Nimrod, …. Application level SSI PARMON, Parallel Oracle, ...
  • 112.
    SCO NonStop ® Cluster for UnixWare Users, applications, and systems management Standard OS kernel calls Modular kernel extensions Extensions UP or SMP node Users, applications, and systems management Standard OS kernel calls Modular kernel extensions Extensions Devices Devices ServerNet ™ UP or SMP node Standard SCO UnixWare ® with clustering hooks Standard SCO UnixWare with clustering hooks Other nodes http://www.sco.com/products/clustering/
  • 113.
    How does NonStopClusters Work? Modular Extensions and Hooks to Provide: Single Clusterwide Filesystem view Transparent Clusterwide device access Transparent swap space sharing Transparent Clusterwide IPC High Performance Internode Communications Transparent Clusterwide Processes, migration,etc. Node down cleanup and resource failover Transparent Clusterwide parallel TCP/IP networking Application Availability Clusterwide Membership and Cluster timesync Cluster System Administration Load Leveling
  • 114.
    Solaris-MC: Solaris forMultiComputers global file system globalized process management globalized networking and I/O http://www.sun.com/research/solaris-mc/
  • 115.
    Solaris MC componentsObject and communication support High availability support PXFS global distributed file system Process mangement Networking
  • 116.
    M ulticomputer OS for UN IX (MOSIX) An OS module (layer) that provides the applications with the illusion of working on a single system Remote operations are performed like local operations Transparent to the application - user interface unchanged PVM / MPI / RSH M O S I X Application Hardware/OS http://www.mosix.cs.huji.ac.il/
  • 117.
    Main tool Supervised by distributed algorithms that respond on-line to global resource availability - transparently Load-balancing - migrate process from over-loaded to under-loaded nodes Memory ushering - migrate processes from a node that has exhausted its memory, to prevent paging/swapping Preemptive process migration that can migrate---> any process, anywhere, anytime
  • 118.
    MOSIX for Linuxat HUJI A scalable cluster configuration: 50 Pentium-II 300 MHz 38 Pentium-Pro 200 MHz (some are SMPs) 16 Pentium-II 400 MHz (some are SMPs) Over 12 GB cluster-wide RAM Connected by the Myrinet 2.56 G.b/s LAN Runs Red-Hat 6.0 , based on Kernel 2.2.7 Upgrade: HW with Intel, SW with Linux Download MOSIX: http://www.mosix.cs.huji.ac.il/
  • 119.
    NOW @ BerkeleyDesign & Implementation of higher-level system Global OS (Glunix) Parallel File Systems (xFS) Fast Communication (HW for Active Messages) Application Support Overcoming technology shortcomings Fault tolerance System Management NOW Goal: Faster for Parallel AND Sequential http://now.cs.berkeley.edu/
  • 120.
    NOW Software ComponentsAM L.C.P. VN segment Driver Unix Workstation AM L.C.P. VN segment Driver Unix Workstation AM L.C.P. VN segment Driver Unix Workstation AM L.C.P. VN segment Driver Unix (Solaris) Workstation Global Layer Unix Myrinet Scalable Interconnect Large Seq. Apps Parallel Apps Sockets, Split-C, MPI, HPF, vSM Active Messages Name Svr Scheduler
  • 121.
    3 Paths forApplications on NOW? Revolutionary (MPP Style): write new programs from scratch using MPP languages, compilers, libraries,… Porting: port programs from mainframes, supercomputers, MPPs, … Evolutionary: take sequential program & use 1) Network RAM: first use memory of many computers to reduce disk accesses; if not fast enough, then: 2) Parallel I/O: use many disks in parallel for accesses not in file cache; if not fast enough, then: 3) Parallel program: change program until it sees enough processors that is fast=> Large speedup without fine grain parallel program
  • 122.
    Comparison of 4Cluster Systems
  • 123.
    Cluster Programming EnvironmentsShared Memory Based DSM Threads/OpenMP (enabled for clusters) Java threads (HKU JESSICA, IBM cJVM) Message Passing Based PVM (PVM) MPI (MPI) Parametric Computations Nimrod/Clustor Automatic Parallelising Compilers Parallel Libraries & Computational Kernels (NetSolve)
  • 124.
    Code-Granularity Code ItemLarge grain (task level) Program Medium grain (control level) Function (thread) Fine grain (data level) Loop (Compiler) Very fine grain (multiple issue) With hardware Levels of Parallelism Task i-l Task i Task i+1 func1 ( ) { .... .... } func2 ( ) { .... .... } func3 ( ) { .... .... } a ( 0 ) =.. b ( 0 ) =.. a ( 1 )=.. b ( 1 )=.. a ( 2 )=.. b ( 2 )=.. + x Load PVM/MPI Threads Compilers CPU
  • 125.
    MPI (Message PassingInterface) A standard message passing interface. MPI 1.0 - May 1994 (started in 1992) C and Fortran bindings (now Java) Portable (once coded, it can run on virtually all HPC platforms including clusters! Performance (by exploiting native hardware features) Functionality (over 115 functions in MPI 1.0) environment management, point-to-point & collective communications, process group, communication world, derived data types, and virtual topology routines. Availability - a variety of implementations available, both vendor and public domain. http://www.mpi-forum.org/
  • 126.
    A Sample MPIProgram... # include <stdio.h> # include <string.h> #include “mpi.h” main( int argc, char *argv[ ]) { int my_rank; /* process rank */ int p; /*no. of processes*/ int source; /* rank of sender */ int dest; /* rank of receiver */ int tag = 0; /* message tag, like “email subject” */ char message[100]; /* buffer */ MPI_Status status; /* function return status */ /* Start up MPI */ MPI_Init( &argc, &argv ); /* Find our process rank/id */ MPI_Comm_rank( MPI_COM_WORLD, &my_rank); /*Find out how many processes/tasks part of this run */ MPI_Comm_size( MPI_COM_WORLD, &p); … (master) (workers) Hello,...
  • 127.
    A Sample MPIProgram if( my_rank == 0) /* Master Process */ { for( source = 1; source < p; source++) { MPI_Recv( message, 100, MPI_CHAR, source, tag, MPI_COM_WORLD, &status); printf(“%s \n”, message); } } else /* Worker Process */ { sprintf( message, “Hello, I am your worker process %d!”, my_rank ); dest = 0; MPI_Send( message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COM_WORLD); } /* Shutdown MPI environment */ MPI_Finalise(); }
  • 128.
    Execution % cc-o hello hello.c -lmpi % mpirun -p2 hello Hello, I am process 1! % mpirun -p4 hello Hello, I am process 1! Hello, I am process 2! Hello, I am process 3! % mpirun hello (no output, there are no workers.., no greetings)
  • 129.
    PARMON: A ClusterMonitoring Tool PARMON Server on each node PARMON Client on JVM http://www.buyya.com/parmon/ PARMON High-Speed Switch parmond parmon
  • 130.
  • 131.
    Single I/O Spaceand Design Issues Globalised Cluster Storage Reference: Designing SSI Clusters with Hierarchical Checkpointing and Single I/O Space”, IEEE Concurrency , March, 1999 by K. Hwang, H. Jin et.al
  • 132.
    Without SingleI/O Space Users With Single I/O Space Services Users Single I/O Space Services Clusters with & without Single I/O Space
  • 133.
    Benefits of SingleI/O Space Eliminate the gap between accessing local disk(s) and remote disks Support persistent programming paradigm Allow striping on remote disks, accelerate parallel I/O operations Facilitate the implementation of distributed checkpointing and recovery schemes
  • 134.
    Single I/O SpaceDesign Issues Integrated I/O Space Addressing and Mapping Mechanisms Data movement procedures
  • 135.
    Integrated I/O SpaceSequential addresses . . . B11 SD 1 SD 2 SD m . . . . . . . . . . . . . . . . . . D11 D12 D1t D21 D22 D2t Dn1 Dn2 Dnt B12 B1k B21 B22 B2k Bm1 Bm2 Bmk LD 1 LD2 LD n Local Disks, (RADD Space) Shared RAIDs, (NASD Space ) . . . P 1 P h . . . Peripherals (NAP Space)
  • 136.
    User-level Middleware plussome Modified OS System Calls User Applications RADD I/O Agent Name Agent Disk/RAID/ NAP Mapper Block Mover I/O Agent NASD I/O Agent NAP I/O Agent Addressing and Mapping
  • 137.
    Data Movement ProceduresNode 1 LD 2 or SD i of the NASD Block Mover User Application I/O Agent Node 2 I/O Agent A A LD 1 Node 1 LD 2 or SD i of the NASD Block Mover User Application I/O Agent Node 2 I/O Agent A Request Data Block A LD 1
  • 138.
    What Next ??Clusters of Clusters (HyperClusters) Global Grid Interplanetary Grid Universal Grid??
  • 139.
    Clusters of Clusters(HyperClusters) Scheduler Master Daemon Execution Daemon Submit Graphical Control Clients Cluster 2 Scheduler Master Daemon Execution Daemon Submit Graphical Control Clients Cluster 3 Scheduler Master Daemon Execution Daemon Submit Graphical Control Clients Cluster 1 LAN/WAN
  • 140.
    Towards Grid Computing….For illustration, placed resources arbitrarily on the GUSTO test-bed!!
  • 141.
    What is Grid? An infrastructure that couples Computers (PCs, workstations, clusters, traditional supercomputers, and even laptops, notebooks, mobile computers, PDA, and so on) Software ? (e.g., renting expensive special purpose applications on demand) Databases (e.g., transparent access to human genome database) Special Instruments (e.g., radio telescope--SETI@Home Searching for Life in galaxy, Austrophysics@Swinburne for pulsars) People (may be even animals who knows ?) across the local/wide-area networks (enterprise, organisations, or Internet) and presents them as an unified integrated (single) resource.
  • 142.
    Conceptual view ofthe Grid Leading to Portal (Super)Computing http://www.sun.com/hpc/
  • 143.
    Grid Application-Drivers Oldand New applications getting enabled due to coupling of computers, databases, instruments, people, etc: (distributed) Supercomputing Collaborative engineering high-throughput computing large scale simulation & parameter studies Remote software access / Renting Software Data-intensive computing On-demand computing
  • 144.
    Grid Components GridFabric Networked Resources across Organisations Computers Clusters Data Sources Scientific Instruments Storage Systems Local Resource Managers Operating Systems Queuing Systems TCP/IP & UDP … Libraries & App Kernels … Distributed Resources Coupling Services Comm. Sign on & Security Information … QoS Process Data Access Development Environments and Tools Languages Libraries Debuggers … Web tools Resource Brokers Monitoring Applications and Portals Prob. Solving Env. Scientific … Collaboration Engineering Web enabled Apps Grid Apps. Grid Middleware Grid Tools
  • 145.
    Many GRID Projectsand Initiatives PUBLIC FORUMS Computing Portals Grid Forum European Grid Forum IEEE TFCC! GRID’2000 and more. Australia Nimrod/G EcoGrid and GRACE DISCWorld Europe UNICORE MOL METODIS Globe Poznan Metacomputing CERN Data Grid MetaMPI DAS JaWS and many more... Public Grid Initiatives Distributed.net [email_address] Compute Power Grid USA Globus Legion JAVELIN AppLes NASA IPG Condor Harness NetSolve NCSA Workbench WebFlow EveryWhere and many more... Japan Ninf Bricks and many more... http://www.gridcomputing.com/
  • 146.
  • 147.
  • 148.
  • 149.
    Nimrod - AJob Management System http://www.dgs.monash.edu.au/~davida/nimrod.html
  • 150.
  • 151.
    Nimrod/G Architecture MiddlewareServices Nimrod/G Client Nimrod/G Client Nimrod/G Client Grid Information Services Schedule Advisor Trading Manager Nimrod Engine GUSTO Test Bed Persistent Store Grid Explorer GE GIS TM TS RM & TS RM & TS RM & TS Dispatcher RM: Local Resource Manager, TS: Trade Server
  • 152.
    User Application ResourceBroker A Resource Domain Grid Explorer Schedule Advisor Trade Manager Job Control Agent Deployment Agent Trade Server Resource Allocation Resource Reservation R 1 Other services Trading Grid Information Server R 2 R n … Charging Alg. Accounting Compute Power Market
  • 153.
    Pointers to Literatureon Cluster Computing
  • 154.
    Reading Resources..1a Internet& WWW Computer Architecture: http://www.cs.wisc.edu/~arch/www/ PFS & Parallel I/O http://www.cs.dartmouth.edu/pario/ Linux Parallel Procesing http://yara.ecn.purdue.edu/~pplinux/Sites/ DSMs http://www.cs.umd.edu/~keleher/dsm.html
  • 155.
    Reading Resources..1b Internet& WWW Solaris-MC http://www.sunlabs.com/research/solaris-mc Microprocessors: Recent Advances http://www.microprocessor.sscc.ru Beowulf: http://www.beowulf.org Metacomputing http://www.sis.port.ac.uk/~mab/Metacomputing/
  • 156.
    Reading Resources..2 BooksIn Search of Cluster by G.Pfister, Prentice Hall (2ed), 98 High Performance Cluster Computing Volume1: Architectures and Systems Volume2: Programming and Applications Edited by Rajkumar Buyya, Prentice Hall, NJ, USA. Scalable Parallel Computing by K Hwang & Zhu, McGraw Hill,98
  • 157.
    Reading Resources..3 JournalsA Case of NOW, IEEE Micro, Feb’95 by Anderson, Culler, Paterson Fault Tolerant COW with SSI, IEEE Concurrency, (to appear) by Kai Hwang, Chow, Wang, Jin, Xu Cluster Computing: The Commodity Supercomputing, Journal of Software Practice and Experience-(get from my web) by Mark Baker & Rajkumar Buyya
  • 158.
    Cluster Computing Infowarehttp://www.csse.monash.edu.au/~rajkumar/cluster/
  • 159.
    Cluster Computing ForumIEEE Task Force on Cluster Computing (TFCC) http://www.ieeetfcc.org
  • 160.
    TFCC Activities... NetworkTechnologies OS Technologies Parallel I/O Programming Environments Java Technologies Algorithms and Applications >Analysis and Profiling Storage Technologies High Throughput Computing
  • 161.
    TFCC Activities... HighAvailability Single System Image Performance Evaluation Software Engineering Education Newsletter Industrial Wing TFCC Regional Activities All the above have there own pages, see pointers from: http://www.ieeetfcc.org
  • 162.
    TFCC Activities... Mailinglist, Workshops, Conferences, Tutorials, Web-resources etc. Resources for introducing subject in senior undergraduate and graduate levels. Tutorials/Workshops at IEEE Chapters.. … .. and so on. FREE MEMBERSHIP, please join! Visit TFCC Page for more details: http://www.ieeetfcc.org (updated daily!).
  • 163.
  • 164.
    Summary We havediscussed Clusters Enabling Technologies Architecture & its Components Classifications Middleware Single System Image Representative Systems
  • 165.
    Conclusions Clusters arepromising.. Solve parallel processing paradox Offer incremental growth and matches with funding pattern. New trends in hardware and software technologies are likely to make clusters more promising..so that Clusters based supercomputers can be seen everywhere!
  • 166.
  • 167.
  • 168.
  • 169.
    SISD : AConventional Computer Speed is limited by the rate at which computer can transfer information internally. Ex:PC, Macintosh, Workstations Processor Data Input Data Output Instructions
  • 170.
    The MISD ArchitectureMore of an intellectual exercise than a practical configuration. Few built, but commercially not available Data Input Stream Data Output Stream Processor A Processor B Processor C Instruction Stream A Instruction Stream B Instruction Stream C
  • 171.
    SIMD Architecture Ex:CRAY machine vector processing, Thinking machine cm* C i <= A i * B i Instruction Stream Processor A Processor B Processor C Data Input stream A Data Input stream B Data Input stream C Data Output stream A Data Output stream B Data Output stream C
  • 172.
    Unlike SISD, MISD,MIMD computer works asynchronously. Shared memory (tightly coupled) MIMD Distributed memory (loosely coupled) MIMD MIMD Architecture Processor A Processor B Processor C Data Input stream A Data Input stream B Data Input stream C Data Output stream A Data Output stream B Data Output stream C Instruction Stream A Instruction Stream B Instruction Stream C
  • 173.
    Shared Memory MIMDmachine Comm: Source PE writes data to GM & destination retrieves it Easy to build, conventional OSes of SISD can be easily be ported Limitation : reliability & expandability. A memory component or any processor failure affects the whole system. Increase of processors leads to memory contention. Ex. : Silicon graphics supercomputers.... Global Memory System Processor A Processor B Processor C MEMORY BUS MEMORY BUS MEMORY BUS
  • 174.
    Distributed Memory MIMDCommunication : IPC on High Speed Network. Network can be configured to ... Tree, Mesh, Cube, etc. Unlike Shared MIMD easily/ readily expandable Highly reliable (any CPU failure does not affect the whole system) Processor A Processor B Processor C IPC channel IPC channel MEMORY BUS MEMORY BUS MEMORY BUS Memory System A Memory System B Memory System C