Module 1: Scalable Computing Over the Internet
1.1 Scalable Computing Over the Internet
Computing technology has evolved over 60 years, with changes in machine
architecture, operating systems, network connectivity, and application
workloads.
Parallel and distributed computing systems use multiple computers over the
Internet to solve large-scale, data-intensive, and network-centric problems,
enhancing quality of life and information services.
1.1.1 The Age of Internet Computing
Supercomputer sites and large data centers need to provide high-performance
computing services to many Internet users concurrently.
The Linpack Benchmark is no longer optimal for measuring system
performance due to the demand for high-throughput computing (HTC) systems
built with parallel and distributed computing technologies, driven by the
emergence of computing clouds.
Data centers need upgrades with fast servers, storage, and high-bandwidth
networks to advance network-based computing and web services.
1.1.1.1 The Platform Evolution
Computer technology has progressed through five generations, each lasting 10
to 20 years, with overlaps.
This evolution includes mainframes (1950-1970), minicomputers (1960-1980),
personal computers with VLSI microprocessors (1970-1990), portable and
pervasive devices (1980-2000), and since 1990, the proliferation of HPC and
HTC systems in clusters, grids, and Internet clouds.
The general trend is to leverage shared web resources and massive data over the
Internet.
Supercomputers (MPPs) are being replaced by clusters for resource sharing in
HPC.
Peer-to-peer (P2P) networks are used for distributed file sharing and content
delivery in HTC.
P2P, cloud computing, and web service platforms are more focused on HTC
applications.
Clustering and P2P technologies have led to computational and data grids.
1.1.1.2 High-Performance Computing
HPC systems emphasize raw speed performance, increasing from Gflops in the
early 1990s to Pflops in 2010, driven by scientific, engineering, and
manufacturing demands.
The Top 500 computer systems are measured by floating-point speed using
Linpack, but supercomputer users are less than 10% of all computer users.
1.1.1.3 High-Throughput Computing
High-end computing is shifting from HPC to HTC, focusing on high-flux
computing for Internet searches and web services used by millions
simultaneously.
Performance in HTC is measured by throughput (tasks completed per unit time).
HTC addresses batch processing speed, cost, energy savings, security, and
reliability in data and enterprise computing centers.
1.1.1.4 Three New Computing Paradigms
New paradigms include Web 2.0 services (with SOA), Internet clouds (with
virtualization), and the Internet of Things (IoT) driven by RFID, GPS, and
sensor technologies.
The concept of "computer" has evolved from centralized systems to "the
network is the computer" (John Gage, 1984), "the data center is the computer"
(David Patterson, 2008), and "the cloud is the computer" (Rajkumar Buyya).
Differences among clusters, grids, P2P, and clouds may blur, with clouds
potentially processing huge datasets from traditional Internet, social networks,
and future IoT.
1.1.1.5 Computing Paradigm Distinctions
Centralized computing: All computer resources are in one physical system,
fully shared and tightly coupled within a single OS.
Parallel computing: Processors are either tightly coupled with shared memory
or loosely coupled with distributed memory, communicating via shared memory
or message passing.
Distributed computing: Multiple autonomous computers, each with private
memory, communicate through a network via message passing.
Cloud computing: An Internet cloud can be centralized or distributed, applying
parallel or distributed computing, built with physical or virtualized resources
over large data centers. It's also seen as utility or service computing.
Concurrent computing/programming: Union of parallel and distributed
computing.
Ubiquitous computing: Computing with pervasive devices anywhere, anytime
using wired/wireless communication.
Internet of Things (IoT): Networked connection of everyday objects
(computers, sensors, humans, etc.), supported by Internet clouds for ubiquitous
computing.
Internet computing: Broader term covering all computing paradigms over the
Internet.
1.1.1.6 Distributed System Families
P2P networks and networks of clusters have consolidated into computational or
data grids.
Internet clouds result from moving desktop computing to service-oriented
computing using server clusters and large databases at data centers.
Grids and clouds are disparity systems emphasizing resource sharing.
Massively distributed systems exploit high degrees of parallelism or
concurrency.
Future HPC and HTC systems will require multicore/many-core processors and
emphasize parallelism and distributed computing, aiming for high throughput,
efficiency, scalability, and reliability.
Efficiency: Measures resource utilization in HPC (massive parallelism) and
HTC (job throughput, data access, storage, power efficiency).
Dependability: Reliability and self-management from chip to application
levels, ensuring high-throughput service with QoS even under failure.
Adaptation: Ability to support billions of job requests over massive datasets
and virtualized cloud resources.
Flexibility: Ability for distributed systems to perform well in both HPC
(science/engineering) and HTC (business) applications.
1.1.2 Scalable Computing Trends and New Paradigms
Technological trends like Moore's Law (processor speed doubling every 18
months) and Gilder's Law (network bandwidth doubling annually) drive
computing applications.
The price/performance ratio of commodity hardware in consumer markets
drives its adoption in large-scale computing.
Distributed systems emphasize resource distribution and a high degree of
parallelism (DoP).
1.1.2.1 Degrees of Parallelism
Bit-Level Parallelism (BLP): Conversion from bit-serial to word-level
processing (e.g., 4-bit to 64-bit CPUs).
Instruction-Level Parallelism (ILP): Processor executes multiple instructions
simultaneously (pipelining, superscalar, VLIW, multithreading). Requires
branch prediction, dynamic scheduling, speculation, and compiler support.
Data-Level Parallelism (DLP): Popularized by SIMD and vector machines
using vector/array instructions. Requires hardware and compiler assistance.
Task-Level Parallelism (TLP): Explored with multicore processors and chip
multiprocessors (CMPs), though programming and compilation for efficient
execution remain challenging.
Job-Level Parallelism (JLP): Increase in computing granularity as systems
move from parallel to distributed processing.
Coarse-grain parallelism is built on top of fine-grain parallelism.
1.1.2.2 Innovative Applications
Both HPC and HTC systems aim for transparency in data access, resource
allocation, process location, concurrency, job replication, and failure recovery.
Applications span science, engineering, business, education, healthcare, Internet
services, military, and government, demanding computing economics, web-
scale data collection, system reliability, and scalable performance.
Distributed transaction processing in banking requires maintaining consistency
of replicated records, facing challenges like software support, network
saturation, and security threats.
1.1.2.3 The Trend toward Utility Computing
Major computing paradigms (Web services, data centers, utility computing,
service computing, grid computing, P2P computing, cloud computing) are
ubiquitous, aim for autonomic operations, and are composable with QoS and
SLAs.
Utility computing is a business model where customers receive computing
resources from a paid service provider (e.g., grid/cloud platforms). Cloud
computing is a broader concept than utility computing.
Technological challenges include network-efficient processors, scalable
memory/storage, distributed OSes, virtualization middleware, new
programming models, and effective resource management.
1.1.2.4 The Hype Cycle of New Technologies
New technologies go through a "hype cycle" with five stages: innovation
trigger, peak of inflated expectations, trough of disillusionment, slope of
enlightenment, and plateau of productivity.
Mainstream adoption timelines are categorized: less than 2 years (hollow
circles), 2 to 5 years (gray circles), 5 to 10 years (solid circles), and more than
10 years (triangles). Crossed circles denote technologies that become obsolete.
As of August 2010, cloud technology had just passed the peak of expectation
and was predicted to reach productivity in 2 to 5 years.
1.1.3 The Internet of Things and Cyber-Physical Systems
1.1.3.1 The Internet of Things
The IoT, introduced in 1999, refers to the networked interconnection of
everyday objects, viewing it as a wireless sensor network.
With IPv6, there are enough addresses to distinguish all objects on Earth. The
IoT needs to track trillions of static or moving objects simultaneously, requiring
universal addressability.
Communication patterns include Human-to-Human (H2H), Human-to-Thing
(H2T), and Thing-to-Thing (T2T).
The goal is to connect things (humans and machines) intelligently at any time
and any place with low cost.
Cloud computing is expected to support fast, efficient, and intelligent
interactions among humans, machines, and objects for a "smart Earth."
1.1.3.2 Cyber-Physical Systems
A Cyber-Physical System (CPS) integrates computational processes with the
physical world, merging computation, communication, and control (3C
technologies) into an intelligent closed-feedback system.
While IoT emphasizes networking connections, CPS emphasizes exploration of
virtual reality (VR) applications in the physical world.
1.2 Technologies for Network-Based Systems
Hardware, software, and network technologies are crucial for distributed
computing system design, especially for building distributed operating systems
to handle massive parallelism.
1.2.1 Multicore CPUs and Multithreading Technologies
Processor speed (MIPS) and network bandwidth (Mbps) are key to HPC and
HTC.
1.2.3 Memory, Storage, and Wide-Area Networking
Memory Technology: DRAM chip capacity has quadrupled every three years
(e.g., 16 KB in 1976 to 64 GB in 2011).
Wide-Area Networking: Ethernet bandwidth has grown rapidly (e.g., 10 Mbps
in 1979 to 1 Gbps in 1999, 40-100 GE in 2011), with 1 Tbps links speculated by
2013. Network performance doubles annually, faster than Moore's Law for
CPUs.
High-bandwidth networking enables massively distributed systems.
InfiniBand and Ethernet are predicted to be major interconnects in HPC.
1.3 System Models for Distributed and Cloud Computing
Distributed and cloud computing systems are built from autonomous computer
nodes interconnected hierarchically by SANs, LANs, or WANs.
Massive systems with millions of computers can achieve web-scale
connectivity.
1.3.1.2 Single-System Image
An ideal cluster merges multiple system images into a Single-System Image
(SSI), making the cluster appear as one integrated resource to the user.
Cluster operating systems or middleware aim to support SSI by sharing CPUs,
memory, and I/O across nodes.
1.3.2.1 Computational Grids
A computing grid offers an infrastructure coupling computers, software,
instruments, people, and sensors, often constructed over LAN, WAN, or
Internet backbone networks regionally, nationally, or globally.
Grids are presented as integrated computing resources or virtual platforms for
virtual organizations, using workstations, servers, clusters, and supercomputers.
They integrate computing, communication, content, and transactions as rented
services.
1.3.2.2 Grid Families
Grid systems are classified into computational/data grids and P2P grids.
Computational and data grids are primarily built at the national level.
1.3.3 Peer-to-Peer Network Families
P2P architecture is a distributed model that is client-oriented, not server-
oriented.
In a P2P system, every node acts as both a client and a server, contributing
resources.
Peer machines are autonomous, can join/leave freely, and there's no master-
slave relationship or central coordination/database. The system is self-
organizing with distributed control.
1.3.4.2 The Cloud Landscape
Cloud computing is an on-demand paradigm addressing bottlenecks of
traditional systems like constant maintenance, poor utilization, and rising
upgrade costs.
Major cloud service models include:
o Infrastructure as a Service (IaaS): Provides infrastructure (servers, storage,
networks, data center fabric) where users deploy and run VMs with guest OSes
and applications, without managing the underlying cloud infrastructure.
o Platform as a Service (PaaS): Enables users to deploy self-built applications
onto a virtualized cloud platform, including middleware, databases,
development tools, and runtime support (e.g., Web 2.0, Java). The provider
supplies APIs and software tools.
o Software as a Service (SaaS): Browser-initiated application software for
thousands of paid cloud customers, applicable to business processes, CRM, etc.
1.4 Software Environments for Distributed Systems and Clouds
1.4.1.2 Web Services and Tools
Web services and REST systems are two approaches to service architecture,
offering loose coupling and support for heterogeneous implementations, making
them more attractive than distributed objects.
1.4.3.1 Message-Passing Interface (MPI)
MPI is the primary programming standard for developing parallel and
concurrent programs on distributed systems, essentially a library of
subprograms callable from C or FORTRAN.
It embodies clusters, grid systems, and P2P systems with upgraded web services
and utility computing applications.
Low-level primitives like Parallel Virtual Machine (PVM) also support
distributed programming.
1.4.3.4 Open Grid Services Architecture (OGSA)
OGSA is a common standard for public use of grid services, driven by large-
scale distributed computing applications requiring high resource and data
sharing.
Key features include a distributed execution environment, PKI services, trust
management, and security policies.
1.4.3.5 Globus Toolkits and Extensions
Globus is a middleware library implementing OGSA standards for resource
discovery, allocation, and security enforcement in a grid environment.
1.5 Performance and Dependability
Performance metrics are crucial for measuring distributed systems.
1.5.1.1 Performance Metrics
System throughput is measured in MIPS, Tflops, or TPS. Other measures
include job response time and network latency.
System overhead is attributed to OS boot time, compile time, I/O data rate, and
runtime support.
Other metrics include QoS for web services, system availability, dependability,
and security resilience.
1.5.1.2 Dimensions of Scalability
Users desire scalable performance, with resource upgrades being backward
compatible.
Size scalability: Achieving higher performance or functionality by increasing
machine size.
1.5.1.4 Amdahl’s Law
Amdahl's Law describes the theoretical speedup of a program when executed in
parallel, considering a sequential portion (bottleneck) that cannot be
parallelized. If a fraction
α of the code is sequential and (1-α) can be parallelized by n processors, the
total execution time is αT + (1-α)T/n.
Amdahl's law is applied to fixed workloads.
1.5.1.5 Gustafson's Law
Gustafson's Law calculates the scaled-workload speedup,
S' = α + (1-α)n, by fixing the parallel execution time.
Gustafson's law is applied when solving scaled problems.
1.5.2 Fault Tolerance and System Availability
System availability and application flexibility are important design goals in
distributed computing systems, alongside performance.