KEMBAR78
How to Build a Compute Cluster | PDF
How to Build a Cluster
A High-Level Overview of the Key Issues
Ramsay Key – May 2017
Why should you care?
• Although a lot of computing has been commoditized and delivered via IT
and “the cloud”, it’s still useful to understand infrastructure:
• Building your own “bare-metal” servers for yourself or a customer
• Deploying your hardware into your customer’s data-center
• Deploying your software into your customer’s data-center
• Understanding considerations for non-traditional systems (i.e. vehicles, devices)
• Some components need to built regardless of “public cloud” or private infrastructure
Photo:Google - http://datacenterfrontier.com/google-building-four-story-data-centers/
The ideas and practices for building
this apply even if you have just a
single rack or half-rack
Outline
• Cluster Design
• Racks and Servers
• Networking
• Day-to-Day Operations
Cluster Design
Gathering Requirements
• Operations
• What is the intended operation of the system? Prototype? Production? 24x7? Call-ins?
• Type of system? Query? Analytics? Batch? Streaming?
• Could the system grow?
• Dataflow
• How much data? Ingest volume (bytes/records)? Query load? Timeframe?
• Compute
• Type of computing: CPU heavy? RAM heavy? File I/O heavy? Network I/O heavy?
• If these answers are not clear – try to derive upper and lower bounds on the needs
RAM goes a long way these days!
• All 4B IPv4 addresses
fit in 18GB of RAM
• 256GB of RAM can
hold 16B MD5 hashes
Design Considerations
• Reliability
• How reliable does your system need to be? How many “nines” of availability do you need?
• Failure
• How much redundancy is required? How much do you have? What about the facilities plant?
• Scalability
• Horizontal vs. Vertical. How much load can the system process? Scalability also includes people processes
• Backup
• Do you have a backup plan?
• Application Deployments
• How to deploy, manage, troubleshoot applications?
Helpful Philosophies
• Keep It Simple Stupid
• Very easy to create complex server infrastructure. Strive for simplicity
• Be Homogenous
• Heterogeneity complicates scaling, debugging, and logistics
• Expect Failure
• Components will fail. Disks fail all the time. More computers ➔ More failure probability
• Automate Everything
• If you can reproduce your infrastructure quickly and easily, it is a good sign it is healthy
Pets vs. Cattle
Financial Considerations
• Electricity availability/cost fundamentally dictates scale
• Appropriate accounting/purchasing allows hardware to be depreciated
• Can you “buy” your way out of scalability problems? (i.e. horizontal scaling)
• Capital Expenditure (CapEx) vs. Operational Expenditures (OpEx)
• CapEx = hardware, facilities
• OpEx = labor, support, maintenance
• Trade-off between CapEx and OpEx
• Clusters generally try to minimize OpEx via automation, homogeneity
• However, clusters don’t run and fix themselves – still need labor to support them
Racks and Servers
Rackspace & Power, Space, Cooling
• Datacenters have real physical constraints generally characterized as Power,
Space, and Cooling (PSC)
• Datacenters are laid out in “racks” (a.k.a. cabinets)
• Datacenters have different “Tiers” (1-4) for handling different failure levels
• Racks are standardized around 42 “rack units” height (a.k.a. “U”).
• “Rack servers” commonly come in 1U and 2U dimensions. Width standardized.
• 3U+ generally implied to be more unique hardware
• Prefer rack-servers over “blade centers”
Rack Considerations
• 1U servers are considered “dense”
• Need to pay attention to cooling and cabling
• Can be hard to fit more elaborate components in 1U (GPUs or large hard-drives)
• 2U servers are good all-around chassis
• May lose some density per U
• A good reference for an “all-up” rack is 40 servers and 2 “top-of-rack”
(TOR) switches
Power Distribution Units (PDU) – can run 2 for redundancy
1U servers
2 Top-of-rack (TOR) switches. Red and blue cables are
“bonded” and provide redundancy and performance
HID – Badge swipe access / alarm
Fiber uplinks to datacenter spine
From: https://techbloc.net/archives/970
1 “management” switch for administration. Separate
network for when main network is down. Yellow.
Server Selection
• For purchasing, best to work through a value-added reseller (VAR)
• Can assist with questions, delivery, coordination
• Consider redundant power supplies depending on production level
• Consider redundant network ports depending on production level
• Hot-swappable hard-drives make life easy (almost standard now)
• Make sure NICs will “PXEBoot” (i.e. network boot)
• Consider RAID (Redundant Array of Inexpensive Disks) level. Popular options:
• RAID10 a good mix between redundancy and performance
• JBOD = Just a bunch of disks – let applications manage redundancy
10 2TB hot-swappable hard-drives
Dual hot-swappable power supplies
Dual CPUs, 10 cores
256GB RAM
Fan bank for cooling
(Not viewable) – 4 10G NIC ports, 2 1G NIC port, 1 IPMI port
Typical
commodity server
– circa 2017
Generally don’t
need tools to
replace parts!
Operating Systems
• Linux is the OS of choice when building clusters
• Lots of tools for managing and tuning Linux clusters at scale
• Licensed software complicates scaling a cluster
• These days many excellent open-source alternatives exists
• CentOS (derivative of Redhat) and Ubuntu are both popular options
• CentOS generally about stability and security
• Popular with enterprises, IT, and operations people
• Ubuntu generally newer and modern (closer to latest constituent software releases)
• Popular with innovators, researchers, etc.
Provisioning
• Provisioning = building or rebuilding a node
• Typical flow is for the node to “PXEBoot” into a kickstart (Redhat/Centos)
or preseed (Ubuntu) an installer
• Node first boots via DHCP then downloads a kickstart/preseed file
• The kickstart/preseed file points to an installer and associated packages
• Foreman is a system that facilitates the “PXEBoot”, kickstart/preseed process
• Typical model is to put the bare minimum into kickstart/preseed and then let
a configuration management system take over
Example Kickstart File
Networking
Network Considerations
• Typical server network configuration would have:
• 1Gb IPMI port
• Allows interaction with basic server functions (power-on/power-off, etc.)
• 1Gb management port
• For server administration tasks, separate from data network
• 2 10Gb data ports (in a bonded configuration)
• For passing data between nodes
• Keep in mind that disk I/O speed may be slower than the network
Network Fabric
• Typical rack configuration has two “top-of-rack” (TOR) switches that connect all
the internal rack servers together, plus an “uplink” to the datacenter spine of “core”
switches so it can talk to the other racks
• Use two switches per rack for redundancy and throughput
• Spine typically runs at 25Gb, 40Gb, or 100Gb
http://www.cisco.com/c/dam/en/us/products/collateral/switches/nexus-7000-
series-switches/white-paper-c11-737022.docx/_jcr_content/renditions/white-
paper-c11-737022_3.jpg
Racks
Day-to-Day Operations
Configuration Management
• Configuration Management = tools that guarantee server configurations
• Popular cluster configuration management tools:
• Puppet: Most common, well-known, agent architecture
• Chef: An early alternative to Puppet, also agent architecture
• Ansible: Popular, agent-less architecture, integrates with networking gear well
• SaltStack: More recent alternative, both agent and agent-less architecture
• All tools have pros & cons - doesn’t matter so much what you use, just use one!
• Possible to entirely define your infrastructure within the tools
• Version control (git) the tool configurations and you get “infrastructure as code”
Example Puppet Manifest
Monitoring
• Bad things will happen in your cluster!
• The larger, the more complex the cluster, the more fantastic ways it can fail
• Absolutely need monitoring of your cluster
• Nagios is the most common tool for monitoring
• Many alternatives: ElastAlert, Zenoss, Prometheus, xymon, Zabbix, Sensu, …
• Tools send an alert (email, syslog, IM) when bad things happen
• Tools usually come with some defaults – and have pluggable architectures
nagios.com
Health & Status (Metrics)
• “Measure Anything. Measure Everything” – Etsy
• Instrument everything you can:
• Useful for performance tuning
• When bad things happen, these will be handy for identifying the root-cause
• Applications, operating systems, processes, disks, network, memory, etc.
• Ganglia is the most common tool for metrics
• Many excellent alternatives: Grafana, Prometheus, Logstash, Graphite, statsd, collectd, OpenTSDB, Timely
• Pick one and use it!
Coordination Services
• Often useful to run a “coordination service” within a cluster
• Coordination service provides distributed reliable services for applications:
• Configuration (identifying masters)
• Naming (finding other services)
• Synchronization (tracking state)
• Typically present a “key-value” interface to clients
• Popular implementations: Zookeeper, etcd, Consul, etc.
Other Usefuls Tools
• Public-Key Infrastructure (PKI, ssh) should be used for authentication
• Avoid passwords – only root should have a password, if at all
• Use LDAP to manage user accounts
• Have some “database” that records which servers provide which functions
• genders is a simple, popular way to do this
• pdsh is a parallel shell command useful for running commands across a cluster
• dshbak cleans up the output
Software Deployment Considerations
• Always a good idea to partition your cluster into:
• Production servers - stuff you really care about…doesn’t need to be 24x7 to be production
• Integration servers - last stop before being added to production
• Test servers - general developer playground similar to production systems
• Package, version, and install your software like a product
• Helps for automation and traceability
• Scripting languages (python, perl, etc.) can be risky to deploy because they can easily be
changed once installed
Future Considerations
• Virtualization
• Openstack, AWS, GCE, Azure, Rackspace
• Containers
• Docker, rkt, CoreOS, Swarm, Kubernetes
• “Serverless” computing
• Open Compute Project
• Software-Defined-Networking (SDN)

How to Build a Compute Cluster

  • 1.
    How to Builda Cluster A High-Level Overview of the Key Issues Ramsay Key – May 2017
  • 2.
    Why should youcare? • Although a lot of computing has been commoditized and delivered via IT and “the cloud”, it’s still useful to understand infrastructure: • Building your own “bare-metal” servers for yourself or a customer • Deploying your hardware into your customer’s data-center • Deploying your software into your customer’s data-center • Understanding considerations for non-traditional systems (i.e. vehicles, devices) • Some components need to built regardless of “public cloud” or private infrastructure
  • 3.
    Photo:Google - http://datacenterfrontier.com/google-building-four-story-data-centers/ Theideas and practices for building this apply even if you have just a single rack or half-rack
  • 4.
    Outline • Cluster Design •Racks and Servers • Networking • Day-to-Day Operations
  • 5.
  • 6.
    Gathering Requirements • Operations •What is the intended operation of the system? Prototype? Production? 24x7? Call-ins? • Type of system? Query? Analytics? Batch? Streaming? • Could the system grow? • Dataflow • How much data? Ingest volume (bytes/records)? Query load? Timeframe? • Compute • Type of computing: CPU heavy? RAM heavy? File I/O heavy? Network I/O heavy? • If these answers are not clear – try to derive upper and lower bounds on the needs
  • 7.
    RAM goes along way these days! • All 4B IPv4 addresses fit in 18GB of RAM • 256GB of RAM can hold 16B MD5 hashes
  • 8.
    Design Considerations • Reliability •How reliable does your system need to be? How many “nines” of availability do you need? • Failure • How much redundancy is required? How much do you have? What about the facilities plant? • Scalability • Horizontal vs. Vertical. How much load can the system process? Scalability also includes people processes • Backup • Do you have a backup plan? • Application Deployments • How to deploy, manage, troubleshoot applications?
  • 9.
    Helpful Philosophies • KeepIt Simple Stupid • Very easy to create complex server infrastructure. Strive for simplicity • Be Homogenous • Heterogeneity complicates scaling, debugging, and logistics • Expect Failure • Components will fail. Disks fail all the time. More computers ➔ More failure probability • Automate Everything • If you can reproduce your infrastructure quickly and easily, it is a good sign it is healthy
  • 10.
  • 11.
    Financial Considerations • Electricityavailability/cost fundamentally dictates scale • Appropriate accounting/purchasing allows hardware to be depreciated • Can you “buy” your way out of scalability problems? (i.e. horizontal scaling) • Capital Expenditure (CapEx) vs. Operational Expenditures (OpEx) • CapEx = hardware, facilities • OpEx = labor, support, maintenance • Trade-off between CapEx and OpEx • Clusters generally try to minimize OpEx via automation, homogeneity • However, clusters don’t run and fix themselves – still need labor to support them
  • 12.
  • 13.
    Rackspace & Power,Space, Cooling • Datacenters have real physical constraints generally characterized as Power, Space, and Cooling (PSC) • Datacenters are laid out in “racks” (a.k.a. cabinets) • Datacenters have different “Tiers” (1-4) for handling different failure levels • Racks are standardized around 42 “rack units” height (a.k.a. “U”). • “Rack servers” commonly come in 1U and 2U dimensions. Width standardized. • 3U+ generally implied to be more unique hardware • Prefer rack-servers over “blade centers”
  • 14.
    Rack Considerations • 1Uservers are considered “dense” • Need to pay attention to cooling and cabling • Can be hard to fit more elaborate components in 1U (GPUs or large hard-drives) • 2U servers are good all-around chassis • May lose some density per U • A good reference for an “all-up” rack is 40 servers and 2 “top-of-rack” (TOR) switches
  • 15.
    Power Distribution Units(PDU) – can run 2 for redundancy 1U servers 2 Top-of-rack (TOR) switches. Red and blue cables are “bonded” and provide redundancy and performance HID – Badge swipe access / alarm Fiber uplinks to datacenter spine From: https://techbloc.net/archives/970 1 “management” switch for administration. Separate network for when main network is down. Yellow.
  • 16.
    Server Selection • Forpurchasing, best to work through a value-added reseller (VAR) • Can assist with questions, delivery, coordination • Consider redundant power supplies depending on production level • Consider redundant network ports depending on production level • Hot-swappable hard-drives make life easy (almost standard now) • Make sure NICs will “PXEBoot” (i.e. network boot) • Consider RAID (Redundant Array of Inexpensive Disks) level. Popular options: • RAID10 a good mix between redundancy and performance • JBOD = Just a bunch of disks – let applications manage redundancy
  • 17.
    10 2TB hot-swappablehard-drives Dual hot-swappable power supplies Dual CPUs, 10 cores 256GB RAM Fan bank for cooling (Not viewable) – 4 10G NIC ports, 2 1G NIC port, 1 IPMI port Typical commodity server – circa 2017 Generally don’t need tools to replace parts!
  • 18.
    Operating Systems • Linuxis the OS of choice when building clusters • Lots of tools for managing and tuning Linux clusters at scale • Licensed software complicates scaling a cluster • These days many excellent open-source alternatives exists • CentOS (derivative of Redhat) and Ubuntu are both popular options • CentOS generally about stability and security • Popular with enterprises, IT, and operations people • Ubuntu generally newer and modern (closer to latest constituent software releases) • Popular with innovators, researchers, etc.
  • 19.
    Provisioning • Provisioning =building or rebuilding a node • Typical flow is for the node to “PXEBoot” into a kickstart (Redhat/Centos) or preseed (Ubuntu) an installer • Node first boots via DHCP then downloads a kickstart/preseed file • The kickstart/preseed file points to an installer and associated packages • Foreman is a system that facilitates the “PXEBoot”, kickstart/preseed process • Typical model is to put the bare minimum into kickstart/preseed and then let a configuration management system take over
  • 20.
  • 21.
  • 22.
    Network Considerations • Typicalserver network configuration would have: • 1Gb IPMI port • Allows interaction with basic server functions (power-on/power-off, etc.) • 1Gb management port • For server administration tasks, separate from data network • 2 10Gb data ports (in a bonded configuration) • For passing data between nodes • Keep in mind that disk I/O speed may be slower than the network
  • 23.
    Network Fabric • Typicalrack configuration has two “top-of-rack” (TOR) switches that connect all the internal rack servers together, plus an “uplink” to the datacenter spine of “core” switches so it can talk to the other racks • Use two switches per rack for redundancy and throughput • Spine typically runs at 25Gb, 40Gb, or 100Gb http://www.cisco.com/c/dam/en/us/products/collateral/switches/nexus-7000- series-switches/white-paper-c11-737022.docx/_jcr_content/renditions/white- paper-c11-737022_3.jpg Racks
  • 24.
  • 25.
    Configuration Management • ConfigurationManagement = tools that guarantee server configurations • Popular cluster configuration management tools: • Puppet: Most common, well-known, agent architecture • Chef: An early alternative to Puppet, also agent architecture • Ansible: Popular, agent-less architecture, integrates with networking gear well • SaltStack: More recent alternative, both agent and agent-less architecture • All tools have pros & cons - doesn’t matter so much what you use, just use one! • Possible to entirely define your infrastructure within the tools • Version control (git) the tool configurations and you get “infrastructure as code”
  • 26.
  • 27.
    Monitoring • Bad thingswill happen in your cluster! • The larger, the more complex the cluster, the more fantastic ways it can fail • Absolutely need monitoring of your cluster • Nagios is the most common tool for monitoring • Many alternatives: ElastAlert, Zenoss, Prometheus, xymon, Zabbix, Sensu, … • Tools send an alert (email, syslog, IM) when bad things happen • Tools usually come with some defaults – and have pluggable architectures
  • 28.
  • 29.
    Health & Status(Metrics) • “Measure Anything. Measure Everything” – Etsy • Instrument everything you can: • Useful for performance tuning • When bad things happen, these will be handy for identifying the root-cause • Applications, operating systems, processes, disks, network, memory, etc. • Ganglia is the most common tool for metrics • Many excellent alternatives: Grafana, Prometheus, Logstash, Graphite, statsd, collectd, OpenTSDB, Timely • Pick one and use it!
  • 31.
    Coordination Services • Oftenuseful to run a “coordination service” within a cluster • Coordination service provides distributed reliable services for applications: • Configuration (identifying masters) • Naming (finding other services) • Synchronization (tracking state) • Typically present a “key-value” interface to clients • Popular implementations: Zookeeper, etcd, Consul, etc.
  • 32.
    Other Usefuls Tools •Public-Key Infrastructure (PKI, ssh) should be used for authentication • Avoid passwords – only root should have a password, if at all • Use LDAP to manage user accounts • Have some “database” that records which servers provide which functions • genders is a simple, popular way to do this • pdsh is a parallel shell command useful for running commands across a cluster • dshbak cleans up the output
  • 34.
    Software Deployment Considerations •Always a good idea to partition your cluster into: • Production servers - stuff you really care about…doesn’t need to be 24x7 to be production • Integration servers - last stop before being added to production • Test servers - general developer playground similar to production systems • Package, version, and install your software like a product • Helps for automation and traceability • Scripting languages (python, perl, etc.) can be risky to deploy because they can easily be changed once installed
  • 35.
    Future Considerations • Virtualization •Openstack, AWS, GCE, Azure, Rackspace • Containers • Docker, rkt, CoreOS, Swarm, Kubernetes • “Serverless” computing • Open Compute Project • Software-Defined-Networking (SDN)