KEMBAR78
What HPC can learn from DevOps? | PDF
DevOps and HPC:
Saudi Aramco HPC use case
Walid A. Shaari 20th April 2016
Ahmed Bu-khamsin
2
References in this document to any specific commercial products, process, or
service by trade name, trademark, manufacturer, or otherwise, does not
necessarily constitute or imply its endorsement, recommendation, or favoring
by Saudi Aramco or Saudi Aramco HPC group. The ideas and findings of authors
expressed in any slides or other material should not be construed as an official
Saudi Aramco or HPC team position and shall not be used for advertising or
product endorsement purposes. Information contained in this document is
published in the interest of scientific and technical information exchange.
DISCLAIMER OF ENDORSEMENT
27/10/2014
3
DevOps
Cultural movement or practice that
emphasizes the collaboration and
communication of both Application
Developers and Operations
professionals.
Development
Business
Operations
adaptive
automated
agile
4
Business Drives
o Optimization
Effective data center(s) resources utilization:
• Utilization of systems, storage, network, or services.
• Better use of employees time and skills.
o Growth ( N x R x P )
Increasing Infrastructure scale
• N: number of managed nodes/clusters/environments
• R: number of applications(business roles)
• P: number of technical services (technology profiles)
5
Popular DevOps Tools
Docker
Mesos
GIT Puppet
6
Data Center blueprints
7
Script
Packages
Files
Services Mounts Security
Cluster Deployment
8
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
• Different Hardware
• Different Sizes
• Different Users
• Different Operating Systems
9
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Common Tasks:
Apply security patches
Add new storage
Upgrade the OS
Install new packages
Common Issues:
Scalability issue
Lack of history
No team collaboration
No drift control
Long development and
test cycle
10
• Do it DevOps way
- Infrastructure as code
• Definition of Infrastructure as code:
"Enable the reconstruction of the business from nothing but a source code
repository, an application data backup, and bare metal resources"
Solution
11
• Domain Specific Language:
- To describe the infrastructure desired state
• Data Store:
- To store the configuration specifications and other data
• Control System:
- To deploy the code and apply the required configuration changes
• Versioning Control System
- To keep history
- enforce workflow and peer review
- Team collaboration
Configuration Management Tools
12
Puppet
• Open-source IT automation framework
• Framework to simplify and automate system configuration and provisioning
• Replaces ssh-for loops and scripts
• Hundreds of configuration modules available for download
• Supports many Linux distributions, Windows, storage and network devices
13
• Hardware Delivery
• Power Up and Network Connectivity
• OS Installation
• Aramco Customization
• Benchmarking
• Application Testing
• Production
HP CMU . IBM xCat . Dell Bright
Where Puppet Fits
Cluster Deployment Project Plan
14
Benefits
• Speeds up clusters deployment From days to hours
- Shorter development cycle
- Same code is used for deployment and compliance
- Code Reuse
15
Benefits
Contribution During Puppet Deployment Project
Contribution During First Deployment Project
Contribution During Second Deployment Project
November 13 2014 - April 22 2015
Commits statistic for
production
697 commits during 160
days
Average 4.4 commits per day
Contributed by 9 authors
16
Benefits
• Automatic and continuous deployment
- Classify the cluster to the right type and Puppet does the rest
17
Benefits
• Advanced reporting capabilities
• Self healing and drift control
• Baseline configuration compliance
18
Benefits
• Version control and development workflow
• Team Collaboration
Production
Bug-fix
New feature
Merge
Request
Merge
Request
19
git Branches and Commits
20
How Pervasive is Configuration Management?
ASM
21
Traditional HPC Cluster Management tools
https://www.flickr.com/photos/vrogy/514733529
22
Provisioning
Workload
Scheduler
& Metrics
System
(user land, kernel modules, devices)
Bare metalBootstrapping
Coniguration
Orchestration
consistency
Provisioningactivity
puppet,
Ansible,
Chef
Grid Engine
SLURM
TORQUE/MOAB
Mesos /Swarm/Nomad
puppet,
Chef
Ansible
foreman
Razor
Digital-rebar
Ironic
Virtual
image
Container
HPC OPSWeb/Cloud OPS
HPC workload runs on
the cloud
25%
24
Which workloads and frameworks are running on
OpenStack?
Source : https://www.openstack.org/assets/survey/Public-User-Survey-Report.pdf
25
HPC in non bare-metal Experimental? Is it Mature?
Vendor trends
26
Next Generation Provisioning
Puppet
Razor Ironic
• No vendor lock: Open Source availability
• Environments Agnostic
• bare-metal, virtual image, and containers
• Use open standards
• Ipmi2, ipxe, dhcp, REST, https
• Handles end to end application provisioning
• Better integration with other tools
• configuration management, CMDB, Monitoring
• Programmable
• On-demand provisioning
• Policy/Model based
27
Data Center current state
SchedulerSchedulerScheduler
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
Cluster Management A
Cluster Management B
Cluster Management C
0%
50%
100%
28
Data Center
Breaking the Silos
SchedulerSchedulerScheduler
MetaScheduler
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
29
Data Center
Efficient Secure Allocation of Resources
VC3
BigData
VC1
Infra
VC2
HPC
SchedulerSchedulerScheduler
DataCenterScheduler
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
2nd Generation Cluster Management
30
Containers
Container encapsulate an application completely with all of its
software dependencies into a standardized unit for software portable
across different platforms*
https://www.docker.com/what-docker
31
Containers Potential Benefits to HPC
o High performing
o Lightweight
o Portable, could solve software packaging, configuration, and delivery
o Host Kernel and system drivers visibility
o Composable
o Targets better scalable monitoring, logging, and security
o Private in-house repositories
o Workforce Separation of concerns (e.g. Operations, Development, Security, Users)
o Builds on mature agile application lifecycle management
o Empowers application support, and developers
o Holistic, yet modular ECO system
o Schedulers, and cluster managers
(Traditional e.g. LSF, UGE, Moab, and Slurm)
(Modern: Mesos, Kubernetes, nextflow)
32
Docker Performance
http://www.theregister.co.uk/2014/08/18/docker_kicks_kvms_butt_in_ibm_tests
33
NVIDIA Example use case
https://github.com/NVIDIA/nvidia-docker
34
Host possible workload
Tiny Core Linux (VM)
Docker Engine
Bin/libs
Enterprise Linux Distribution
Service
RHEL7
HPCtask
HPCtask
HPCtask
HPCtask
AlpineMicroService
MicroService
MicroService
MicroService
Ubuntu
Bigdata
Alpine
Redis
Kibana
Logstash
Elasticsearc
35
HPC Host Reality
RHEL7
HPCTask
HPCTask
HPCTAsk
HPCTask
Bin/Libs
HPC service
Docker Engine
Docker capable OS
Bin/Libs
HPC service
Bin/Libs
HPC service
Docker Engine
Docker capable OS
Docker Engine
Docker capable OS
Bin/Libs
HPC Job 3
Docker Engine
Docker capable OS
Docker Engine
Docker capable OS
Bin/Libs
HPC Job 3
Bin/Libs
HPC Job 3
Container Cluster Management/orchestration
36
Possible HPC Challenges
o Change of processes, and mindset
o Linux kernel requirements
o Maturity of the cluster management and scheduling solution
o Keeping up with the containers eco system
o Extremely fast moving target
o Several architectural and fundamental decisions to make
o Memory deduplication
o Necessity of automated tool-chains
“development, integration, and delivery workflows”
o Security
Trusted container libraries
37
Thank you
38
Extra Slides
27/10/2014
39
• http://www.meetup.com/Docker-Riyadh/
• http://www.meetup.com/Docker-Dhahran/
Saudi Docker meetups
27/10/2014
40
Mesos
§ Mature, Open Source Apache Project
§ Cluster Resource Manager
§ Scalable to 10,000s of nodes
§ Fault tolerant, no single point of failure
§ Multi-tenancy with strong resource isolation
§ Improved resource utilization
41
Mesos workload schedulers “Frameworks”
42
43
File system Layers
44
Devil in the details
ssh
mpi
Scheduler
Init
musl glibc
Docker Engine
Docker capable OS
Bin/Libs
HPC service

What HPC can learn from DevOps?

  • 1.
    DevOps and HPC: SaudiAramco HPC use case Walid A. Shaari 20th April 2016 Ahmed Bu-khamsin
  • 2.
    2 References in thisdocument to any specific commercial products, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by Saudi Aramco or Saudi Aramco HPC group. The ideas and findings of authors expressed in any slides or other material should not be construed as an official Saudi Aramco or HPC team position and shall not be used for advertising or product endorsement purposes. Information contained in this document is published in the interest of scientific and technical information exchange. DISCLAIMER OF ENDORSEMENT 27/10/2014
  • 3.
    3 DevOps Cultural movement orpractice that emphasizes the collaboration and communication of both Application Developers and Operations professionals. Development Business Operations adaptive automated agile
  • 4.
    4 Business Drives o Optimization Effectivedata center(s) resources utilization: • Utilization of systems, storage, network, or services. • Better use of employees time and skills. o Growth ( N x R x P ) Increasing Infrastructure scale • N: number of managed nodes/clusters/environments • R: number of applications(business roles) • P: number of technical services (technology profiles)
  • 5.
  • 6.
  • 7.
  • 8.
    8 Script Pack ag es FilesServic es Mo un ts Securi ty Script Pack ag es Files Servic es Mo un ts Securi ty Script Pack ag es Files Servic es Mo un ts Securi ty Script Pack ag es Files Servic es Mo un ts Securi ty Script Pack ag es Files Servic es Mo un ts Securi ty Script Pack ag es Files Servic es Mo un ts Securi ty Script Pack ag es Files Servic es Mo un ts Securi ty Script Pack ag es Files Servic es Mo un ts Securi ty Script Pack ag es Files Servic es Mo un ts Securi ty • Different Hardware • Different Sizes • Different Users • Different Operating Systems
  • 9.
    9 Script Packag es Fi l esSer vi c es M o u n t s Se c u rit y Script Packag es Fi l es Ser vi c es M o u n t s Se c u rit y Script Packag es Fi l es Ser vi c es M o u n t s Se c u rit y Script Packag es Fi l es Ser vi c es M o u n t s Se c u rit y Script Packag es Fi l es Ser vi c es M o u n t s Se c u rit y Script Packag es Fi l es Ser vi c es M o u n t s Se c u rit y Script Packag es Fi l es Ser vi c es M o u n t s Se c u rit y Script Packag es Fi l es Ser vi c es M o u n t s Se c u rit y Script Packag es Fi l es Ser vi c es M o u n t s Se c u rit y Common Tasks: Apply security patches Add new storage Upgrade the OS Install new packages Common Issues: Scalability issue Lack of history No team collaboration No drift control Long development and test cycle
  • 10.
    10 • Do itDevOps way - Infrastructure as code • Definition of Infrastructure as code: "Enable the reconstruction of the business from nothing but a source code repository, an application data backup, and bare metal resources" Solution
  • 11.
    11 • Domain SpecificLanguage: - To describe the infrastructure desired state • Data Store: - To store the configuration specifications and other data • Control System: - To deploy the code and apply the required configuration changes • Versioning Control System - To keep history - enforce workflow and peer review - Team collaboration Configuration Management Tools
  • 12.
    12 Puppet • Open-source ITautomation framework • Framework to simplify and automate system configuration and provisioning • Replaces ssh-for loops and scripts • Hundreds of configuration modules available for download • Supports many Linux distributions, Windows, storage and network devices
  • 13.
    13 • Hardware Delivery •Power Up and Network Connectivity • OS Installation • Aramco Customization • Benchmarking • Application Testing • Production HP CMU . IBM xCat . Dell Bright Where Puppet Fits Cluster Deployment Project Plan
  • 14.
    14 Benefits • Speeds upclusters deployment From days to hours - Shorter development cycle - Same code is used for deployment and compliance - Code Reuse
  • 15.
    15 Benefits Contribution During PuppetDeployment Project Contribution During First Deployment Project Contribution During Second Deployment Project November 13 2014 - April 22 2015 Commits statistic for production 697 commits during 160 days Average 4.4 commits per day Contributed by 9 authors
  • 16.
    16 Benefits • Automatic andcontinuous deployment - Classify the cluster to the right type and Puppet does the rest
  • 17.
    17 Benefits • Advanced reportingcapabilities • Self healing and drift control • Baseline configuration compliance
  • 18.
    18 Benefits • Version controland development workflow • Team Collaboration Production Bug-fix New feature Merge Request Merge Request
  • 19.
  • 20.
    20 How Pervasive isConfiguration Management? ASM
  • 21.
    21 Traditional HPC ClusterManagement tools https://www.flickr.com/photos/vrogy/514733529
  • 22.
    22 Provisioning Workload Scheduler & Metrics System (user land,kernel modules, devices) Bare metalBootstrapping Coniguration Orchestration consistency Provisioningactivity puppet, Ansible, Chef Grid Engine SLURM TORQUE/MOAB Mesos /Swarm/Nomad puppet, Chef Ansible foreman Razor Digital-rebar Ironic Virtual image Container HPC OPSWeb/Cloud OPS
  • 23.
    HPC workload runson the cloud 25%
  • 24.
    24 Which workloads andframeworks are running on OpenStack? Source : https://www.openstack.org/assets/survey/Public-User-Survey-Report.pdf
  • 25.
    25 HPC in nonbare-metal Experimental? Is it Mature? Vendor trends
  • 26.
    26 Next Generation Provisioning Puppet RazorIronic • No vendor lock: Open Source availability • Environments Agnostic • bare-metal, virtual image, and containers • Use open standards • Ipmi2, ipxe, dhcp, REST, https • Handles end to end application provisioning • Better integration with other tools • configuration management, CMDB, Monitoring • Programmable • On-demand provisioning • Policy/Model based
  • 27.
    27 Data Center currentstate SchedulerSchedulerScheduler Jobs Jobs Jobs Jobs Jobs Jobs Jobs Jobs Jobs Cluster Management A Cluster Management B Cluster Management C 0% 50% 100%
  • 28.
    28 Data Center Breaking theSilos SchedulerSchedulerScheduler MetaScheduler Jobs Jobs Jobs Jobs Jobs Jobs Jobs Jobs
  • 29.
    29 Data Center Efficient SecureAllocation of Resources VC3 BigData VC1 Infra VC2 HPC SchedulerSchedulerScheduler DataCenterScheduler Jobs Jobs Jobs Jobs Jobs Jobs Jobs Jobs 2nd Generation Cluster Management
  • 30.
    30 Containers Container encapsulate anapplication completely with all of its software dependencies into a standardized unit for software portable across different platforms* https://www.docker.com/what-docker
  • 31.
    31 Containers Potential Benefitsto HPC o High performing o Lightweight o Portable, could solve software packaging, configuration, and delivery o Host Kernel and system drivers visibility o Composable o Targets better scalable monitoring, logging, and security o Private in-house repositories o Workforce Separation of concerns (e.g. Operations, Development, Security, Users) o Builds on mature agile application lifecycle management o Empowers application support, and developers o Holistic, yet modular ECO system o Schedulers, and cluster managers (Traditional e.g. LSF, UGE, Moab, and Slurm) (Modern: Mesos, Kubernetes, nextflow)
  • 32.
  • 33.
    33 NVIDIA Example usecase https://github.com/NVIDIA/nvidia-docker
  • 34.
    34 Host possible workload TinyCore Linux (VM) Docker Engine Bin/libs Enterprise Linux Distribution Service RHEL7 HPCtask HPCtask HPCtask HPCtask AlpineMicroService MicroService MicroService MicroService Ubuntu Bigdata Alpine Redis Kibana Logstash Elasticsearc
  • 35.
    35 HPC Host Reality RHEL7 HPCTask HPCTask HPCTAsk HPCTask Bin/Libs HPCservice Docker Engine Docker capable OS Bin/Libs HPC service Bin/Libs HPC service Docker Engine Docker capable OS Docker Engine Docker capable OS Bin/Libs HPC Job 3 Docker Engine Docker capable OS Docker Engine Docker capable OS Bin/Libs HPC Job 3 Bin/Libs HPC Job 3 Container Cluster Management/orchestration
  • 36.
    36 Possible HPC Challenges oChange of processes, and mindset o Linux kernel requirements o Maturity of the cluster management and scheduling solution o Keeping up with the containers eco system o Extremely fast moving target o Several architectural and fundamental decisions to make o Memory deduplication o Necessity of automated tool-chains “development, integration, and delivery workflows” o Security Trusted container libraries
  • 37.
  • 38.
  • 39.
  • 40.
    40 Mesos § Mature, OpenSource Apache Project § Cluster Resource Manager § Scalable to 10,000s of nodes § Fault tolerant, no single point of failure § Multi-tenancy with strong resource isolation § Improved resource utilization
  • 41.
  • 42.
  • 43.
  • 44.
    44 Devil in thedetails ssh mpi Scheduler Init musl glibc Docker Engine Docker capable OS Bin/Libs HPC service