KEMBAR78
Docker, Monitoring and SLURM Specific Visualisations | PDF
ss
Docker, Monitoring and SLURM Specific Visualisations
QNIBTerminal @ work
• Docker in a Nutshell
• QNIBx
Terminal
Monitoring
Inventory
• SLURM Autogenerated Dashboards
2
Agenda
3
About Me
• Christian Kniep
@CQnib, christian@qnib.org
4
About Me
• Christian Kniep
@CQnib, christian@qnib.org
• >10y Iteration
SysAdmin, SysOps, SysEngineer, R&D Engineer
DevOps @Locafox (hyper-scale web-service)
5
About Me
• Christian Kniep
@CQnib, christian@qnib.org
• >10y Iteration
SysAdmin, SysOps, SysEngineer, R&D Engineer
DevOps @Locafox (hyper-scale web-service)
• Founder of QNIB Solutions
Holistic System Management
Containerization of SysOps and Workload
Consultancy / Software Design & Development
Docker in a Nutshell
7
Multiple Guests
SERVER SERVER
Traditional Virtualisation Containerisation
8
Multiple Guests
SERVER
HOST	
 Ā KERNEL
SERVER
HOST	
 Ā KERNEL
Traditional Virtualisation Containerisation
9
Multiple Guests
SERVER
HOST	
 Ā KERNEL
Userland
SERVER
HOST	
 Ā KERNEL
Userland
Traditional Virtualisation Containerisation
10
Multiple Guests
SERVER
HOST	
 Ā KERNEL
HYPERVISOR	
 Ā (Type	
 Ā II)
Userland
SERVER
HOST	
 Ā KERNEL
Userland
Traditional Virtualisation Containerisation
11
Multiple Guests
SERVER
HOST	
 Ā KERNEL
HYPERVISOR	
 Ā (Type	
 Ā II)
KERNEL
Userland
KERNEL KERNEL
SERVER
HOST	
 Ā KERNEL
Userland
Traditional Virtualisation Containerisation
12
Multiple Guests
SERVER
HOST	
 Ā KERNEL
HYPERVISOR	
 Ā (Type	
 Ā II)
KERNEL
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVER
HOST	
 Ā KERNEL
Userland
Traditional Virtualisation Containerisation
13
Multiple Guests
SERVER
HOST	
 Ā KERNEL
HYPERVISOR	
 Ā (Type	
 Ā II)
KERNEL
SERVICE
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVICE SERVICE
SERVER
HOST	
 Ā KERNEL
Userland
Traditional Virtualisation Containerisation
14
Multiple Guests
SERVER
HOST	
 Ā KERNEL
HYPERVISOR	
 Ā (Type	
 Ā II)
KERNEL
SERVICE
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVICE SERVICE
SERVER
HOST	
 Ā KERNEL
Userland
Traditional Virtualisation Containerisation
Docker
15
Multiple Guests
SERVER
HOST	
 Ā KERNEL
HYPERVISOR	
 Ā (Type	
 Ā II)
KERNEL
SERVICE
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVICE SERVICE
SERVER
HOST	
 Ā KERNEL
Userland
Userland	
 Ā (#1) Userland	
 Ā (#2)
Traditional Virtualisation Containerisation
Docker
16
Multiple Guests
SERVER
HOST	
 Ā KERNEL
HYPERVISOR	
 Ā (Type	
 Ā II)
KERNEL
SERVICE
Userland
KERNEL KERNEL
Userland
Userland Userland
SERVICE SERVICE
SERVER
HOST	
 Ā KERNEL
Userland
Userland	
 Ā (#1) Userland	
 Ā (#2)
SERVICE SERVICE
Traditional Virtualisation Containerisation
Docker
HOST
container1
17
Docker Internal View
• Containers are ā€˜grouped processes’
isolated by Kernel Namespaces (PID, network, mount, …)
resource restrictions applicable through CGroups
bash
ls -l
container2
apache
container3
mysqld
container4
slurmd
ssh
• 1/2 Day, July 16th @ISC High Performance
Deep dive into the talking points
How Docker might impact System Operations & HPC Applications
Further discussion beyond what I am talking about today
18
Docker Workshop
• Full Day, September 28th @ISC Cloud&BigData
19
Docker Workshop #2
QNIBTerminal
• Framework of system container to spin up stacks
SLURM
21
QNIBTerminal
• Framework of system container to spin up stacks
SLURM
22
QNIBTerminal
• Framework of system container to spin up stacks
SLURM
23
QNIBTerminal
• Framework of system container to spin up stacks
SLURM
24
QNIBTerminal
1
2
3
QNIBMonitoring
• Current monitoring systems do not connect
overlaying metrics with log events
use/build inventory system to provide connections usually hidden
users perspective and scope/context/background
26
QNIBMonitoring
• Current monitoring systems do not connect
overlaying metrics with log events
use/build inventory system to provide connections usually hidden
users perspective and scope/context/background
• QNIBMonitoring provides
open metrics system (system / application metrics, log aggregates)
log event framework, consuming/processing/visualise events
auto discovery / configuration through consul
27
QNIBMonitoring
28
QNIBMonitoring
• Logstash (Log/Event Monitoring)
29
QNIBMonitoring
• Grafana (Performance Monitoring)
30
QNIBMonitoring
• Overlay Metrics w/ Events
QNIBInventory
32
QNIBInventory
• Network Topology
33
QNIBInventory
• Installed Software
34
QNIBInventory
• SLURM Cluster
• Enrich Log/Events
35
QNIBInventory
1
2
• Enrich Log/Events
• Help visualise connections
36
QNIBInventory
• Enrich Log/Events
• Help visualise connections
• Build up history
37
QNIBInventory
Cluster Use-Case
• Multiple backgrounds have to be considered
Enduser (Engineer, Software Developer, Scientist)
Operation Personel
Management
• Psychology plays important role
Local rationality / context
10.000ft Overview vs. verifying hypothesis vs. Reporting
Empower users to extend their domain knowledge by providing toolset
39
Context Sensitive Dashboards
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
40
Cluster Usecase
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
41
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
42
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
43
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
carbon
carbon
graphite-api
graphite-api
Performance
grafana
grafana
elasticsearch
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
44
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
carbon
carbon
graphite-api
graphite-api
Performance
grafana
grafana
Log/Events
elasticsearch
logger logstash
kibana kiabana
kopf es-kopf
elasticsearch
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
45
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
carbon
carbon
graphite-api
graphite-api
Performance
grafana
grafana
Log/Events
elasticsearch
logger logstash
kibana kiabana
kopf es-kopf
neo4j neo4j
Inventory
inventory QINBInv
elasticsearch
• Small SLURM cluster
couple of nodes, two user groups, couple of users
script & MPI workload
46
Cluster Usecase
srv backend consul
slurmctld
slurmctld
compute0
slurmd
compute<N>
slurmd
Compute
carbon
carbon
graphite-api
graphite-api
Performance
grafana
grafana
Log/Events
elasticsearch
logger logstash
kibana kiabana
kopf es-kopf
neo4j neo4j
Inventory
inventory QINBInv
postgres postgres
galaxy
galaxy galaxy
• Live cluster Status
Utilisation per cluster / user / user-group
SLA met by SysOps
Most common jobs, misbehaving enduser
47
Management Context
• Live cluster Status
Utilisation per cluster / user / user-group
SLA met by SysOps
Most common jobs, misbehaving enduser
• Reports
per day / user / job-type / …
48
Management Context
• Live cluster Status
Utilisation per cluster / user / user-group
SLA met by SysOps
Most common jobs, misbehaving enduser
• Reports
per day / user / job-type / …
• Capacity Planning
utilisation over time, comparison of HW generations, global FS capacity
49
Management Context
50
SLURM Dashboard
51
SLURM Dashboard
• Nodes are connected to Partitions
52
SLURM Inventar
• Nodes are connected to Partitions
• Jobs are connected to both
53
SLURM Inventar
54
SLURM Dashboard
• Live progress of SLURM job
Monitor iteration speed to estimate workload behaviour
Get to know job while it’s running (instead of postmortem)
Introduce application profiling / log events (enhance feedback)
55
Enduser Context
• Live progress of SLURM job
Monitor iteration speed to estimate workload behaviour
Get to know job while it’s running (instead of postmortem)
Introduce application profiling / log events (enhance feedback)
• Post Mortem
Get detailed report after job has finished
56
Enduser Context
• Live progress of SLURM job
Monitor iteration speed to estimate workload behaviour
Get to know job while it’s running (instead of postmortem)
Introduce application profiling / log events (enhance feedback)
• Post Mortem
Get detailed report after job has finished
• MDO jobs
depending on outcome and progression submit next iteration(s)
57
Enduser Context
58
SLURM Dashboard
59
SLURM Dashboard
• Live cluster Status
USE method overviews (Utilisation/Saturation/Errors)
Anomaly detection (w/ and w/o humans)
Spotting abnormal behaviour
60
SysOps Context
• Live cluster Status
USE method overviews (Utilisation/Saturation/Errors)
Anomaly detection (w/ and w/o humans)
Spotting abnormal behaviour
• Drill into monitoring
verify hypothesis about incidents/problems
correlate events, metrics and inventory
61
SysOps Context
• Live cluster Status
USE method overviews (Utilisation/Saturation/Errors)
Anomaly detection (w/ and w/o humans)
Spotting abnormal behaviour
• Drill into monitoring
verify hypothesis about incidents/problems
correlate events, metrics and inventory
• Guid through ā€˜known problems’
close feedback loops provide confidence
62
SysOps Context
63
Central Logging
64
Galaxy
65
Galaxy Use-Cases
SLURM
66
Galaxy Use-Cases
SLURM
Log
Events
WORKFLOW
Metrics Inventory
• Model Assess Workflow in Galaxy
Easy to grasp (in contrast to Hadoop, Spark, …)
Event triggered, Cronjob?
Using idle compute resources
67
Thank you!
• Contact
christian@qnib.org
@CQnib, @_qnib
• Web
www.qnib.org (blog)
doc.qnib.org (Paper)
• Feel free…
…ask questions (now / later)
…ask for a Demo

Docker, Monitoring and SLURM Specific Visualisations

  • 1.
    ss Docker, Monitoring andSLURM Specific Visualisations QNIBTerminal @ work
  • 2.
    • Docker ina Nutshell • QNIBx Terminal Monitoring Inventory • SLURM Autogenerated Dashboards 2 Agenda
  • 3.
    3 About Me • ChristianKniep @CQnib, christian@qnib.org
  • 4.
    4 About Me • ChristianKniep @CQnib, christian@qnib.org • >10y Iteration SysAdmin, SysOps, SysEngineer, R&D Engineer DevOps @Locafox (hyper-scale web-service)
  • 5.
    5 About Me • ChristianKniep @CQnib, christian@qnib.org • >10y Iteration SysAdmin, SysOps, SysEngineer, R&D Engineer DevOps @Locafox (hyper-scale web-service) • Founder of QNIB Solutions Holistic System Management Containerization of SysOps and Workload Consultancy / Software Design & Development
  • 6.
    Docker in aNutshell
  • 7.
    7 Multiple Guests SERVER SERVER TraditionalVirtualisation Containerisation
  • 8.
    8 Multiple Guests SERVER HOST Ā KERNEL SERVER HOST Ā KERNEL Traditional Virtualisation Containerisation
  • 9.
    9 Multiple Guests SERVER HOST Ā KERNEL Userland SERVER HOST Ā KERNEL Userland Traditional Virtualisation Containerisation
  • 10.
    10 Multiple Guests SERVER HOST Ā KERNEL HYPERVISOR Ā (Type Ā II) Userland SERVER HOST Ā KERNEL Userland Traditional Virtualisation Containerisation
  • 11.
    11 Multiple Guests SERVER HOST Ā KERNEL HYPERVISOR Ā (Type Ā II) KERNEL Userland KERNEL KERNEL SERVER HOST Ā KERNEL Userland Traditional Virtualisation Containerisation
  • 12.
    12 Multiple Guests SERVER HOST Ā KERNEL HYPERVISOR Ā (Type Ā II) KERNEL Userland KERNEL KERNEL Userland Userland Userland SERVER HOST Ā KERNEL Userland Traditional Virtualisation Containerisation
  • 13.
    13 Multiple Guests SERVER HOST Ā KERNEL HYPERVISOR Ā (Type Ā II) KERNEL SERVICE Userland KERNEL KERNEL Userland Userland Userland SERVICE SERVICE SERVER HOST Ā KERNEL Userland Traditional Virtualisation Containerisation
  • 14.
    14 Multiple Guests SERVER HOST Ā KERNEL HYPERVISOR Ā (Type Ā II) KERNEL SERVICE Userland KERNEL KERNEL Userland Userland Userland SERVICE SERVICE SERVER HOST Ā KERNEL Userland Traditional Virtualisation Containerisation Docker
  • 15.
    15 Multiple Guests SERVER HOST Ā KERNEL HYPERVISOR Ā (Type Ā II) KERNEL SERVICE Userland KERNEL KERNEL Userland Userland Userland SERVICE SERVICE SERVER HOST Ā KERNEL Userland Userland Ā (#1) Userland Ā (#2) Traditional Virtualisation Containerisation Docker
  • 16.
    16 Multiple Guests SERVER HOST Ā KERNEL HYPERVISOR Ā (Type Ā II) KERNEL SERVICE Userland KERNEL KERNEL Userland Userland Userland SERVICE SERVICE SERVER HOST Ā KERNEL Userland Userland Ā (#1) Userland Ā (#2) SERVICE SERVICE Traditional Virtualisation Containerisation Docker
  • 17.
    HOST container1 17 Docker Internal View •Containers are ā€˜grouped processes’ isolated by Kernel Namespaces (PID, network, mount, …) resource restrictions applicable through CGroups bash ls -l container2 apache container3 mysqld container4 slurmd ssh
  • 18.
    • 1/2 Day,July 16th @ISC High Performance Deep dive into the talking points How Docker might impact System Operations & HPC Applications Further discussion beyond what I am talking about today 18 Docker Workshop
  • 19.
    • Full Day,September 28th @ISC Cloud&BigData 19 Docker Workshop #2
  • 20.
  • 21.
    • Framework ofsystem container to spin up stacks SLURM 21 QNIBTerminal
  • 22.
    • Framework ofsystem container to spin up stacks SLURM 22 QNIBTerminal
  • 23.
    • Framework ofsystem container to spin up stacks SLURM 23 QNIBTerminal
  • 24.
    • Framework ofsystem container to spin up stacks SLURM 24 QNIBTerminal 1 2 3
  • 25.
  • 26.
    • Current monitoringsystems do not connect overlaying metrics with log events use/build inventory system to provide connections usually hidden users perspective and scope/context/background 26 QNIBMonitoring
  • 27.
    • Current monitoringsystems do not connect overlaying metrics with log events use/build inventory system to provide connections usually hidden users perspective and scope/context/background • QNIBMonitoring provides open metrics system (system / application metrics, log aggregates) log event framework, consuming/processing/visualise events auto discovery / configuration through consul 27 QNIBMonitoring
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
    • Enrich Log/Events •Help visualise connections 36 QNIBInventory
  • 37.
    • Enrich Log/Events •Help visualise connections • Build up history 37 QNIBInventory
  • 38.
  • 39.
    • Multiple backgroundshave to be considered Enduser (Engineer, Software Developer, Scientist) Operation Personel Management • Psychology plays important role Local rationality / context 10.000ft Overview vs. verifying hypothesis vs. Reporting Empower users to extend their domain knowledge by providing toolset 39 Context Sensitive Dashboards
  • 40.
    • Small SLURMcluster couple of nodes, two user groups, couple of users script & MPI workload 40 Cluster Usecase
  • 41.
    • Small SLURMcluster couple of nodes, two user groups, couple of users script & MPI workload 41 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute
  • 42.
    • Small SLURMcluster couple of nodes, two user groups, couple of users script & MPI workload 42 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute
  • 43.
    • Small SLURMcluster couple of nodes, two user groups, couple of users script & MPI workload 43 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute carbon carbon graphite-api graphite-api Performance grafana grafana
  • 44.
    elasticsearch • Small SLURMcluster couple of nodes, two user groups, couple of users script & MPI workload 44 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute carbon carbon graphite-api graphite-api Performance grafana grafana Log/Events elasticsearch logger logstash kibana kiabana kopf es-kopf
  • 45.
    elasticsearch • Small SLURMcluster couple of nodes, two user groups, couple of users script & MPI workload 45 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute carbon carbon graphite-api graphite-api Performance grafana grafana Log/Events elasticsearch logger logstash kibana kiabana kopf es-kopf neo4j neo4j Inventory inventory QINBInv
  • 46.
    elasticsearch • Small SLURMcluster couple of nodes, two user groups, couple of users script & MPI workload 46 Cluster Usecase srv backend consul slurmctld slurmctld compute0 slurmd compute<N> slurmd Compute carbon carbon graphite-api graphite-api Performance grafana grafana Log/Events elasticsearch logger logstash kibana kiabana kopf es-kopf neo4j neo4j Inventory inventory QINBInv postgres postgres galaxy galaxy galaxy
  • 47.
    • Live clusterStatus Utilisation per cluster / user / user-group SLA met by SysOps Most common jobs, misbehaving enduser 47 Management Context
  • 48.
    • Live clusterStatus Utilisation per cluster / user / user-group SLA met by SysOps Most common jobs, misbehaving enduser • Reports per day / user / job-type / … 48 Management Context
  • 49.
    • Live clusterStatus Utilisation per cluster / user / user-group SLA met by SysOps Most common jobs, misbehaving enduser • Reports per day / user / job-type / … • Capacity Planning utilisation over time, comparison of HW generations, global FS capacity 49 Management Context
  • 50.
  • 51.
  • 52.
    • Nodes areconnected to Partitions 52 SLURM Inventar
  • 53.
    • Nodes areconnected to Partitions • Jobs are connected to both 53 SLURM Inventar
  • 54.
  • 55.
    • Live progressof SLURM job Monitor iteration speed to estimate workload behaviour Get to know job while it’s running (instead of postmortem) Introduce application profiling / log events (enhance feedback) 55 Enduser Context
  • 56.
    • Live progressof SLURM job Monitor iteration speed to estimate workload behaviour Get to know job while it’s running (instead of postmortem) Introduce application profiling / log events (enhance feedback) • Post Mortem Get detailed report after job has finished 56 Enduser Context
  • 57.
    • Live progressof SLURM job Monitor iteration speed to estimate workload behaviour Get to know job while it’s running (instead of postmortem) Introduce application profiling / log events (enhance feedback) • Post Mortem Get detailed report after job has finished • MDO jobs depending on outcome and progression submit next iteration(s) 57 Enduser Context
  • 58.
  • 59.
  • 60.
    • Live clusterStatus USE method overviews (Utilisation/Saturation/Errors) Anomaly detection (w/ and w/o humans) Spotting abnormal behaviour 60 SysOps Context
  • 61.
    • Live clusterStatus USE method overviews (Utilisation/Saturation/Errors) Anomaly detection (w/ and w/o humans) Spotting abnormal behaviour • Drill into monitoring verify hypothesis about incidents/problems correlate events, metrics and inventory 61 SysOps Context
  • 62.
    • Live clusterStatus USE method overviews (Utilisation/Saturation/Errors) Anomaly detection (w/ and w/o humans) Spotting abnormal behaviour • Drill into monitoring verify hypothesis about incidents/problems correlate events, metrics and inventory • Guid through ā€˜known problems’ close feedback loops provide confidence 62 SysOps Context
  • 63.
  • 64.
  • 65.
  • 66.
    66 Galaxy Use-Cases SLURM Log Events WORKFLOW Metrics Inventory •Model Assess Workflow in Galaxy Easy to grasp (in contrast to Hadoop, Spark, …) Event triggered, Cronjob? Using idle compute resources
  • 67.
    67 Thank you! • Contact christian@qnib.org @CQnib,@_qnib • Web www.qnib.org (blog) doc.qnib.org (Paper) • Feel free… …ask questions (now / later) …ask for a Demo