KEMBAR78
Monitoring Kubernetes with Prometheus | PDF
Monitoring Kubernetes with
Prometheus
Tom Wilkie, July 2017
❤
Tom Wilkie VP Product, Grafana Labs
Previously: Kausal, Weaveworks, Google, Acunu, Xensource
Twitter: @tom_wilkie Email: tom@grafana.com
Prometheus
Kubernetes
Monitoring & Alerting
Getting Started
Prometheus
● A monitoring & alerting system.

● Inspired by Google’s BorgMon

● Originally built by SoundCloud in 2012

● Open Source, now part of the CNCF

● Simple text-based metrics format

● Multidimensional datamodel

● Rich, concise query language
Prometheus’ data model is very simple:

<identifier> → [ (t0, v0), (t1, v1), ... ]
Timestamps are millisecond int64, values are float64
https://www.slideshare.net/Docker/monitoring-the-prometheus-way-julius-voltz-prometheus
Prometheus identifiers

http_requests_total{job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“200”}
http_requests_total{job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“500”}
http_requests_total{job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“200”}
http_requests_total{job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“502”}
Prometheus series selector

http_requests_total{job=“nginx”, status=~“5..”}
Building queries usually starts with a selector

PromQL: http_requests_total{job=“nginx”, status=~“5..”}
{job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“500”} 34
{job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“502”} 56
{job=“nginx”, instances=“2.3.4.5:80”, path=“/home”, status=“500”} 76
{job=“nginx”, instances=“2.3.4.5:80”, path=“/setting”, status=“502”} 96
...
Can select vectors of values…

PromQL: http_requests_total{job=“nginx”, status=~“502”}[1m]
{job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“500”} [30, 31, 32, 34]
{job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“500”} [4, 24, 56, 56]
{job=“nginx”, instances=“2.3.4.5:80”, path=“/home”, status=“500”} [76, 76, 76, 76]
{job=“nginx”, instances=“2.3.4.5:80”, path=“/setting”, status=“500”} [56, 106, 5, 96]
...
And apply functions…

PromQL: rate(http_requests_total{job=“nginx”, status=~“502”}[1m])
{job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“500”} 0.0666
{job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“500”} 0.866
{job=“nginx”, instances=“2.3.4.5:80”, path=“/home”, status=“500”} 0.0
{job=“nginx”, instances=“2.3.4.5:80”, path=“/settings”, status=“500”} 2.43
...
And aggregate by a dimension…

PromQL: sum by (path) (rate(http_requests_total{job=“nginx”, status=~“502”}[1m]))
{path=“/home”} 0.0666
{path=“/settings”} 3.3
...
Do binary operations…

PromQL: sum by (path) (rate(http_requests_total{job=“nginx”, status=~“502”}[1m]))
/
sum by (path) (rate(http_requests_total{job=“nginx”}[1m]))
{path=“/home”} 0.001
{path=“/settings”} 1.0
...
Kubernetes
● Platform for managing containerized
workloads and services

● “operating system for you datacenter”

● Inspired by Google’s Borg

● Also part of the CNCF

● Distributed, fault tolerant architecture

● Rich object model for you applications
https://thenewstack.io/myth-cloud-native-portability/
kube-state-metrics
cAdvisor
Monitoring & Alerting
What should I monitor?
USE Method
● Utilisation, Saturation, Errors…

RED Method
● Requests, Errors, Duration…

??? Method
● Expected system state…
USE Method
• cluster and node
level metrics 

• node_exporter run
as a daemonset
CPU Utilisation:
1 - avg(rate(node_cpu{mode=“idle"}[1m]))
CPU Saturation:
sum(node_load1)/ sum(node:node_num_cpu:sum)
USE Method
USE Method
● Can also look at container
level metrics from
cAdvisor…

● …and combine them with
metadata from kube-
state-metrics.
Container CPU usage by “app” label
sum by (namespace, label_name) (
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod
* on (pod_name) group_left(label_name)
label_join(kube_pod_labels, "pod_name", ",", "pod")
)
USE Method
RED Method
● Metrics exposed by
components for RED-
style monitoring
RED Method
Most useful alert I’ve found:

100 * sum by(instance, job) (
rate(rest_client_requests_total{code!~”2..”}[5m])
)
/
sum by(instance, job) (
rate(rest_client_requests_total[5m])
)
??? Method
Alert expressions are invariants that describe a healthy system
kube_deployment_spec_replicas !=
kube_deployment_status_replicas_available
rate(kube_pod_container_status_restarts_total[15m]) > 0
??? Method
Alert expressions are invariants that describe a healthy system
(kube_pod_status_phase{phase!~”Running|Succeeded”}) > 0
sum(kube_pod_container_resource_requests_cpu_cores)
/ sum(node:node_num_cpu:sum)
>
(count(node:node_num_cpu:sum) - 1)
/ count(node:node_num_cpu:sum)
Getting Started
Getting setup
● github.com/coreos/prometheus-operator - Job to look after running
Prometheus on Kubernetes

● github.com/coreos/kube-prometheus - Set of configs for running all there
other things you need.

● github.com/kausalco/public/tree/master/prometheus-ksonnet - My configs
for running Prometheus, Alertmanager, Grafana etc

● github.com/kubernetes-monitoring/kubernetes-mixin - Joint project to unify
and improve common alerts for Kubernetes.
More reading…
https://landing.google.com/sre/book.html
https://www.youtube.com/watch?v=1oJXMdVi0mM
http://www.brendangregg.com/usemethod.html
Questions?

Monitoring Kubernetes with Prometheus

  • 1.
  • 2.
    Tom Wilkie VPProduct, Grafana Labs Previously: Kausal, Weaveworks, Google, Acunu, Xensource Twitter: @tom_wilkie Email: tom@grafana.com
  • 3.
  • 4.
    Prometheus ● A monitoring& alerting system. ● Inspired by Google’s BorgMon ● Originally built by SoundCloud in 2012 ● Open Source, now part of the CNCF ● Simple text-based metrics format ● Multidimensional datamodel ● Rich, concise query language
  • 7.
    Prometheus’ data modelis very simple: <identifier> → [ (t0, v0), (t1, v1), ... ] Timestamps are millisecond int64, values are float64 https://www.slideshare.net/Docker/monitoring-the-prometheus-way-julius-voltz-prometheus
  • 8.
    Prometheus identifiers http_requests_total{job=“nginx”, instances=“1.2.3.4:80”,path=“/home”, status=“200”} http_requests_total{job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“500”} http_requests_total{job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“200”} http_requests_total{job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“502”} Prometheus series selector http_requests_total{job=“nginx”, status=~“5..”}
  • 9.
    Building queries usuallystarts with a selector PromQL: http_requests_total{job=“nginx”, status=~“5..”} {job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“500”} 34 {job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“502”} 56 {job=“nginx”, instances=“2.3.4.5:80”, path=“/home”, status=“500”} 76 {job=“nginx”, instances=“2.3.4.5:80”, path=“/setting”, status=“502”} 96 ...
  • 10.
    Can select vectorsof values… PromQL: http_requests_total{job=“nginx”, status=~“502”}[1m] {job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“500”} [30, 31, 32, 34] {job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“500”} [4, 24, 56, 56] {job=“nginx”, instances=“2.3.4.5:80”, path=“/home”, status=“500”} [76, 76, 76, 76] {job=“nginx”, instances=“2.3.4.5:80”, path=“/setting”, status=“500”} [56, 106, 5, 96] ...
  • 11.
    And apply functions… PromQL:rate(http_requests_total{job=“nginx”, status=~“502”}[1m]) {job=“nginx”, instances=“1.2.3.4:80”, path=“/home”, status=“500”} 0.0666 {job=“nginx”, instances=“1.2.3.4:80”, path=“/settings”, status=“500”} 0.866 {job=“nginx”, instances=“2.3.4.5:80”, path=“/home”, status=“500”} 0.0 {job=“nginx”, instances=“2.3.4.5:80”, path=“/settings”, status=“500”} 2.43 ...
  • 12.
    And aggregate bya dimension… PromQL: sum by (path) (rate(http_requests_total{job=“nginx”, status=~“502”}[1m])) {path=“/home”} 0.0666 {path=“/settings”} 3.3 ...
  • 13.
    Do binary operations… PromQL:sum by (path) (rate(http_requests_total{job=“nginx”, status=~“502”}[1m])) / sum by (path) (rate(http_requests_total{job=“nginx”}[1m])) {path=“/home”} 0.001 {path=“/settings”} 1.0 ...
  • 14.
    Kubernetes ● Platform formanaging containerized workloads and services ● “operating system for you datacenter” ● Inspired by Google’s Borg ● Also part of the CNCF ● Distributed, fault tolerant architecture ● Rich object model for you applications
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
    What should Imonitor? USE Method ● Utilisation, Saturation, Errors… RED Method ● Requests, Errors, Duration… ??? Method ● Expected system state…
  • 21.
    USE Method • clusterand node level metrics • node_exporter run as a daemonset
  • 22.
    CPU Utilisation: 1 -avg(rate(node_cpu{mode=“idle"}[1m])) CPU Saturation: sum(node_load1)/ sum(node:node_num_cpu:sum) USE Method
  • 23.
    USE Method ● Canalso look at container level metrics from cAdvisor… ● …and combine them with metadata from kube- state-metrics.
  • 24.
    Container CPU usageby “app” label sum by (namespace, label_name) ( sum(rate(container_cpu_usage_seconds_total[5m])) by (pod * on (pod_name) group_left(label_name) label_join(kube_pod_labels, "pod_name", ",", "pod") ) USE Method
  • 25.
    RED Method ● Metricsexposed by components for RED- style monitoring
  • 26.
    RED Method Most usefulalert I’ve found: 100 * sum by(instance, job) ( rate(rest_client_requests_total{code!~”2..”}[5m]) ) / sum by(instance, job) ( rate(rest_client_requests_total[5m]) )
  • 27.
    ??? Method Alert expressionsare invariants that describe a healthy system kube_deployment_spec_replicas != kube_deployment_status_replicas_available rate(kube_pod_container_status_restarts_total[15m]) > 0
  • 28.
    ??? Method Alert expressionsare invariants that describe a healthy system (kube_pod_status_phase{phase!~”Running|Succeeded”}) > 0 sum(kube_pod_container_resource_requests_cpu_cores) / sum(node:node_num_cpu:sum) > (count(node:node_num_cpu:sum) - 1) / count(node:node_num_cpu:sum)
  • 29.
  • 30.
    Getting setup ● github.com/coreos/prometheus-operator- Job to look after running Prometheus on Kubernetes ● github.com/coreos/kube-prometheus - Set of configs for running all there other things you need. ● github.com/kausalco/public/tree/master/prometheus-ksonnet - My configs for running Prometheus, Alertmanager, Grafana etc ● github.com/kubernetes-monitoring/kubernetes-mixin - Joint project to unify and improve common alerts for Kubernetes.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.