最近把Prometheus監(jiān)控遷移到了kubernetes集群中,部署文檔參考《Kubernetes環(huán)境使用Prometheus Operator自發(fā)現(xiàn)監(jiān)控SpringBoot》,各類監(jiān)控項(xiàng)的數(shù)據(jù)采集,以及grafana的監(jiān)控展示測(cè)試都正常,于是進(jìn)入下一步報(bào)警的遷入測(cè)試,alertmanager原生不支持釘釘報(bào)警,所以只能通過(guò)webhook的方式,好在已經(jīng)有大佬開源了一套基于prometheus 釘釘報(bào)警的webhook(項(xiàng)目地址https://github.com/timonwong/prometheus-webhook-dingtalk),所以我們直接配置使用就可以了。
怎么創(chuàng)建釘釘機(jī)器人非常簡(jiǎn)單這里就不介紹了,創(chuàng)建好釘釘機(jī)器人以后,下一步就是部署webhook,接收alertmanager的報(bào)警信息,格式化以后再發(fā)送給釘釘機(jī)器人。非kubernetes集群部署也是非常簡(jiǎn)單,直接編寫一個(gè)docker-compose文件,直接運(yùn)行就可以了。
1、在kubernetes集群中,pod之間需要通信,需要使用service,所以先編寫一個(gè)kubernetes的yaml文件dingtalk-webhook.yaml。
apiVersion: apps/v1 kind: Deployment metadata: name: webhook-dingtalk namespace: monitoring spec: selector: matchLabels: app: dingtalk replicas: 1 template: metadata: labels: app: dingtalk spec: restartPolicy: Always containers: - name: dingtalk image: timonwong/prometheus-webhook-dingtalk:v1.4.0 imagePullPolicy: IfNotPresent args: - '--web.enable-ui' - '--web.enable-lifecycle' - '--config.file=/config/config.yaml' ports: - containerPort: 8060 protocol: TCP volumeMounts: - mountPath: "/config" name: dingtalk-volume resources: limits: cpu: 100m memory: 100Mi requests: cpu: 100m memory: 100Mi volumes: - name: dingtalk-volume persistentVolumeClaim: claimName: dingding-pvc --- apiVersion: v1 kind: Service metadata: name: webhook-dingtalk namespace: monitoring spec: ports: - port: 80 protocol: TCP targetPort: 8060 selector: app: dingtalk sessionAffinity: None
1.1、第一種方式通過(guò)數(shù)據(jù)持久化,把配置文件config.yaml和報(bào)警模板放在了共享存儲(chǔ)里面,這樣webhook不管部署到哪臺(tái)node,都可以讀取到配置文件和報(bào)警模板。怎么通過(guò)NFS讓數(shù)據(jù)持久化可以參考文檔《Kubernetes使用StorageClass動(dòng)態(tài)生成NFS類型的PV》。
dingding-pvc.yaml
kind: PersistentVolumeClaim apiVersion: v1 metadata: name: dingding-pvc annotations: volume.beta.kubernetes.io/storage-class: "atang-nfs" namespace: monitoring spec: accessModes: - ReadWriteMany resources: requests: storage: 50Mi
配置文件config.yaml:
templates: - /config/template.tmpl targets: webhook1: url: https://oapi.dingtalk.com/robot/send?access_token=替換成自己的釘釘機(jī)器人token
報(bào)警模板template.tmpl:
{{ define "ding.link.title" }}[監(jiān)控報(bào)警]{{ end }} {{ define "ding.link.content" -}} {{- if gt (len .Alerts.Firing) 0 -}} {{ range $i, $alert := .Alerts.Firing }} [告警項(xiàng)目]:{{ index $alert.Labels "alertname" }} [告警實(shí)例]:{{ index $alert.Labels "instance" }} [告警級(jí)別]:{{ index $alert.Labels "severity" }} [告警閥值]:{{ index $alert.Annotations "value" }} [告警詳情]:{{ index $alert.Annotations "description" }} [觸發(fā)時(shí)間]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{- end }} {{- if gt (len .Alerts.Resolved) 0 -}} {{ range $i, $alert := .Alerts.Resolved }} [項(xiàng)目]:{{ index $alert.Labels "alertname" }} [實(shí)例]:{{ index $alert.Labels "instance" }} [狀態(tài)]:恢復(fù)正常 [開始]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} [恢復(fù)]:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{- end }} {{- end }}
可以根據(jù)自己的喜歡自己修改模板,“.EndsAt.Add 28800e9”是UTC時(shí)間+8小時(shí),因?yàn)閜rometheus和alertmanager默認(rèn)都是使用的UTC時(shí)間,另外需要把這兩個(gè)文件的屬主和屬組設(shè)置成65534,不然webhook容器沒有權(quán)限訪問這兩個(gè)文件。
1.2、第二種方式通過(guò)configMap方式(推薦)掛載配置文件和模板,需要修改原來(lái)的dingtalk-webhook.yaml文件,添加掛載為configMap。
apiVersion: v1 kind: ConfigMap metadata: name: dingtalk-config namespace: monitoring data: config.yaml: | templates: - /config/template.tmpl targets: webhook1: url: https://oapi.dingtalk.com/robot/send?access_token=your_dingding_token template.tmpl: | {{ define "ding.link.title" }}[監(jiān)控報(bào)警]{{ end }} {{ define "ding.link.content" -}} {{- if gt (len .Alerts.Firing) 0 -}} {{ range $i, $alert := .Alerts.Firing }} [告警項(xiàng)目]:{{ index $alert.Labels "alertname" }} [告警實(shí)例]:{{ index $alert.Labels "instance" }} [告警級(jí)別]:{{ index $alert.Labels "severity" }} [告警閥值]:{{ index $alert.Annotations "value" }} [告警詳情]:{{ index $alert.Annotations "description" }} [觸發(fā)時(shí)間]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{- end }} {{- if gt (len .Alerts.Resolved) 0 -}} {{ range $i, $alert := .Alerts.Resolved }} [項(xiàng)目]:{{ index $alert.Labels "alertname" }} [實(shí)例]:{{ index $alert.Labels "instance" }} [狀態(tài)]:恢復(fù)正常 [開始]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} [恢復(fù)]:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{- end }} {{- end }} --- apiVersion: apps/v1 kind: Deployment metadata: name: dingding-webhook namespace: monitoring spec: selector: matchLabels: app: dingtalk replicas: 1 template: metadata: labels: app: dingtalk spec: restartPolicy: Always containers: - name: dingtalk image: timonwong/prometheus-webhook-dingtalk:v1.4.0 imagePullPolicy: IfNotPresent args: - '--web.enable-ui' - '--web.enable-lifecycle' - '--config.file=/config/config.yaml' ports: - containerPort: 8060 protocol: TCP volumeMounts: - name: dingtalk-config mountPath: "/config" resources: limits: cpu: 100m memory: 100Mi requests: cpu: 100m memory: 100Mi volumes: - name: dingtalk-config configMap: name: dingtalk-config --- apiVersion: v1 kind: Service metadata: name: dingding-webhook namespace: monitoring spec: ports: - port: 80 protocol: TCP targetPort: 8060 selector: app: dingtalk sessionAffinity: None
2、修改alertmanager默認(rèn)的配置文件,增加webhook_configs,直接修改kube-prometheus-master/manifests/alertmanager-secret.yaml文件為以下內(nèi)容:
apiVersion: v1 data: {} kind: Secret metadata: name: alertmanager-main namespace: monitoring stringData: alertmanager.yaml: |- "global": "resolve_timeout": "5m" "inhibit_rules": - "equal": - "namespace" - "alertname" "source_match": "severity": "critical" "target_match_re": "severity": "warning|info" - "equal": - "namespace" - "alertname" "source_match": "severity": "warning" "target_match_re": "severity": "info" "receivers": - "name": "www.zhongjima.net" #- "name": "Watchdog" #- "name": "Critical" #- "name": "webhook" "webhook_configs": - "url": "http://webhook-dingtalk/dingtalk/webhook1/send" "send_resolved": true "route": "group_by": - "namespace" "group_interval": "5m" "group_wait": "30s" "receiver": "www.zhongjima.net" "repeat_interval": "12h" #"routes": #- "match": # "alertname": "Watchdog" # "receiver": "Watchdog" #- "match": # "severity": "critical" # "receiver": "Critical"
所有的yaml文件準(zhǔn)備好以后,執(zhí)行
kubectl apply -f dingding-pvc.yaml kubectl apply -f dingtalk-webhook.yaml kubectl apply -f alertmanager-secret.yaml
查看執(zhí)行結(jié)果
然后訪問alertmanager的地址(把a(bǔ)lertmanager.amd5.cn替換為自己的地址)查看配置webhook_configs是否已經(jīng)生效,http://alertmanager.amd5.cn/#/status。
3、生效以后,我們就添加報(bào)警規(guī)則,等待觸發(fā)規(guī)則閾值報(bào)警測(cè)試。
直接修改kube-prometheus-master/manifests/prometheus-rules.yaml在末尾添加下面的內(nèi)容,注意縮進(jìn)。
- name: prometheus-operator rules: - alert: PrometheusOperatorReconcileErrors annotations: message: Errors while reconciling {{ $labels.controller }} in {{ $labels.namespace }} Namespace. expr: | rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1 for: 10m labels: severity: warning - alert: PrometheusOperatorNodeLookupErrors annotations: message: Errors while reconciling Prometheus in {{ $labels.namespace }} Namespace. expr: | rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1 for: 10m labels: severity: warning #以下為添加的報(bào)警測(cè)試規(guī)則 - name: www.zhongjima.net rules: - alert: '釘釘報(bào)警測(cè)試' expr: | jvm_threads_live > 140 for: 1m labels: severity: '警告' annotations: summary: "{{ $labels.instance }}: 釘釘報(bào)警測(cè)試" description: "{{ $labels.instance }}:釘釘報(bào)警測(cè)試" custom: "釘釘報(bào)警測(cè)試" value: "{{$value}}"
然后執(zhí)行命令更新規(guī)則
kubectl apply -f prometheus-rules.yaml
然后訪問prometheus地址http://prometheus.amd5.cn/alerts查看rule生效情況,如下圖:
等故障持續(xù)到我們?cè)O(shè)置規(guī)則時(shí)間后,釘釘就會(huì)收到報(bào)警: