Kubernetes集群Prometheus Operator釘釘報(bào)警配置

2020年9月29日17:24:02 發(fā)表評(píng)論 6,398 ℃

最近把Prometheus監(jiān)控遷移到了kubernetes集群中,部署文檔參考《Kubernetes環(huán)境使用Prometheus Operator自發(fā)現(xiàn)監(jiān)控SpringBoot》,各類監(jiān)控項(xiàng)的數(shù)據(jù)采集,以及grafana的監(jiān)控展示測(cè)試都正常,于是進(jìn)入下一步報(bào)警的遷入測(cè)試,alertmanager原生不支持釘釘報(bào)警,所以只能通過(guò)webhook的方式,好在已經(jīng)有大佬開源了一套基于prometheus 釘釘報(bào)警的webhook(項(xiàng)目地址https://github.com/timonwong/prometheus-webhook-dingtalk),所以我們直接配置使用就可以了。

怎么創(chuàng)建釘釘機(jī)器人非常簡(jiǎn)單這里就不介紹了,創(chuàng)建好釘釘機(jī)器人以后,下一步就是部署webhook,接收alertmanager的報(bào)警信息,格式化以后再發(fā)送給釘釘機(jī)器人。非kubernetes集群部署也是非常簡(jiǎn)單,直接編寫一個(gè)docker-compose文件,直接運(yùn)行就可以了。

1、在kubernetes集群中,pod之間需要通信,需要使用service,所以先編寫一個(gè)kubernetes的yaml文件dingtalk-webhook.yaml。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: webhook-dingtalk
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dingtalk
  replicas: 1
  template:
    metadata:
      labels:
        app: dingtalk
    spec:
      restartPolicy: Always
      containers:
      - name: dingtalk
        image: timonwong/prometheus-webhook-dingtalk:v1.4.0
        imagePullPolicy: IfNotPresent
        args:
          - '--web.enable-ui'
          - '--web.enable-lifecycle'
          - '--config.file=/config/config.yaml'
        ports:
        - containerPort: 8060
          protocol: TCP
        volumeMounts:
        - mountPath: "/config"
          name: dingtalk-volume
        resources:
          limits:
            cpu: 100m
            memory: 100Mi
          requests:
            cpu: 100m
            memory: 100Mi
      volumes:
      - name: dingtalk-volume
        persistentVolumeClaim:
          claimName: dingding-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: webhook-dingtalk
  namespace: monitoring
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 8060
  selector:
    app: dingtalk
  sessionAffinity: None

1.1、第一種方式通過(guò)數(shù)據(jù)持久化,把配置文件config.yaml和報(bào)警模板放在了共享存儲(chǔ)里面,這樣webhook不管部署到哪臺(tái)node,都可以讀取到配置文件和報(bào)警模板。怎么通過(guò)NFS讓數(shù)據(jù)持久化可以參考文檔《Kubernetes使用StorageClass動(dòng)態(tài)生成NFS類型的PV》。

dingding-pvc.yaml 

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: dingding-pvc
  annotations:
    volume.beta.kubernetes.io/storage-class: "atang-nfs"
  namespace: monitoring
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 50Mi

配置文件config.yaml:

templates:
  - /config/template.tmpl
targets:
  webhook1:
    url: https://oapi.dingtalk.com/robot/send?access_token=替換成自己的釘釘機(jī)器人token

報(bào)警模板template.tmpl:

{{ define "ding.link.title" }}[監(jiān)控報(bào)警]{{ end }}
{{ define "ding.link.content" -}}
{{- if gt (len .Alerts.Firing) 0 -}}
  {{ range $i, $alert := .Alerts.Firing }}
    [告警項(xiàng)目]:{{ index $alert.Labels "alertname" }}
    [告警實(shí)例]:{{ index $alert.Labels "instance" }}
    [告警級(jí)別]:{{ index $alert.Labels "severity" }}
    [告警閥值]:{{ index $alert.Annotations "value" }}
    [告警詳情]:{{ index $alert.Annotations "description" }}
    [觸發(fā)時(shí)間]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
  {{ end }}{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
  {{ range $i, $alert := .Alerts.Resolved }}
    [項(xiàng)目]:{{ index $alert.Labels "alertname" }}
    [實(shí)例]:{{ index $alert.Labels "instance" }}
    [狀態(tài)]:恢復(fù)正常
    [開始]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    [恢復(fù)]:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
  {{ end }}{{- end }}
{{- end }}

可以根據(jù)自己的喜歡自己修改模板,“.EndsAt.Add 28800e9”是UTC時(shí)間+8小時(shí),因?yàn)閜rometheus和alertmanager默認(rèn)都是使用的UTC時(shí)間,另外需要把這兩個(gè)文件的屬主和屬組設(shè)置成65534,不然webhook容器沒有權(quán)限訪問這兩個(gè)文件。

1.2、第二種方式通過(guò)configMap方式(推薦)掛載配置文件和模板,需要修改原來(lái)的dingtalk-webhook.yaml文件,添加掛載為configMap。

apiVersion: v1
kind: ConfigMap
metadata:
  name: dingtalk-config
  namespace: monitoring
data:
  config.yaml: |
    templates:
      - /config/template.tmpl
    targets:
      webhook1:
        url: https://oapi.dingtalk.com/robot/send?access_token=your_dingding_token
  template.tmpl: |
    {{ define "ding.link.title" }}[監(jiān)控報(bào)警]{{ end }}
    {{ define "ding.link.content" -}}
    {{- if gt (len .Alerts.Firing) 0 -}}
      {{ range $i, $alert := .Alerts.Firing }}
        [告警項(xiàng)目]:{{ index $alert.Labels "alertname" }}
        [告警實(shí)例]:{{ index $alert.Labels "instance" }}
        [告警級(jí)別]:{{ index $alert.Labels "severity" }}
        [告警閥值]:{{ index $alert.Annotations "value" }}
        [告警詳情]:{{ index $alert.Annotations "description" }}
        [觸發(fā)時(shí)間]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
      {{ end }}{{- end }}
    {{- if gt (len .Alerts.Resolved) 0 -}}
      {{ range $i, $alert := .Alerts.Resolved }}
        [項(xiàng)目]:{{ index $alert.Labels "alertname" }}
        [實(shí)例]:{{ index $alert.Labels "instance" }}
        [狀態(tài)]:恢復(fù)正常
        [開始]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
        [恢復(fù)]:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
      {{ end }}{{- end }}
    {{- end }}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dingding-webhook
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dingtalk
  replicas: 1
  template:
    metadata:
      labels:
        app: dingtalk
    spec:
      restartPolicy: Always
      containers:
      - name: dingtalk
        image: timonwong/prometheus-webhook-dingtalk:v1.4.0
        imagePullPolicy: IfNotPresent
        args:
          - '--web.enable-ui'
          - '--web.enable-lifecycle'
          - '--config.file=/config/config.yaml'
        ports:
        - containerPort: 8060
          protocol: TCP
        volumeMounts:
        - name: dingtalk-config
          mountPath: "/config"
        resources:
          limits:
            cpu: 100m
            memory: 100Mi
          requests:
            cpu: 100m
            memory: 100Mi
      volumes:
      - name: dingtalk-config
        configMap:
          name: dingtalk-config
---
apiVersion: v1
kind: Service
metadata:
  name: dingding-webhook
  namespace: monitoring
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 8060
  selector:
    app: dingtalk
  sessionAffinity: None

2、修改alertmanager默認(rèn)的配置文件,增加webhook_configs,直接修改kube-prometheus-master/manifests/alertmanager-secret.yaml文件為以下內(nèi)容:

apiVersion: v1
data: {}
kind: Secret
metadata:
  name: alertmanager-main
  namespace: monitoring
stringData:
  alertmanager.yaml: |-
    "global":
      "resolve_timeout": "5m"
    "inhibit_rules":
    - "equal":
      - "namespace"
      - "alertname"
      "source_match":
        "severity": "critical"
      "target_match_re":
        "severity": "warning|info"
    - "equal":
      - "namespace"
      - "alertname"
      "source_match":
        "severity": "warning"
      "target_match_re":
        "severity": "info"
    "receivers":
    - "name": "www.zhongjima.net"
    #- "name": "Watchdog"
    #- "name": "Critical"
    #- "name": "webhook"
      "webhook_configs":
      - "url": "http://webhook-dingtalk/dingtalk/webhook1/send"
        "send_resolved": true 
    "route":
      "group_by":
      - "namespace"
      "group_interval": "5m"
      "group_wait": "30s"
      "receiver": "www.zhongjima.net"
      "repeat_interval": "12h"
      #"routes":
      #- "match":
      #    "alertname": "Watchdog"
      #  "receiver": "Watchdog"
      #- "match":
      #    "severity": "critical"
      #  "receiver": "Critical"

所有的yaml文件準(zhǔn)備好以后,執(zhí)行

kubectl apply -f dingding-pvc.yaml 
kubectl apply -f dingtalk-webhook.yaml
kubectl apply -f alertmanager-secret.yaml

查看執(zhí)行結(jié)果

Kubernetes集群Prometheus Operator釘釘報(bào)警配置

然后訪問alertmanager的地址(把a(bǔ)lertmanager.amd5.cn替換為自己的地址)查看配置webhook_configs是否已經(jīng)生效,http://alertmanager.amd5.cn/#/status。

3、生效以后,我們就添加報(bào)警規(guī)則,等待觸發(fā)規(guī)則閾值報(bào)警測(cè)試。

直接修改kube-prometheus-master/manifests/prometheus-rules.yaml在末尾添加下面的內(nèi)容,注意縮進(jìn)。

   - name: prometheus-operator
     rules:
    - alert: PrometheusOperatorReconcileErrors
      annotations:
        message: Errors while reconciling {{ $labels.controller }} in {{ $labels.namespace
          }} Namespace.
      expr: |
        rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1
      for: 10m
      labels:
        severity: warning
    - alert: PrometheusOperatorNodeLookupErrors
      annotations:
        message: Errors while reconciling Prometheus in {{ $labels.namespace }} Namespace.
      expr: |
        rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1
      for: 10m
      labels:
        severity: warning
        
  #以下為添加的報(bào)警測(cè)試規(guī)則
  - name: www.zhongjima.net
    rules:
    - alert: '釘釘報(bào)警測(cè)試'
      expr: |
        jvm_threads_live > 140
      for: 1m
      labels:
        severity: '警告'
      annotations:
        summary: "{{ $labels.instance }}: 釘釘報(bào)警測(cè)試"
        description: "{{ $labels.instance }}:釘釘報(bào)警測(cè)試"
        custom: "釘釘報(bào)警測(cè)試"
        value: "{{$value}}"

然后執(zhí)行命令更新規(guī)則

kubectl apply -f prometheus-rules.yaml

然后訪問prometheus地址http://prometheus.amd5.cn/alerts查看rule生效情況,如下圖:

Kubernetes集群Prometheus Operator釘釘報(bào)警配置

等故障持續(xù)到我們?cè)O(shè)置規(guī)則時(shí)間后,釘釘就會(huì)收到報(bào)警:

Kubernetes集群Prometheus Operator釘釘報(bào)警配置

【騰訊云】云服務(wù)器、云數(shù)據(jù)庫(kù)、COS、CDN、短信等云產(chǎn)品特惠熱賣中

發(fā)表評(píng)論

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: