主要是我这边为了对接监控团队,所以采用了vmagent的部署方式

vmagent官方文档: https://docs.victoriametrics.com/vmagent.html

Prometheus官网scrape_config配置

使用 vmagent 代替 Prometheus 采集监控指标: https://mp.weixin.qq.com/s/jGf1L-8c8id8umB72b3AsQ

一、vmagent是什么

以下为官方文档原话:

vmagent是一个微小但又十分强大的agent,它可以帮助我们从不同的来源处收集指标,将指标存储在vm或者其他支持remote_write协议的prometheus兼容的存储系统。

vmagent的特点

  • 支持作为prometheus的替代品,用于从比如node_exporter抓取数据
  • 可以从 Kafka 读取数据。请参阅这些文档
  • 可以将数据写入Kafka。请参阅这些文档
  • 可以通过 Prometheus relabeling 添加、删除和修改标签(relabel)。可以在将数据发送到远程存储之前对其进行过滤。有关详细信息,请参阅这些文档
  • 通过 VictoriaMetrics 支持的所有摄取协议接受数据 - 请参阅这些文档
  • 可以将收集的指标同时复制到多个远程存储系统。
  • 在与远程存储连接不稳定的环境中顺利工作。如果远程存储不可用,则收集到的指标缓存在-remoteWrite.tmpDataPath. 一旦与远程存储的连接被修复,缓冲的指标就会被发送到远程存储。可以使用 限制缓冲区的最大磁盘使用量-remoteWrite.maxDiskUsagePerURL
  • 与 Prometheus 相比,使用更少的 RAM、CPU、磁盘 IO 和网络带宽。
  • vmagent当必须抓取大量目标时,抓取目标可以分布在多个实例中。请参阅这些文档
  • 可以有效地抓取暴露数百万时间序列的目标,例如Prometheus 中的 /federate 端点。请参阅这些文档
  • 可以通过在抓取时间和将其发送到远程存储系统之前限制唯一时间序列的数量来处理高基数高流失率问题。请参阅这些文档
  • 可以从多个文件加载抓取配置。请参阅这些文档

二、架构图

  • vmagent采集的指标包括:node-exporter,kubernetes-cadvisor,kube-state-metrics
  • vmagent采集好后发送到监控团队的VictoriaMetrics上

三、部署清单文件

namespace.yml

apiVersion: v1
kind: Namespace
metadata:
  name: sbux-monitoring

serviceaccount.yml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: vmagent
  namespace: sbux-monitoring

clusterrole.yml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: vmagent
rules:
  - apiGroups: ["", "networking.k8s.io", "extensions"]
    resources:
      - nodes
      - nodes/metrics
      - services
      - endpoints
      - endpointslices
      - pods
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources:
      - namespaces
      - configmaps
    verbs: ["get"]
  - nonResourceURLs: ["/metrics", "/metrics/resources"]
    verbs: ["get"]

clusterrolebinding.yml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: vmagent
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: vmagent
subjects:
  - kind: ServiceAccount
    name: vmagent
    namespace: sbux-monitoring

configmap-vmagent.yml

vmagent的配置文件,根据实际需求,需要采集node-exporter,cadvisor,kube-state-metrics指标

global.external_labels字段:配置为每个集群的名字

scrape_timeout:我设置的60秒,实测kube-state-metrics的指标有时候拉取比较慢,除了调整这个超时时间,还应该调整kube-state-metrics pod的CPU和内存配置

apiVersion: v1
kind: ConfigMap
metadata:
  name: vmagent-config
  namespace: sbux-monitoring
data:
  scrape.yml: |
    global:
      scrape_interval: 30s
      scrape_timeout: 60s
      external_labels:
        cluster: gds-poc
    scrape_configs:
    - job_name: 'vmanent'
      static_configs:
        - targets: ['vmagent:8429']
    - job_name: 'node-exporter'
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_endpoints_name]
        regex: node-exporter
        action: keep
      - source_labels: [__meta_kubernetes_namespace]
        regex: kube-system
        action: keep
      - source_labels: [__meta_kubernetes_pod_node_name]
        target_label: cps_node_name

    - job_name: 'kubernetes-cadvisor'

      scheme: https
      tls_config:
        ca_file: /secrets/kubelet/ca
        key_file: /secrets/kubelet/key
        cert_file: /secrets/kubelet/cert
      metrics_path: /metrics/cadvisor

      kubernetes_sd_configs:
      - role: node

      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: cps_node_name
        replacement: $1

    - job_name: 'kube-state-metrics'

      kubernetes_sd_configs:
      - role: endpoints

      relabel_configs:
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: keep
        regex: kube-state-metrics
      - source_labels: [__meta_kubernetes_namespace]
        action: keep
        regex: kube-system

secret_kubelet.yml

cadvisor的job使用的tls,因此需要挂载kubelet的证书

#使用kube-system命名空间下的kubelet-client-tls-secret
#修改namespace字段,删除creationTimestamp,resourceVersion,selfLink,uid等字段
~]$ kubectl -n kube-system get secret kubelet-client-tls-secret -oyaml > secret_kubelet.yml

deployment.yml

运行参数:

  • -promscrape.config=/config/scrape.yml 指定vmagent的配置文件路径,如volumeMounts字段
  • -remoteWrite.tmpDataPath=/tmpData, -remoteWrite.maxDiskUsagePerURL=10GB 指定监控指标临时存储目录为/tmpData,临时目录最大可用10GB。生产环境应当对该目录持久化
  • -remoteWrite.url=http://victoriametrics.victoriametrics:8428/api/v1/write, -remoteWrite.url=https://prometheus-vminsert.xxxxxx.net/insert/0/prometheus 指定了两个远程写入的地址,一个是我自己测试的victoriametrics,一个是监控团队的victoriametrics
  • -remoteWrite.tlsInsecureSkipVerify=true 因为远程写入地址的https证书是自签的,所以需要配置此选项,生产环境建议增加basicauth配置,加强安全性
  • -promscrape.maxScrapeSize=50MB The maximum size of scrape response。实际测试中集群内应用数量很多的时候,response会超过20MB。
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vmagent
  namespace: sbux-monitoring
  labels:
    app: vmagent
spec:
  selector:
    matchLabels:
      app: vmagent
  template:
    metadata:
      labels:
        app: vmagent
    spec:
      serviceAccountName: vmagent
      containers:
        - name: vmagent
          image: "registry.xxxxxx.net/library/vmagent:v1.77.1"
          imagePullPolicy: IfNotPresent
          args:
            - -promscrape.config=/config/scrape.yml
            - -remoteWrite.tmpDataPath=/tmpData
            - -promscrape.maxScrapeSize=50MB
            - -remoteWrite.maxDiskUsagePerURL=10GB
            - -remoteWrite.url=http://victoriametrics.victoriametrics:8428/api/v1/write
            - -remoteWrite.url=https://prometheus-vminsert.xxxxxxcf.net/insert/0/prometheus
            - -remoteWrite.tlsInsecureSkipVerify=true
            - -envflag.enable=true
            - -envflag.prefix=VM_
            - -loggerFormat=json
          ports:
            - name: http
              containerPort: 8429
          resources:
            limits:
              cpu: "4"
              memory: 8Gi
            requests:
              cpu: "1"
              memory: 1Gi
          volumeMounts:
            - name: config
              mountPath: /config
            - name: kubelet-client-tls-secret
              mountPath: /secrets/kubelet
      volumes:
        - name: config
          configMap:
            name: vmagent-config
        - name: kubelet-client-tls-secret
          secret:
            defaultMode: 420
            optional: true
            secretName: kubelet-client-tls-secret

service.yml

---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: vmagent
  name: vmagent
  namespace: sbux-monitoring
spec:
  ports:
  - name: http-8429
    port: 8429
    protocol: TCP
    targetPort: 8429
  selector:
    app: vmagent
  type: ClusterIP

四、遇到的问题

1、kube-state-metrics 指标拿不到,看vmagent有报错:

{"ts":"2022-05-30T13:20:16.594Z","level":"error","caller":"VictoriaMetrics/lib/promscrape/scrapework.go:355","msg":"error when scraping \"http://192.168.154.27:8080/metrics\" from job \"kube-state-metrics\" with labels {cluster=\"prod-azure\",instance=\"192.168.154.27:8080\",job=\"kube-state-metrics\"}: cannot read Prometheus exposition data: cannot read a block of data in 0.000s: the response from \"http://192.168.154.27:8080/metrics\" exceeds -promscrape.maxScrapeSize=16777216; either reduce the response size for the target or increase -promscrape.maxScrapeSize"}

解决办法:增大promscrape.maxScrapeSize

2、kube-state-metrics指标拉取超时

解决办法:kube-state-metrics应用的配置太低,CPU内存调高。vmagent配置文件里面刮擦超时时间改为60s

3、dashboard

  • vmagent:12683

五、资源占用

基本上1C1G就够用了

  • node节点:400
  • pod数:4500(其中790个是业务的pod,其他的是系统组件之类的)
  • deployment:191(其中131个是业务的应用,其他是系统组件)
image-20220714164858005