参考:https://mp.weixin.qq.com/s/gXffcNzixAiTKSBZcf2sBA

最终效果图:

下面全部使用docker部署:

一、部署prometheus

这是一个默认的prometheus配置文件:

[root@localhost prometheus]# cat prometheus.yml 
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
[root@localhost prometheus]# docker run -d --name prometheus -p 9090:9090 -v ${PWD}:/etc/prometheus  prom/prometheus:v2.25.0

网页访问9090测试

二、部署grafana

[root@localhost ~]# docker run -d --name=grafana -p 3000:3000    grafana/grafana:7.2.2

访问3000端口,并配置prometheus数据源

三、部署blackbox-exporter

Blackbox_exporter是prometheus官方的组件,github地址: https://github.com/prometheus/blackbox_exporter

配置文件使用官方默认的,更多配置可以参考官方example.yml:

[root@localhost blackbox-exporter]# cat blackbox.yml 
modules:
  http_2xx:  # http 检测模块  Blockbox-Exporter 中所有的探针均是以 Module 的信息进行配置
    prober: http
    timeout: 10s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2"]   
      valid_status_codes: [200]  # 这里最好作一个返回状态码,在grafana作图时,有明示---陈刚注释。
      method: GET
      preferred_ip_protocol: "ip4"
  http_post_2xx: # http post 监测模块
    prober: http
    timeout: 10s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2"]
      method: POST
      preferred_ip_protocol: "ip4"
  tcp_connect:  # TCP 检测模块
    prober: tcp
    timeout: 10s
  dns:  # DNS 检测模块
    prober: dns
    dns:
      transport_protocol: "tcp"  # 默认是 udp
      preferred_ip_protocol: "ip4"  # 默认是 ip6
      query_name: "kubernetes.default.svc.cluster.local"
[root@localhost blackbox-exporter]# docker run  -d -p 9115:9115 --name blackbox_exporter -v `pwd`:/config prom/blackbox-exporter:master --config.file=/config/blackbox.yml

访问9115端口测试

四、prometheus配置文件里添加job,对blackbox数据进行收集

这段内容从官方文档抄过来的:

[root@localhost prometheus]# tail -17 prometheus.yml 
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]  # Look for a HTTP 200 response.
    static_configs:
      - targets:
        - http://prometheus.io    # Target to probe with http.
        - https://prometheus.io   # Target to probe with https.
        - https://jd.com # Target to probe with http on port 8080.
        - https://www.bejson.com # Target to probe with http on port 8080.
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 172.17.0.3:9115  # The blackbox exporter's real hostname:port.

Lifecycle api没有开启(curl -X POST http://127.0.0.1:9090/-/reload),只能手动重载配置:

[root@localhost prometheus]# docker exec -it prometheus kill -1 1

prometheus页面查看target

五、prometheus导入dashborad

使用的dashboard是这个: https://grafana.com/grafana/dashboards/13230

六、看效果

七、设置prometheus告警

首先在prometheus.yml文件里面通过rule_files指定告警规则文件的访问路径

/etc/prometheus/rules $ cat /etc/prometheus/prometheus.yml
rule_files:
  - "/etc/prometheus/rules/*.rules"

然后编辑ssl告警规则文件

/etc/prometheus $ mkdir /etc/prometheus/rules
/etc/prometheus/rules $ cat /etc/prometheus/rules/ssl-expire-alert.rules 
groups:
- name: ssl_expiry
  rules:
  - alert: Ssl Cert Will Expire in 30 days
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 300
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "SSL certificate will expire soon on (instance {{ $labels.instance }})"
      description: "SSL certificate expires in 30 days\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

prometheus加载配置文件

/etc/prometheus/rules $ kill -1 1

去prometheus界面查看告警,已经有了

八、配置alertmanager邮件告警

部署alertmanager,配置文件是默认的,没有改

/alertmanager $ cat /etc/alertmanager/alertmanager.yml 
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
➜  alertmanager docker run --name alertmanager -d -v $(pwd)/alertmanager.yml:/etc/alertmanager/alertmanager.yml -p 9093:9093 prom/alertmanager:v0.21.0

网页访问测试:

关联prometheus和alertmanager,此时需要修改prometheus.yml,添加alertmanager配置

/prometheus $ cat /etc/prometheus/prometheus.yml
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 172.17.0.7:9093

加载prometheus配置

/prometheus $ kill -1 1

刷新alertmanager页面,发现告警已经过来了

修改alertmanager配置文件,配置邮件告警:

alertmanager重载配置文件:

/alertmanager $ kill -1 1

查看邮箱有没有收到邮件(如果没收到的话要看下alertmanager的日志有什么报错,比如smtp服务器连不上,或者配置文件某一行格式不对)