prometheus+grafana+alertmanager 安装配置文档

1. 安装组件基本介绍：

prometheus:
- server端守护进程，负责拉取各个终端exporter收集到metrics（监控指标数据），并记录在本身提供的tsdb时序记录数据库中，默认保留天数15天，可以通过启动参数自行设置数据保留天数。
- prometheus官方提供了多种exporter,
- 默认监听9090端口，对外提供web图形查询页面，以及数据库查询访问接口。
- 配置监控规则rules（需自行手动配置），并将触发规则的告警发送至alertmanager ，并由alertmanager中配置的告警媒介向外发送告警。
grafana：
- 由于prometheus本身提供的图形页面过于简陋，所以使用grafana来提供图形页面展示。
- grafana 是专门用于图形展示的软件，支持多种数据来源，prometheus只是其中一种。
- 自带告警功能，且告警规则可在监控图形上直接配置，不过由于此种方式不支持模板变量（dashboard中为了方便展示配置的特殊变量），即每一个指标，每一台设备均需要单独配置，所以实用性较低
- 默认监听端口：3000
node_exporter:
- agent端，prometheus官方提供的诸多exporter中的一种，安装与各监控节点主机
- 负责抓取主机及系统各项信息，如cpu，mem ,disk,networtk.filesystem，…等等各项基本指标，非常全面。并将抓取到的各项指标metrics 通过http协议对方发布，供prometheus server端抓取。
- 默认监听端口： 9100
cadvisor：
- agent端，安装与docker主机，抓取主机和docker容器运行中各项数据。
- 本身也已容器方式运行，监听端口8080（可自行设置对外映射端口，且建议映射到其他端口）。
- 提供基本的graph展示页面，同时提供metrics抓取页面
alertmanager:
- 接受prometheus发送的告警，并通过一定规则分组，控制告警的发送（如告警频率，规则抑制，匹配不同的告警后端媒介，设置静默等）。
- 可配置多种不同的告警后端媒介，如：邮件，webhook，wechat（企业微信）已经一些企业版的监控告警平台等。
- 默认监听端口：9093
blackbox_exporter:
- Prometheus 官方提供的 exporter 之一，可以提供 http、dns、tcp、icmp 的监控数据采集
- 可直接配置与prometheus server节点，也可配置在单独节点
- 默认监听端口：9115
nginx:
- 由于prometheus，alertmanager本身不具有认证功能，所以前端使用nginx对外访问，提供基本basic认证已经配置https
- 以上各组件均需暴露自身端口，所以在docker-compos 部署过程中，将容器部署在同一网络中，前端入口映射端口由nginx统一配置，方便管理

2.prometheus-server

2.1 官方地址：

官方文档地址：https://prometheus.io/docs/introduction/overview/
github项目下载地址： https://github.com/prometheus/prometheus

2.2 安装 prometheus server

2.2.1 linux（centos7）下载安装

创建运行prometheus server进程的系统用户，并为其创建家目录/var/lib/prometheus 作为数据存储目录

~]# useradd -r -m -d /var/lib/prometheus prometheus

下载并安装prometheus server，以2.14.0为例：

 wget https://github.com/prometheus/prometheus/releases/download/v2.14.0/prometheus-2.14.0.linux-amd64.tar.gz
 tar -xf prometheus-2.14.0.linux-amd64.tar.gz  -C /usr/local/
 cd /usr/local
 ln -sv prometheus-2.14.0.linux-amd64 prometheus

创建unit file，让systemd 管理prometheus

 vim /usr/lib/systemd/system/prometheus.service             
 [Unit]
 Description=The Prometheus 2 monitoring system and time series database.
 Documentation=https://prometheus.io
 After=network.target
 [Service]
 EnvironmentFile=-/etc/sysconfig/prometheus
 User=prometheus
 ExecStart=/usr/local/prometheus/prometheus \
 --storage.tsdb.path=/home/prometheus/prometheus \
 --config.file=/usr/local/prometheus/prometheus.yml \
 --web.listen-address=0.0.0.0:9090 \
 --web.external-url= $PROM_EXTRA_ARGS
 Restart=on-failure
 StartLimitInterval=1
 RestartSec=3
 [Install]
 WantedBy=multi-user.target

其他运行时参数： ./prometheus –help
启动服务

systemctl daemon-reload
systemctl start prometheus.service

注意开启防火墙端口：

iptables -I INPUT -p tcp --dport 9090 -s NETWORK/MASK -j ACCEPT

浏览器访问：

http://IP:PORT

2.2.2 docker安装：

image: prom/prometheus
启动命令：

$ docker run --name prometheus -d -v ./prometheus:/etc/prometheus/ -v ./db/:/prometheus -p 9090:9090 prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.listen-address="0.0.0.0:9090" --storage.tsdb.path=/prometheus --web.console.libraries=/usr/share/prometheus/console_libraries --web.console.templates=/usr/share/prometheus/consoles --storage.tsdb.retention=30d

2.3 prometheus配置:

2.3.1 启动参数

常用启动参数：

--config.file=/etc/prometheus/prometheus.yml # 指明主配置文件
--web.listen-address="0.0.0.0:9090"     # 指明监听地址端口
--storage.tsdb.path=/prometheus     # 指明数据库目录
--web.console.libraries=/usr/share/prometheus/console_libraries
--web.console.templates=/usr/share/prometheus/consoles  # 指明console lib 和 tmpl
--storage.tsdb.retention=60d  # 指明数据保留天数，默认15

2.3.2 配置文件：

Prometheus的主配置⽂件为prometheus.yml
它主要由global、rule_files、scrape_configs、alerting、remote_write和remote_read⼏个配置段组成：

 - global：全局配置段；

 - rule_files：指定告警规则文件的路径

 - scrape_configs：
    scrape配置集合，⽤于定义监控的⽬标对象（target）的集合，以及描述如何抓取 （scrape）相关指标数据的配置参数；
    通常，每个scrape配置对应于⼀个单独的作业（job），
    ⽽每个targets可通过静态配置（static_configs）直接给出定义，也可基于Prometheus⽀持的服务发现机制进 ⾏⾃动配置；

  - job_name: 'nodes'
 static_configs:    # 静态指定，targets中的 host:port/metrics 将会作为metrics抓取对象
 - targets: ['localhost:9100']
 - targets: ['172.20.94.1:9100']

- job_name: 'docker_host'
  file_sd_configs:  # 基于文件的服务发现，文件中（yml 和json 格式）定义的host:port/metrics将会成为抓取对象
 - files:
  - ./sd_files/docker_host.yml
refresh_interval: 30s

alertmanager_configs：

可由Prometheus使⽤的Alertmanager实例的集合，以及如何同这些Alertmanager交互的配置参数；

每个Alertmanager可通过静态配置（static_configs）直接给出定义，也可基于Prometheus⽀持的服务发现机制进⾏⾃动配置；

remote_write：

配置“远程写”机制，Prometheus需要将数据保存于外部的存储系统（例如InfluxDB）时 定义此配置段，
随后Prometheus将样本数据通过HTTP协议发送给由URL指定适配器(Adaptor)；

remote_read：

配置“远程读”机制，Prometheus将接收到的查询请求交给由URL指定适配器 （Adpater）执⾏，
Adapter将请求条件转换为远程存储服务中的查询请求，并将获取的响应数据转换为Prometheus可⽤的格式；

监控及告警规则配置文件：*.yml
- 定义监控规则
- 需要在主配置文件rule_files: 中指定才会生效

 rule_files:
- "test_rules.yml"  # 指定配置告警规则的文件路径

服务发现定义文件：支持yaml 和 json 两种格式
- 也是需要在主配置文件中定义
```
file_sd_configs:
- files:
    - ./sd_files/http.yml
  refresh_interval: 30s
```

2.3.3 简单的配置文件示例：

prometheus.yml 示例

global:
  scrape_interval:  15s      #每过15秒抓取一次指标数据
  evaluation_interval: 15s#每过15秒执行一次报警规则，也就是说15秒执行一次报警
alerting:
  alertmanagers:
  - static_configs:
 - targets: ["localhost:9093"]# 设置报警信息推送地址 ， 一般而言设置的是alertManager的地址
rule_files:
  - "test_rules.yml"  # 指定配置告警规则的文件路径
scrape_configs: 
  - job_name: 'node'#自己定义的监控的job_name
 static_configs:    # 配置静态规则，直接指定抓取的ip:port
- targets: ['localhost:9100']
  - job_name: 'CDG-MS'
 honor_labels: true
 metrics_path: '/prometheus'
 static_configs:
- targets: ['localhost:8089']
 relabel_configs:
- target_label: env
  replacement: dev
  - job_name: 'eureka'
 file_sd_configs:       # 基于文件的服务发现
- files:
 - "/app/enmonster/basic/prometheus/prometheus-2.2.1.linux-amd64/eureka.json" # 支持json 和yml 两种格式
refresh_interval: 30s  # 30s钟自行刷新配置，读取文件，修改之后无需手动reload
 relabel_configs:
- source_labels: [__job_name__]
  regex: (.*)
  target_label: job
  replacement: ${1}
- target_label: env
  replacement: dev

告警规则配置文件示例：“`
[root@host40 monitor-bak]# cat prometheus/rules/docker_monitor.yml
groups:
</p></li>
<li><p>name: “container monitor”
rules:

<ul>
<li>alert: “Container down: env1″
expr: time() – container_last_seen{name=”env1”} > 60
for: 30s
labels:
severity: critical
annotations:
summary: “Container down: {{$labels.instance}} name={{$labels.name}}”

“`

基于文件的服务发现定义文件： *.yml

[root@host40 monitor]# cat prometheus/sd_files/virtual_lan.yml 
- targets: ['10.10.11.179:9100']
- targets: ['10.10.11.178:9100']

[root@host40 monitor]# cat prometheus/sd_files/tcp.yml 
- targets: ['10.10.11.178:8001']
labels:
server_name: http_download
- targets: ['10.10.11.178:3307']
labels:
server_name: xiaojing_db
- targets: ['10.10.11.178:3001']
labels:
server_name: test_web

2.3.5其他配置

由于prometheus很多配置需要和其他组件耦合，所以在介绍到相应组件时再行介绍

2.4 prometheus web-gui

web页面访问地址： http://ip:port 如：http://10.10.11.40:9090/
alerts：查看告警规则
graph：查询收集到的指标数据，并提供简单的绘图
status： prometheus运行时配置已经监听主机相关信息
详情自行查看web-gui页面

3.node_exporter

3.1 基本介绍

node_exporter 在被监控节点安装，抓取主机监控信息，并对外提供http服务，供prometheus抓取监控信息。
项目及文档地址：https://github.com/prometheus/node_exporter
prometheus官方提供了很多不同类型的exporter，列表地址： https://prometheus.io/docs/instrumenting/exporters/

3.2 安装node_exporter

3.2.1 linux（centos7）下载安装：

下载并解压

wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz
tar xf node_exporter-0.18.1.linux-amd64.tar.gz -C /usr/local/
cd /usr/local
ln -sv node_exporter-0.18.1.linux-amd64/ node_exporter

创建用户：

useradd -r -m -d /var/lib/prometheus prometheus

配置unit file：

vim /usr/lib/systemd/system/node_exporter.service
[Unit]
Description=Prometheus exporter for machine metrics, written in Go with pluggable metric 
collectors.Documentation=https://github.com/prometheus/node_exporterAfter=network.target
[Service]
EnvironmentFile=-/etc/sysconfig/node_exporter
User=prometheus
ExecStart=/usr/local/node_exporter/node_exporter \
$NODE_EXPORTER_OPTS
Restart=on-failure
StartLimitInterval=1
RestartSec=3
[Install]
WantedBy=multi-user.target

启动服务：

systemctl daemon-reload
systemctl start node_exporter.service

可以手动测试是否可以获取metrics信息：
```
curl http://localhost:9100/metrics
```

开启防火墙：

iptables -I INPUT -p tcp --dport 9100 -s NET/MASK -j ACCEPT

3.2.2 docker安装

image: quay.io/prometheus/node-exporter，prom/node-exporter

启动命令：

docker run -d --net="host" --pid="host" -v "/:/host:ro,rslave" --name monitor-node-exporter --restart always quay.io/prometheus/node-exporter --path.rootfs=/host --web.listen-address=:9100

对于部分低版本的docker，出现报错：Error response from daemon: linux mounts: Could not find source mount of /解决办法：-v “/:/host:ro,rslave” -> -v “/:/host:ro”

3.3 配置node_exporter

开启关闭collectors:

./node_exporter --help  # 查看支持的所有collectors，可根据实际需求 enable 和 disabled 各项指标收集

如 --collector.cpu=disabled ，不再收集cpu相关信息

Textfile Collector：文本文件收集器

通过 启动参数 --collector.textfile.directory="DIR"   可开启文本文件收集器
收集器会收集目录下所有*.prom的文件中的指标，指标必须满足    prom格式

示例:

echo my_batch_job_completion_time $(date +%s) > /path/to/directory/my_batch_job.prom.$$
mv /path/to/directory/my_batch_job.prom.$$ /path/to/directory/my_batch_job.prom            
echo 'role{role="application_server"} 1' > /path/to/directory/role.prom.$$
mv /path/to/directory/role.prom.$$ /path/to/directory/role.prom    
rpc_duration_seconds{quantile="0.5"} 4773
http_request_duration_seconds_bucket{le="0.5"} 129389

即如果node_exporter 不能满足自身指标抓取，可以通过脚本形式将指标抓取之后写入文件，由node_exporter对外提供个prometheus抓取
可以省掉pushgateway

有关prom格式和查询语法，将再之后介绍

3.4 配置prometheus抓取node_exporter 指标

示例： prometheus.yml
“`
scrape_configs:
</p></li>
</ul>

<h1>The job name is added as a label <code>job=</code> to any timeseries scraped from this config.</h1>

<ul>
<li>job_name: ‘prometheus’
# metrics_path defaults to ‘/metrics’
# scheme defaults to ‘http’.
static_configs:

<ul>
<li>targets: [‘localhost:9090’]</li>
</ul></li>
<li><p>job_name: ‘nodes’
static_configs:

<ul>
<li>targets: [‘localhost:9100’]</li>
<li>targets: [‘172.20.94.1:9100’]</li>
</ul>

<pre><code class=”line-numbers”></code></pre>

<ul>
<li>job_name: ‘node_real_lan’
file_sd_configs:</li>
<li>files:</li>
</ul></li>
<li>./sd_files/real_lan.yml
refresh_interval: 30s
params: # 可选
collect[]:

<ul>
<li>cpu</li>
<li>meminfo</li>
<li>diskstats</li>
<li>netdev</li>
<li>netstat</li>
<li>filefd</li>
<li>filesystem</li>
<li>xfs

“`

4.cadvisor

4.1 官方地址：

https://github.com/google/cadvisor
image： gcr.io/google_containers/cadvisor[:v0.36.0] # 需要能访问google
image: google/cadvisor:v0.33.0 # docker hub镜像，版本没有google的新

4.2 docker run

sudo docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:ro \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --volume=/dev/disk/:/dev/disk:ro \
  --publish=9080:8080 \
  --detach=true \
  --name=cadvisor \
  --privileged \
  --device=/dev/kmsg \
  google/cadvisor:v0.33.0

4.3 web页面查看简单的单机图形监控信息

http://ip:port

4.4 配置prometheus抓取

配置示例：“`
<ul>
<li>job_name: ‘docker’
static_configs:</li>
</ul></li>
<li>targets: [‘localhost:9080’]

“`

5.grafana

5.1 官方地址

grafana程序下载地址：https://grafana.com/grafana/download
grafana dashboard 下载地址： https://grafana.com/grafana/download/

5.2 安装grafana

5.2.1 linux(centos7)安装

下载并安装

wget https://dl.grafana.com/oss/release/grafana-7.2.2-1.x86_64.rpm
sudo yum install grafana-7.2.2-1.x86_64.rpm

准备service 文件：

[Unit]
Description=Grafana instance
Documentation=http://docs.grafana.org
Wants=network-online.target
After=network-online.target
After=postgresql.service mariadb.service mysqld.service

[Service]
EnvironmentFile=/etc/sysconfig/grafana-server
User=grafana
Group=grafana
Type=notify
Restart=on-failure
WorkingDirectory=/usr/share/grafana
RuntimeDirectory=grafana
RuntimeDirectoryMode=0750
ExecStart=/usr/sbin/grafana-server  \
--config=${CONF_FILE}  \
--pidfile=${PID_FILE_DIR}/grafana-server.pid\
--packaging=rpm  \
cfg:default.paths.logs=${LOG_DIR}  \
cfg:default.paths.data=${DATA_DIR} \
cfg:default.paths.plugins=${PLUGINS_DIR} \
cfg:default.paths.provisioning=${PROVISIONING_CFG_DIR}

LimitNOFILE=10000
TimeoutStopSec=20

[Install]
WantedBy=multi-user.target

启动grafana

systemctl enable grafana-server.service
systemctl restart grafana-server.service

默认监听3000端口

开启防火墙：

iptables -I INPUT -p tcp --dport 3000 -s NET/MASK -j ACCEPT

5.2.2 docker安装

image: grafana/grafana

docker run -d --name=grafana -p 3000:3000 grafana/grafana:7.2.2

5.3 grafana 简单使用流程

web页面访问：
```
http://ip:port
```
首次登陆会要求自行设置账号密码
7.2版本会要求输入账号密码之后重置，初始账号密码都是admin
使用流程：
- 添加数据源
- 添加dashboard，配置图形监控面板，也可在官网下载对应服务的dashboard模板，下载地址：https://grafana.com/grafana/download/
- 导入模板，json 或链接或模板编号
- 查看dashboard
常用模板编号：
- node-exporter： cn/8919,en/11074
- k8s: 13105
- docker: 12831
- alertmanager: 9578
- blackbox_exportre: 9965

重置管理员密码：

查看Grafana配置文件，确定grafana.db的路径
配置文件路径：/etc/grafana/grafana.ini
[paths]
;data = /var/lib/grafana
[database]
# For "sqlite3" only, path relative to data_path setting
;path = grafana.db
通过配置文件得知grafana.db的完整路径如下：
/var/lib/grafana/grafana.db

使用sqlites修改admin密码 
sqlite3 /var/lib/grafana/grafana.db
sqlite> update user set password = 
'59acf18b94d7eb0694c61e60ce44c110c7a683ac6a8f09580d626f90f4a242000746579358d77dd9e570e83fa24faa88a8a6', 
salt = 'F3FAxVm33R' where login = 'admin';
.exit

使用admin admin 登录

5.4 grafana告警配置：

grafana-server配置 smtp服务器，配置发件邮箱

vim /etc/grafana/grafana.ini
[smtp]
enabled =  true
host = smtp.126.com:465
user = USER@126.com
password = PASS
skip_verify = false
from_address = USER@126.com
from_name = Grafana Alart

grafana页面添加Notification Channel

Alerting -> Notification Channel
save之前 可以send test

进入dashboard，添加alart rules
由于现阶段grafana(7.2.2)不支持在报警查询中使用模板变量。所以报警功能实用性很低。生产中建议使用alertmanager

6.prometheus and PromQL：

6.1 PromQL 简述

prometheus用来查询数据库的语法规则，用来将数据库中存储的由各exporter 采集到的metric指标组织成可视化的图标信息，以及告警规则
promQL一个多维数据模型，其中包含通过metric name 和键/值对标识的时间序列数据
一种灵活的查询语言，可利用此维度
不依赖分布式存储；单服务器节点是自治的
多种图形和仪表板支持模式

6.2 使用到promQL的组件：

prometheus server
client libraries for instrumenting application c7ode
push gateway
exporters
alertmanager

6.3 metric 介绍

6.3.1 metric类型

gauges：返回单一数值，如：
- node_boot_time_seconds
node_boot_time_seconds{instance=”10.10.11.40:9100″,job=”node_real_lan”} 1574040030
counters：计数，
histograms：直方图，统计数据的分布情况。比如最大值，最小值，中间值，中位数，百分位数等。
summaries: 采样点分位图统计。

6.3.2 label

node_boot_time_seconds{instance=”10.10.11.40:9100″,job=”node_real_lan”}
如上示例，这里的instance，和job 就是label
- job ： job_name,在prometheus.yml 中定义
- instance: host:port
也可以在配置文件自行定义label，如：
```
- targets: ['10.10.11.178:3001']
labels:
server_name: test_web
```
添加的label即会在prometheus查询数据使用:
```
metric{servername=...,}
```

6.4 PromQL 表达式

PromQL表达式即是grafana绘制图标的基本语句，也是prometheus用来设置告警规则的基本语句，所以能弄懂或者看懂promQL 非常重要。

6.4.1 先看示例：

计算cpu使用率：

(1-((sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance))/(sum(increase(node_cpu_seconds_total[1m])) by (instance)))) * 100

其中metric：

node_cpu_seconds_total         # 总cpu 使用时间
node_cpu_seconds_total{mode="idle"} # 空闲cpu使用时间，其他类似标签： user , system , steal , softirq , irq , nice , iowait , idle

用到的函数：

“`
increase( [1m]) # 1分钟之类的增量。
sum()
sum() by (TAG) # 其中 TAG 是标签，此地 instance 代表的是机器名. 按主机名进行相加，否则多主机只会显示一条线。
</p></li>
</ul>

<pre><code class=”line-numbers”>#### 6.4.2 标签选择

– 匹配运算：

“`
= #等于 Select labels that are exactly equal to the provided string.
!= #不等于 Select labels that are not equal to the provided string.
=~ #正则表达式匹配 Select labels that regex-match the provided string.
!~ #正则表达式不匹配 Select labels that do not regex-match the provided string.
“`

– 示例：

“`
node_cpu_seconds_total{mode=”idle”} # mode : 标签，metric自带属性。
api_http_requests_total{method=”POST”, handler=”/messages”}
“`

“`
http_requests_total{environment=~”staging|testing|development”,method!=”GET”}
“`

– 注意：必须指定一个名称或至少一个与空字符串不匹配的标签匹配器

“`
{job=~”.*”} # Bad!
{job=~”.+”} # Good!
{job=~”.*”,method=”get”} # Good!
“`

#### 6.4.3 运算

– 时间范围：

“`
s -秒
m – 分钟
h – 小时
d – 天
w -周
y -年
“`

– 运算符：

“`
+ (addition)
– (subtraction)
* (multiplication)
/ (division)
% (modulo)
^ (power/exponentiatio
== (equal)
!= (not-equal)
> (greater-than)
= (greater-or-equal)
40
for: 1m
labels:
servirity: warning
annotations:
summary: “{{$labels.instance}}:CPU 使用过高”
description: “{{$labels.instance}}:CPU 使用率超过 40%”
value: “{{$value}}”
– alert: “CPU 使用率超过90%”
expr: 100-(avg(rate(node_cpu_seconds_total{mode=”idle”}[1m])) by(instance)* 100) > 90
for: 1m
labels:
severity: critical
annotations:
summary: “{{$labels.instance}}:CPU 使用率90%”
description: “{{$labels.instance}}:CPU 使用率超过90%，持续时间超过5mins”
value: “{{$value}}”
“`

– 如果需要在配置文件中使用中文，务必注意编码规则为utf8，否则报错

### 7.6 配置alertmanager

– 详细文档地址： https://prometheus.io/docs/alerting/latest/configuration/
– 主配置文件： alertmanager.yml
– 模板配置文件： *.tmpl
– 只是介绍少部需要用到的配置，如需查看完整配置，请查看官方文档

#### 7.6.1 alertmanager.yml

– 主配置文件中需要配置：
– global: 发件邮箱配置，
– templates: 指定邮件模板文件（如果不指定，则使用alertmanager默认模板），
– routes：配置告警规则，比如匹配哪个label的规则发送到哪个后端
– receivers：配置后端告警媒介： email，wechat，webhook等等

– 先看示例：

“`
vim alertmanager.yml
global:
smtp_smarthost: ‘xxx’
smtp_from: ‘xxx’
smtp_auth_username: ‘xxx’
smtp_auth_password: ‘xxx’
smtp_require_tls: false
templates:
– ‘/alertmanager/template/*.tmpl’
route:
receiver: ‘default-receiver’
group_wait: 1s #组报警等待时间
group_interval: 1s #组报警间隔时间
repeat_interval: 1s #重复报警间隔时间
group_by: [cluster, alertname]
routes:
– receiver: test
group_wait: 1s
match_re:
severity: test
receivers:
– name: ‘default-receiver’
email_configs:
– to: ‘xx@xx.xx’
html: ‘{{ template “xx.html” . }}’
headers: { Subject: ” {{ .CommonAnnotations.summary }}” }
– name: ‘test’
email_configs:
– to: ‘xxx@xx.xx’
html: ‘{{ template “xx.html” . }}’
headers: { Subject: ” {{ 第二路由匹配测试}}” }
“`

“`
vim test.tmpl
{{ define “xx.html” }}

报警项	磁盘	报警阀值	开始时间
{{ index $alert.Labels “alertname” }}	{{ index $alert.Labels “instance” }}	{{ index $alert.Annotations “value” }}	{{ $alert.StartsAt }}

{{ end }}
“`

– 详解：

gloable：

resolve_timeout: # 在没有报警的情况下声明为已解决的时间

  - 其他邮件相关配置，如示例

route： # 所有报警信息进入后的根路由，用来设置报警的分发策略

group_by: [‘LABEL_NAME’,’alertname’, ‘cluster’,’job’,’instance’,…]

这里的标签列表是接收到报警信息后的重新分组标签，例如，接收到的报警信息里面有许多具有 cluster=A 
和alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面

group_wait: 30s

当一个新的报警分组被创建后，需要等待至少group_wait时间来初始化通知，这种方式可以确保您能有足够的时间为同一分组来获取多个警报，然后一起触发这个报警信息。

group_interval: 5m

当第一个报警发送后，等待'group_interval'时间来发送新的一组报警信息。

repeat_interval: 5m

如果一个报警信息已经发送成功了，等待'repeat_interval'时间来重新发送他们

match：
label_name: NAME

匹配报警规则，满足条件的告警将被发给 receiver

match_re:
label_name: , …

正则表达式匹配。满足条件的告警将被发给 receiver

receiver: receiver_name

将满足match 和 match_re的告警发给后端 告警媒介（邮件，webhook，pagerduty，wechat，...）
必须有一个default receivererr="root route must specify a default receiver"

routes:
– …

配置多条规则。

templates：
[ – … ]

“`

配置模板，比如邮件告警页面模板

  receivers:
    - <receiver> ...# 列表

- name： receiver_name   # 用于填写在route.receiver中的名字

 email_configs:         # 配置邮件告警

 - to: <tmpl_string>
send_resolved: <boolean> | default = false      # 故障恢复之后，是否发送恢复通知

配置接受邮件告警的邮箱，也可以配置单独配置发件邮箱。详见官方文档
https://prometheus.io/docs/alerting/latest/configuration/#email_config

- name: ...
  wechat_configs:
  - send_resolved: <boolean> | default = false

 api_secret: <secret> | default = global.wechat_api_secret
 api_url: <string> | default = global.wechat_api_url
 corp_id: <string> | default = global.wechat_api_corp_id
    message: <tmpl_string> | default = '{{ template "wechat.default.message" . }}'

    agent_id: <string> | default = '{{ template "wechat.default.agent_id" . }}'

    to_user: <string> | default = '{{ template "wechat.default.to_user" . }}'
    to_party: <string> | default = '{{ template "wechat.default.to_party" . }}'
    to_tag: <string> | default = '{{ template "wechat.default.to_tag" . }}'             
    # 说明
        to_user: 企业微信用户ID
        to_party: 需要发送的组id

        corp_id: 企业微信账号唯一ID 可以在 我的企业 查看                         
        agent_id: 应用的 ID，应用管理 --> 打开自定应用查看
        api_secret: 应用的密钥

        打开企业微信注册 https://work.weixin.qq.com
        微信API官方文档 https://work.weixin.qq.com/api/doc#90002/90151/90854

企业微信告警配置

  inhibit_rules:
 - source_match:
  severity: 'critical'
target_match:
  severity: 'warning'
equal: ['alertname', 'dev', 'instance']

抑制相关配置

7.6.2 配置企业微信告警

注册企业: https://work.weixin.qq.com可以注册未认证企业，人数上限200,绑定个人微信即可使用web后台
微信API官方文档 : https://work.weixin.qq.com/api/doc#90002/90151/90854
注册之后绑定私人微信即可扫码进入管理后台。
发送告警的应用需要新建，操作也很简单
需要注意的参数：
- corp_id: 企业微信账号唯一ID 可以在我的企业查看
- agent_id: 应用的 ID，应用管理 –> 打开自定应用查看
- api_secret: 应用的密钥
- to_user: 企业微信用户ID，
- to_party: 需要发送的组id，通讯录，点击组名旁边的点可查看
配置示例：

 receivers:
- name: 'default'
  email_configs:
 - to: 'XXX'
send_resolved: true

  wechat_configs:
 - send_resolved: true
corp_id: 'XXX'
api_secret: 'XXX'
agent_id: 1000002
to_user: XXX
to_party: 2
message: '{{ template "wechat.html" . }}'

template：
- 由于alertmanager默认的微信报警模板太丑丑陋和冗长，所以使用告警模板，邮件模板默认的倒是还可以
- 示例1：

  cat wechat.tmpl
  {{ define "wechat.html" }}
  {{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }}
  [@警报~]
  实例: {{ .Labels.instance }}
  信息: {{ .Annotations.summary }}
  详情: {{ .Annotations.description }}
  值: {{ .Annotations.value }}
  时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
  {{ end }}{{ end -}}
  {{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }}
  [@恢复~]
  实例: {{ .Labels.instance }}
  信息: {{ .Annotations.summary }}
  时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
  恢复: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
  {{ end }}{{ end -}}
  {{- end }}

7.6.3 告警模板时间问题：

参考来源： https://blog.csdn.net/knight_zhou/article/details/106323719

Prometheus 邮件告警自定义模板的默认使用的是utc时间。

触发时间: {{ .StartsAt.Format "2020-01-02 15:04:05" }} 
修改之后：{{ (.StartsAt.Add 28800e9).Format "2020-01-02 15:04:05" }}

7.7 prometheus常用告警规则：

很厉害的一个页面，包括的好多写好的规则： https://awesome-prometheus-alerts.grep.to/rules

7.7.1 容器告警指标，容器down掉告警

vim rules/docker_monitor.yml
groups:
  - name: "container monitor"   
 rules:
- alert: "Container down: env1"
  expr: time() - container_last_seen{name="env1"} > 60
  for: 30s
  labels:
 severity: critical
  annotations:
 summary: "Container down: {{$labels.instance}} name={{$labels.name}}"

注意：

此项指标只能监控容器down 掉，无法准确监控容器恢复（不准），即便容器没有成功启动，过一段时间，也会受到resolve通知

7.7.2 针对磁盘CPU,IO ,磁盘使用、内存使用、TCP、网络流量配置监控告警:

groups:
- name: 主机状态-监控告警
  rules:
  - alert: 主机状态
 expr: up == 0
 for: 1m
 labels:
status: 非常严重
 annotations:
summary: "{{$labels.instance}}:服务器宕机"
description: "{{$labels.instance}}:服务器延时超过5分钟"

  - alert: CPU使用情况
 expr: 100-(avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100)
 for: 1m
 labels:
status: 一般告警
 annotations:
summary: "{{$labels.mountpoint}} CPU使用率过高！"
description: "{{$labels.mountpoint }} CPU使用大于60%(目前使用:{{$value}}%)"
- alert: cpu使用率过高告警  # 查询提供了hostname label
  expr: (100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 10
nodename) (node_uname_info) > 85
  for: 5m
  labels:
 region: 成都
  annotations:
 summary: "{{$labels.instance}}（{{$labels.nodename}}）CPU使用率过高！"
 description: '服务器{{$labels.instance}}（{{$labels.nodename}}）CPU使用率超过85%(
$value}}%)'       
- alert: 系统负载过高
  expr: (node_load1/count without (cpu, mode) (node_cpu_seconds_total{mode="system"}
nodename) (node_uname_info)>1.1
  for: 3m
  labels:
 region: 成都
  annotations:
 summary: "{{$labels.instance}}（{{$labels.nodename}}）系统负载过高！"
 description: '{{$labels.instance}}（{{$labels.nodename}}）当前负载超标率 {{printf 

- alert: 内存不足告警
  expr: (100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)* o
nodename) (node_uname_info) > 80
  for: 3m
  labels:
 region: 成都
  annotations:
 summary: "{{$labels.instance}}（{{$labels.nodename}}）内存使用率过高！"
 description: '服务器{{$labels.instance}}（{{$labels.nodename}}）内存使用率超过80%(
$value}}%)'
  - alert: IO操作耗时
 expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100)  102400
 for: 1m
 labels:
status: 严重告警
 annotations:
summary: "{{$labels.mountpoint}} 流入网络带宽过高！"
description: "{{$labels.mountpoint }}流入网络带宽持续2分钟高于100M. RX带宽使用率{
  - alert: 网络流出
 expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|d
instance)) / 100) > 102400
 for: 1m
 labels:
status: 严重告警
 annotations:
summary: "{{$labels.mountpoint}} 流出网络带宽过高！"
description: "{{$labels.mountpoint }}流出网络带宽持续2分钟高于100M. RX带宽使用率{
  - alert: network in
 expr: sum by (instance) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1
 for: 1m
 labels:
name: network
severity: Critical
 annotations:
summary: "{{$labels.mountpoint}} 流入网络带宽过高"
description: "{{$labels.mountpoint }}流入网络异常,高于100M"
value: "{{ $value }}"        
  - alert: network out
 expr: sum by (instance) (irate(node_network_transmit_bytes_total[2m])) / 1024 / 
 for: 1m
 labels:
name: network
severity: Critical
 annotations:
summary: "{{$labels.mountpoint}} 发送网络带宽过高"
description: "{{$labels.mountpoint }}发送网络异常,高于100M"
value: "{{ $value }}" 

  - alert: TCP会话
 expr: node_netstat_Tcp_CurrEstab > 1000
 for: 1m
 labels:
status: 严重告警
 annotations:
summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高！"
description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$valu
  - alert: 磁盘容量
 expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_b
> 80
 for: 1m
 labels:
status: 严重告警
 annotations:
summary: "{{$labels.mountpoint}} 磁盘分区使用率过高！"
description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"    
- alert: 硬盘空间不足告警  # 查询结果多了hostname等label
  expr: (100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_by
)* on(instance) group_left(nodename) (node_uname_info)> 80
  for: 3m
  labels:
 region: 成都
  annotations:
 summary: "{{$labels.instance}}（{{$labels.nodename}}）硬盘使用率过高！"
 description: '服务器{{$labels.instance}}（{{$labels.nodename}}）硬盘使用率超过80%(
$value}}%)'
  - alert: volume fullIn fourdaysd # 预计磁盘4天后写满
 expr: predict_linear(node_filesystem_free_bytes[2h], 4 * 24 * 3600) < 0
 for: 5m
 labels:
name: disk
severity: Critical
 annotations:
summary: "{{$labels.mountpoint}} 预计主机可用磁盘空间4天后将写满"
description: "{{$labels.mountpoint }}" 
value: "{{ $value }}%"  
  - alert: disk write rate
 expr: sum by (instance) (irate(node_disk_written_bytes_total[2m])) / 1024 / 1024
 for: 1m
 labels:container_memory_max_usage_bytes
name: disk
severity: Critical
 annotations:
summary: "disk write rate (instance {{ $labels.instance }})"
description: "磁盘写入速率大于50MB/s"
value: "{{ $value }}%" 
  - alert: disk read latency
 expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_complet
 for: 1m
 labels:
name: disk
severity: Critical
 annotations:
summary: "unusual disk read latency (instance {{ $labels.instance }})"
description: "磁盘读取延迟大于100毫秒"
value: "{{ $value }}%" 
  - alert: disk write latency
 expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_compl
 for: 1m
 labels:
name: disk
severity: Critical
 annotations:
summary: "unusual disk write latency (instance {{ $labels.instance }})"
description: "磁盘写入延迟大于100毫秒"
value: "{{ $value }}%"

7.8 alertmanager 管理api

GET /-/healthy  
GET /-/ready  
POST /-/reload

示例：

curl -u monitor:fosafer.com 127.0.0.1:9093/-/healthy
    OK
curl -XPOST -u monitor:fosafer.com 127.0.0.1:9093/-/reload
 [root@host40 monitor]# curl -XPOST -u monitor:fosafer.com 127.0.0.1:9093/-/reload
failed to reload config: yaml: unmarshal errors:
line 26: field receiver already set in type config.plain

等同： docker exec -it monitor-alertmanager kill -1 1 ，但是失败会报错

8.blackbox_exporter

8.1 blackbox_exporter简介

blackbox_exporter是Prometheus 官方提供的 exporter 之一，可以提供 http、dns、tcp、icmp 的监控数据采集。
官方地址： https://github.com/prometheus/blackbox_exporter

应用场景：

HTTP 测试
定义 Request Header 信息
判断 Http status / Http Respones Header / Http Body 内容
TCP 测试
业务组件端口状态监听
应用层协议定义与监听
ICMP 测试
主机探活机制
POST 测试
接口联通性
SSL 证书过期时间

8.2 blackbox_exporter安装

8.2.1 linux(centos7) 二进制下载安装blackbox_exporter

下载并解压

wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.18.0/
blackbox_exporter-0.18.0.linux-amd64.tar.gz
tar -xf blackbox_exporter-0.18.0.linux-amd64.tar.gz -C /usr/local/
cd /usr/local 
ln -sv blackbox_exporter-0.18.0.linux-amd64 blackbox_exporter
cd blackbox_exporter
./blackbox_exporter --version

添加systemd服务unit：

vim /lib/systemd/system/blackbox_exporter.service
[Unit]
Description=blackbox_exporter
After=network.target
[Service]
User=root
Type=simple
ExecStart=/usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target

systemctl daemon-reload
systemctl enable blackbox_exporter
systemctl start blackbox_exporter

默认监听端口： 9115

8.2.2 docker 安装blackbox_exporter

image: prom/blackbox-exporter:master

docker run:

docker run --rm -d -p 9115:9115 --name blackbox_exporter -v `pwd`:/config prom/blackbox-exporter:master --config.file=/config/blackbox.yml

8.3 配置blackbox_exporter

默认配置文件：

blackbox_exporter 默认情况配置文件已经能够满足大多数需求，后续如需自行配置，参见官方文档，以及项目类一个示例配置文件

https://github.com/prometheus/blackbox_exporter/blob/master/example.yml

cat blackbox.yml
modules:
http_2xx:
prober: http
http_post_2xx:
prober: http
http:
method: POST
tcp_connect:
prober: tcp
pop3s_banner:
prober: tcp
tcp:
query_response:
- expect: "^+OK"
tls: true
tls_config:
insecure_skip_verify: false
ssh_banner:
prober: tcp
tcp:
query_response:
- expect: "^SSH-2.0-"
irc_banner:
prober: tcp
tcp:
query_response:
- send: "NICK prober"
- send: "USER prober prober prober :prober"
- expect: "PING :([^ ]+)"
send: "PONG ${1}"
- expect: "^:[^ ]+ 001"
icmp:
prober: icmp

8.4 配置prometheus:

官方介绍： https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config
参考文档： https://blog.csdn.net/qq_25934401/article/details/84325356

说明：

labels:
job:  job_name
__address__: :
instance: 默认__address__，如果没有被重新标签的话
__scheme__： scheme
__metrics_path__： path
__param_: url 中第一个出现的  参数

8.4.1 http/https 测试示例：

scrape_configs:
  - job_name: 'blackbox'
 metrics_path: /probe
 params:
module: [http_2xx]  # Look for a HTTP 200 response.
 static_configs:
- targets:
  - http://prometheus.io # Target to probe with http.
  - https://prometheus.io# Target to probe with https.
  - http://example.com:8080 # Target to probe with http on port 8080.
 relabel_configs:
- source_labels: [__address__]
  target_label: __param_target
- source_labels: [__param_target]
  target_label: instance
- target_label: __address__
  replacement: 127.0.0.1:9115  # The blackbox exporter's real hostname:port.

8.4.2 tcp探测示例：

- job_name: "blackbox_telnet_port]"
  scrape_interval: 5s
  metrics_path: /probe
  params:
 module: [tcp_connect]
  static_configs:
- targets: [ '1x3.x1.xx.xx4:443' ]
  labels:
 group: 'xxxidc机房ip监控'
- targets: ['10.xx.xx.xxx:443']
  labels:
 group: 'Process status of nginx(main) server'
  relabel_configs:
- source_labels: [__address__]
  target_label: __param_target
- source_labels: [__param_target]
  target_label: instance
- target_label: __address__
  replacement: 10.xxx.xx.xx:9115

8.4.3 icmp探测示例：

- job_name: 'blackbox00_ping_idc_ip'
  scrape_interval: 10s
  metrics_path: /probe
  params:
 module: [icmp]  #ping
  static_configs:
- targets: [ '1x.xx.xx.xx' ]
  labels:
 group: 'xxnginx 虚拟IP'
  relabel_configs:
- source_labels: [__address__]
  regex: (.*)(:80)?
  target_label: __param_target
  replacement: ${1}
- source_labels: [__param_target]
  regex: (.*)
  target_label: ping
  replacement: ${1}
- source_labels: []
  regex: .*
  target_label: __address__
  replacement: 1x.xxx.xx.xx:9115

8.4.4 POST探测示例：

- job_name: 'blackbox_http_2xx_post'
  scrape_interval: 10s
  metrics_path: /probe
  params:
 module: [http_post_2xx_query]
  static_configs:
- targets:
  - https://xx.xxx.com/api/xx/xx/fund/query.action
  labels:
 group: 'Interface monitoring'
  relabel_configs:
- source_labels: [__address__]
  target_label: __param_target
- source_labels: [__param_target]
  target_label: instance
- target_label: __address__
  replacement: 1x.xx.xx.xx:9115  # The blackbox exporter's real hostname:port.

8.4.5 SSL证书时间监测：

cat << 'EOF' > prometheus.yml
rule_files:
  - ssl_expiry.rules
scrape_configs:
  - job_name: 'blackbox'
 metrics_path: /probe
 params:
module: [http_2xx]  # Look for a HTTP 200 response.
 static_configs:
- targets:
  - example.com  # Target to probe
 relabel_configs:
- source_labels: [__address__]
  target_label: __param_target
- source_labels: [__param_target]
  target_label: instance
- target_label: __address__
  replacement: 127.0.0.1:9115  # Blackbox exporter.
  EOF 
cat << 'EOF' > ssl_expiry.rules 
groups: 
  - name: ssl_expiry.rules 
 rules: 
- alert: SSLCertExpiringSoon 
  expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 30 
  for: 10m
EOF

8.5 查看监听过程：

类似于：

curl http://172.16.10.65:9115/probe?target=prometheus.io&module=http_2xx&debug=true

8.6 添加告警：

icmp、tcp、http、post 监测是否正常可以观察probe_success 这一指标
```
probe_success == 0 ##联通性异常
probe_success == 1 ##联通性正常
```
告警也是判断这个指标是否等于0，如等于0 则触发异常报警“`
[sss@prometheus01 prometheus]$ cat rules/blackbox-alert.rules
groups:

<ul>
<li>name: blackbox_network_stats
rules:</li>
</ul></li>
<li>alert: blackbox_network_stats
expr: probe_success <span class=”text-highlighted-inline” style=”background-color: #fffd38;”> 0
for: 1m
labels:
severity: critical
annotations:
summary: “Instance {{ $labels.instance }} is down”
description: “This requires immediate action!”

“`

9.docker-compose部署完整prometheus监控系统

部署主机： 10.10.11.40

9.1 部署组件：

 prometheus
 alertmanager
 grafana
 nginx
 node_exporter
 cadvisor
 blackbox_exporter

image：

 prom/prometheus
 prom/alertmanager
 quay.io/prometheus/node-exporter  ，prom/node-exporter
 gcr.io/google_containers/cadvisor[:v0.36.0]  # 需要能访问google
 google/cadvisor:v0.33.0 # docker hub镜像，版本没有google的新
 grafana/grafana
 nginx

将iamge pull下来之后从新tag ，并上传至本地harbor 仓库

image: 10.10.11.40:80/base/nginx:1.19.3
image: 10.10.11.40:80/base/prometheus:2.22.0
image: 10.10.11.40:80/base/grafana:7.2.2
image: 10.10.11.40:80/base/alertmanager:0.21.0
image: 10.10.11.40:80/base/node_exporter:1.0.1
image: 10.10.11.40:80/base/cadvisor:v0.33.0
image: 10.10.11.40:80/base/blackbox-exporter:0.18.0

9.2 部署结构

目录结构一览

mkdir /home/deploy/monitor
cd /home/deploy/monitor

[root@host40 monitor]# tree
.
├── alertmanager
│   ├── alertmanager.yml
│   ├── db
│   │   ├── nflog
│   │   └── silences
│   └── templates
│    └── wechat.tmpl
├── blackbox_exporter
│   └── blackbox.yml
├── docker-compose.yml
├── grafana
│   └── db
│    ├── grafana.db
│    ├── plugins
    ...
├── nginx
│   ├── auth
│   └── nginx.conf
├── node-exporter
│   └── textfiles
├── node_exporter_install_docker.sh
├── prometheus
│   ├── db
│   ├── prometheus.yml
│   ├── rules
│   │   ├── docker_monitor.yml
│   │   ├── system_monitor.yml
│   │   └── tcp_monitor.yml
│   └── sd_files
│    ├── docker_host.yml
│    ├── http.yml
│    ├── icmp.yml
│    ├── real_lan.yml
│    ├── real_wan.yml
│    ├── sedFDm5Rw
│    ├── tcp.yml
│    ├── virtual_lan.yml
│    └── virtual_wan.yml
└── sd_controler.sh

nginx basic认证需要的文件：

[root@host40 monitor-bak]# ls nginx/auth/ -a
.  ..  .htpasswd

部分挂在目录权限：

prometheus，grafana,alertmanager 的 db目录 需要777权限
单独挂在的配置文件 alertmanager.yml，prometheus.yml，nginx.conf 需要 666权限。
如果为了安全起见，建议将配置文件放入专门目录中挂载，并在command 中修改启动参数指定配置文件即可

9.3 docker-compose.yml

[root@host40 monitor-bak]# cat docker-compose.yml 
version: "3"
services:

  nginx:
 image: 10.10.11.40:80/base/nginx:1.19.3
 hostname: nginx
 container_name: monitor-nginx
 restart: always
 privileged: false
 ports:
- 3001:3000
- 9090:9090
- 9093:9093
 volumes:
- ./nginx/nginx.conf:/etc/nginx/nginx.conf
- ./nginx/auth:/etc/nginx/basic_auth
 networks:
monitor:
  aliases:
 - nginx
 logging:
driver: json-file
options:
  max-file: '5'
  max-size: 50m

  prometheus:
 image: 10.10.11.40:80/base/prometheus:2.22.0
 container_name: monitor-prometheus
 hostname: prometheus
 restart: always
 privileged: true
 volumes:
- ./prometheus/db/:/prometheus/
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/rules/:/etc/prometheus/rules/
- ./prometheus/sd_files/:/etc/prometheus/sd_files/
 command: 
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--storage.tsdb.retention=60d'
 networks:
monitor:
  aliases:
 - prometheus
 logging:
driver: json-file
options:
  max-file: '5'
  max-size: 50m

  grafana:
 image: 10.10.11.40:80/base/grafana:7.2.2
 container_name: monitor-grafana
 hostname: grafana
 restart: always
 privileged: true
 volumes:
- ./grafana/db/:/var/lib/grafana 
 networks:
monitor:
  aliases:
 - grafana
 logging:
driver: json-file
options:
  max-file: '5'
  max-size: 50m

  alertmanger:
 image: 10.10.11.40:80/base/alertmanager:0.21.0
 container_name: monitor-alertmanager
 hostname: alertmanager
 restart: always
 privileged: true
 volumes:
- ./alertmanager/db/:/alertmanager
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- ./alertmanager/templates/:/etc/alertmanager/templates
 networks:
monitor:
  aliases:
 - alertmanager
 logging:
driver: json-file
options:
  max-file: '5'
  max-size: 50m

  node-exporter:
 image: 10.10.11.40:80/base/node_exporter:1.0.1
 container_name: monitor-node-exporter
 hostname: host40
 restart: always
 privileged: true
 volumes:
- /:/host:ro,rslave
- ./node-exporter/textfiles/:/textfiles
 network_mode: "host"
 command: 
- '--path.rootfs=/host'
- '--web.listen-address=:9100'
- '--collector.textfile.directory=/textfiles' 
 logging:
driver: json-file
options:
  max-file: '5'
  max-size: 50m

  cadvisor:
 image: 10.10.11.40:80/base/cadvisor:v0.33.0
 container_name: monitor-cadvisor
 hostname: cadvisor
 restart: always
 privileged: true
 volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
 ports:
- 9080:8080
 networks: 
monitor:
 logging:
driver: json-file
options:
  max-file: '5'
  max-size: 50m

  blackbox_exporter:
 image: 10.10.11.40:80/base/blackbox-exporter:0.18.0
 container_name: monitor-blackbox
 hostname: blackbox-exporter
 restart: always
 privileged: true
 volumes:
- ./blackbox_exporter/:/etc/blackbox_exporter
 networks:
monitor:
  aliases:
 - blackbox
 command:
- '--config.file=/etc/blackbox_exporter/blackbox.yml'
 logging:
driver: json-file
options:
  max-file: '5'
  max-size: 50m

networks:
  monitor:
 ipam:
config:
  - subnet: 192.168.17.0/24

9.4 nginx

由于prometheus，alertmanager 本身不带认证功能，所以前端使用nginx完成调度和basic auth 认证,同一代理后端监听端口，便于管理。
各程序默认端口

 prometheus： 9090
 grafana：3000
 alertmanager： 9093
 node_exproter: 9100
 cadvisor: 8080 (客户端)

nginx基础image使用basic认证：

 echo monitor:`openssl passwd -crypt 123456` > .htpasswd

单独挂在配置文件容器不更新：（当然也可以选择挂在目录，而不是直接挂在文件）
```
chmod 666 nginx.conf   
```

nginx容器加载配置文件：

docker exec -it web-director nginx -s reload

nginx.conf“`
[root@host40 monitor-bak]# cat nginx/nginx.conf
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /run/nginx.pid;
include /usr/share/nginx/modules/*.conf;
events {
worker_connections 10240;
}
http {
log_format main ‘$remote_addr – $remote_user [$time_local] “$request” ‘
‘$status $body_bytes_sent “$http_referer” ‘
‘”$http_user_agent” “$http_x_forwarded_for”‘;
access_log /var/log/nginx/access.log main;
sendfileon;
tcp_nopush on;
tcp_nodelayon;
keepalive_timeout65;
types_hash_max_size 2048;
include /etc/nginx/mime.types;
default_type application/octet-stream;
</p></li>
</ul>

<p>proxy_connect_timeout500ms;
proxy_send_timeout1000ms;
proxy_read_timeout3000ms;
proxy_buffers 64 8k;
proxy_busy_buffers_size 128k;
proxy_temp_file_write_size 64k;
proxy_redirect off;
proxy_next_upstream error invalid_header timeout http_502 http_504;
proxy_http_version 1.1;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Real-Port $remote_port;
proxy_set_header Host $http_host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
client_max_body_size 10m;
client_body_buffer_size 512k;
client_body_timeout 180;
client_header_timeout 10;
send_timeout 240;
gzip on;
gzip_min_length 1k;
gzip_buffers 4 16k;
gzip_comp_level 2;
gzip_types application/javascript application/x-javascript text/css text/javascript image/jpeg image/gif image/png;
gzip_vary off;
gzip_disable “MSIE [1-6].”;

server {
listen 3000;
server_name _;

location / {
proxy_pass http://grafana:3000;
}
}

server {
listen 9090;
server_name _;

location / {
auth_basic “auth for monitor”;
auth_basic_user_file /etc/nginx/basic_auth/.htpasswd;
proxy_pass http://prometheus:9090;
}
}

server {
listen 9093;
server_name _;

location / {
auth_basic “auth for monitor”;
auth_basic_user_file /etc/nginx/basic_auth/.htpasswd;
proxy_pass http://alertmanager:9093;<br />
}
}
}

“`

9.5 prometheus
- 注意db目录需可写，给777权限
9.5.1 主配置文件： prometheus.yml
```
[root@host40 monitor-bak]# cat prometheus/prometheus.yml 
# my global config
global:
  scrape_interval:  15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
 - targets: ["alertmanager:9093"]
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/*.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=` to any timeseries scraped from this config.
  - job_name: 'prometheus'
 static_configs:
 - targets: ['localhost:9090']

  - job_name: 'alertmanager'
 static_configs:
- targets: ['alertmanager:9093']
  - job_name: 'node_real_lan'
 file_sd_configs:
- files: 
 - ./sd_files/real_lan.yml
  refresh_interval: 30s

  - job_name: 'node_virtual_lan'
 file_sd_configs:
- files:
 - ./sd_files/virtual_lan.yml
  refresh_interval: 30s

  - job_name: 'node_real_wan'
 file_sd_configs:
- files:
 - ./sd_files/real_wan.yml
  refresh_interval: 30s

  - job_name: 'node_virtual_wan'
 file_sd_configs:
- files:
 - ./sd_files/virtual_wan.yml
  refresh_interval: 30s

  - job_name: 'docker_host'
 file_sd_configs:
- files:
 - ./sd_files/docker_host.yml
  refresh_interval: 30s
  - job_name: 'tcp'
 metrics_path: /probe
 params:
module: [tcp_connect]
 file_sd_configs:
- files:
 - ./sd_files/tcp.yml
  refresh_interval: 30s
 relabel_configs:
- source_labels: [__address__]
  target_label: __param_target
- source_labels: [__param_target]
  target_label: instance
- target_label: __address__
  replacement: blackbox:9115 
  - job_name: 'http'
 metrics_path: /probe
 params:
module: [http_2xx]
 file_sd_configs:
- files:
 - ./sd_files/http.yml
  refresh_interval: 30s
 relabel_configs:
- source_labels: [__address__]
  target_label: __param_target
- source_labels: [__param_target]
  target_label: instance
- target_label: __address__
  replacement: blackbox:9115 
  - job_name: 'icmp'
 metrics_path: /probe
 params:
module: [icmp]
 file_sd_configs:
- files:
 - ./sd_files/icmp.yml
  refresh_interval: 30s
 relabel_configs:
- source_labels: [__address__]
  target_label: __param_target
- source_labels: [__param_target]
  target_label: instance
- target_label: __address__
  replacement: blackbox:9115 
```
9.5.2 全部节点使用基于文件的服务发现：
- 将需要监控的主机targets 写入相应job的target文件即可。示例如下：
```
ls prometheus/sd_files/
docker_host.yml  http.yml  icmp.yml  real_lan.yml  real_wan.yml  sedFDm5Rw  tcp.yml  virtual_lan.yml  virtual_wan.yml
```
```
 cat prometheus/sd_files/docker_host.yml
 - targets: ['10.10.11.178:9080']
 - targets: ['10.10.11.99:9080']
 - targets: ['10.10.11.40:9080']
 - targets: ['10.10.11.35:9080']
 - targets: ['10.10.11.45:9080']
 - targets: ['10.10.11.46:9080']
 - targets: ['10.10.11.48:9080']
 - targets: ['10.10.11.47:9080']
 - targets: ['10.10.11.65:9081']
 - targets: ['10.10.11.61:9080']
 - targets: ['10.10.11.66:9080']
 - targets: ['10.10.11.68:9080']
 - targets: ['10.10.11.98:9080']
 - targets: ['10.10.11.75:9080']
 - targets: ['10.10.11.97:9080']
 - targets: ['10.10.11.179:9080']
```
```
 cat prometheus/sd_files/tcp.yml
 - targets: ['10.10.11.178:8001']
labels:
  server_name: http_download
 - targets: ['10.10.11.178:3307']
labels:
  server_name: xiaojing_db
 - targets: ['10.10.11.178:3001']
labels:
  server_name: test_web
```
9.5.3 rules文件：
- docker rules：
```
cat prometheus/rules/docker_monitor.yml 
 groups:
- name: "container monitor"
  rules:
 - alert: "Container down: env1"
expr: time() - container_last_seen{name="env1"} > 60
for: 30s
labels:
  severity: critical
annotations:
  summary: "Container down: {{$labels.instance}} name={{$labels.name}}"
 ```

- tcp rules:

 ```
 cat prometheus/rules/tcp_monitor.yml 
 groups:
 - name: blackbox_network_stats
rules:
- alert: blackbox_network_stats
  expr: probe_success == 0
  for: 1m
  labels:
 severity: critical
  annotations:
 summary: "Instance {{ $labels.instance }} ,server-name: {{ $labels.server_name }} is down"
 description: "连接不通..."
 ```

- system rules: # cpu ,mem, disk, network, filesystem...
```
cat prometheus/rules/system_monitor.yml
groups:
– name: “system info”
rules:
– alert: “服务器宕机”
expr: up 0
for: 3m
labels:
severity: critical
annotations:
summary: “{{$labels.instance}}:服务器宕机”
description: “{{$labels.instance}}:服务器无法连接，持续时间已超过3mins”
– alert: “系统负载过高”
expr: (node_load1/count without (cpu, mode) (node_cpu_seconds_total{mode=”system”}))* on(instance) group_left(
nodename) (node_uname_info) > 1.1
for: 3m
labels:
servirity: warning
annotations:
summary: “{{$labels.instance}}:系统负载过高”
description: “{{$labels.instance}}:系统负载过高.”
value: “{{$value}}”
– alert: “CPU 使用率超过90%”
expr: 100-(avg(rate(node_cpu_seconds_total{mode=”idle”}[5m])) by(instance)* 100) > 90
for: 3m
labels:
severity: critical
annotations:
summary: “{{$labels.instance}}:CPU 使用率90%”
description: “{{$labels.instance}}:CPU 使用率超过90%.”
value: “{{$value}}”
– alert: “内存使用率超过80%”
expr: (100 – node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)* on(instance) group_left(
nodename) (node_uname_info) > 80
for: 3m
labels:
severity: critical
annotations:
summary: “{{$labels.instance}}:内存使用率80%”
description: “{{$labels.instance}}:内存使用率超过80%”
value: “{{$value}}”
- alert: “IO操作耗时超过60%”
  expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) 85
  for: 3m
  labels:
  severity: longtime
  annotations:
  summary: “{{$labels.instance}}:磁盘分区容量超过85%”
  description: “{{$labels.instance}}:磁盘分区容量超过85%”
  value: “{{$value}}”
- alert: “磁盘将在4天后写满”
  expr: predict_linear(node_filesystem_free_bytes[2h], 4 * 24 * 3600) < 0
  for: 3m
  labels:
  severity: longtime
  annotations:
  summary: “{{$labels.instance}}: 预计将有磁盘分区在4天后写满，”
  description: “{{$labels.instance}}:预计将有磁盘分区在4天后写满，”
  value: “{{$value}}”
  
  “`
  </p></li>
  </ul>
  
  <h3>9.6 alertmanager：</h3>
  
  <ul>
  <li><p>注意db目录可写：</p></li>
  <li><p>主配置文件：
  
  “`
  cat alertmanager/alertmanager.yml
  global:
  resolve_timeout: 5m
  smtp_smarthost: ‘smtphz.qiye.163.com:25’
  smtp_from: ‘XXX@fosafer.com’
  smtp_auth_username: ‘XXX@fosafer.com’
  smtp_auth_password: ‘XXX’
  smtp_hello: ‘qiye.163.com’
  smtp_require_tls: true
  route:
  group_by: [‘instance’]
  group_wait: 30s
  receiver: default
  routes:
  - group_interval: 3m
    repeat_interval: 10m
    match:
    severiry: warning
    receiver: ‘default’
  - group_interval: 3m
    repeat_interval: 30m
    match:
    severiry: critical
    receiver: ‘default’
  - group_interval: 5m
    repeat_interval: 24h
    match:
    severiry: longtime
    receiver: ‘default’
    templates:
- ./templates/*.tmpl
  receivers:
- name: ‘default’
  email_configs:
  - to: ‘xiangkaihua@fosafer.com’
    send_resolved: true
  wechat_configs:
  - send_resolved: true
    corp_id: ‘XXX’
    api_secret: ‘XXX’
    agent_id: 1000002
    to_user: XXX
    to_party: 2
    message: ‘{{ template “wechat.html” . }}’
- name: ‘critical’
  email_configs:
  - to: ‘342382676@qq.com’
    send_resolved: true
  - to: ‘xiangkaihua@fosafer.com’
    send_resolved: true
    
    “`
- 告警模板文件
```
cat alertmanager/templates/wechat.tmpl 
{{ define "wechat.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }}
[@警报~]
实例: {{ .Labels.instance }}
信息: {{ .Annotations.summary }}
详情: {{ .Annotations.description }}
值: {{ .Annotations.value }}
时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{ end -}}
{{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }}
[@恢复~]
实例: {{ .Labels.instance }}
信息: {{ .Annotations.summary }}
时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{ end -}}
{{- end }}     
```
9.7 grafana
- 只需要挂载volume即可，配置文件无需更改，db目录也不大，可以保存配置和dashboard
10.客户端部署

10.1 被监控主机无docker，单独安装node_exporter
- 安装脚本：
```
http://10.10.11.178:8001/node_exporter_install.sh
```
10.2 被监控主机运行docker，docker 安装 node_exporter cadvisor
- 安装脚本：
```
http://10.10.11.178:8001/node_exporter_install_docker.sh
```
- 需要的image，对于没有添加10.10.11.40:80 仓库的docker主机，可以下载save的image，先load image 在安装
```
http://10.10.11.178:8001/monitor-client.tgz
```
11.prometheus使用和维护

11.1 通过脚本添加和删除监控节点
- 所有的job都使用基于文件的服务发现，所以，只用将target写入sd_file即可，无需重读配置文件
- 基于此写了一个文本处理脚本作为sd_files的前端，通过命令行的形式添加和删除targets，无需手动编辑文件
- 脚本名称： sd_controler.sh
- 脚本使用：./sd_controler.sh 即可查看usage
- 完整脚本如下：
  “`
  [root@host40 monitor]# cat sd_controler.sh
  
  !/bin/bash
  
  version: 1.0
  
  Description: add | del | show instance from|to prometheus file_sd_files.
  
  rl | vl | dk | rw | vw | tcp | http | icmp : short for job name, each one means a sd_file.
  
  tcp | http | icmp ( because with ports for service ) add with label (server_name by default) to easy read in alert emails.
  
  each time can only add|del for one instance.
  
  说明：用来添加、删除、查看prometheus基于文件的服务发现中的条目。比如IP:PORT 组合。
  
  rl | vl | dk | rw | vw | tcp | http | icmp ：这写prometheus job名称的简称，每一项代表一个job，操作一个sd_file 即job文件服务发现使用的文件。
  
  tcp | http | icmp，由于常常无法根据服务端口第一时间确认挂掉的是什么服务，所以，在tcp http icmp（顺带）添加的时候要求带上server_name的标签label，
  
  让监控人员收到告警邮件第十时间知道挂掉的是什么服务。
  
  每一次只能添加、删除一条记录，如果需要批量添加，可以直接使用vim 文本操作，或者写for 语句批量执行。
  
  vars
  
  SD_DIR=./prometheus/sd_files
  DOCKER_SD=$SD_DIR/docker_host.yml
  RL_HOST_SD=$SD_DIR/real_lan.yml
  VL_HOST_SD=$SD_DIR/virtual_lan.yml
  RW_HOST_SD=$SD_DIR/real_wan.yml
  VW_HOST_SD=$SD_DIR/virtual_wan.yml
  
  TCP_SD=$SD_DIR/tcp.yml
  HTTP_SD=$SD_DIR/http.yml
  ICMP_SD=$SD_DIR/icmp.yml
  
  SDFILE=
  
  funcs
  
  usage(){
  echo -e “Usage: $0 [ IP:PORT | FQDN ] [ server-name ]”
  echo -e ” example: \n\t node add:\t $0 rl add | del 10.10.10.10:9100\n\t tcp,http,icmp add:\t $0 tcp add 10.10.10.10:3306 web-mysql\n\t del:\t $0 http del www.baidu.com\n\t show:\t $0 rl | vl | dk | rw | vw | tcp | http | icmp show.”
  exit
  }
  
  add(){
$1: SDFILE, $2: IP:PORT

grep -q $2 $1 || echo -e “- targets: [‘$2’]” >> $1
}

del(){

$1: SDFILE, $2: IP:PORT

sed -i ‘/’$2’/d’ $1
}

add_with_label(){

$1: SDFILE, $2: [IP:[PROT]|FQDN] $3:SERVER-NAME

LABEL_01=”server_name”
if ! grep -q ‘$2’ $1;then
echo -e “- targets: [‘$2’]” >> $1
echo -e ” labels:” >> $1
echo -e ” ${LABEL_01}: $3″ >> $1
fi
}

del_with_label(){

$1: SDFILE, $2: [IP:[PROT]|FQDN]

NUM=cat -n $SDFILE |grep "'$2'"|awk '{print $1}'
let ENDNUM=NUM+2

sed -i $NUM,${ENDNUM}d $1
}

action(){
if [ “$1” “add” ];then
add $SDFILE $2
elif [ “$1” “del” ];then
del $SDFILE $2
elif [ “$1” “show” ];then
cat $SDFILE
fi
}

action_with_label(){
if [ “$1” “add” ];then
add_with_label $SDFILE $2 $3
elif [ “$1” “del” ];then
del_with_label $SDFILE $2 $3
elif [ “$1” “show” ];then
cat $SDFILE
fi
}

### main code
[ “$2” “” ] || [[ ! “$2” =~ ^(add|del|show)$ ]] && usage

curl –version &>/dev/null || { echo -e “no curl found. ” && exit 15; }

if [[ $1 =~ ^(rl|vl|rw|vw|dk)$ ]] && [ “$2” “add” ];then
[ “$3” “” ] && usage

if [ “$4” != “-f” ];then
COOD=curl -IL -o /dev/null --retry 3 --connect-timeout 3 -s -w "%{http_code}" http://$3/metrics
[ “$COOD” != “200” ] && echo -e “http://$3/metrics is not arriable. check it again. or you can use -f to ignor it.” && exit 11
fi
fi

if [[ $1 =~ ^(tcp|http|icmp)$ ]] && [ “$2” “add” ];then
[ “$4” “” ] && echo -e “监听 tcp http icmp 服务时必须指明 server-name.” && usage
fi

case $1 in
rl)
SDFILE=$RL_HOST_SD
action $2 $3 && echo $2 OK
;;
vl)
SDFILE=$VL_HOST_SD
action $2 $3 && echo $2 OK
;;
dk)
SDFILE=$DOCKER_SD
action $2 $3 && echo $2 OK
;;
rw)
SDFILE=$RW_HOST_SD
action $2 $3 && echo $2 OK
;;
vw)
SDFILE=$VW_HOST_SD
action $2 $3 && echo $2 OK
;;
tcp)
SDFILE=$TCP_SD
action_with_label $2 $3 $4 && echo $2 OK
;;
http)
SDFILE=$HTTP_SD
action_with_label $2 $3 $4 && echo $2 OK
;;
icmp)
SDFILE=$ICMP_SD
action_with_label $2 $3 $4 && echo $2 OK
;;
*)
usage
;;
esac

“`