1. 安装组件基本介绍:
- prometheus:
- server端守护进程,负责拉取各个终端exporter收集到metrics(监控指标数据),并记录在本身提供的tsdb时序记录数据库中,默认保留天数15天,可以通过启动参数自行设置数据保留天数。
- prometheus官方提供了多种exporter,
- 默认监听9090端口,对外提供web图形查询页面,以及数据库查询访问接口。
- 配置监控规则rules(需自行手动配置),并将触发规则的告警发送至alertmanager ,并由alertmanager中配置的告警媒介向外发送告警。
- grafana:
- 由于prometheus本身提供的图形页面过于简陋,所以使用grafana来提供图形页面展示。
- grafana 是专门用于图形展示的软件,支持多种数据来源,prometheus只是其中一种。
- 自带告警功能,且告警规则可在监控图形上直接配置,不过由于此种方式不支持模板变量(dashboard中为了方便展示配置的特殊变量),即每一个指标,每一台设备均需要单独配置,所以实用性较低
- 默认监听端口:3000
- node_exporter:
- agent端,prometheus官方提供的诸多exporter中的一种,安装与各监控节点主机
- 负责抓取主机及系统各项信息,如cpu,mem ,disk,networtk.filesystem,…等等各项基本指标,非常全面。并将抓取到的各项指标metrics 通过http协议对方发布,供prometheus server端抓取。
- 默认监听端口: 9100
- cadvisor:
- agent端,安装与docker主机,抓取主机和docker容器运行中各项数据。
- 本身也已容器方式运行,监听端口8080(可自行设置对外映射端口,且建议映射到其他端口)。
- 提供基本的graph展示页面,同时提供metrics抓取页面
- alertmanager:
- 接受prometheus发送的告警,并通过一定规则分组,控制告警的发送(如告警频率,规则抑制,匹配不同的告警后端媒介,设置静默等)。
- 可配置多种不同的告警后端媒介,如:邮件,webhook,wechat(企业微信)已经一些企业版的监控告警平台等。
- 默认监听端口:9093
- blackbox_exporter:
- Prometheus 官方提供的 exporter 之一,可以提供 http、dns、tcp、icmp 的监控数据采集
- 可直接配置与prometheus server节点,也可配置在单独节点
- 默认监听端口:9115
- nginx:
- 由于prometheus,alertmanager本身不具有认证功能,所以前端使用nginx对外访问,提供基本basic认证已经配置https
- 以上各组件均需暴露自身端口,所以在docker-compos 部署过程中,将容器部署在同一网络中,前端入口映射端口由nginx统一配置,方便管理
2.prometheus-server
2.1 官方地址:
- 官方文档地址:https://prometheus.io/docs/introduction/overview/
- github项目下载地址: https://github.com/prometheus/prometheus
2.2 安装 prometheus server
2.2.1 linux(centos7) 下载安装
- 创建运行prometheus server进程的系统用户,并为其创建家目录/var/lib/prometheus 作为数据存储目录
~]# useradd -r -m -d /var/lib/prometheus prometheus
- 下载并安装prometheus server,以2.14.0为例:
wget https://github.com/prometheus/prometheus/releases/download/v2.14.0/prometheus-2.14.0.linux-amd64.tar.gz
tar -xf prometheus-2.14.0.linux-amd64.tar.gz -C /usr/local/
cd /usr/local
ln -sv prometheus-2.14.0.linux-amd64 prometheus
- 创建unit file,让systemd 管理prometheus
vim /usr/lib/systemd/system/prometheus.service
[Unit]
Description=The Prometheus 2 monitoring system and time series database.
Documentation=https://prometheus.io
After=network.target
[Service]
EnvironmentFile=-/etc/sysconfig/prometheus
User=prometheus
ExecStart=/usr/local/prometheus/prometheus \
--storage.tsdb.path=/home/prometheus/prometheus \
--config.file=/usr/local/prometheus/prometheus.yml \
--web.listen-address=0.0.0.0:9090 \
--web.external-url= $PROM_EXTRA_ARGS
Restart=on-failure
StartLimitInterval=1
RestartSec=3
[Install]
WantedBy=multi-user.target
- 其他运行时参数: ./prometheus –help
- 启动服务
systemctl daemon-reload
systemctl start prometheus.service
- 注意开启防火墙端口:
iptables -I INPUT -p tcp --dport 9090 -s NETWORK/MASK -j ACCEPT
- 浏览器访问:
http://IP:PORT
2.2.2 docker安装:
- image: prom/prometheus
- 启动命令:
$ docker run --name prometheus -d -v ./prometheus:/etc/prometheus/ -v ./db/:/prometheus -p 9090:9090 prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.listen-address="0.0.0.0:9090" --storage.tsdb.path=/prometheus --web.console.libraries=/usr/share/prometheus/console_libraries --web.console.templates=/usr/share/prometheus/consoles --storage.tsdb.retention=30d
2.3 prometheus配置:
2.3.1 启动参数
- 常用启动参数:
--config.file=/etc/prometheus/prometheus.yml # 指明主配置文件
--web.listen-address="0.0.0.0:9090" # 指明监听地址端口
--storage.tsdb.path=/prometheus # 指明数据库目录
--web.console.libraries=/usr/share/prometheus/console_libraries
--web.console.templates=/usr/share/prometheus/consoles # 指明console lib 和 tmpl
--storage.tsdb.retention=60d # 指明数据保留天数,默认15
2.3.2 配置文件:
- Prometheus的主配置⽂件为prometheus.yml
它主要由global、rule_files、scrape_configs、alerting、remote_write和remote_read⼏个配置段组成:
- global:全局配置段;
- rule_files:指定告警规则文件的路径
- scrape_configs:
scrape配置集合,⽤于定义监控的⽬标对象(target)的集合,以及描述如何抓取 (scrape)相关指标数据的配置参数;
通常,每个scrape配置对应于⼀个单独的作业(job),
⽽每个targets可通过静态配置(static_configs)直接给出定义,也可基于Prometheus⽀持的服务发现机制进 ⾏⾃动配置;
- job_name: 'nodes'
static_configs: # 静态指定,targets中的 host:port/metrics 将会作为metrics抓取对象
- targets: ['localhost:9100']
- targets: ['172.20.94.1:9100']
- job_name: 'docker_host'
file_sd_configs: # 基于文件的服务发现,文件中(yml 和json 格式)定义的host:port/metrics将会成为抓取对象
- files:
- ./sd_files/docker_host.yml
refresh_interval: 30s
- alertmanager_configs:
可由Prometheus使⽤的Alertmanager实例的集合,以及如何同这些Alertmanager交互的配置参数;
每个Alertmanager可通过静态配置(static_configs)直接给出定义, 也可基于Prometheus⽀持的服务发现机制进⾏⾃动配置;
- remote_write:
配置“远程写”机制,Prometheus需要将数据保存于外部的存储系统(例如InfluxDB)时 定义此配置段,
随后Prometheus将样本数据通过HTTP协议发送给由URL指定适配器(Adaptor);
- remote_read:
配置“远程读”机制,Prometheus将接收到的查询请求交给由URL指定适配器 (Adpater)执⾏,
Adapter将请求条件转换为远程存储服务中的查询请求,并将获取的响应数据转换为Prometheus可⽤的格式;
- 监控及告警规则配置文件:*.yml
- 定义监控规则
- 需要在主配置文件rule_files: 中指定才会生效
rule_files:
- "test_rules.yml" # 指定配置告警规则的文件路径
- 服务发现定义文件:支持yaml 和 json 两种格式
- 也是需要在主配置文件中定义
file_sd_configs: - files: - ./sd_files/http.yml refresh_interval: 30s
2.3.3 简单的配置文件示例:
- prometheus.yml 示例
global:
scrape_interval: 15s #每过15秒抓取一次指标数据
evaluation_interval: 15s#每过15秒执行一次报警规则,也就是说15秒执行一次报警
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]# 设置报警信息推送地址 , 一般而言设置的是alertManager的地址
rule_files:
- "test_rules.yml" # 指定配置告警规则的文件路径
scrape_configs:
- job_name: 'node'#自己定义的监控的job_name
static_configs: # 配置静态规则,直接指定抓取的ip:port
- targets: ['localhost:9100']
- job_name: 'CDG-MS'
honor_labels: true
metrics_path: '/prometheus'
static_configs:
- targets: ['localhost:8089']
relabel_configs:
- target_label: env
replacement: dev
- job_name: 'eureka'
file_sd_configs: # 基于文件的服务发现
- files:
- "/app/enmonster/basic/prometheus/prometheus-2.2.1.linux-amd64/eureka.json" # 支持json 和yml 两种格式
refresh_interval: 30s # 30s钟自行刷新配置,读取文件,修改之后无需手动reload
relabel_configs:
- source_labels: [__job_name__]
regex: (.*)
target_label: job
replacement: ${1}
- target_label: env
replacement: dev
- 告警规则配置文件示例:“`
[root@host40 monitor-bak]# cat prometheus/rules/docker_monitor.yml
groups:
</p></li>
<li><p>name: “container monitor”
rules:<ul>
<li>alert: “Container down: env1″
expr: time() – container_last_seen{name=”env1”} > 60
for: 30s
labels:
severity: critical
annotations:
summary: “Container down: {{$labels.instance}} name={{$labels.name}}”“`
- 基于文件的服务发现定义文件: *.yml
[root@host40 monitor]# cat prometheus/sd_files/virtual_lan.yml - targets: ['10.10.11.179:9100'] - targets: ['10.10.11.178:9100']
[root@host40 monitor]# cat prometheus/sd_files/tcp.yml - targets: ['10.10.11.178:8001'] labels: server_name: http_download - targets: ['10.10.11.178:3307'] labels: server_name: xiaojing_db - targets: ['10.10.11.178:3001'] labels: server_name: test_web
2.3.5其他配置
- 由于prometheus很多配置需要和其他组件耦合,所以在介绍到相应组件时再行介绍
2.4 prometheus web-gui
- web页面访问地址: http://ip:port 如:http://10.10.11.40:9090/
- alerts: 查看告警规则
- graph: 查询收集到的指标数据,并提供简单的绘图
- status: prometheus运行时配置已经监听主机相关信息
- 详情自行查看web-gui页面
3.node_exporter
3.1 基本介绍
- node_exporter 在被监控节点安装,抓取主机监控信息,并对外提供http服务,供prometheus抓取监控信息。
- 项目及文档地址:https://github.com/prometheus/node_exporter
- prometheus官方提供了很多不同类型的exporter,列表地址: https://prometheus.io/docs/instrumenting/exporters/
3.2 安装node_exporter
3.2.1 linux(centos7)下载安装:
- 下载并解压
wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz tar xf node_exporter-0.18.1.linux-amd64.tar.gz -C /usr/local/ cd /usr/local ln -sv node_exporter-0.18.1.linux-amd64/ node_exporter
- 创建用户:
useradd -r -m -d /var/lib/prometheus prometheus
- 配置unit file:
vim /usr/lib/systemd/system/node_exporter.service [Unit] Description=Prometheus exporter for machine metrics, written in Go with pluggable metric collectors.Documentation=https://github.com/prometheus/node_exporterAfter=network.target [Service] EnvironmentFile=-/etc/sysconfig/node_exporter User=prometheus ExecStart=/usr/local/node_exporter/node_exporter \ $NODE_EXPORTER_OPTS Restart=on-failure StartLimitInterval=1 RestartSec=3 [Install] WantedBy=multi-user.target
- 启动服务:
systemctl daemon-reload systemctl start node_exporter.service
- 可以手动测试是否可以获取metrics信息:
curl http://localhost:9100/metrics
- 开启防火墙:
iptables -I INPUT -p tcp --dport 9100 -s NET/MASK -j ACCEPT
3.2.2 docker安装
- image: quay.io/prometheus/node-exporter,prom/node-exporter
- 启动命令:
docker run -d --net="host" --pid="host" -v "/:/host:ro,rslave" --name monitor-node-exporter --restart always quay.io/prometheus/node-exporter --path.rootfs=/host --web.listen-address=:9100
- 对于部分低版本的docker,出现报错:Error response from daemon: linux mounts: Could not find source mount of /解决办法:-v “/:/host:ro,rslave” -> -v “/:/host:ro”
3.3 配置node_exporter
- 开启关闭collectors:
./node_exporter --help # 查看支持的所有collectors,可根据实际需求 enable 和 disabled 各项指标收集
如 --collector.cpu=disabled ,不再收集cpu相关信息
- Textfile Collector: 文本文件收集器
通过 启动参数 --collector.textfile.directory="DIR" 可开启文本文件收集器 收集器会收集目录下所有*.prom的文件中的指标,指标必须满足 prom格式
示例:
echo my_batch_job_completion_time $(date +%s) > /path/to/directory/my_batch_job.prom.$$ mv /path/to/directory/my_batch_job.prom.$$ /path/to/directory/my_batch_job.prom echo 'role{role="application_server"} 1' > /path/to/directory/role.prom.$$ mv /path/to/directory/role.prom.$$ /path/to/directory/role.prom rpc_duration_seconds{quantile="0.5"} 4773 http_request_duration_seconds_bucket{le="0.5"} 129389
即如果node_exporter 不能满足自身指标抓取,可以通过脚本形式将指标抓取之后写入文件,由node_exporter对外提供个prometheus抓取
可以省掉pushgateway - 有关prom格式和查询语法,将再之后介绍
3.4 配置prometheus抓取node_exporter 指标
- 示例: prometheus.yml
“`
scrape_configs:
</p></li>
</ul><h1>The job name is added as a label <code>job=</code> to any timeseries scraped from this config.</h1>
<ul>
<li>job_name: ‘prometheus’
# metrics_path defaults to ‘/metrics’
# scheme defaults to ‘http’.
static_configs:<ul>
<li>targets: [‘localhost:9090’]</li>
</ul></li>
<li><p>job_name: ‘nodes’
static_configs:<ul>
<li>targets: [‘localhost:9100’]</li>
<li>targets: [‘172.20.94.1:9100’]</li>
</ul><pre><code class=”line-numbers”></code></pre>
<ul>
<li>job_name: ‘node_real_lan’
file_sd_configs:</li>
<li>files:</li>
</ul></li>
<li>./sd_files/real_lan.yml
refresh_interval: 30s
params: # 可选
collect[]:<ul>
<li>cpu</li>
<li>meminfo</li>
<li>diskstats</li>
<li>netdev</li>
<li>netstat</li>
<li>filefd</li>
<li>filesystem</li>
<li>xfs“`
4.cadvisor
4.1 官方地址:
- https://github.com/google/cadvisor
- image: gcr.io/google_containers/cadvisor[:v0.36.0] # 需要能访问google
- image: google/cadvisor:v0.33.0 # docker hub镜像,版本没有google的新
4.2 docker run
sudo docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=9080:8080 \
--detach=true \
--name=cadvisor \
--privileged \
--device=/dev/kmsg \
google/cadvisor:v0.33.0
4.3 web页面查看简单的单机图形监控信息
- http://ip:port
4.4 配置prometheus抓取
- 配置示例:“`
<ul>
<li>job_name: ‘docker’
static_configs:</li>
</ul></li>
<li>targets: [‘localhost:9080’]“`
5.grafana
5.1 官方地址
- grafana程序下载地址:https://grafana.com/grafana/download
- grafana dashboard 下载地址: https://grafana.com/grafana/download/
5.2 安装grafana
5.2.1 linux(centos7)安装
- 下载并安装
wget https://dl.grafana.com/oss/release/grafana-7.2.2-1.x86_64.rpm sudo yum install grafana-7.2.2-1.x86_64.rpm
- 准备service 文件:
[Unit] Description=Grafana instance Documentation=http://docs.grafana.org Wants=network-online.target After=network-online.target After=postgresql.service mariadb.service mysqld.service [Service] EnvironmentFile=/etc/sysconfig/grafana-server User=grafana Group=grafana Type=notify Restart=on-failure WorkingDirectory=/usr/share/grafana RuntimeDirectory=grafana RuntimeDirectoryMode=0750 ExecStart=/usr/sbin/grafana-server \ --config=${CONF_FILE} \ --pidfile=${PID_FILE_DIR}/grafana-server.pid\ --packaging=rpm \ cfg:default.paths.logs=${LOG_DIR} \ cfg:default.paths.data=${DATA_DIR} \ cfg:default.paths.plugins=${PLUGINS_DIR} \ cfg:default.paths.provisioning=${PROVISIONING_CFG_DIR} LimitNOFILE=10000 TimeoutStopSec=20 [Install] WantedBy=multi-user.target
- 启动grafana
systemctl enable grafana-server.service systemctl restart grafana-server.service
默认监听3000端口
- 开启防火墙:
iptables -I INPUT -p tcp --dport 3000 -s NET/MASK -j ACCEPT
5.2.2 docker安装
- image: grafana/grafana
docker run -d --name=grafana -p 3000:3000 grafana/grafana:7.2.2
5.3 grafana 简单使用流程
- web页面访问:
http://ip:port
首次登陆会要求自行设置账号密码
7.2版本会要求输入账号密码之后重置,初始账号密码都是admin - 使用流程:
- 添加数据源
- 添加dashboard,配置图形监控面板,也可在官网下载对应服务的dashboard模板,下载地址:https://grafana.com/grafana/download/
- 导入模板,json 或 链接 或模板编号
- 查看dashboard
- 常用模板编号:
- node-exporter: cn/8919,en/11074
- k8s: 13105
- docker: 12831
- alertmanager: 9578
- blackbox_exportre: 9965
- 重置管理员密码:
查看Grafana配置文件,确定grafana.db的路径 配置文件路径:/etc/grafana/grafana.ini [paths] ;data = /var/lib/grafana [database] # For "sqlite3" only, path relative to data_path setting ;path = grafana.db 通过配置文件得知grafana.db的完整路径如下: /var/lib/grafana/grafana.db
使用sqlites修改admin密码 sqlite3 /var/lib/grafana/grafana.db sqlite> update user set password = '59acf18b94d7eb0694c61e60ce44c110c7a683ac6a8f09580d626f90f4a242000746579358d77dd9e570e83fa24faa88a8a6', salt = 'F3FAxVm33R' where login = 'admin'; .exit
使用admin admin 登录
5.4 grafana告警配置:
- grafana-server配置 smtp服务器,配置发件邮箱
vim /etc/grafana/grafana.ini [smtp] enabled = true host = smtp.126.com:465 user = USER@126.com password = PASS skip_verify = false from_address = USER@126.com from_name = Grafana Alart
- grafana页面添加Notification Channel
Alerting -> Notification Channel save之前 可以send test
- 进入dashboard,添加alart rules
- 由于现阶段grafana(7.2.2)不支持在报警查询中使用模板变量。所以报警功能实用性很低。生产中建议使用alertmanager
6.prometheus and PromQL:
6.1 PromQL 简述
- prometheus用来查询数据库的语法规则,用来将数据库中存储的由各exporter 采集到的metric指标组织成可视化的图标信息,以及告警规则
- promQL一个多维数据模型,其中包含通过metric name 和键/值对标识的时间序列数据
- 一种灵活的查询语言 ,可利用此维度
- 不依赖分布式存储;单服务器节点是自治的
- 多种图形和仪表板支持模式
6.2 使用到promQL的组件:
- prometheus server
- client libraries for instrumenting application c7ode
- push gateway
- exporters
- alertmanager
6.3 metric 介绍
6.3.1 metric类型
- gauges: 返回单一数值,如:
- node_boot_time_seconds
node_boot_time_seconds{instance=”10.10.11.40:9100″,job=”node_real_lan”} 1574040030
- counters: 计数,
- histograms: 直方图,统计数据的分布情况。比如最大值,最小值,中间值,中位数,百分位数等。
- summaries: 采样点分位图统计。
6.3.2 label
- node_boot_time_seconds{instance=”10.10.11.40:9100″,job=”node_real_lan”}
如上示例,这里的instance,和job 就是label
- job : job_name,在prometheus.yml 中定义
- instance: host:port
- 也可以在配置文件自行定义label,如:
- targets: ['10.10.11.178:3001'] labels: server_name: test_web
添加的label即会在prometheus查询数据使用:
metric{servername=...,}
6.4 PromQL 表达式
- PromQL表达式即是grafana绘制图标的基本语句,也是prometheus用来设置告警规则的基本语句,所以能弄懂或者看懂promQL 非常重要。
6.4.1 先看示例:
- 计算cpu使用率:
(1-((sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance))/(sum(increase(node_cpu_seconds_total[1m])) by (instance)))) * 100
其中metric:
node_cpu_seconds_total # 总cpu 使用时间 node_cpu_seconds_total{mode="idle"} # 空闲cpu使用时间,其他类似标签: user , system , steal , softirq , irq , nice , iowait , idle
用到的函数:
“`
increase( [1m]) # 1分钟之类的增量。
sum()
sum() by (TAG) # 其中 TAG 是标签,此地 instance 代表的是机器名. 按主机名进行相加,否则多主机只会显示一条线。
</p></li>
</ul><pre><code class=”line-numbers”>#### 6.4.2 标签选择
– 匹配运算:
“`
= #等于 Select labels that are exactly equal to the provided string.
!= #不等于 Select labels that are not equal to the provided string.
=~ #正则表达式匹配 Select labels that regex-match the provided string.
!~ #正则表达式不匹配 Select labels that do not regex-match the provided string.
“`– 示例:
“`
node_cpu_seconds_total{mode=”idle”} # mode : 标签,metric自带属性。
api_http_requests_total{method=”POST”, handler=”/messages”}
“`“`
http_requests_total{environment=~”staging|testing|development”,method!=”GET”}
“`– 注意: 必须指定一个名称或至少一个与空字符串不匹配的标签匹配器
“`
{job=~”.*”} # Bad!
{job=~”.+”} # Good!
{job=~”.*”,method=”get”} # Good!
“`#### 6.4.3 运算
– 时间范围:
“`
s -秒
m – 分钟
h – 小时
d – 天
w -周
y -年
“`– 运算符:
“`
+ (addition)
– (subtraction)
* (multiplication)
/ (division)
% (modulo)
^ (power/exponentiatio
== (equal)
!= (not-equal)
> (greater-than)
= (greater-or-equal)
40
for: 1m
labels:
servirity: warning
annotations:
summary: “{{$labels.instance}}:CPU 使用过高”
description: “{{$labels.instance}}:CPU 使用率超过 40%”
value: “{{$value}}”
– alert: “CPU 使用率超过90%”
expr: 100-(avg(rate(node_cpu_seconds_total{mode=”idle”}[1m])) by(instance)* 100) > 90
for: 1m
labels:
severity: critical
annotations:
summary: “{{$labels.instance}}:CPU 使用率90%”
description: “{{$labels.instance}}:CPU 使用率超过90%,持续时间超过5mins”
value: “{{$value}}”
“`– 如果需要在配置文件中使用中文,务必注意编码规则为utf8,否则报错
### 7.6 配置alertmanager
– 详细文档地址: https://prometheus.io/docs/alerting/latest/configuration/
– 主配置文件: alertmanager.yml
– 模板配置文件: *.tmpl
– 只是介绍少部需要用到的配置,如需查看完整配置,请查看官方文档#### 7.6.1 alertmanager.yml
– 主配置文件中需要配置:
– global: 发件邮箱配置,
– templates: 指定邮件模板文件(如果不指定,则使用alertmanager默认模板),
– routes: 配置告警规则,比如匹配哪个label的规则发送到哪个后端
– receivers: 配置后端告警媒介: email,wechat,webhook等等– 先看示例:
“`
vim alertmanager.yml
global:
smtp_smarthost: ‘xxx’
smtp_from: ‘xxx’
smtp_auth_username: ‘xxx’
smtp_auth_password: ‘xxx’
smtp_require_tls: false
templates:
– ‘/alertmanager/template/*.tmpl’
route:
receiver: ‘default-receiver’
group_wait: 1s #组报警等待时间
group_interval: 1s #组报警间隔时间
repeat_interval: 1s #重复报警间隔时间
group_by: [cluster, alertname]
routes:
– receiver: test
group_wait: 1s
match_re:
severity: test
receivers:
– name: ‘default-receiver’
email_configs:
– to: ‘xx@xx.xx’
html: ‘{{ template “xx.html” . }}’
headers: { Subject: ” {{ .CommonAnnotations.summary }}” }
– name: ‘test’
email_configs:
– to: ‘xxx@xx.xx’
html: ‘{{ template “xx.html” . }}’
headers: { Subject: ” {{ 第二路由匹配测试}}” }
“`“`
vim test.tmpl
{{ define “xx.html” }}{{ range $i, $alert := .Alerts }}
{{ end }}
报警项 磁盘 报警阀值 开始时间 {{ index $alert.Labels “alertname” }} {{ index $alert.Labels “instance” }} {{ index $alert.Annotations “value” }} {{ $alert.StartsAt }} {{ end }}
“`– 详解:
gloable:
resolve_timeout: # 在没有报警的情况下声明为已解决的时间
- 其他邮件相关配置,如示例
route: # 所有报警信息进入后的根路由,用来设置报警的分发策略
group_by: [‘LABEL_NAME’,’alertname’, ‘cluster’,’job’,’instance’,…]
这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
group_wait: 30s
当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
group_interval: 5m
当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
repeat_interval: 5m
如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
match:
label_name: NAME匹配报警规则,满足条件的告警将被发给 receiver
match_re:
label_name: , …正则表达式匹配。满足条件的告警将被发给 receiver
receiver: receiver_name
将满足match 和 match_re的告警发给后端 告警媒介(邮件,webhook,pagerduty,wechat,...) 必须有一个default receivererr="root route must specify a default receiver"
routes:
– …配置多条规则。
templates:
[ – … ]“`
配置模板,比如邮件告警页面模板
receivers: - <receiver> ...# 列表
- name: receiver_name # 用于填写在route.receiver中的名字
email_configs: # 配置邮件告警
- to: <tmpl_string> send_resolved: <boolean> | default = false # 故障恢复之后,是否发送恢复通知
配置接受邮件告警的邮箱,也可以配置单独配置发件邮箱。 详见官方文档
https://prometheus.io/docs/alerting/latest/configuration/#email_config- name: ... wechat_configs: - send_resolved: <boolean> | default = false api_secret: <secret> | default = global.wechat_api_secret api_url: <string> | default = global.wechat_api_url corp_id: <string> | default = global.wechat_api_corp_id message: <tmpl_string> | default = '{{ template "wechat.default.message" . }}' agent_id: <string> | default = '{{ template "wechat.default.agent_id" . }}' to_user: <string> | default = '{{ template "wechat.default.to_user" . }}' to_party: <string> | default = '{{ template "wechat.default.to_party" . }}' to_tag: <string> | default = '{{ template "wechat.default.to_tag" . }}' # 说明 to_user: 企业微信用户ID to_party: 需要发送的组id corp_id: 企业微信账号唯一ID 可以在 我的企业 查看 agent_id: 应用的 ID,应用管理 --> 打开自定应用查看 api_secret: 应用的密钥 打开企业微信注册 https://work.weixin.qq.com 微信API官方文档 https://work.weixin.qq.com/api/doc#90002/90151/90854
企业微信告警配置
inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
抑制相关配置
7.6.2 配置企业微信告警
- 注册企业: https://work.weixin.qq.com可以注册未认证企业,人数上限200,绑定个人微信即可使用web后台
微信API官方文档 : https://work.weixin.qq.com/api/doc#90002/90151/90854
- 注册之后绑定私人微信即可扫码进入管理后台。
- 发送告警的应用需要新建,操作也很简单
- 需要注意的参数:
- corp_id: 企业微信账号唯一ID 可以在 我的企业 查看
- agent_id: 应用的 ID,应用管理 –> 打开自定应用查看
- api_secret: 应用的密钥
- to_user: 企业微信用户ID,
- to_party: 需要发送的组id,通讯录,点击组名旁边的点可查看
- 配置示例:
receivers: - name: 'default' email_configs: - to: 'XXX' send_resolved: true wechat_configs: - send_resolved: true corp_id: 'XXX' api_secret: 'XXX' agent_id: 1000002 to_user: XXX to_party: 2 message: '{{ template "wechat.html" . }}'
- template:
- 由于alertmanager默认的微信报警模板太丑丑陋和冗长,所以使用告警模板,邮件模板默认的倒是还可以
- 示例1:
cat wechat.tmpl {{ define "wechat.html" }} {{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }} [@警报~] 实例: {{ .Labels.instance }} 信息: {{ .Annotations.summary }} 详情: {{ .Annotations.description }} 值: {{ .Annotations.value }} 时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{ end -}} {{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }} [@恢复~] 实例: {{ .Labels.instance }} 信息: {{ .Annotations.summary }} 时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} 恢复: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{ end -}} {{- end }}
7.6.3 告警模板时间问题:
- 参考来源: https://blog.csdn.net/knight_zhou/article/details/106323719
- Prometheus 邮件告警自定义模板的默认使用的是utc时间。
触发时间: {{ .StartsAt.Format "2020-01-02 15:04:05" }} 修改之后:{{ (.StartsAt.Add 28800e9).Format "2020-01-02 15:04:05" }}
7.7 prometheus常用告警规则:
- 很厉害的一个页面,包括的好多写好的规则: https://awesome-prometheus-alerts.grep.to/rules
7.7.1 容器告警指标,容器down掉告警
vim rules/docker_monitor.yml groups: - name: "container monitor" rules: - alert: "Container down: env1" expr: time() - container_last_seen{name="env1"} > 60 for: 30s labels: severity: critical annotations: summary: "Container down: {{$labels.instance}} name={{$labels.name}}"
注意:
此项指标只能监控容器down 掉,无法准确监控容器恢复(不准),即便容器没有成功启动,过一段时间,也会受到resolve通知
7.7.2 针对磁盘CPU,IO ,磁盘使用、内存使用、TCP、网络流量配置监控告警:
groups: - name: 主机状态-监控告警 rules: - alert: 主机状态 expr: up == 0 for: 1m labels: status: 非常严重 annotations: summary: "{{$labels.instance}}:服务器宕机" description: "{{$labels.instance}}:服务器延时超过5分钟" - alert: CPU使用情况 expr: 100-(avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) for: 1m labels: status: 一般告警 annotations: summary: "{{$labels.mountpoint}} CPU使用率过高!" description: "{{$labels.mountpoint }} CPU使用大于60%(目前使用:{{$value}}%)" - alert: cpu使用率过高告警 # 查询提供了hostname label expr: (100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 10 nodename) (node_uname_info) > 85 for: 5m labels: region: 成都 annotations: summary: "{{$labels.instance}}({{$labels.nodename}})CPU使用率过高!" description: '服务器{{$labels.instance}}({{$labels.nodename}})CPU使用率超过85%( $value}}%)' - alert: 系统负载过高 expr: (node_load1/count without (cpu, mode) (node_cpu_seconds_total{mode="system"} nodename) (node_uname_info)>1.1 for: 3m labels: region: 成都 annotations: summary: "{{$labels.instance}}({{$labels.nodename}})系统负载过高!" description: '{{$labels.instance}}({{$labels.nodename}})当前负载超标率 {{printf - alert: 内存不足告警 expr: (100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)* o nodename) (node_uname_info) > 80 for: 3m labels: region: 成都 annotations: summary: "{{$labels.instance}}({{$labels.nodename}})内存使用率过高!" description: '服务器{{$labels.instance}}({{$labels.nodename}})内存使用率超过80%( $value}}%)' - alert: IO操作耗时 expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) 102400 for: 1m labels: status: 严重告警 annotations: summary: "{{$labels.mountpoint}} 流入网络带宽过高!" description: "{{$labels.mountpoint }}流入网络带宽持续2分钟高于100M. RX带宽使用率{ - alert: 网络流出 expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|d instance)) / 100) > 102400 for: 1m labels: status: 严重告警 annotations: summary: "{{$labels.mountpoint}} 流出网络带宽过高!" description: "{{$labels.mountpoint }}流出网络带宽持续2分钟高于100M. RX带宽使用率{ - alert: network in expr: sum by (instance) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1 for: 1m labels: name: network severity: Critical annotations: summary: "{{$labels.mountpoint}} 流入网络带宽过高" description: "{{$labels.mountpoint }}流入网络异常,高于100M" value: "{{ $value }}" - alert: network out expr: sum by (instance) (irate(node_network_transmit_bytes_total[2m])) / 1024 / for: 1m labels: name: network severity: Critical annotations: summary: "{{$labels.mountpoint}} 发送网络带宽过高" description: "{{$labels.mountpoint }}发送网络异常,高于100M" value: "{{ $value }}" - alert: TCP会话 expr: node_netstat_Tcp_CurrEstab > 1000 for: 1m labels: status: 严重告警 annotations: summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!" description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$valu - alert: 磁盘容量 expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_b > 80 for: 1m labels: status: 严重告警 annotations: summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!" description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)" - alert: 硬盘空间不足告警 # 查询结果多了hostname等label expr: (100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_by )* on(instance) group_left(nodename) (node_uname_info)> 80 for: 3m labels: region: 成都 annotations: summary: "{{$labels.instance}}({{$labels.nodename}})硬盘使用率过高!" description: '服务器{{$labels.instance}}({{$labels.nodename}})硬盘使用率超过80%( $value}}%)' - alert: volume fullIn fourdaysd # 预计磁盘4天后写满 expr: predict_linear(node_filesystem_free_bytes[2h], 4 * 24 * 3600) < 0 for: 5m labels: name: disk severity: Critical annotations: summary: "{{$labels.mountpoint}} 预计主机可用磁盘空间4天后将写满" description: "{{$labels.mountpoint }}" value: "{{ $value }}%" - alert: disk write rate expr: sum by (instance) (irate(node_disk_written_bytes_total[2m])) / 1024 / 1024 for: 1m labels:container_memory_max_usage_bytes name: disk severity: Critical annotations: summary: "disk write rate (instance {{ $labels.instance }})" description: "磁盘写入速率大于50MB/s" value: "{{ $value }}%" - alert: disk read latency expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_complet for: 1m labels: name: disk severity: Critical annotations: summary: "unusual disk read latency (instance {{ $labels.instance }})" description: "磁盘读取延迟大于100毫秒" value: "{{ $value }}%" - alert: disk write latency expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_compl for: 1m labels: name: disk severity: Critical annotations: summary: "unusual disk write latency (instance {{ $labels.instance }})" description: "磁盘写入延迟大于100毫秒" value: "{{ $value }}%"
7.8 alertmanager 管理api
GET /-/healthy GET /-/ready POST /-/reload
- 示例:
curl -u monitor:fosafer.com 127.0.0.1:9093/-/healthy OK curl -XPOST -u monitor:fosafer.com 127.0.0.1:9093/-/reload [root@host40 monitor]# curl -XPOST -u monitor:fosafer.com 127.0.0.1:9093/-/reload failed to reload config: yaml: unmarshal errors: line 26: field receiver already set in type config.plain
等同: docker exec -it monitor-alertmanager kill -1 1 ,但是失败会报错
8.blackbox_exporter
8.1 blackbox_exporter简介
- blackbox_exporter是Prometheus 官方提供的 exporter 之一,可以提供 http、dns、tcp、icmp 的监控数据采集。
- 官方地址: https://github.com/prometheus/blackbox_exporter
- 应用场景:
HTTP 测试 定义 Request Header 信息 判断 Http status / Http Respones Header / Http Body 内容 TCP 测试 业务组件端口状态监听 应用层协议定义与监听 ICMP 测试 主机探活机制 POST 测试 接口联通性 SSL 证书过期时间
8.2 blackbox_exporter安装
8.2.1 linux(centos7) 二进制下载安装blackbox_exporter
- 下载并解压
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.18.0/ blackbox_exporter-0.18.0.linux-amd64.tar.gz tar -xf blackbox_exporter-0.18.0.linux-amd64.tar.gz -C /usr/local/ cd /usr/local ln -sv blackbox_exporter-0.18.0.linux-amd64 blackbox_exporter cd blackbox_exporter ./blackbox_exporter --version
- 添加systemd服务unit:
vim /lib/systemd/system/blackbox_exporter.service [Unit] Description=blackbox_exporter After=network.target [Service] User=root Type=simple ExecStart=/usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml Restart=on-failure [Install] WantedBy=multi-user.target
systemctl daemon-reload systemctl enable blackbox_exporter systemctl start blackbox_exporter
- 默认监听端口: 9115
8.2.2 docker 安装blackbox_exporter
- image: prom/blackbox-exporter:master
- docker run:
docker run --rm -d -p 9115:9115 --name blackbox_exporter -v `pwd`:/config prom/blackbox-exporter:master --config.file=/config/blackbox.yml
8.3 配置blackbox_exporter
- 默认配置文件:
- blackbox_exporter 默认情况配置文件已经能够满足大多数需求,后续如需自行配置,参见官方文档,以及项目类一个示例配置文件
- https://github.com/prometheus/blackbox_exporter/blob/master/example.yml
cat blackbox.yml modules: http_2xx: prober: http http_post_2xx: prober: http http: method: POST tcp_connect: prober: tcp pop3s_banner: prober: tcp tcp: query_response: - expect: "^+OK" tls: true tls_config: insecure_skip_verify: false ssh_banner: prober: tcp tcp: query_response: - expect: "^SSH-2.0-" irc_banner: prober: tcp tcp: query_response: - send: "NICK prober" - send: "USER prober prober prober :prober" - expect: "PING :([^ ]+)" send: "PONG ${1}" - expect: "^:[^ ]+ 001" icmp: prober: icmp
8.4 配置prometheus:
- 官方介绍: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config
- 参考文档: https://blog.csdn.net/qq_25934401/article/details/84325356
- 说明:
labels: job: job_name __address__: : instance: 默认__address__,如果没有被重新标签的话 __scheme__: scheme __metrics_path__: path __param_: url 中第一个出现的 参数
8.4.1 http/https 测试示例:
scrape_configs: - job_name: 'blackbox' metrics_path: /probe params: module: [http_2xx] # Look for a HTTP 200 response. static_configs: - targets: - http://prometheus.io # Target to probe with http. - https://prometheus.io# Target to probe with https. - http://example.com:8080 # Target to probe with http on port 8080. relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 127.0.0.1:9115 # The blackbox exporter's real hostname:port.
8.4.2 tcp探测示例:
- job_name: "blackbox_telnet_port]" scrape_interval: 5s metrics_path: /probe params: module: [tcp_connect] static_configs: - targets: [ '1x3.x1.xx.xx4:443' ] labels: group: 'xxxidc机房ip监控' - targets: ['10.xx.xx.xxx:443'] labels: group: 'Process status of nginx(main) server' relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 10.xxx.xx.xx:9115
8.4.3 icmp探测示例:
- job_name: 'blackbox00_ping_idc_ip' scrape_interval: 10s metrics_path: /probe params: module: [icmp] #ping static_configs: - targets: [ '1x.xx.xx.xx' ] labels: group: 'xxnginx 虚拟IP' relabel_configs: - source_labels: [__address__] regex: (.*)(:80)? target_label: __param_target replacement: ${1} - source_labels: [__param_target] regex: (.*) target_label: ping replacement: ${1} - source_labels: [] regex: .* target_label: __address__ replacement: 1x.xxx.xx.xx:9115
8.4.4 POST探测示例:
- job_name: 'blackbox_http_2xx_post' scrape_interval: 10s metrics_path: /probe params: module: [http_post_2xx_query] static_configs: - targets: - https://xx.xxx.com/api/xx/xx/fund/query.action labels: group: 'Interface monitoring' relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 1x.xx.xx.xx:9115 # The blackbox exporter's real hostname:port.
8.4.5 SSL证书时间监测:
cat << 'EOF' > prometheus.yml rule_files: - ssl_expiry.rules scrape_configs: - job_name: 'blackbox' metrics_path: /probe params: module: [http_2xx] # Look for a HTTP 200 response. static_configs: - targets: - example.com # Target to probe relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 127.0.0.1:9115 # Blackbox exporter. EOF cat << 'EOF' > ssl_expiry.rules groups: - name: ssl_expiry.rules rules: - alert: SSLCertExpiringSoon expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 30 for: 10m EOF
8.5 查看监听过程:
- 类似于:
curl http://172.16.10.65:9115/probe?target=prometheus.io&module=http_2xx&debug=true
8.6 添加告警:
- icmp、tcp、http、post 监测是否正常可以观察probe_success 这一指标
probe_success == 0 ##联通性异常 probe_success == 1 ##联通性正常
- 告警也是判断这个指标是否等于0,如等于0 则触发异常报警“`
[sss@prometheus01 prometheus]$ cat rules/blackbox-alert.rules
groups:<ul>
<li>name: blackbox_network_stats
rules:</li>
</ul></li>
<li>alert: blackbox_network_stats
expr: probe_success <span class=”text-highlighted-inline” style=”background-color: #fffd38;”> 0
for: 1m
labels:
severity: critical
annotations:
summary: “Instance {{ $labels.instance }} is down”
description: “This requires immediate action!”“`
9.docker-compose部署完整prometheus监控系统
- 部署主机: 10.10.11.40
9.1 部署组件:
prometheus alertmanager grafana nginx node_exporter cadvisor blackbox_exporter
- image:
prom/prometheus prom/alertmanager quay.io/prometheus/node-exporter ,prom/node-exporter gcr.io/google_containers/cadvisor[:v0.36.0] # 需要能访问google google/cadvisor:v0.33.0 # docker hub镜像,版本没有google的新 grafana/grafana nginx
- 将iamge pull下来之后从新tag ,并上传至本地harbor 仓库
image: 10.10.11.40:80/base/nginx:1.19.3 image: 10.10.11.40:80/base/prometheus:2.22.0 image: 10.10.11.40:80/base/grafana:7.2.2 image: 10.10.11.40:80/base/alertmanager:0.21.0 image: 10.10.11.40:80/base/node_exporter:1.0.1 image: 10.10.11.40:80/base/cadvisor:v0.33.0 image: 10.10.11.40:80/base/blackbox-exporter:0.18.0
9.2 部署结构
- 目录结构一览
mkdir /home/deploy/monitor cd /home/deploy/monitor
[root@host40 monitor]# tree . ├── alertmanager │ ├── alertmanager.yml │ ├── db │ │ ├── nflog │ │ └── silences │ └── templates │ └── wechat.tmpl ├── blackbox_exporter │ └── blackbox.yml ├── docker-compose.yml ├── grafana │ └── db │ ├── grafana.db │ ├── plugins ... ├── nginx │ ├── auth │ └── nginx.conf ├── node-exporter │ └── textfiles ├── node_exporter_install_docker.sh ├── prometheus │ ├── db │ ├── prometheus.yml │ ├── rules │ │ ├── docker_monitor.yml │ │ ├── system_monitor.yml │ │ └── tcp_monitor.yml │ └── sd_files │ ├── docker_host.yml │ ├── http.yml │ ├── icmp.yml │ ├── real_lan.yml │ ├── real_wan.yml │ ├── sedFDm5Rw │ ├── tcp.yml │ ├── virtual_lan.yml │ └── virtual_wan.yml └── sd_controler.sh
- nginx basic认证需要的文件:
[root@host40 monitor-bak]# ls nginx/auth/ -a . .. .htpasswd
- 部分挂在目录权限:
prometheus,grafana,alertmanager 的 db目录 需要777权限 单独挂在的配置文件 alertmanager.yml,prometheus.yml,nginx.conf 需要 666权限。 如果为了安全起见,建议将配置文件放入专门目录中挂载,并在command 中修改启动参数指定配置文件即可
9.3 docker-compose.yml
[root@host40 monitor-bak]# cat docker-compose.yml version: "3" services: nginx: image: 10.10.11.40:80/base/nginx:1.19.3 hostname: nginx container_name: monitor-nginx restart: always privileged: false ports: - 3001:3000 - 9090:9090 - 9093:9093 volumes: - ./nginx/nginx.conf:/etc/nginx/nginx.conf - ./nginx/auth:/etc/nginx/basic_auth networks: monitor: aliases: - nginx logging: driver: json-file options: max-file: '5' max-size: 50m prometheus: image: 10.10.11.40:80/base/prometheus:2.22.0 container_name: monitor-prometheus hostname: prometheus restart: always privileged: true volumes: - ./prometheus/db/:/prometheus/ - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - ./prometheus/rules/:/etc/prometheus/rules/ - ./prometheus/sd_files/:/etc/prometheus/sd_files/ command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--web.console.libraries=/usr/share/prometheus/console_libraries' - '--web.console.templates=/usr/share/prometheus/consoles' - '--storage.tsdb.retention=60d' networks: monitor: aliases: - prometheus logging: driver: json-file options: max-file: '5' max-size: 50m grafana: image: 10.10.11.40:80/base/grafana:7.2.2 container_name: monitor-grafana hostname: grafana restart: always privileged: true volumes: - ./grafana/db/:/var/lib/grafana networks: monitor: aliases: - grafana logging: driver: json-file options: max-file: '5' max-size: 50m alertmanger: image: 10.10.11.40:80/base/alertmanager:0.21.0 container_name: monitor-alertmanager hostname: alertmanager restart: always privileged: true volumes: - ./alertmanager/db/:/alertmanager - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml - ./alertmanager/templates/:/etc/alertmanager/templates networks: monitor: aliases: - alertmanager logging: driver: json-file options: max-file: '5' max-size: 50m node-exporter: image: 10.10.11.40:80/base/node_exporter:1.0.1 container_name: monitor-node-exporter hostname: host40 restart: always privileged: true volumes: - /:/host:ro,rslave - ./node-exporter/textfiles/:/textfiles network_mode: "host" command: - '--path.rootfs=/host' - '--web.listen-address=:9100' - '--collector.textfile.directory=/textfiles' logging: driver: json-file options: max-file: '5' max-size: 50m cadvisor: image: 10.10.11.40:80/base/cadvisor:v0.33.0 container_name: monitor-cadvisor hostname: cadvisor restart: always privileged: true volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro - /dev/disk/:/dev/disk:ro ports: - 9080:8080 networks: monitor: logging: driver: json-file options: max-file: '5' max-size: 50m blackbox_exporter: image: 10.10.11.40:80/base/blackbox-exporter:0.18.0 container_name: monitor-blackbox hostname: blackbox-exporter restart: always privileged: true volumes: - ./blackbox_exporter/:/etc/blackbox_exporter networks: monitor: aliases: - blackbox command: - '--config.file=/etc/blackbox_exporter/blackbox.yml' logging: driver: json-file options: max-file: '5' max-size: 50m networks: monitor: ipam: config: - subnet: 192.168.17.0/24
9.4 nginx
- 由于prometheus,alertmanager 本身不带认证功能,所以前端使用nginx完成调度和basic auth 认证,同一代理后端监听端口,便于管理。
- 各程序默认端口
prometheus: 9090 grafana:3000 alertmanager: 9093 node_exproter: 9100 cadvisor: 8080 (客户端)
- nginx基础image使用basic认证:
echo monitor:`openssl passwd -crypt 123456` > .htpasswd
- 单独挂在配置文件容器不更新:(当然也可以选择挂在目录,而不是直接挂在文件)
chmod 666 nginx.conf
- nginx容器加载配置文件:
docker exec -it web-director nginx -s reload
- nginx.conf“`
[root@host40 monitor-bak]# cat nginx/nginx.conf
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /run/nginx.pid;
include /usr/share/nginx/modules/*.conf;
events {
worker_connections 10240;
}
http {
log_format main ‘$remote_addr – $remote_user [$time_local] “$request” ‘
‘$status $body_bytes_sent “$http_referer” ‘
‘”$http_user_agent” “$http_x_forwarded_for”‘;
access_log /var/log/nginx/access.log main;
sendfileon;
tcp_nopush on;
tcp_nodelayon;
keepalive_timeout65;
types_hash_max_size 2048;
include /etc/nginx/mime.types;
default_type application/octet-stream;
</p></li>
</ul><p>proxy_connect_timeout500ms;
proxy_send_timeout1000ms;
proxy_read_timeout3000ms;
proxy_buffers 64 8k;
proxy_busy_buffers_size 128k;
proxy_temp_file_write_size 64k;
proxy_redirect off;
proxy_next_upstream error invalid_header timeout http_502 http_504;
proxy_http_version 1.1;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Real-Port $remote_port;
proxy_set_header Host $http_host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
client_max_body_size 10m;
client_body_buffer_size 512k;
client_body_timeout 180;
client_header_timeout 10;
send_timeout 240;
gzip on;
gzip_min_length 1k;
gzip_buffers 4 16k;
gzip_comp_level 2;
gzip_types application/javascript application/x-javascript text/css text/javascript image/jpeg image/gif image/png;
gzip_vary off;
gzip_disable “MSIE [1-6].”;server {
listen 3000;
server_name _;location / {
proxy_pass http://grafana:3000;
}
}server {
listen 9090;
server_name _;location / {
auth_basic “auth for monitor”;
auth_basic_user_file /etc/nginx/basic_auth/.htpasswd;
proxy_pass http://prometheus:9090;
}
}server {
listen 9093;
server_name _;location / {
auth_basic “auth for monitor”;
auth_basic_user_file /etc/nginx/basic_auth/.htpasswd;
proxy_pass http://alertmanager:9093;<br />
}
}
}“`
9.5 prometheus
- 注意db目录需可写,给777权限
9.5.1 主配置文件: prometheus.yml
[root@host40 monitor-bak]# cat prometheus/prometheus.yml # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: ["alertmanager:9093"] # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "rules/*.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=` to any timeseries scraped from this config. - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'alertmanager' static_configs: - targets: ['alertmanager:9093'] - job_name: 'node_real_lan' file_sd_configs: - files: - ./sd_files/real_lan.yml refresh_interval: 30s - job_name: 'node_virtual_lan' file_sd_configs: - files: - ./sd_files/virtual_lan.yml refresh_interval: 30s - job_name: 'node_real_wan' file_sd_configs: - files: - ./sd_files/real_wan.yml refresh_interval: 30s - job_name: 'node_virtual_wan' file_sd_configs: - files: - ./sd_files/virtual_wan.yml refresh_interval: 30s - job_name: 'docker_host' file_sd_configs: - files: - ./sd_files/docker_host.yml refresh_interval: 30s - job_name: 'tcp' metrics_path: /probe params: module: [tcp_connect] file_sd_configs: - files: - ./sd_files/tcp.yml refresh_interval: 30s relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: blackbox:9115 - job_name: 'http' metrics_path: /probe params: module: [http_2xx] file_sd_configs: - files: - ./sd_files/http.yml refresh_interval: 30s relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: blackbox:9115 - job_name: 'icmp' metrics_path: /probe params: module: [icmp] file_sd_configs: - files: - ./sd_files/icmp.yml refresh_interval: 30s relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: blackbox:9115
9.5.2 全部节点使用基于文件的服务发现:
- 将需要监控的主机targets 写入相应job的target文件即可。示例如下:
ls prometheus/sd_files/ docker_host.yml http.yml icmp.yml real_lan.yml real_wan.yml sedFDm5Rw tcp.yml virtual_lan.yml virtual_wan.yml
cat prometheus/sd_files/docker_host.yml - targets: ['10.10.11.178:9080'] - targets: ['10.10.11.99:9080'] - targets: ['10.10.11.40:9080'] - targets: ['10.10.11.35:9080'] - targets: ['10.10.11.45:9080'] - targets: ['10.10.11.46:9080'] - targets: ['10.10.11.48:9080'] - targets: ['10.10.11.47:9080'] - targets: ['10.10.11.65:9081'] - targets: ['10.10.11.61:9080'] - targets: ['10.10.11.66:9080'] - targets: ['10.10.11.68:9080'] - targets: ['10.10.11.98:9080'] - targets: ['10.10.11.75:9080'] - targets: ['10.10.11.97:9080'] - targets: ['10.10.11.179:9080']
cat prometheus/sd_files/tcp.yml - targets: ['10.10.11.178:8001'] labels: server_name: http_download - targets: ['10.10.11.178:3307'] labels: server_name: xiaojing_db - targets: ['10.10.11.178:3001'] labels: server_name: test_web
9.5.3 rules文件:
- docker rules:
cat prometheus/rules/docker_monitor.yml groups: - name: "container monitor" rules: - alert: "Container down: env1" expr: time() - container_last_seen{name="env1"} > 60 for: 30s labels: severity: critical annotations: summary: "Container down: {{$labels.instance}} name={{$labels.name}}" ``` - tcp rules: ``` cat prometheus/rules/tcp_monitor.yml groups: - name: blackbox_network_stats rules: - alert: blackbox_network_stats expr: probe_success == 0 for: 1m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} ,server-name: {{ $labels.server_name }} is down" description: "连接不通..." ``` - system rules: # cpu ,mem, disk, network, filesystem...
cat prometheus/rules/system_monitor.yml
groups:
– name: “system info”
rules:
– alert: “服务器宕机”
expr: up 0
for: 3m
labels:
severity: critical
annotations:
summary: “{{$labels.instance}}:服务器宕机”
description: “{{$labels.instance}}:服务器无法连接,持续时间已超过3mins”
– alert: “系统负载过高”
expr: (node_load1/count without (cpu, mode) (node_cpu_seconds_total{mode=”system”}))* on(instance) group_left(
nodename) (node_uname_info) > 1.1
for: 3m
labels:
servirity: warning
annotations:
summary: “{{$labels.instance}}:系统负载过高”
description: “{{$labels.instance}}:系统负载过高.”
value: “{{$value}}”
– alert: “CPU 使用率超过90%”
expr: 100-(avg(rate(node_cpu_seconds_total{mode=”idle”}[5m])) by(instance)* 100) > 90
for: 3m
labels:
severity: critical
annotations:
summary: “{{$labels.instance}}:CPU 使用率90%”
description: “{{$labels.instance}}:CPU 使用率超过90%.”
value: “{{$value}}”
– alert: “内存使用率超过80%”
expr: (100 – node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)* on(instance) group_left(
nodename) (node_uname_info) > 80
for: 3m
labels:
severity: critical
annotations:
summary: “{{$labels.instance}}:内存使用率80%”
description: “{{$labels.instance}}:内存使用率超过80%”
value: “{{$value}}”- alert: “IO操作耗时超过60%”
expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) 85
for: 3m
labels:
severity: longtime
annotations:
summary: “{{$labels.instance}}:磁盘分区容量超过85%”
description: “{{$labels.instance}}:磁盘分区容量超过85%”
value: “{{$value}}” - alert: “磁盘将在4天后写满”
expr: predict_linear(node_filesystem_free_bytes[2h], 4 * 24 * 3600) < 0
for: 3m
labels:
severity: longtime
annotations:
summary: “{{$labels.instance}}: 预计将有磁盘分区在4天后写满,”
description: “{{$labels.instance}}:预计将有磁盘分区在4天后写满,”
value: “{{$value}}”“`
</p></li>
</ul><h3>9.6 alertmanager:</h3>
<ul>
<li><p>注意db目录可写:</p></li>
<li><p>主配置文件:“`
cat alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: ‘smtphz.qiye.163.com:25’
smtp_from: ‘XXX@fosafer.com’
smtp_auth_username: ‘XXX@fosafer.com’
smtp_auth_password: ‘XXX’
smtp_hello: ‘qiye.163.com’
smtp_require_tls: true
route:
group_by: [‘instance’]
group_wait: 30s
receiver: default
routes:- group_interval: 3m
repeat_interval: 10m
match:
severiry: warning
receiver: ‘default’ - group_interval: 3m
repeat_interval: 30m
match:
severiry: critical
receiver: ‘default’ - group_interval: 5m
repeat_interval: 24h
match:
severiry: longtime
receiver: ‘default’
templates:
- group_interval: 3m
- ./templates/*.tmpl
receivers: - name: ‘default’
email_configs:- to: ‘xiangkaihua@fosafer.com’
send_resolved: true
wechat_configs:
- send_resolved: true
corp_id: ‘XXX’
api_secret: ‘XXX’
agent_id: 1000002
to_user: XXX
to_party: 2
message: ‘{{ template “wechat.html” . }}’
- to: ‘xiangkaihua@fosafer.com’
- name: ‘critical’
email_configs:- to: ‘342382676@qq.com’
send_resolved: true - to: ‘xiangkaihua@fosafer.com’
send_resolved: true“`
- to: ‘342382676@qq.com’
- 告警模板文件
cat alertmanager/templates/wechat.tmpl {{ define "wechat.html" }} {{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }} [@警报~] 实例: {{ .Labels.instance }} 信息: {{ .Annotations.summary }} 详情: {{ .Annotations.description }} 值: {{ .Annotations.value }} 时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{ end -}} {{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }} [@恢复~] 实例: {{ .Labels.instance }} 信息: {{ .Annotations.summary }} 时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} 恢复: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{ end -}} {{- end }}
9.7 grafana
- 只需要挂载volume即可,配置文件无需更改,db目录也不大,可以保存配置和dashboard
10.客户端部署
10.1 被监控主机无docker,单独安装node_exporter
- 安装脚本:
http://10.10.11.178:8001/node_exporter_install.sh
10.2 被监控主机运行docker,docker 安装 node_exporter cadvisor
- 安装脚本:
http://10.10.11.178:8001/node_exporter_install_docker.sh
- 需要的image,对于没有添加10.10.11.40:80 仓库的docker主机,可以下载save的image,先load image 在安装
http://10.10.11.178:8001/monitor-client.tgz
11.prometheus使用和维护
11.1 通过脚本添加和删除监控节点
- 所有的job都使用基于文件的服务发现,所以,只用将target写入sd_file即可,无需重读配置文件
- 基于此写了一个文本处理脚本作为sd_files的前端,通过命令行的形式添加和删除targets,无需手动编辑文件
- 脚本名称: sd_controler.sh
- 脚本使用:./sd_controler.sh 即可查看usage
- 完整脚本如下:
“`
[root@host40 monitor]# cat sd_controler.sh!/bin/bash
version: 1.0
Description: add | del | show instance from|to prometheus file_sd_files.
rl | vl | dk | rw | vw | tcp | http | icmp : short for job name, each one means a sd_file.
tcp | http | icmp ( because with ports for service ) add with label (server_name by default) to easy read in alert emails.
each time can only add|del for one instance.
说明:用来添加、删除、查看prometheus基于文件的服务发现中的条目。比如IP:PORT 组合。
rl | vl | dk | rw | vw | tcp | http | icmp :这写prometheus job名称的简称,每一项代表一个job,操作一个sd_file 即job文件服务发现使用的文件。
tcp | http | icmp,由于常常无法根据服务端口第一时间确认挂掉的是什么服务,所以,在tcp http icmp(顺带)添加的时候要求带上server_name的标签label,
让监控人员收到告警邮件第十时间知道挂掉的是什么服务。
每一次只能添加、删除一条记录,如果需要批量添加,可以直接使用vim 文本操作,或者写for 语句批量执行。
vars
SD_DIR=./prometheus/sd_files
DOCKER_SD=$SD_DIR/docker_host.yml
RL_HOST_SD=$SD_DIR/real_lan.yml
VL_HOST_SD=$SD_DIR/virtual_lan.yml
RW_HOST_SD=$SD_DIR/real_wan.yml
VW_HOST_SD=$SD_DIR/virtual_wan.ymlTCP_SD=$SD_DIR/tcp.yml
HTTP_SD=$SD_DIR/http.yml
ICMP_SD=$SD_DIR/icmp.ymlSDFILE=
funcs
usage(){
echo -e “Usage: $0 [ IP:PORT | FQDN ] [ server-name ]”
echo -e ” example: \n\t node add:\t $0 rl add | del 10.10.10.10:9100\n\t tcp,http,icmp add:\t $0 tcp add 10.10.10.10:3306 web-mysql\n\t del:\t $0 http del www.baidu.com\n\t show:\t $0 rl | vl | dk | rw | vw | tcp | http | icmp show.”
exit
}add(){
$1: SDFILE, $2: IP:PORT
grep -q $2 $1 || echo -e “- targets: [‘$2’]” >> $1
}del(){
$1: SDFILE, $2: IP:PORT
sed -i ‘/’$2’/d’ $1
}add_with_label(){
$1: SDFILE, $2: [IP:[PROT]|FQDN] $3:SERVER-NAME
LABEL_01=”server_name”
if ! grep -q ‘$2’ $1;then
echo -e “- targets: [‘$2’]” >> $1
echo -e ” labels:” >> $1
echo -e ” ${LABEL_01}: $3″ >> $1
fi
}del_with_label(){
$1: SDFILE, $2: [IP:[PROT]|FQDN]
NUM=
cat -n $SDFILE |grep "'$2'"|awk '{print $1}'
let ENDNUM=NUM+2sed -i $NUM,${ENDNUM}d $1
}action(){
if [ “$1” “add” ];then
add $SDFILE $2
elif [ “$1” “del” ];then
del $SDFILE $2
elif [ “$1” “show” ];then
cat $SDFILE
fi
}action_with_label(){
if [ “$1” “add” ];then
add_with_label $SDFILE $2 $3
elif [ “$1” “del” ];then
del_with_label $SDFILE $2 $3
elif [ “$1” “show” ];then
cat $SDFILE
fi
}### main code
[ “$2” “” ] || [[ ! “$2” =~ ^(add|del|show)$ ]] && usagecurl –version &>/dev/null || { echo -e “no curl found. ” && exit 15; }
if [[ $1 =~ ^(rl|vl|rw|vw|dk)$ ]] && [ “$2” “add” ];then
[ “$3” “” ] && usageif [ “$4” != “-f” ];then
COOD=curl -IL -o /dev/null --retry 3 --connect-timeout 3 -s -w "%{http_code}" http://$3/metrics
[ “$COOD” != “200” ] && echo -e “http://$3/metrics is not arriable. check it again. or you can use -f to ignor it.” && exit 11
fi
fiif [[ $1 =~ ^(tcp|http|icmp)$ ]] && [ “$2” “add” ];then
[ “$4” “” ] && echo -e “监听 tcp http icmp 服务时必须指明 server-name.” && usage
ficase $1 in
rl)
SDFILE=$RL_HOST_SD
action $2 $3 && echo $2 OK
;;
vl)
SDFILE=$VL_HOST_SD
action $2 $3 && echo $2 OK
;;
dk)
SDFILE=$DOCKER_SD
action $2 $3 && echo $2 OK
;;
rw)
SDFILE=$RW_HOST_SD
action $2 $3 && echo $2 OK
;;
vw)
SDFILE=$VW_HOST_SD
action $2 $3 && echo $2 OK
;;
tcp)
SDFILE=$TCP_SD
action_with_label $2 $3 $4 && echo $2 OK
;;
http)
SDFILE=$HTTP_SD
action_with_label $2 $3 $4 && echo $2 OK
;;
icmp)
SDFILE=$ICMP_SD
action_with_label $2 $3 $4 && echo $2 OK
;;
*)
usage
;;
esac“`
- 注册企业: https://work.weixin.qq.com可以注册未认证企业,人数上限200,绑定个人微信即可使用web后台