Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
tensorflow
GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/zh-cn/io/tutorials/prometheus.ipynb
25118 views
Kernel: Python 3
#@title Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # https://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.

从 Prometheus 服务器加载指标

小心:除了 Python 软件包以外,此笔记本还使用 sudo apt-get install 安装了第三方软件包。

概述

本教程会将 Prometheus 服务器中的 CoreDNS 指标加载到 tf.data.Dataset 中,然后使用 tf.keras 进行训练和推理。

CoreDNS 是一种专注于服务发现的 DNS 服务器,作为 Kubernetes 集群的一部分广泛部署。因此,CoreDNS 常通过 DevOps 运算进行密切监控。

本教程中提供的示例可帮助 DevOps 通过机器学习实现自动化运算。

设置和用法

安装所需的 tensorflow-io 软件包,然后重新启动运行时

import os
try: %tensorflow_version 2.x except Exception: pass
TensorFlow 2.x selected.
!pip install tensorflow-io
Requirement already satisfied: tensorflow-io in /usr/local/lib/python3.6/dist-packages (0.12.0) Requirement already satisfied: tensorflow<2.2.0,>=2.1.0 in /tensorflow-2.1.0/python3.6 (from tensorflow-io) (2.1.0) Requirement already satisfied: opt-einsum>=2.3.2 in /tensorflow-2.1.0/python3.6 (from tensorflow<2.2.0,>=2.1.0->tensorflow-io) (3.2.0) Requirement already satisfied: google-pasta>=0.1.6 in /tensorflow-2.1.0/python3.6 (from tensorflow<2.2.0,>=2.1.0->tensorflow-io) (0.1.8) Requirement already satisfied: tensorflow-estimator<2.2.0,>=2.1.0rc0 in /tensorflow-2.1.0/python3.6 (from tensorflow<2.2.0,>=2.1.0->tensorflow-io) (2.1.0) Requirement already satisfied: tensorboard<2.2.0,>=2.1.0 in /tensorflow-2.1.0/python3.6 (from tensorflow<2.2.0,>=2.1.0->tensorflow-io) (2.1.0) Requirement already satisfied: wheel>=0.26; python_version >= "3" in /tensorflow-2.1.0/python3.6 (from tensorflow<2.2.0,>=2.1.0->tensorflow-io) (0.34.2) Requirement already satisfied: grpcio>=1.8.6 in /tensorflow-2.1.0/python3.6 (from tensorflow<2.2.0,>=2.1.0->tensorflow-io) (1.27.2) Requirement already satisfied: astor>=0.6.0 in /tensorflow-2.1.0/python3.6 (from tensorflow<2.2.0,>=2.1.0->tensorflow-io) (0.8.1) Requirement already satisfied: absl-py>=0.7.0 in /tensorflow-2.1.0/python3.6 (from tensorflow<2.2.0,>=2.1.0->tensorflow-io) (0.9.0) Requirement already satisfied: termcolor>=1.1.0 in /tensorflow-2.1.0/python3.6 (from tensorflow<2.2.0,>=2.1.0->tensorflow-io) (1.1.0) Requirement already satisfied: numpy<2.0,>=1.16.0 in /tensorflow-2.1.0/python3.6 (from tensorflow<2.2.0,>=2.1.0->tensorflow-io) (1.18.1) Requirement already satisfied: keras-applications>=1.0.8 in /tensorflow-2.1.0/python3.6 (from tensorflow<2.2.0,>=2.1.0->tensorflow-io) (1.0.8) Requirement already satisfied: protobuf>=3.8.0 in /tensorflow-2.1.0/python3.6 (from tensorflow<2.2.0,>=2.1.0->tensorflow-io) (3.11.3) Requirement already satisfied: keras-preprocessing>=1.1.0 in /tensorflow-2.1.0/python3.6 (from tensorflow<2.2.0,>=2.1.0->tensorflow-io) (1.1.0) Requirement already satisfied: wrapt>=1.11.1 in /tensorflow-2.1.0/python3.6 (from tensorflow<2.2.0,>=2.1.0->tensorflow-io) (1.12.0) Requirement already satisfied: gast==0.2.2 in /tensorflow-2.1.0/python3.6 (from tensorflow<2.2.0,>=2.1.0->tensorflow-io) (0.2.2) Requirement already satisfied: scipy==1.4.1; python_version >= "3" in /tensorflow-2.1.0/python3.6 (from tensorflow<2.2.0,>=2.1.0->tensorflow-io) (1.4.1) Requirement already satisfied: six>=1.12.0 in /tensorflow-2.1.0/python3.6 (from tensorflow<2.2.0,>=2.1.0->tensorflow-io) (1.14.0) Requirement already satisfied: markdown>=2.6.8 in /tensorflow-2.1.0/python3.6 (from tensorboard<2.2.0,>=2.1.0->tensorflow<2.2.0,>=2.1.0->tensorflow-io) (3.2.1) Requirement already satisfied: setuptools>=41.0.0 in /tensorflow-2.1.0/python3.6 (from tensorboard<2.2.0,>=2.1.0->tensorflow<2.2.0,>=2.1.0->tensorflow-io) (45.2.0) Requirement already satisfied: werkzeug>=0.11.15 in /tensorflow-2.1.0/python3.6 (from tensorboard<2.2.0,>=2.1.0->tensorflow<2.2.0,>=2.1.0->tensorflow-io) (1.0.0) Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /tensorflow-2.1.0/python3.6 (from tensorboard<2.2.0,>=2.1.0->tensorflow<2.2.0,>=2.1.0->tensorflow-io) (0.4.1) Requirement already satisfied: google-auth<2,>=1.6.3 in /tensorflow-2.1.0/python3.6 (from tensorboard<2.2.0,>=2.1.0->tensorflow<2.2.0,>=2.1.0->tensorflow-io) (1.11.2) Requirement already satisfied: requests<3,>=2.21.0 in /tensorflow-2.1.0/python3.6 (from tensorboard<2.2.0,>=2.1.0->tensorflow<2.2.0,>=2.1.0->tensorflow-io) (2.23.0) Requirement already satisfied: h5py in /tensorflow-2.1.0/python3.6 (from keras-applications>=1.0.8->tensorflow<2.2.0,>=2.1.0->tensorflow-io) (2.10.0) Requirement already satisfied: requests-oauthlib>=0.7.0 in /tensorflow-2.1.0/python3.6 (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.2.0,>=2.1.0->tensorflow<2.2.0,>=2.1.0->tensorflow-io) (1.3.0) Requirement already satisfied: pyasn1-modules>=0.2.1 in /tensorflow-2.1.0/python3.6 (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow<2.2.0,>=2.1.0->tensorflow-io) (0.2.8) Requirement already satisfied: cachetools<5.0,>=2.0.0 in /tensorflow-2.1.0/python3.6 (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow<2.2.0,>=2.1.0->tensorflow-io) (4.0.0) Requirement already satisfied: rsa<4.1,>=3.1.4 in /tensorflow-2.1.0/python3.6 (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow<2.2.0,>=2.1.0->tensorflow-io) (4.0) Requirement already satisfied: certifi>=2017.4.17 in /tensorflow-2.1.0/python3.6 (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow<2.2.0,>=2.1.0->tensorflow-io) (2019.11.28) Requirement already satisfied: idna<3,>=2.5 in /tensorflow-2.1.0/python3.6 (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow<2.2.0,>=2.1.0->tensorflow-io) (2.9) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /tensorflow-2.1.0/python3.6 (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow<2.2.0,>=2.1.0->tensorflow-io) (1.25.8) Requirement already satisfied: chardet<4,>=3.0.2 in /tensorflow-2.1.0/python3.6 (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow<2.2.0,>=2.1.0->tensorflow-io) (3.0.4) Requirement already satisfied: oauthlib>=3.0.0 in /tensorflow-2.1.0/python3.6 (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.2.0,>=2.1.0->tensorflow<2.2.0,>=2.1.0->tensorflow-io) (3.1.0) Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /tensorflow-2.1.0/python3.6 (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow<2.2.0,>=2.1.0->tensorflow-io) (0.4.8)
from datetime import datetime import tensorflow as tf import tensorflow_io as tfio

安装并设置 CoreDNS 和 Prometheus

出于演示目的,CoreDNS 服务器在本地开放了 9053 端口用于接收 DNS 查询,并开放了 9153 端口(默认)用于公开抓取指标。以下为 CoreDNS 的基本 Corefile 配置,可供下载

.:9053 { prometheus whoami }

有关安装的更多详细信息,请参阅 CoreDNS 文档

!curl -s -OL https://github.com/coredns/coredns/releases/download/v1.6.7/coredns_1.6.7_linux_amd64.tgz !tar -xzf coredns_1.6.7_linux_amd64.tgz !curl -s -OL https://raw.githubusercontent.com/tensorflow/io/master/docs/tutorials/prometheus/Corefile !cat Corefile
.:9053 { prometheus whoami }
# Run `./coredns` as a background process. # IPython doesn't recognize `&` in inline bash cells. get_ipython().system_raw('./coredns &')

下一步是设置 Prometheus 服务器,并使用 Prometheus 抓取在上述 9153 端口上公开的 CoreDNS 指标。用于配置的 prometheus.yml 文件同样可供下载

!curl -s -OL https://github.com/prometheus/prometheus/releases/download/v2.15.2/prometheus-2.15.2.linux-amd64.tar.gz !tar -xzf prometheus-2.15.2.linux-amd64.tar.gz --strip-components=1 !curl -s -OL https://raw.githubusercontent.com/tensorflow/io/master/docs/tutorials/prometheus/prometheus.yml !cat prometheus.yml
global: scrape_interval: 1s evaluation_interval: 1s alerting: alertmanagers: - static_configs: - targets: rule_files: scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: "coredns" static_configs: - targets: ['localhost:9153']
# Run `./prometheus` as a background process. # IPython doesn't recognize `&` in inline bash cells. get_ipython().system_raw('./prometheus &')

为了展示一些活动,可以使用 dig 命令针对已设置的 CoreDNS 服务器生成一些 DNS 查询:

!sudo apt-get install -y -qq dnsutils
!dig @127.0.0.1 -p 9053 demo1.example.org
; <<>> DiG 9.11.3-1ubuntu1.11-Ubuntu <<>> @127.0.0.1 -p 9053 demo1.example.org ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53868 ;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 3 ;; WARNING: recursion requested but not available ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: 855234f1adcb7a28 (echoed) ;; QUESTION SECTION: ;demo1.example.org. IN A ;; ADDITIONAL SECTION: demo1.example.org. 0 IN A 127.0.0.1 _udp.demo1.example.org. 0 IN SRV 0 0 45361 . ;; Query time: 0 msec ;; SERVER: 127.0.0.1#9053(127.0.0.1) ;; WHEN: Tue Mar 03 22:35:20 UTC 2020 ;; MSG SIZE rcvd: 132
!dig @127.0.0.1 -p 9053 demo2.example.org
; <<>> DiG 9.11.3-1ubuntu1.11-Ubuntu <<>> @127.0.0.1 -p 9053 demo2.example.org ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53163 ;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 3 ;; WARNING: recursion requested but not available ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: f18b2ba23e13446d (echoed) ;; QUESTION SECTION: ;demo2.example.org. IN A ;; ADDITIONAL SECTION: demo2.example.org. 0 IN A 127.0.0.1 _udp.demo2.example.org. 0 IN SRV 0 0 42194 . ;; Query time: 0 msec ;; SERVER: 127.0.0.1#9053(127.0.0.1) ;; WHEN: Tue Mar 03 22:35:21 UTC 2020 ;; MSG SIZE rcvd: 132

现在设置的是 CoreDNS 服务器,Prometheus 服务器将抓取该 CoreDNS 服务器的指标并准备用于 TensorFlow。

为 CoreDNS 指标创建数据集并在 TensorFlow 中使用

可以使用 tfio.experimental.IODataset.from_prometheus 为 CoreDNS 指标创建可在 PostgreSQL 服务器上访问的数据集。至少需要两个参数。需要将 query 传递至 Prometheus 服务器以选择指标,length 为要加载到数据集的时间段。

您可以从 "coredns_dns_request_count_total""5"(秒)开始来创建以下数据集。由于在本教程前面部分中已发送了两个 DNS 查询,因此在时间序列末尾,"coredns_dns_request_count_total" 的指标将为 "2.0"

dataset = tfio.experimental.IODataset.from_prometheus( "coredns_dns_request_count_total", 5, endpoint="http://localhost:9090") print("Dataset Spec:\n{}\n".format(dataset.element_spec)) print("CoreDNS Time Series:") for (time, value) in dataset: # time is milli second, convert to data time: time = datetime.fromtimestamp(time // 1000) print("{}: {}".format(time, value['coredns']['localhost:9153']['coredns_dns_request_count_total']))
Dataset Spec: (TensorSpec(shape=(), dtype=tf.int64, name=None), {'coredns': {'localhost:9153': {'coredns_dns_request_count_total': TensorSpec(shape=(), dtype=tf.float64, name=None)}}}) CoreDNS Time Series: 2020-03-03 22:35:17: 2.0 2020-03-03 22:35:18: 2.0 2020-03-03 22:35:19: 2.0 2020-03-03 22:35:20: 2.0 2020-03-03 22:35:21: 2.0

进一步研究数据集的规范:

( TensorSpec(shape=(), dtype=tf.int64, name=None), { 'coredns': { 'localhost:9153': { 'coredns_dns_request_count_total': TensorSpec(shape=(), dtype=tf.float64, name=None) } } } )

显而易见,数据集由 (time, values) 元组组成,其中 values 字段为 Python 字典,扩展为:

"job_name": { "instance_name": { "metric_name": value, }, }

在上例中,'coredns' 为作业名称,'localhost:9153' 为实例名称,而 'coredns_dns_request_count_total' 为指标名称。请注意,根据所使用的 Prometheus 查询,可能会返回多个作业/实例/指标。这也是在数据集结构中使用 Python 字典的原因。

以另一项查询 "go_memstats_gc_sys_bytes" 为例。由于 CoreDNS 和 Prometheus 均使用 Go 语言进行编写,"go_memstats_gc_sys_bytes" 指标可用于 "coredns" 作业和 "prometheus" 作业:

注:此单元在您第一次运行时可能会出错。再次运行将通过。

dataset = tfio.experimental.IODataset.from_prometheus( "go_memstats_gc_sys_bytes", 5, endpoint="http://localhost:9090") print("Time Series CoreDNS/Prometheus Comparision:") for (time, value) in dataset: # time is milli second, convert to data time: time = datetime.fromtimestamp(time // 1000) print("{}: {}/{}".format( time, value['coredns']['localhost:9153']['go_memstats_gc_sys_bytes'], value['prometheus']['localhost:9090']['go_memstats_gc_sys_bytes']))
Time Series CoreDNS/Prometheus Comparision: 2020-03-03 22:35:17: 2385920.0/2775040.0 2020-03-03 22:35:18: 2385920.0/2775040.0 2020-03-03 22:35:19: 2385920.0/2775040.0 2020-03-03 22:35:20: 2385920.0/2775040.0 2020-03-03 22:35:21: 2385920.0/2775040.0

现在,可以将创建的 Dataset 直接传递至 tf.keras 用于训练或推理了。

使用数据集进行模型训练

在指标数据集创建完成后,可以将数据集直接传递至 tf.keras 用于模型训练或推理。

出于演示目的,本教程将仅使用一种非常简单的 LSTM 模型,该模型以 1 个特征和 2 个步骤作为输入:

n_steps, n_features = 2, 1 simple_lstm_model = tf.keras.models.Sequential([ tf.keras.layers.LSTM(8, input_shape=(n_steps, n_features)), tf.keras.layers.Dense(1) ]) simple_lstm_model.compile(optimizer='adam', loss='mae')

要使用的数据集为带有 10 个样本的 CoreDNS 的 'go_memstats_sys_bytes' 的值。但是,由于形成了 window=n_stepsshift=1 的滑动窗口,因此还需要使用其他样本(对于任意两个连续元素,将第一个元素作为 x,将第二个元素作为 y 用于训练)。总计为 10 + n_steps - 1 + 1 = 12 秒。

数据值还将缩放到 [0, 1]

n_samples = 10 dataset = tfio.experimental.IODataset.from_prometheus( "go_memstats_sys_bytes", n_samples + n_steps - 1 + 1, endpoint="http://localhost:9090") # take go_memstats_gc_sys_bytes from coredns job dataset = dataset.map(lambda _, v: v['coredns']['localhost:9153']['go_memstats_sys_bytes']) # find the max value and scale the value to [0, 1] v_max = dataset.reduce(tf.constant(0.0, tf.float64), tf.math.maximum) dataset = dataset.map(lambda v: (v / v_max)) # expand the dimension by 1 to fit n_features=1 dataset = dataset.map(lambda v: tf.expand_dims(v, -1)) # take a sliding window dataset = dataset.window(n_steps, shift=1, drop_remainder=True) dataset = dataset.flat_map(lambda d: d.batch(n_steps)) # the first value is x and the next value is y, only take 10 samples x = dataset.take(n_samples) y = dataset.skip(1).take(n_samples) dataset = tf.data.Dataset.zip((x, y)) # pass the final dataset to model.fit for training simple_lstm_model.fit(dataset.batch(1).repeat(10), epochs=5, steps_per_epoch=10)
Train for 10 steps Epoch 1/5 10/10 [==============================] - 2s 150ms/step - loss: 0.8484 Epoch 2/5 10/10 [==============================] - 0s 10ms/step - loss: 0.7808 Epoch 3/5 10/10 [==============================] - 0s 10ms/step - loss: 0.7102 Epoch 4/5 10/10 [==============================] - 0s 11ms/step - loss: 0.6359 Epoch 5/5 10/10 [==============================] - 0s 11ms/step - loss: 0.5572
<tensorflow.python.keras.callbacks.History at 0x7f1758f3da90>

以上训练模型在实际场景中并不实用,因为本教程中设置的 CoreDNS 服务器没有任何工作负载。不过,这是一条可用于从真正的生产服务器加载指标的工作流水线。开发者可以改进该模型,以解决 DevOps 自动化中的现实问题。