Path: blob/master/site/zh-cn/io/tutorials/mongodb.ipynb
25118 views
Kernel: Python 3
Copyright 2021 The TensorFlow IO Authors.
In [1]:
来自 MongoDB 集合的 TensorFlow 数据集
![]() |
在 Google Colab 中运行 | 在 GitHub 上查看源代码 | ![]() |
概述
本教程着重阐述通过从 mongoDB 集合中读取数据并使用其训练 tf.keras
模型来准备 tf.data.Dataset
。
**注:**对 mongodb 存储的基本了解可以帮助您更轻松地学习本教程。
安装软件包
本教程使用 pymongo
作为辅助软件包来创建新的 mongodb 数据库和集合以存储数据。
安装要求的 tensorflow-io 和 mongodb (辅助)软件包
In [2]:
Out[2]:
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
导入软件包
In [3]:
验证 tf 和 tfio 导入
In [4]:
Out[4]:
tensorflow-io version: 0.20.0
tensorflow version: 2.6.0
下载并安装 MongoDB 实例
出于演示目的,使用了开源版本的 mongodb。
In [5]:
Out[5]:
* Starting database mongodb
...done.
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 8.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin:
In [6]:
启动实例后,在进程列表中使用 grep 搜索 mongo
以确认可用性。
In [7]:
Out[7]:
mongodb 580 1 13 17:38 ? 00:00:00 /usr/bin/mongod --config /etc/mongodb.conf
root 612 610 0 17:38 ? 00:00:00 grep mongo
查询基础端点以检索有关集群的信息。
In [8]:
Out[8]:
['admin', 'local']
探索数据集
出于本教程的目的,让我们下载 PetFinder 数据集并手动将数据馈入 mongodb。此分类问题的目标是预测宠物是否会被收养。
In [9]:
Out[9]:
Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip
1671168/1668792 [==============================] - 0s 0us/step
1679360/1668792 [==============================] - 0s 0us/step
In [10]:
Out[10]:
出于本教程的目的,对标签列进行了修改。0 表示该宠物未被收养,1 表示被收养。
In [11]:
In [12]:
Out[12]:
(11537, 14)
拆分数据集
In [13]:
Out[13]:
Number of training samples: 8075
Number of testing sample: 3462
在 mongo 集合中存储训练数据和测试数据
In [14]:
In [15]:
In [16]:
In [17]:
准备 tfio 数据集
当数据在集群中可用后,会针对此目的使用 mongodb.MongoDBIODsataset
类。该类继承自 tf.data.Dataset
,因此,它原生具有 tf.data.Dataset
的所有有用功能。
训练数据集
In [18]:
Out[18]:
Connection successful: mongodb://localhost:27017
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/data/experimental/ops/counter.py:66: scan (from tensorflow.python.data.experimental.ops.scan_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.scan(...) instead
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_io/python/experimental/mongodb_dataset_ops.py:114: take_while (from tensorflow.python.data.experimental.ops.take_while_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.take_while(...)
<MongoDBIODataset shapes: (), types: tf.string>
train_ds
中的每一项都是一个字符串,需要解码为 json。为此,可以通过指定 TensorSpec
仅选择一部分列
In [19]:
Out[19]:
{'Fee': TensorSpec(shape=(), dtype=tf.int32, name='Fee'),
'PhotoAmt': TensorSpec(shape=(), dtype=tf.int32, name='PhotoAmt'),
'target': TensorSpec(shape=(), dtype=tf.int64, name='target')}
In [20]:
Out[20]:
<BatchDataset shapes: ({PhotoAmt: (None,), Fee: (None,)}, (None,)), types: ({PhotoAmt: tf.int32, Fee: tf.int32}, tf.int64)>
测试数据集
In [21]:
Out[21]:
Connection successful: mongodb://localhost:27017
<BatchDataset shapes: ({PhotoAmt: (None,), Fee: (None,)}, (None,)), types: ({PhotoAmt: tf.int32, Fee: tf.int32}, tf.int64)>
定义 keras 预处理层
根据结构化数据教程,建议使用 Keras 预处理层,因为它们更直观,并且可以轻松地与模型集成。但是,也可以使用标准的 feature_columns。
为了对结构化数据分类中的 preprocessing_layers
有更好的理解,请参阅结构化数据教程
In [22]:
In [23]:
构建、编译并训练模型
In [24]:
In [25]:
In [26]:
In [27]:
Out[27]:
Epoch 1/10
109/109 [==============================] - 1s 2ms/step - loss: 0.6261 - accuracy: 0.4711
Epoch 2/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5939 - accuracy: 0.6967
Epoch 3/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5900 - accuracy: 0.6993
Epoch 4/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5846 - accuracy: 0.7146
Epoch 5/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5824 - accuracy: 0.7178
Epoch 6/10
109/109 [==============================] - 0s 2ms/step - loss: 0.5778 - accuracy: 0.7233
Epoch 7/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5810 - accuracy: 0.7083
Epoch 8/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5791 - accuracy: 0.7149
Epoch 9/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5742 - accuracy: 0.7207
Epoch 10/10
109/109 [==============================] - 0s 2ms/step - loss: 0.5797 - accuracy: 0.7083
<keras.callbacks.History at 0x7f743229fe90>
在测试数据上进行推断
In [28]:
Out[28]:
109/109 [==============================] - 0s 2ms/step - loss: 0.5696 - accuracy: 0.7383
test loss, test acc: [0.569588840007782, 0.7383015751838684]
注:本教程的目标是演示 Tensorflow-IO 从 mongodb 准备 tf.data.Datasets
并直接训练 tf.keras
模型的能力,因此提高模型的准确率超出了当前范围。但是,用户可以探索数据集并使用特征列和模型架构来获得更好的分类性能。