GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ja/io/tutorials/mongodb.ipynb
³⁸³⁹⁴ views

Kernel: Python 3

Copyright 2021 The TensorFlow IO Authors.

In [1]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

MongoDB コレクションからの Tensorflow データセット

概要

このチュートリアルでは、mongoDB コレクションからデータを読み取り、それを使用して tf.keras モデルをトレーニングすることにより、tf.data.Dataset を準備することに焦点を当てています。

**注: **mongodb ストレージの基本的な理解は、チュートリアルを簡単に実行するのに役立ちます。

セットアップパッケージ

このチュートリアルでは、pymongo をヘルパーパッケージとして使用して、データを格納するための新しい mongodb データベースとコレクションを作成します。

必要な tensorflow-io および mongodb（ヘルパー）パッケージをインストールする

In [2]:

!pip install -q tensorflow-io
!pip install -q pymongo

Out[2]:

WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
    WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)

パッケージをインポートする

In [3]:

import os
import time
from pprint import pprint
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
import tensorflow_io as tfio
from pymongo import MongoClient

インポートした TensorFlow と TensorFlow-IO を確認する

In [4]:

print("tensorflow-io version: {}".format(tfio.__version__))
print("tensorflow version: {}".format(tf.__version__))

Out[4]:

tensorflow-io version: 0.20.0
tensorflow version: 2.6.0

MongoDB インスタンスをダウンロードしてセットアップする

デモの目的で、mongodb のオープンソースバージョンが使用されます。

In [5]:

%%bash

sudo apt install -y mongodb >log
service mongodb start

Out[5]:

 * Starting database mongodb
   ...done.

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 8.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 

In [6]:

# Sleep for few seconds to let the instance start.
time.sleep(5)

インスタンスが開始されたら、プロセスリストで mongo の grep を実行して、利用できることを確認します。

In [7]:

%%bash

ps -ef | grep mongo

Out[7]:

mongodb      580       1 13 17:38 ?        00:00:00 /usr/bin/mongod --config /etc/mongodb.conf
root         612     610  0 17:38 ?        00:00:00 grep mongo

ベースエンドポイントにクエリを実行して、クラスターに関する情報を取得します。

In [8]:

client = MongoClient()
client.list_database_names() # ['admin', 'local']

Out[8]:

['admin', 'local']

データセットを探索する

このチュートリアルでは、 PetFinderデータセットをダウンロードして、データを mongodb に手動でフィードします。この分類問題の目的は、ペットが引き取られるかどうかを予測することです。

In [9]:

dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'
tf.keras.utils.get_file('petfinder_mini.zip', dataset_url,
                        extract=True, cache_dir='.')
pf_df = pd.read_csv(csv_file)

Out[9]:

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip
1671168/1668792 [==============================] - 0s 0us/step
1679360/1668792 [==============================] - 0s 0us/step

In [10]:

pf_df.head()

Out[10]:

チュートリアルの目的で、ラベル列に変更が加えられます。 0 はペットが引き取られなかったことを示し、1 は引き取られたことを示すようになります。

In [11]:

# In the original dataset "4" indicates the pet was not adopted.
pf_df['target'] = np.where(pf_df['AdoptionSpeed']==4, 0, 1)

# Drop un-used columns.
pf_df = pf_df.drop(columns=['AdoptionSpeed', 'Description'])

In [12]:

# Number of datapoints and columns
len(pf_df), len(pf_df.columns)

Out[12]:

(11537, 14)

データセットを分割する

In [13]:

train_df, test_df = train_test_split(pf_df, test_size=0.3, shuffle=True)
print("Number of training samples: ",len(train_df))
print("Number of testing sample: ",len(test_df))

Out[13]:

Number of training samples:  8075
Number of testing sample:  3462

mongo コレクションにトレーニングデータとテストデータを格納する

In [14]:

URI = "mongodb://localhost:27017"
DATABASE = "tfiodb"
TRAIN_COLLECTION = "train"
TEST_COLLECTION = "test"

In [15]:

db = client[DATABASE]
if "train" not in db.list_collection_names():
  db.create_collection(TRAIN_COLLECTION)
if "test" not in db.list_collection_names():
  db.create_collection(TEST_COLLECTION)

In [16]:

def store_records(collection, records):
  writer = tfio.experimental.mongodb.MongoDBWriter(
      uri=URI, database=DATABASE, collection=collection
  )
  for record in records:
      writer.write(record)

In [17]:

store_records(collection="train", records=train_df.to_dict("records"))
time.sleep(2)
store_records(collection="test", records=test_df.to_dict("records"))

tfio データセットを準備する

データがクラスターで利用可能になると、 mongodb.MongoDBIODataset クラスがこの目的で使用されます。このクラスは tf.data.Dataset を継承しているため、tf.data.Dataset のすべての便利な機能をすぐに利用できます。

トレーニングデータセット

In [18]:

train_ds = tfio.experimental.mongodb.MongoDBIODataset(
        uri=URI, database=DATABASE, collection=TRAIN_COLLECTION
    )

train_ds

Out[18]:

Connection successful: mongodb://localhost:27017
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/data/experimental/ops/counter.py:66: scan (from tensorflow.python.data.experimental.ops.scan_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.scan(...) instead
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_io/python/experimental/mongodb_dataset_ops.py:114: take_while (from tensorflow.python.data.experimental.ops.take_while_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.take_while(...)

<MongoDBIODataset shapes: (), types: tf.string>

train_ds の各項目は、json にデコードする必要がある文字列です。これを行うには、TensorSpec を指定して、列のサブセットのみを選択できます。

In [19]:

# Numeric features.
numerical_cols = ['PhotoAmt', 'Fee'] 

SPECS = {
    "target": tf.TensorSpec(tf.TensorShape([]), tf.int64, name="target"),
}
for col in numerical_cols:
  SPECS[col] = tf.TensorSpec(tf.TensorShape([]), tf.int32, name=col)
pprint(SPECS)

Out[19]:

{'Fee': TensorSpec(shape=(), dtype=tf.int32, name='Fee'),
 'PhotoAmt': TensorSpec(shape=(), dtype=tf.int32, name='PhotoAmt'),
 'target': TensorSpec(shape=(), dtype=tf.int64, name='target')}

In [20]:

BATCH_SIZE=32
train_ds = train_ds.map(
        lambda x: tfio.experimental.serialization.decode_json(x, specs=SPECS)
    )

# Prepare a tuple of (features, label)
train_ds = train_ds.map(lambda v: (v, v.pop("target")))
train_ds = train_ds.batch(BATCH_SIZE)

train_ds

Out[20]:

<BatchDataset shapes: ({PhotoAmt: (None,), Fee: (None,)}, (None,)), types: ({PhotoAmt: tf.int32, Fee: tf.int32}, tf.int64)>

テストデータセット

In [21]:

test_ds = tfio.experimental.mongodb.MongoDBIODataset(
        uri=URI, database=DATABASE, collection=TEST_COLLECTION
    )
test_ds = test_ds.map(
        lambda x: tfio.experimental.serialization.decode_json(x, specs=SPECS)
    )
# Prepare a tuple of (features, label)
test_ds = test_ds.map(lambda v: (v, v.pop("target")))
test_ds = test_ds.batch(BATCH_SIZE)

test_ds

Out[21]:

Connection successful: mongodb://localhost:27017

<BatchDataset shapes: ({PhotoAmt: (None,), Fee: (None,)}, (None,)), types: ({PhotoAmt: tf.int32, Fee: tf.int32}, tf.int64)>

Keras 前処理レイヤーを定義する

構造化データのチュートリアルに従って、Keras 前処理レイヤーを使用することをお勧めします。より直感的で、モデルと簡単に統合できるためです。ただし、標準の feature_columns も使用できます。

構造化データの分類での preprocessing_layers の理解を深めるには、構造化データのチュートリアルを参照してください。

In [22]:

def get_normalization_layer(name, dataset):
  # Create a Normalization layer for our feature.
  normalizer = preprocessing.Normalization(axis=None)

  # Prepare a Dataset that only yields our feature.
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the statistics of the data.
  normalizer.adapt(feature_ds)

  return normalizer

In [23]:

all_inputs = []
encoded_features = []

for header in numerical_cols:
  numeric_col = tf.keras.Input(shape=(1,), name=header)
  normalization_layer = get_normalization_layer(header, train_ds)
  encoded_numeric_col = normalization_layer(numeric_col)
  all_inputs.append(numeric_col)
  encoded_features.append(encoded_numeric_col)

モデルをビルド、コンパイル、およびトレーニングする

In [24]:

# Set the parameters

OPTIMIZER="adam"
LOSS=tf.keras.losses.BinaryCrossentropy(from_logits=True)
METRICS=['accuracy']
EPOCHS=10

In [25]:

# Convert the feature columns into a tf.keras layer
all_features = tf.keras.layers.concatenate(encoded_features)

# design/build the model
x = tf.keras.layers.Dense(32, activation="relu")(all_features)
x = tf.keras.layers.Dropout(0.5)(x)
x = tf.keras.layers.Dense(64, activation="relu")(x)
x = tf.keras.layers.Dropout(0.5)(x)
output = tf.keras.layers.Dense(1)(x)
model = tf.keras.Model(all_inputs, output)

In [26]:

# compile the model
model.compile(optimizer=OPTIMIZER, loss=LOSS, metrics=METRICS)

In [27]:

# fit the model
model.fit(train_ds, epochs=EPOCHS)

Out[27]:

Epoch 1/10
109/109 [==============================] - 1s 2ms/step - loss: 0.6261 - accuracy: 0.4711
Epoch 2/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5939 - accuracy: 0.6967
Epoch 3/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5900 - accuracy: 0.6993
Epoch 4/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5846 - accuracy: 0.7146
Epoch 5/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5824 - accuracy: 0.7178
Epoch 6/10
109/109 [==============================] - 0s 2ms/step - loss: 0.5778 - accuracy: 0.7233
Epoch 7/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5810 - accuracy: 0.7083
Epoch 8/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5791 - accuracy: 0.7149
Epoch 9/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5742 - accuracy: 0.7207
Epoch 10/10
109/109 [==============================] - 0s 2ms/step - loss: 0.5797 - accuracy: 0.7083

<keras.callbacks.History at 0x7f743229fe90>

テストデータを推測する

In [28]:

res = model.evaluate(test_ds)
print("test loss, test acc:", res)

Out[28]:

109/109 [==============================] - 0s 2ms/step - loss: 0.5696 - accuracy: 0.7383
test loss, test acc: [0.569588840007782, 0.7383015751838684]

注: このチュートリアルの目的は、mongodb から tf.data.Datasets を準備して tf.keras モデルを直接トレーニングする Tensorflow-IO の機能を示すことであるため、モデルの精度を向上させることは現在の範囲外です。ただし、ユーザーはデータセットを調べ、特徴列とモデルアーキテクチャを試して、分類パフォーマンスを向上させることができます。

Copyright 2021 The TensorFlow IO Authors.

MongoDB コレクションからの Tensorflow データセット

概要

セットアップパッケージ

必要な tensorflow-io および mongodb（ヘルパー）パッケージをインストールする

パッケージをインポートする

インポートした TensorFlow と TensorFlow-IO を確認する

MongoDB インスタンスをダウンロードしてセットアップする

データセットを探索する

データセットを分割する

mongo コレクションにトレーニングデータとテストデータを格納する

tfio データセットを準備する

トレーニングデータセット

テストデータセット

Keras 前処理レイヤーを定義する

モデルをビルド、コンパイル、およびトレーニングする

テストデータを推測する

参照:

Product

Resources

Company

Copyright 2021 The TensorFlow IO Authors.

MongoDB コレクションからの Tensorflow データセット

概要

セットアップパッケージ

必要な tensorflow-io および mongodb（ヘルパー）パッケージをインストールする

パッケージをインポートする

インポートした TensorFlow と TensorFlow-IO を確認する

MongoDB インスタンスをダウンロードしてセットアップする

データセットを探索する

データセットを分割する

mongo コレクション にトレーニングデータとテストデータを格納する

tfio データセットを準備する

トレーニングデータセット

テストデータセット

Keras 前処理レイヤーを定義する

モデルをビルド、コンパイル、およびトレーニングする

テストデータを推測する

参照:

mongo コレクションにトレーニングデータとテストデータを格納する