GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/en-snapshot/io/tutorials/mongodb.ipynb
³⁸⁵⁹⁴ views

Kernel: Python 3

Copyright 2021 The TensorFlow IO Authors.

In [1]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Tensorflow datasets from MongoDB collections

View on TensorFlow.org

Run in Google Colab

View source on GitHub

Download notebook

Overview

This tutorial focuses on preparing tf.data.Datasets by reading data from mongoDB collections and using it for training a tf.keras model.

NOTE: A basic understanding of mongodb storage will help you in following the tutorial with ease.

Setup packages

This tutorial uses pymongo as a helper package to create a new mongodb database and collection to store the data.

Install the required tensorflow-io and mongodb (helper) packages

In [2]:

!pip install -q tensorflow-io
!pip install -q pymongo

Out[2]:

WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
    WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)

Import packages

In [3]:

import os
import time
from pprint import pprint
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
import tensorflow_io as tfio
from pymongo import MongoClient

Validate tf and tfio imports

In [4]:

print("tensorflow-io version: {}".format(tfio.__version__))
print("tensorflow version: {}".format(tf.__version__))

Out[4]:

tensorflow-io version: 0.20.0
tensorflow version: 2.6.0

Download and setup the MongoDB instance

For demo purposes, the open-source version of mongodb is used.

In [5]:

%%bash

sudo apt install -y mongodb >log
service mongodb start

Out[5]:

 * Starting database mongodb
   ...done.

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 8.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 

In [6]:

# Sleep for few seconds to let the instance start.
time.sleep(5)

Once the instance has been started, grep for mongo in the processes list to confirm the availability.

In [7]:

%%bash

ps -ef | grep mongo

Out[7]:

mongodb      580       1 13 17:38 ?        00:00:00 /usr/bin/mongod --config /etc/mongodb.conf
root         612     610  0 17:38 ?        00:00:00 grep mongo

query the base endpoint to retrieve information about the cluster.

In [8]:

client = MongoClient()
client.list_database_names() # ['admin', 'local']

Out[8]:

['admin', 'local']

Explore the dataset

For the purpose of this tutorial, lets download the PetFinder dataset and feed the data into mongodb manually. The goal of this classification problem is predict if the pet will be adopted or not.

In [9]:

dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'
tf.keras.utils.get_file('petfinder_mini.zip', dataset_url,
                        extract=True, cache_dir='.')
pf_df = pd.read_csv(csv_file)

Out[9]:

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip
1671168/1668792 [==============================] - 0s 0us/step
1679360/1668792 [==============================] - 0s 0us/step

In [10]:

pf_df.head()

Out[10]:

For the purpose of the tutorial, modifications are made to the label column. 0 will indicate the pet was not adopted, and 1 will indicate that it was.

In [11]:

# In the original dataset "4" indicates the pet was not adopted.
pf_df['target'] = np.where(pf_df['AdoptionSpeed']==4, 0, 1)

# Drop un-used columns.
pf_df = pf_df.drop(columns=['AdoptionSpeed', 'Description'])

In [12]:

# Number of datapoints and columns
len(pf_df), len(pf_df.columns)

Out[12]:

(11537, 14)

Split the dataset

In [13]:

train_df, test_df = train_test_split(pf_df, test_size=0.3, shuffle=True)
print("Number of training samples: ",len(train_df))
print("Number of testing sample: ",len(test_df))

Out[13]:

Number of training samples:  8075
Number of testing sample:  3462

Store the train and test data in mongo collections

In [14]:

URI = "mongodb://localhost:27017"
DATABASE = "tfiodb"
TRAIN_COLLECTION = "train"
TEST_COLLECTION = "test"

In [15]:

db = client[DATABASE]
if "train" not in db.list_collection_names():
  db.create_collection(TRAIN_COLLECTION)
if "test" not in db.list_collection_names():
  db.create_collection(TEST_COLLECTION)

In [16]:

def store_records(collection, records):
  writer = tfio.experimental.mongodb.MongoDBWriter(
      uri=URI, database=DATABASE, collection=collection
  )
  for record in records:
      writer.write(record)

In [17]:

store_records(collection="train", records=train_df.to_dict("records"))
time.sleep(2)
store_records(collection="test", records=test_df.to_dict("records"))

Prepare tfio datasets

Once the data is available in the cluster, the mongodb.MongoDBIODataset class is utilized for this purpose. The class inherits from tf.data.Dataset and thus exposes all the useful functionalities of tf.data.Dataset out of the box.

Training dataset

In [18]:

train_ds = tfio.experimental.mongodb.MongoDBIODataset(
        uri=URI, database=DATABASE, collection=TRAIN_COLLECTION
    )

train_ds

Out[18]:

Connection successful: mongodb://localhost:27017
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/data/experimental/ops/counter.py:66: scan (from tensorflow.python.data.experimental.ops.scan_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.scan(...) instead
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_io/python/experimental/mongodb_dataset_ops.py:114: take_while (from tensorflow.python.data.experimental.ops.take_while_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.take_while(...)

<MongoDBIODataset shapes: (), types: tf.string>

Each item in train_ds is a string which needs to be decoded into a json. To do so, you can select only a subset of the columns by specifying the TensorSpec

In [19]:

# Numeric features.
numerical_cols = ['PhotoAmt', 'Fee'] 

SPECS = {
    "target": tf.TensorSpec(tf.TensorShape([]), tf.int64, name="target"),
}
for col in numerical_cols:
  SPECS[col] = tf.TensorSpec(tf.TensorShape([]), tf.int32, name=col)
pprint(SPECS)

Out[19]:

{'Fee': TensorSpec(shape=(), dtype=tf.int32, name='Fee'),
 'PhotoAmt': TensorSpec(shape=(), dtype=tf.int32, name='PhotoAmt'),
 'target': TensorSpec(shape=(), dtype=tf.int64, name='target')}

In [20]:

BATCH_SIZE=32
train_ds = train_ds.map(
        lambda x: tfio.experimental.serialization.decode_json(x, specs=SPECS)
    )

# Prepare a tuple of (features, label)
train_ds = train_ds.map(lambda v: (v, v.pop("target")))
train_ds = train_ds.batch(BATCH_SIZE)

train_ds

Out[20]:

<BatchDataset shapes: ({PhotoAmt: (None,), Fee: (None,)}, (None,)), types: ({PhotoAmt: tf.int32, Fee: tf.int32}, tf.int64)>

Testing dataset

In [21]:

test_ds = tfio.experimental.mongodb.MongoDBIODataset(
        uri=URI, database=DATABASE, collection=TEST_COLLECTION
    )
test_ds = test_ds.map(
        lambda x: tfio.experimental.serialization.decode_json(x, specs=SPECS)
    )
# Prepare a tuple of (features, label)
test_ds = test_ds.map(lambda v: (v, v.pop("target")))
test_ds = test_ds.batch(BATCH_SIZE)

test_ds

Out[21]:

Connection successful: mongodb://localhost:27017

<BatchDataset shapes: ({PhotoAmt: (None,), Fee: (None,)}, (None,)), types: ({PhotoAmt: tf.int32, Fee: tf.int32}, tf.int64)>

Define the keras preprocessing layers

As per the structured data tutorial, it is recommended to use the Keras Preprocessing Layers as they are more intuitive, and can be easily integrated with the models. However, the standard feature_columns can also be used.

For a better understanding of the preprocessing_layers in classifying structured data, please refer to the structured data tutorial

In [22]:

def get_normalization_layer(name, dataset):
  # Create a Normalization layer for our feature.
  normalizer = preprocessing.Normalization(axis=None)

  # Prepare a Dataset that only yields our feature.
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the statistics of the data.
  normalizer.adapt(feature_ds)

  return normalizer

In [23]:

all_inputs = []
encoded_features = []

for header in numerical_cols:
  numeric_col = tf.keras.Input(shape=(1,), name=header)
  normalization_layer = get_normalization_layer(header, train_ds)
  encoded_numeric_col = normalization_layer(numeric_col)
  all_inputs.append(numeric_col)
  encoded_features.append(encoded_numeric_col)

Build, compile and train the model

In [24]:

# Set the parameters

OPTIMIZER="adam"
LOSS=tf.keras.losses.BinaryCrossentropy(from_logits=True)
METRICS=['accuracy']
EPOCHS=10

In [25]:

# Convert the feature columns into a tf.keras layer
all_features = tf.keras.layers.concatenate(encoded_features)

# design/build the model
x = tf.keras.layers.Dense(32, activation="relu")(all_features)
x = tf.keras.layers.Dropout(0.5)(x)
x = tf.keras.layers.Dense(64, activation="relu")(x)
x = tf.keras.layers.Dropout(0.5)(x)
output = tf.keras.layers.Dense(1)(x)
model = tf.keras.Model(all_inputs, output)

In [26]:

# compile the model
model.compile(optimizer=OPTIMIZER, loss=LOSS, metrics=METRICS)

In [27]:

# fit the model
model.fit(train_ds, epochs=EPOCHS)

Out[27]:

Epoch 1/10
109/109 [==============================] - 1s 2ms/step - loss: 0.6261 - accuracy: 0.4711
Epoch 2/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5939 - accuracy: 0.6967
Epoch 3/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5900 - accuracy: 0.6993
Epoch 4/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5846 - accuracy: 0.7146
Epoch 5/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5824 - accuracy: 0.7178
Epoch 6/10
109/109 [==============================] - 0s 2ms/step - loss: 0.5778 - accuracy: 0.7233
Epoch 7/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5810 - accuracy: 0.7083
Epoch 8/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5791 - accuracy: 0.7149
Epoch 9/10
109/109 [==============================] - 0s 3ms/step - loss: 0.5742 - accuracy: 0.7207
Epoch 10/10
109/109 [==============================] - 0s 2ms/step - loss: 0.5797 - accuracy: 0.7083

<keras.callbacks.History at 0x7f743229fe90>

Infer on the test data

In [28]:

res = model.evaluate(test_ds)
print("test loss, test acc:", res)

Out[28]:

109/109 [==============================] - 0s 2ms/step - loss: 0.5696 - accuracy: 0.7383
test loss, test acc: [0.569588840007782, 0.7383015751838684]

Note: Since the goal of this tutorial is to demonstrate Tensorflow-IO's capability to prepare tf.data.Datasets from mongodb and train tf.keras models directly, improving the accuracy of the models is out of the current scope. However, the user can explore the dataset and play around with the feature columns and model architectures to get a better classification performance.