Path: blob/master/site/en-snapshot/io/tutorials/mongodb.ipynb
25118 views
Copyright 2021 The TensorFlow IO Authors.
Tensorflow datasets from MongoDB collections
Overview
This tutorial focuses on preparing tf.data.Dataset
s by reading data from mongoDB collections and using it for training a tf.keras
model.
NOTE: A basic understanding of mongodb storage will help you in following the tutorial with ease.
Setup packages
This tutorial uses pymongo
as a helper package to create a new mongodb database and collection to store the data.
Install the required tensorflow-io and mongodb (helper) packages
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
Import packages
Validate tf and tfio imports
Download and setup the MongoDB instance
For demo purposes, the open-source version of mongodb is used.
Once the instance has been started, grep for mongo
in the processes list to confirm the availability.
query the base endpoint to retrieve information about the cluster.
Explore the dataset
For the purpose of this tutorial, lets download the PetFinder dataset and feed the data into mongodb manually. The goal of this classification problem is predict if the pet will be adopted or not.
For the purpose of the tutorial, modifications are made to the label column. 0 will indicate the pet was not adopted, and 1 will indicate that it was.
Split the dataset
Store the train and test data in mongo collections
Prepare tfio datasets
Once the data is available in the cluster, the mongodb.MongoDBIODataset
class is utilized for this purpose. The class inherits from tf.data.Dataset
and thus exposes all the useful functionalities of tf.data.Dataset
out of the box.
Training dataset
Each item in train_ds
is a string which needs to be decoded into a json. To do so, you can select only a subset of the columns by specifying the TensorSpec
Testing dataset
Define the keras preprocessing layers
As per the structured data tutorial, it is recommended to use the Keras Preprocessing Layers as they are more intuitive, and can be easily integrated with the models. However, the standard feature_columns can also be used.
For a better understanding of the preprocessing_layers
in classifying structured data, please refer to the structured data tutorial
Build, compile and train the model
Infer on the test data
Note: Since the goal of this tutorial is to demonstrate Tensorflow-IO's capability to prepare tf.data.Datasets
from mongodb and train tf.keras
models directly, improving the accuracy of the models is out of the current scope. However, the user can explore the dataset and play around with the feature columns and model architectures to get a better classification performance.