Path: blob/master/site/en-snapshot/io/tutorials/mongodb.ipynb
38594 views
Copyright 2021 The TensorFlow IO Authors.
Tensorflow datasets from MongoDB collections
Overview
This tutorial focuses on preparing tf.data.Datasets by reading data from mongoDB collections and using it for training a tf.keras model.
NOTE: A basic understanding of mongodb storage will help you in following the tutorial with ease.
Setup packages
This tutorial uses pymongo as a helper package to create a new mongodb database and collection to store the data.
Install the required tensorflow-io and mongodb (helper) packages
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)
Import packages
Validate tf and tfio imports
Download and setup the MongoDB instance
For demo purposes, the open-source version of mongodb is used.
Once the instance has been started, grep for mongo in the processes list to confirm the availability.
query the base endpoint to retrieve information about the cluster.
Explore the dataset
For the purpose of this tutorial, lets download the PetFinder dataset and feed the data into mongodb manually. The goal of this classification problem is predict if the pet will be adopted or not.
For the purpose of the tutorial, modifications are made to the label column. 0 will indicate the pet was not adopted, and 1 will indicate that it was.
Split the dataset
Store the train and test data in mongo collections
Prepare tfio datasets
Once the data is available in the cluster, the mongodb.MongoDBIODataset class is utilized for this purpose. The class inherits from tf.data.Dataset and thus exposes all the useful functionalities of tf.data.Dataset out of the box.
Training dataset
Each item in train_ds is a string which needs to be decoded into a json. To do so, you can select only a subset of the columns by specifying the TensorSpec
Testing dataset
Define the keras preprocessing layers
As per the structured data tutorial, it is recommended to use the Keras Preprocessing Layers as they are more intuitive, and can be easily integrated with the models. However, the standard feature_columns can also be used.
For a better understanding of the preprocessing_layers in classifying structured data, please refer to the structured data tutorial
Build, compile and train the model
Infer on the test data
Note: Since the goal of this tutorial is to demonstrate Tensorflow-IO's capability to prepare tf.data.Datasets from mongodb and train tf.keras models directly, improving the accuracy of the models is out of the current scope. However, the user can explore the dataset and play around with the feature columns and model architectures to get a better classification performance.
View on TensorFlow.org
Run in Google Colab
View source on GitHub
Download notebook