Path: blob/master/site/en-snapshot/io/tutorials/elasticsearch.ipynb
25118 views
Copyright 2020 The TensorFlow IO Authors.
Streaming structured data from Elasticsearch using Tensorflow-IO
Overview
This tutorial focuses on streaming data from an Elasticsearch cluster into a tf.data.Dataset
which is then used in conjunction with tf.keras
for training and inference.
Elasticseach is primarily a distributed search engine which supports storing structured, unstructured, geospatial, numeric data etc. For the purpose of this tutorial, a dataset with structured records is utilized.
NOTE: A basic understanding of elasticsearch storage will help you in following the tutorial with ease.
Setup packages
The elasticsearch
package is utilized for preparing and storing the data within elasticsearch indices for demonstration purposes only. In real-world production clusters with numerous nodes, the cluster might be receiving the data from connectors like logstash etc.
Once the data is available in the elasticsearch cluster, only tensorflow-io
is required to stream the data into the models.
Install the required tensorflow-io and elasticsearch packages
Import packages
Validate tf and tfio imports
Download and setup the Elasticsearch instance
For demo purposes, the open-source version of the elasticsearch package is used.
Run the instance as a daemon process
Once the instance has been started, grep for elasticsearch
in the processes list to confirm the availability.
query the base endpoint to retrieve information about the cluster.
Explore the dataset
For the purpose of this tutorial, lets download the PetFinder dataset and feed the data into elasticsearch manually. The goal of this classification problem is predict if the pet will be adopted or not.
For the purpose of the tutorial, modifications are made to the label column. 0 will indicate the pet was not adopted, and 1 will indicate that it was.
Split the dataset
Store the train and test data in elasticsearch indices
Storing the data in the local elasticsearch cluster simulates an environment for continuous remote data retrieval for training and inference purposes.
Prepare tfio datasets
Once the data is available in the cluster, only tensorflow-io
is required to stream the data from the indices. The elasticsearch.ElasticsearchIODataset
class is utilized for this purpose. The class inherits from tf.data.Dataset
and thus exposes all the useful functionalities of tf.data.Dataset
out of the box.
Training dataset
Testing dataset
Define the keras preprocessing layers
As per the structured data tutorial, it is recommended to use the Keras Preprocessing Layers as they are more intuitive, and can be easily integrated with the models. However, the standard feature_columns can also be used.
For a better understanding of the preprocessing_layers
in classifying structured data, please refer to the structured data tutorial
Fetch a batch and observe the features of a sample record. This will help in defining the keras preprocessing layers for training the tf.keras
model.
Choose a subset of features.
Build, compile and train the model
Infer on the test data
Note: Since the goal of this tutorial is to demonstrate Tensorflow-IO's capability to stream data from elasticsearch and train tf.keras
models directly, improving the accuracy of the models is out of the current scope. However, the user can explore the dataset and play around with the feature columns and model architectures to get a better classification performance.