Path: blob/master/site/en-snapshot/io/tutorials/orc.ipynb
25118 views
Copyright 2021 The TensorFlow Authors.
Apache ORC Reader
Overview
Apache ORC is a popular columnar storage format. tensorflow-io package provides a default implementation of reading Apache ORC files.
Setup
Install required packages, and restart runtime
Download a sample dataset file in ORC
The dataset you will use here is the Iris Data Set from UCI. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. It has 4 attributes: (1) sepal length, (2) sepal width, (3) petal length, (4) petal width, and the last column contains the class label.
Create a dataset from the file
Examine the dataset:
Let's walk through an end-to-end example of tf.keras model training with ORC dataset based on iris dataset.
Data preprocessing
Configure which columns are features, and which column is label:
A util function to map species to float numbers for model training:
Build, compile and train the model
Finally, you are ready to build the model and train it! You will build a 3 layer keras model to predict the class of the iris plant from the dataset you just processed.