GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ko/io/tutorials/elasticsearch.ipynb
³⁹⁰²⁴ views

Kernel: Python 3

Copyright 2020 The TensorFlow IO Authors.

In [1]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Tensorflow-IO를 사용하여 Elasticsearch에서 구조화된 데이터 스트리밍

개요

이 튜토리얼은 Elasticsearch 클러스터에서 tf.data.Dataset로 데이터를 스트리밍한 다음, tf.keras과 연계하여 이 스트리밍 데이터를 훈련과 추론에 이용하는 방법을 소개합니다.

Elasticseach는 주로 구조화, 비구조화, 지리 공간, 숫자 데이터 등을 저장하는 것을 지원하는 분산형 검색 엔진입니다. 이 튜토리얼의 목적을 위해, 구조화된 기록이 포함된 데이터세트가 활용됩니다.

참고: elasticsearch 스토리지에 대한 기본적인 이해가 있으면 튜토리얼을 진행하기가 더 쉽습니다.

설정 패키지

elasticsearch 패키지는 elasticsearch 색인에서 오직 시연 목적으로만 데이터를 준비하고 저장하는 데 활용됩니다. 노드가 많은 실제 프로덕션 클러스터에서, 클러스터는 logstash 등과 같은 커넥터에서 데이터를 받을 수 있습니다.

elasticsearch 클러스터에서 데이터를 사용할 수 있게 되면 데이터를 모델로 스트리밍 하는 데 tensorflow-io만 있으면 됩니다.

필요한 tensorflow-io 및 elasticsearch 패키지 설치하기

In [2]:

!pip install tensorflow-io
!pip install elasticsearch

Out[2]:

Requirement already satisfied: tensorflow-io in /usr/local/lib/python3.6/dist-packages (0.16.0)
Requirement already satisfied: tensorflow<2.4.0,>=2.3.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-io) (2.3.0)
Requirement already satisfied: keras-preprocessing<1.2,>=1.1.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow<2.4.0,>=2.3.0->tensorflow-io) (1.1.2)
Requirement already satisfied: numpy<1.19.0,>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<2.4.0,>=2.3.0->tensorflow-io) (1.18.5)
Requirement already satisfied: scipy==1.4.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow<2.4.0,>=2.3.0->tensorflow-io) (1.4.1)
Requirement already satisfied: absl-py>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<2.4.0,>=2.3.0->tensorflow-io) (0.10.0)
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow<2.4.0,>=2.3.0->tensorflow-io) (3.3.0)
Requirement already satisfied: h5py<2.11.0,>=2.10.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<2.4.0,>=2.3.0->tensorflow-io) (2.10.0)
Requirement already satisfied: grpcio>=1.8.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow<2.4.0,>=2.3.0->tensorflow-io) (1.33.2)
Requirement already satisfied: wheel>=0.26 in /usr/local/lib/python3.6/dist-packages (from tensorflow<2.4.0,>=2.3.0->tensorflow-io) (0.35.1)
Requirement already satisfied: astunparse==1.6.3 in /usr/local/lib/python3.6/dist-packages (from tensorflow<2.4.0,>=2.3.0->tensorflow-io) (1.6.3)
Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<2.4.0,>=2.3.0->tensorflow-io) (1.15.0)
Requirement already satisfied: google-pasta>=0.1.8 in /usr/local/lib/python3.6/dist-packages (from tensorflow<2.4.0,>=2.3.0->tensorflow-io) (0.2.0)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<2.4.0,>=2.3.0->tensorflow-io) (1.1.0)
Requirement already satisfied: tensorboard<3,>=2.3.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<2.4.0,>=2.3.0->tensorflow-io) (2.3.0)
Requirement already satisfied: tensorflow-estimator<2.4.0,>=2.3.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<2.4.0,>=2.3.0->tensorflow-io) (2.3.0)
Requirement already satisfied: wrapt>=1.11.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow<2.4.0,>=2.3.0->tensorflow-io) (1.12.1)
Requirement already satisfied: protobuf>=3.9.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow<2.4.0,>=2.3.0->tensorflow-io) (3.12.4)
Requirement already satisfied: gast==0.3.3 in /usr/local/lib/python3.6/dist-packages (from tensorflow<2.4.0,>=2.3.0->tensorflow-io) (0.3.3)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (2.23.0)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (1.7.0)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (1.0.1)
Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (50.3.2)
Requirement already satisfied: google-auth<2,>=1.6.3 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (1.17.2)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (3.3.3)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (0.4.2)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (2020.6.20)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (2.10)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (0.2.8)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (4.1.1)
Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3" in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (4.6)
Requirement already satisfied: importlib-metadata; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from markdown>=2.6.8->tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (2.0.0)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (1.3.0)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.6/dist-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (0.4.8)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata; python_version < "3.8"->markdown>=2.6.8->tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (3.4.0)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow<2.4.0,>=2.3.0->tensorflow-io) (3.1.0)
Requirement already satisfied: elasticsearch in /usr/local/lib/python3.6/dist-packages (7.9.1)
Requirement already satisfied: urllib3>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from elasticsearch) (1.24.3)
Requirement already satisfied: certifi in /usr/local/lib/python3.6/dist-packages (from elasticsearch) (2020.6.20)

패키지 가져오기

In [3]:

import os
import time
from sklearn.model_selection import train_test_split
from elasticsearch import Elasticsearch
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
import tensorflow_io as tfio

tf 및 tfio 가져오기 검증하기

In [4]:

print("tensorflow-io version: {}".format(tfio.__version__))
print("tensorflow version: {}".format(tf.__version__))

Out[4]:

tensorflow-io version: 0.16.0
tensorflow version: 2.3.0

Elasticsearch 인스턴스 다운로드 및 설정하기

시연 목적으로, elasticsearch 패키지의 오픈 소스 버전이 사용됩니다.

In [5]:

%%bash

wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
sudo chown -R daemon:daemon elasticsearch-7.9.2/
shasum -a 512 -c elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512

Out[5]:

elasticsearch-oss-7.9.2-linux-x86_64.tar.gz: OK

인스턴스를 데몬 프로세스로 실행

In [6]:

%%bash --bg

sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch

Out[6]:

Starting job # 0 in a separate thread.

In [7]:

# Sleep for few seconds to let the instance start.
time.sleep(20)

인스턴스가 시작되면, 프로세스 목록의 elasticsearch를 grep 하여 가용성을 확인합니다.

In [8]:

%%bash

ps -ef | grep elasticsearch

Out[8]:

root         144     142  0 21:24 ?        00:00:00 sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch
daemon       145     144 86 21:24 ?        00:00:17 /content/elasticsearch-7.9.2/jdk/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Xms1g -Xmx1g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-16913031424109346409 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m -XX:MaxDirectMemorySize=536870912 -Des.path.home=/content/elasticsearch-7.9.2 -Des.path.conf=/content/elasticsearch-7.9.2/config -Des.distribution.flavor=oss -Des.distribution.type=tar -Des.bundled_jdk=true -cp /content/elasticsearch-7.9.2/lib/* org.elasticsearch.bootstrap.Elasticsearch
root         382     380  0 21:24 ?        00:00:00 grep elasticsearch

베이스 엔드포인트를 쿼리 하여 클러스터에 대한 정보를 검색합니다.

In [9]:

%%bash

curl -sX GET "localhost:9200/"

Out[9]:

{
  "name" : "d1bc7d054c69",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "P8YXfKqYS-OS3k9CdMmlsw",
  "version" : {
    "number" : "7.9.2",
    "build_flavor" : "oss",
    "build_type" : "tar",
    "build_hash" : "d34da0ea4a966c4e49417f2da2f244e3e97b4e6e",
    "build_date" : "2020-09-23T00:45:33.626720Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

데이터세트 살펴보기

이 튜토리얼의 목적을 위해, PetFinder 데이터세트를 다운로드하고 데이터를 elasticsearch에 수동으로 입력합니다. 이 분류 문제의 목표는 애완동물이 입양되었는지 아닌지 예측하는 것입니다.

In [10]:

dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'
tf.keras.utils.get_file('petfinder_mini.zip', dataset_url,
                        extract=True, cache_dir='.')
pf_df = pd.read_csv(csv_file)

Out[10]:

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip
1671168/1668792 [==============================] - 0s 0us/step

In [11]:

pf_df.head()

Out[11]:

이 튜토리얼의 목적을 위해, 레이블 열이 수정되었습니다. 0은 애완 동물이 입양되지 않았음을 나타내고 1은 입양되었음을 나타냅니다.

In [12]:

# In the original dataset "4" indicates the pet was not adopted.
pf_df['target'] = np.where(pf_df['AdoptionSpeed']==4, 0, 1)

# Drop un-used columns.
pf_df = pf_df.drop(columns=['AdoptionSpeed', 'Description'])

In [13]:

# Number of datapoints and columns
len(pf_df), len(pf_df.columns)

Out[13]:

(11537, 14)

데이터세트 분할하기

In [14]:

train_df, test_df = train_test_split(pf_df, test_size=0.3, shuffle=True)
print("Number of training samples: ",len(train_df))
print("Number of testing sample: ",len(test_df))

Out[14]:

Number of training samples:  8075
Number of testing sample:  3462

elasticsearch 색인에 훈련 및 테스트 데이터 저장하기

elasticsearch에 데이터를 저장하면 훈련 및 추론 목적으로 데이터를 지속해서 원격으로 검색하기 위한 환경이 시뮬레이션됩니다.

In [15]:

ES_NODES = "http://localhost:9200"

def prepare_es_data(index, doc_type, df):
  records = df.to_dict(orient="records")
  es_data = []
  for idx, record in enumerate(records):
    meta_dict = {
          "index": {
              "_index": index, 
              "_type": doc_type, 
              "_id": idx
          }
      }
    es_data.append(meta_dict)
    es_data.append(record)

  return es_data

def index_es_data(index, es_data):
  es_client = Elasticsearch(hosts = [ES_NODES])
  if es_client.indices.exists(index):
      print("deleting the '{}' index.".format(index))
      res = es_client.indices.delete(index=index)
      print("Response from server: {}".format(res))

  print("creating the '{}' index.".format(index))
  res = es_client.indices.create(index=index)
  print("Response from server: {}".format(res))

  print("bulk index the data")
  res = es_client.bulk(index=index, body=es_data, refresh = True)
  print("Errors: {}, Num of records indexed: {}".format(res["errors"], len(res["items"])))

In [16]:

train_es_data = prepare_es_data(index="train", doc_type="pet", df=train_df)
test_es_data = prepare_es_data(index="test", doc_type="pet", df=test_df)

index_es_data(index="train", es_data=train_es_data)
time.sleep(3)
index_es_data(index="test", es_data=test_es_data)

Out[16]:

creating the 'train' index.
Response from server: {'acknowledged': True, 'shards_acknowledged': True, 'index': 'train'}
bulk index the data

/usr/local/lib/python3.6/dist-packages/elasticsearch/connection/base.py:190: ElasticsearchDeprecationWarning: [types removal] Specifying types in bulk requests is deprecated.
  warnings.warn(message, category=ElasticsearchDeprecationWarning)

Errors: False, Num of records indexed: 8075
creating the 'test' index.
Response from server: {'acknowledged': True, 'shards_acknowledged': True, 'index': 'test'}
bulk index the data
Errors: False, Num of records indexed: 3462

tfio 데이터세트 준비

클러스터에서 데이터를 사용할 수 있게 되면 색인에서 데이터를 모델로 스트리밍 하는 데 tensorflow-io만 있으면 됩니다. elasticsearch.ElasticsearchIODataset 클래스는 이 목적을 위해 이용됩니다. 이 클래스는 tf.data.Dataset에서 상속되므로 tf.data.Dataset의 유용한 모든 기능을 즉시 사용할 수 있습니다.

데이터세트 훈련

In [17]:

BATCH_SIZE=32
HEADERS = {"Content-Type": "application/json"}

train_ds = tfio.experimental.elasticsearch.ElasticsearchIODataset(
        nodes=[ES_NODES],
        index="train",
        doc_type="pet",
        headers=HEADERS
    )

# Prepare a tuple of (features, label)
train_ds = train_ds.map(lambda v: (v, v.pop("target")))
train_ds = train_ds.batch(BATCH_SIZE)

Out[17]:

Connection successful: http://localhost:9200/_cluster/health

데이터세트 테스트

In [18]:

test_ds = tfio.experimental.elasticsearch.ElasticsearchIODataset(
        nodes=[ES_NODES],
        index="test",
        doc_type="pet",
        headers=HEADERS
    )

# Prepare a tuple of (features, label)
test_ds = test_ds.map(lambda v: (v, v.pop("target")))
test_ds = test_ds.batch(BATCH_SIZE)

Out[18]:

Connection successful: http://localhost:9200/_cluster/health

keras 전처리 레이어 정의하기

구조화된 데이터 튜토리얼에 따라, Keras 전처리 레이어를 사용하는 것이 좋습니다. 더욱 혁신적이고 모델과 쉽게 통합할 수 있기 때문입니다. 하지만, 표준 feature_columns도 사용할 수 있습니다.

구조화된 데이터를 분류할 때 preprocessing_layers를 더욱 잘 이해하기 위해서는, structured data tutorial를 참조합니다.

In [19]:

def get_normalization_layer(name, dataset):
  # Create a Normalization layer for our feature.
  normalizer = preprocessing.Normalization()

  # Prepare a Dataset that only yields our feature.
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the statistics of the data.
  normalizer.adapt(feature_ds)

  return normalizer

def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
  # Create a StringLookup layer which will turn strings into integer indices
  if dtype == 'string':
    index = preprocessing.StringLookup(max_tokens=max_tokens)
  else:
    index = preprocessing.IntegerLookup(max_values=max_tokens)

  # Prepare a Dataset that only yields our feature
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the set of possible values and assign them a fixed integer index.
  index.adapt(feature_ds)

  # Create a Discretization for our integer indices.
  encoder = preprocessing.CategoryEncoding(max_tokens=index.vocab_size())

  # Prepare a Dataset that only yields our feature.
  feature_ds = feature_ds.map(index)

  # Learn the space of possible indices.
  encoder.adapt(feature_ds)

  # Apply one-hot encoding to our indices. The lambda function captures the
  # layer so you can use them, or include them in the functional model later.
  return lambda feature: encoder(index(feature))

배치를 패치하고 샘플 기록의 기능을 관찰합니다. 이는 tf.keras 모델을 훈련하기 위해 Keras 전처리 레이어를 정의하는 데 도움이 됩니다.

In [20]:

ds_iter = iter(train_ds)
features, label = next(ds_iter)
{key: value.numpy()[0] for key,value in features.items()}

Out[20]:

{'Age': 2,
 'Breed1': b'Tabby',
 'Color1': b'Black',
 'Color2': b'Cream',
 'Fee': 0,
 'FurLength': b'Short',
 'Gender': b'Male',
 'Health': b'Healthy',
 'MaturitySize': b'Small',
 'PhotoAmt': 4,
 'Sterilized': b'No',
 'Type': b'Cat',
 'Vaccinated': b'No'}

기능의 하위 집합을 선택합니다.

In [21]:

all_inputs = []
encoded_features = []

# Numeric features.
for header in ['PhotoAmt', 'Fee']:
  numeric_col = tf.keras.Input(shape=(1,), name=header)
  normalization_layer = get_normalization_layer(header, train_ds)
  encoded_numeric_col = normalization_layer(numeric_col)
  all_inputs.append(numeric_col)
  encoded_features.append(encoded_numeric_col)

# Categorical features encoded as string.
categorical_cols = ['Type', 'Color1', 'Color2', 'Gender', 'MaturitySize',
                    'FurLength', 'Vaccinated', 'Sterilized', 'Health', 'Breed1']
for header in categorical_cols:
  categorical_col = tf.keras.Input(shape=(1,), name=header, dtype='string')
  encoding_layer = get_category_encoding_layer(header, train_ds, dtype='string',
                                               max_tokens=5)
  encoded_categorical_col = encoding_layer(categorical_col)
  all_inputs.append(categorical_col)
  encoded_features.append(encoded_categorical_col)

모델 구축, 컴파일 및 훈련하기

In [22]:

# Set the parameters

OPTIMIZER="adam"
LOSS=tf.keras.losses.BinaryCrossentropy(from_logits=True)
METRICS=['accuracy']
EPOCHS=10

In [23]:

# Convert the feature columns into a tf.keras layer
all_features = tf.keras.layers.concatenate(encoded_features)

# design/build the model
x = tf.keras.layers.Dense(32, activation="relu")(all_features)
x = tf.keras.layers.Dropout(0.5)(x)
x = tf.keras.layers.Dense(64, activation="relu")(x)
x = tf.keras.layers.Dropout(0.5)(x)
output = tf.keras.layers.Dense(1)(x)
model = tf.keras.Model(all_inputs, output)

tf.keras.utils.plot_model(model, rankdir='LR', show_shapes=True)

Out[23]:

In [24]:

# compile the model
model.compile(optimizer=OPTIMIZER, loss=LOSS, metrics=METRICS)

In [25]:

# fit the model
model.fit(train_ds, epochs=EPOCHS)

Out[25]:

Epoch 1/10

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/functional.py:543: UserWarning: Input dict contained keys ['Age'] which did not match any model input. They will be ignored by the model.
  [n for n in tensors.keys() if n not in ref_input_names])

253/253 [==============================] - 4s 14ms/step - loss: 0.6169 - accuracy: 0.6042
Epoch 2/10
253/253 [==============================] - 4s 14ms/step - loss: 0.5634 - accuracy: 0.6937
Epoch 3/10
253/253 [==============================] - 4s 15ms/step - loss: 0.5573 - accuracy: 0.6981
Epoch 4/10
253/253 [==============================] - 4s 15ms/step - loss: 0.5528 - accuracy: 0.7087
Epoch 5/10
253/253 [==============================] - 4s 14ms/step - loss: 0.5512 - accuracy: 0.7173
Epoch 6/10
253/253 [==============================] - 4s 15ms/step - loss: 0.5456 - accuracy: 0.7219
Epoch 7/10
253/253 [==============================] - 4s 15ms/step - loss: 0.5397 - accuracy: 0.7283
Epoch 8/10
253/253 [==============================] - 4s 14ms/step - loss: 0.5385 - accuracy: 0.7331
Epoch 9/10
253/253 [==============================] - 4s 15ms/step - loss: 0.5355 - accuracy: 0.7326
Epoch 10/10
253/253 [==============================] - 4s 15ms/step - loss: 0.5412 - accuracy: 0.7321

<tensorflow.python.keras.callbacks.History at 0x7f5c235112e8>

테스트 데이터 추론하기

In [26]:

res = model.evaluate(test_ds)
print("test loss, test acc:", res)

Out[26]:

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/functional.py:543: UserWarning: Input dict contained keys ['Age'] which did not match any model input. They will be ignored by the model.
  [n for n in tensors.keys() if n not in ref_input_names])

109/109 [==============================] - 2s 15ms/step - loss: 0.5344 - accuracy: 0.7421
test loss, test acc: [0.534355640411377, 0.7420566082000732]

참고: 이 튜토리얼의 목적은 elasticsearch에서 데이터를 스트리밍하고 tf.keras 모델을 직접 훈련하는 Tensorflow-IO의 기능을 시연하는 것이기 때문에, 모델의 정확성을 개선하는 것은 현재 범위 밖입니다. 하지만, 사용자는 데이터세트를 탐색하고 기능 열 및 모델 아키텍처를 사용하여 더욱 나은 분류 성능을 얻을 수 있습니다.