Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
tensorflow
GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ko/io/tutorials/mongodb.ipynb
25118 views
Kernel: Python 3
#@title Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # https://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.

MongoDB 컬렉션의 Tensorflow 데이터세트

개요

이 튜토리얼은 mongoDB 컬렉션에서 데이터를 읽어
tf.data.Dataset를 준비하고 이를 사용하여 tf.keras를 훈련하는 데 중점을 둡니다.

참고: mongodb storage에 대한 기본적인 이해가 있으면 튜토리얼을 진행하기가 더 쉽습니다.

설정 패키지

이 튜토리얼은 pymongo를 helper 패키지로 사용하여 데이터를 저장하기 위해 새로운 mongodb 데이터베이스와 컬렉션을 생성합니다.

필요한 tensorflow-io 및 mongodb(helper) 패키지 설치하기

!pip install -q tensorflow-io !pip install -q pymongo
WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages) WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages) WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages) WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages) WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages) WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages) WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages) WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages) WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages) WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages) WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages) WARNING: Ignoring invalid distribution -eras (/usr/local/lib/python3.7/dist-packages)

패키지 가져오기

import os import time from pprint import pprint from sklearn.model_selection import train_test_split import numpy as np import pandas as pd import tensorflow as tf from tensorflow.keras import layers from tensorflow.keras.layers.experimental import preprocessing import tensorflow_io as tfio from pymongo import MongoClient

tf 및 tfio 가져오기 검증하기

print("tensorflow-io version: {}".format(tfio.__version__)) print("tensorflow version: {}".format(tf.__version__))
tensorflow-io version: 0.20.0 tensorflow version: 2.6.0

MongoDB 인스턴스 다운로드 및 설정하기

시연 목적으로, mongodb의 오픈 소스 버전이 사용됩니다.

%%bash sudo apt install -y mongodb >log service mongodb start
* Starting database mongodb ...done.
WARNING: apt does not have a stable CLI interface. Use with caution in scripts. debconf: unable to initialize frontend: Dialog debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 8.) debconf: falling back to frontend: Readline debconf: unable to initialize frontend: Readline debconf: (This frontend requires a controlling tty.) debconf: falling back to frontend: Teletype dpkg-preconfigure: unable to re-open stdin:
# Sleep for few seconds to let the instance start. time.sleep(5)

인스턴스가 시작되면, 프로세스 목록의 mongo를 grep 하여 가용성을 확인합니다.

%%bash ps -ef | grep mongo
mongodb 580 1 13 17:38 ? 00:00:00 /usr/bin/mongod --config /etc/mongodb.conf root 612 610 0 17:38 ? 00:00:00 grep mongo

베이스 엔드포인트를 쿼리 하여 클러스터에 대한 정보를 검색합니다.

client = MongoClient() client.list_database_names() # ['admin', 'local']
['admin', 'local']

데이터세트 살펴보기

이 튜토리얼의 목적을 위해, PetFinder 데이터세트를 다운로드하고 데이터를 mongodb에 수동으로 입력합니다. 이 분류 문제의 목표는 애완동물이 입양되었는지 아닌지 예측하는 것입니다.

dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip' csv_file = 'datasets/petfinder-mini/petfinder-mini.csv' tf.keras.utils.get_file('petfinder_mini.zip', dataset_url, extract=True, cache_dir='.') pf_df = pd.read_csv(csv_file)
Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip 1671168/1668792 [==============================] - 0s 0us/step 1679360/1668792 [==============================] - 0s 0us/step
pf_df.head()

이 튜토리얼의 목적을 위해, 레이블 열이 수정되었습니다. 0은 애완 동물이 입양되지 않았음을 나타내고 1은 입양되었음을 나타냅니다.

# In the original dataset "4" indicates the pet was not adopted. pf_df['target'] = np.where(pf_df['AdoptionSpeed']==4, 0, 1) # Drop un-used columns. pf_df = pf_df.drop(columns=['AdoptionSpeed', 'Description'])
# Number of datapoints and columns len(pf_df), len(pf_df.columns)
(11537, 14)

데이터세트 분할하기

train_df, test_df = train_test_split(pf_df, test_size=0.3, shuffle=True) print("Number of training samples: ",len(train_df)) print("Number of testing sample: ",len(test_df))
Number of training samples: 8075 Number of testing sample: 3462

mongo 컬렉션에 훈련 및 테스트 데이터 저장하기

URI = "mongodb://localhost:27017" DATABASE = "tfiodb" TRAIN_COLLECTION = "train" TEST_COLLECTION = "test"
db = client[DATABASE] if "train" not in db.list_collection_names(): db.create_collection(TRAIN_COLLECTION) if "test" not in db.list_collection_names(): db.create_collection(TEST_COLLECTION)
def store_records(collection, records): writer = tfio.experimental.mongodb.MongoDBWriter( uri=URI, database=DATABASE, collection=collection ) for record in records: writer.write(record)
store_records(collection="train", records=train_df.to_dict("records")) time.sleep(2) store_records(collection="test", records=test_df.to_dict("records"))

tfio 데이터세트 준비

클러스터에서 데이터를 사용할 수 있게 되면 mongodb.MongoDBIODataset 클래스는 이 목적을 위해 이용됩니다. 이 클래스는 tf.data.Dataset에서 상속되므로 tf.data.Dataset의 유용한 모든 기능을 즉시 사용할 수 있습니다.

데이터세트 훈련

train_ds = tfio.experimental.mongodb.MongoDBIODataset( uri=URI, database=DATABASE, collection=TRAIN_COLLECTION ) train_ds
Connection successful: mongodb://localhost:27017 WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/data/experimental/ops/counter.py:66: scan (from tensorflow.python.data.experimental.ops.scan_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.data.Dataset.scan(...) instead WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_io/python/experimental/mongodb_dataset_ops.py:114: take_while (from tensorflow.python.data.experimental.ops.take_while_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.data.Dataset.take_while(...)
<MongoDBIODataset shapes: (), types: tf.string>

train_ds 내 각 항목은 json으로 디코딩 되어야 하는 문자열입니다. 그러려면 TensorSpec를 지정하여 열의 하위 집합만 선택할 수 있습니다.

# Numeric features. numerical_cols = ['PhotoAmt', 'Fee'] SPECS = { "target": tf.TensorSpec(tf.TensorShape([]), tf.int64, name="target"), } for col in numerical_cols: SPECS[col] = tf.TensorSpec(tf.TensorShape([]), tf.int32, name=col) pprint(SPECS)
{'Fee': TensorSpec(shape=(), dtype=tf.int32, name='Fee'), 'PhotoAmt': TensorSpec(shape=(), dtype=tf.int32, name='PhotoAmt'), 'target': TensorSpec(shape=(), dtype=tf.int64, name='target')}
BATCH_SIZE=32 train_ds = train_ds.map( lambda x: tfio.experimental.serialization.decode_json(x, specs=SPECS) ) # Prepare a tuple of (features, label) train_ds = train_ds.map(lambda v: (v, v.pop("target"))) train_ds = train_ds.batch(BATCH_SIZE) train_ds
<BatchDataset shapes: ({PhotoAmt: (None,), Fee: (None,)}, (None,)), types: ({PhotoAmt: tf.int32, Fee: tf.int32}, tf.int64)>

데이터세트 테스트

test_ds = tfio.experimental.mongodb.MongoDBIODataset( uri=URI, database=DATABASE, collection=TEST_COLLECTION ) test_ds = test_ds.map( lambda x: tfio.experimental.serialization.decode_json(x, specs=SPECS) ) # Prepare a tuple of (features, label) test_ds = test_ds.map(lambda v: (v, v.pop("target"))) test_ds = test_ds.batch(BATCH_SIZE) test_ds
Connection successful: mongodb://localhost:27017
<BatchDataset shapes: ({PhotoAmt: (None,), Fee: (None,)}, (None,)), types: ({PhotoAmt: tf.int32, Fee: tf.int32}, tf.int64)>

keras 전처리 레이어 정의하기

구조화된 데이터 튜토리얼에 따라, Keras 전처리 레이어를 사용하는 것이 좋습니다. 더욱 혁신적이고 모델과 쉽게 통합할 수 있기 때문입니다. 하지만, 표준 feature_columns도 사용할 수 있습니다.

구조화된 데이터를 분류할 때 preprocessing_layers를 더욱 잘 이해하기 위해서는, structured data tutorial를 참조합니다.

def get_normalization_layer(name, dataset): # Create a Normalization layer for our feature. normalizer = preprocessing.Normalization(axis=None) # Prepare a Dataset that only yields our feature. feature_ds = dataset.map(lambda x, y: x[name]) # Learn the statistics of the data. normalizer.adapt(feature_ds) return normalizer
all_inputs = [] encoded_features = [] for header in numerical_cols: numeric_col = tf.keras.Input(shape=(1,), name=header) normalization_layer = get_normalization_layer(header, train_ds) encoded_numeric_col = normalization_layer(numeric_col) all_inputs.append(numeric_col) encoded_features.append(encoded_numeric_col)

모델 구축, 컴파일 및 훈련하기

# Set the parameters OPTIMIZER="adam" LOSS=tf.keras.losses.BinaryCrossentropy(from_logits=True) METRICS=['accuracy'] EPOCHS=10
# Convert the feature columns into a tf.keras layer all_features = tf.keras.layers.concatenate(encoded_features) # design/build the model x = tf.keras.layers.Dense(32, activation="relu")(all_features) x = tf.keras.layers.Dropout(0.5)(x) x = tf.keras.layers.Dense(64, activation="relu")(x) x = tf.keras.layers.Dropout(0.5)(x) output = tf.keras.layers.Dense(1)(x) model = tf.keras.Model(all_inputs, output)
# compile the model model.compile(optimizer=OPTIMIZER, loss=LOSS, metrics=METRICS)
# fit the model model.fit(train_ds, epochs=EPOCHS)
Epoch 1/10 109/109 [==============================] - 1s 2ms/step - loss: 0.6261 - accuracy: 0.4711 Epoch 2/10 109/109 [==============================] - 0s 3ms/step - loss: 0.5939 - accuracy: 0.6967 Epoch 3/10 109/109 [==============================] - 0s 3ms/step - loss: 0.5900 - accuracy: 0.6993 Epoch 4/10 109/109 [==============================] - 0s 3ms/step - loss: 0.5846 - accuracy: 0.7146 Epoch 5/10 109/109 [==============================] - 0s 3ms/step - loss: 0.5824 - accuracy: 0.7178 Epoch 6/10 109/109 [==============================] - 0s 2ms/step - loss: 0.5778 - accuracy: 0.7233 Epoch 7/10 109/109 [==============================] - 0s 3ms/step - loss: 0.5810 - accuracy: 0.7083 Epoch 8/10 109/109 [==============================] - 0s 3ms/step - loss: 0.5791 - accuracy: 0.7149 Epoch 9/10 109/109 [==============================] - 0s 3ms/step - loss: 0.5742 - accuracy: 0.7207 Epoch 10/10 109/109 [==============================] - 0s 2ms/step - loss: 0.5797 - accuracy: 0.7083
<keras.callbacks.History at 0x7f743229fe90>

테스트 데이터 추론하기

res = model.evaluate(test_ds) print("test loss, test acc:", res)
109/109 [==============================] - 0s 2ms/step - loss: 0.5696 - accuracy: 0.7383 test loss, test acc: [0.569588840007782, 0.7383015751838684]

참고: 이 튜토리얼의 목적은 mongodb에서 tf.data.Datasets를 준비하고 tf.keras 모델을 직접 훈련하는 Tensorflow-IO의 기능을 시연하는 것이기 때문에, 모델의 정확성을 개선하는 것은 현재 범위 밖입니다. 하지만, 사용자는 데이터세트를 탐색하고 기능 열 및 모델 아키텍처를 사용하여 더욱 나은 분류 성능을 얻을 수 있습니다.