Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
tensorflow
GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ko/tutorials/load_data/text.ipynb
25118 views
Kernel: Python 3
#@title Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # https://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.

ν…μŠ€νŠΈ λ‘œλ“œν•˜κΈ°

이 νŠœν† λ¦¬μ–Όμ€ ν…μŠ€νŠΈλ₯Ό λ‘œλ“œν•˜κ³  μ „μ²˜λ¦¬ν•˜λŠ” 두 κ°€μ§€ 방법을 보여 μ€λ‹ˆλ‹€.

  • λ¨Όμ € Keras μœ ν‹Έλ¦¬ν‹°μ™€ μ „μ²˜λ¦¬ λ ˆμ΄μ–΄λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€. μ—¬κΈ°μ—λŠ” 데이터 ν‘œμ€€ν™”, 토큰화 및 벑터화λ₯Ό μœ„ν•΄ 데이터λ₯Ό tf.data.Dataset와 tf.keras.layers.TextVectorization으둜 λ³€ν™˜ν•˜λŠ” tf.keras.utils.text_dataset_from_directoryκ°€ ν¬ν•¨λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. TensorFlowκ°€ 처음이라면 μ—¬κΈ°μ„œλΆ€ν„° μ‹œμž‘ν•΄μ•Ό ν•©λ‹ˆλ‹€.

  • 그런 λ‹€μŒ tf.data.TextLineDataset와 같은 ν•˜μœ„ μˆ˜μ€€ μœ ν‹Έλ¦¬ν‹°λ₯Ό μ‚¬μš©ν•˜μ—¬ ν…μŠ€νŠΈ νŒŒμΌμ„ λ‘œλ“œν•˜κ³ , 보닀 μ„Έλ°€ν•œ μ œμ–΄λ₯Ό μœ„ν•΄ text.UnicodeScriptTokenizer 및 text.case_fold_utf8κ³Ό 같은 TensorFlow Text APIλ₯Ό μ‚¬μš©ν•˜μ—¬ 데이터λ₯Ό μ²˜λ¦¬ν•©λ‹ˆλ‹€.

!pip install "tensorflow-text==2.11.*"
import collections import pathlib import tensorflow as tf from tensorflow.keras import layers from tensorflow.keras import losses from tensorflow.keras import utils from tensorflow.keras.layers import TextVectorization import tensorflow_datasets as tfds import tensorflow_text as tf_text

예제 1: μŠ€νƒ μ˜€λ²„ν”Œλ‘œ μ§ˆλ¬Έμ— λŒ€ν•œ νƒœκ·Έ μ˜ˆμΈ‘ν•˜κΈ°

첫 번째 예제둜 μŠ€νƒ μ˜€λ²„ν”Œλ‘œμ—μ„œ ν”„λ‘œκ·Έλž˜λ° 질문 λ°μ΄ν„°μ„ΈνŠΈλ₯Ό λ‹€μš΄λ‘œλ“œν•©λ‹ˆλ‹€. 각 질문("κ°’μœΌλ‘œ 사전을 μ–΄λ–»κ²Œ μ •λ ¬ν•˜λ‚˜μš”?")λ§ˆλ‹€ μ •ν™•νžˆ ν•˜λ‚˜μ˜ νƒœκ·Έ(Python, CSharp, JavaScript, λ˜λŠ” Java)κ°€ λ ˆμ΄λΈ”λ‘œ μ§€μ •λ©λ‹ˆλ‹€. μ—¬λŸ¬λΆ„μ˜ μž‘μ—…μ€ μ§ˆλ¬Έμ— λŒ€ν•œ νƒœκ·Έλ₯Ό μ˜ˆμΈ‘ν•˜λŠ” λͺ¨λΈμ„ κ°œλ°œν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€. 이것은 μ€‘μš”ν•˜κ³  널리 적용 κ°€λŠ₯ν•œ λ¨Έμ‹ λŸ¬λ‹ 문제인 닀쀑 클래슀 λΆ„λ₯˜ μ˜ˆμ œμž…λ‹ˆλ‹€.

λ°μ΄ν„°μ„ΈνŠΈ λ‹€μš΄λ‘œλ“œ 및 νƒμƒ‰ν•˜κΈ°

tf.keras.utils.get_file을 μ‚¬μš©ν•˜μ—¬ μŠ€νƒ μ˜€λ²„ν”Œλ‘œ λ°μ΄ν„°μ„ΈνŠΈλ₯Ό λ‹€μš΄λ‘œλ“œν•˜κ³ , 디렉터리 ꡬ쑰λ₯Ό νƒμƒ‰ν•˜μ—¬ μ‹œμž‘ν•˜λ„λ‘ ν•©λ‹ˆλ‹€.

data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz' dataset_dir = utils.get_file( origin=data_url, untar=True, cache_dir='stack_overflow', cache_subdir='') dataset_dir = pathlib.Path(dataset_dir).parent
list(dataset_dir.iterdir())
train_dir = dataset_dir/'train' list(train_dir.iterdir())

train/csharp, train/java, train/python 및 train/javascript λ””λ ‰ν„°λ¦¬μ—λŠ” λ§Žμ€ ν…μŠ€νŠΈ 파일이 ν¬ν•¨λ˜μ–΄ 있으며, 각 ν…μŠ€νŠΈ 파일의 λ‚΄μš©μ€ μŠ€νƒ μ˜€λ²„ν”Œλ‘œ μ§ˆλ¬Έμž…λ‹ˆλ‹€.

λ‹€μŒκ³Ό 같이 예제 νŒŒμΌμ„ 좜λ ₯ν•˜κ³  데이터λ₯Ό κ²€μ‚¬ν•©λ‹ˆλ‹€.

sample_file = train_dir/'python/1755.txt' with open(sample_file) as f: print(f.read())

λ°μ΄ν„°μ„ΈνŠΈ λ‘œλ“œν•˜κΈ°

λ‹€μŒμœΌλ‘œ λ””μŠ€ν¬λ‘œλΆ€ν„° 데이터λ₯Ό λ‘œλ“œν•˜κ³  데이터λ₯Ό ν›ˆλ ¨μ— μ ν•©ν•œ ν˜•μ‹μœΌλ‘œ μ€€λΉ„ν•©λ‹ˆλ‹€. μ΄λ ‡κ²Œ ν•˜κΈ° μœ„ν•΄ μ—¬λŸ¬λΆ„μ€ tf.keras.utils.text_dataset_from_directory μœ ν‹Έλ¦¬ν‹°λ₯Ό μ‚¬μš©ν•˜μ—¬ λ ˆμ΄λΈ”μ΄ μ§€μ •λœ tf.data.Datasetλ₯Ό μƒμ„±ν•©λ‹ˆλ‹€. tf.dataλŠ” μž…λ ₯ νŒŒμ΄ν”„λΌμΈμ„ κ΅¬μΆ•ν•˜λŠ” κ°•λ ₯ν•œ 도ꡬ λͺ¨μŒμž…λ‹ˆλ‹€(tf.data: TensorFlow μž…λ ₯ νŒŒμ΄ν”„λΌμΈ λΉŒλ“œ κ°€μ΄λ“œμ—μ„œ μžμ„Ένžˆ μ•Œμ•„λ³΄μ„Έμš”).

tf.keras.utils.text_dataset_from_directory APIλŠ” λ‹€μŒκ³Ό 같은 디렉터리 ꡬ쑰λ₯Ό μ˜ˆμƒν•©λ‹ˆλ‹€.

train/ ...csharp/ ......1.txt ......2.txt ...java/ ......1.txt ......2.txt ...javascript/ ......1.txt ......2.txt ...python/ ......1.txt ......2.txt

λ¨Έμ‹ λŸ¬λ‹ μ‹€ν—˜μ„ μ‹€ν–‰ν•  λ•Œ λ°μ΄ν„°μ„ΈνŠΈλ₯Ό ν›ˆλ ¨, 검증 및 ν…ŒμŠ€νŠΈμ˜ μ„Έ λΆ€λΆ„μœΌλ‘œ λ‚˜λˆ„λŠ” 것이 κ°€μž₯ μ’‹μŠ΅λ‹ˆλ‹€.

μŠ€νƒ μ˜€λ²„ν”Œλ‘œ λ°μ΄ν„°μ„ΈνŠΈλŠ” 이미 ν›ˆλ ¨ μ„ΈνŠΈμ™€ ν…ŒμŠ€νŠΈ μ„ΈνŠΈλ‘œ λ‚˜λˆ„μ–΄μ Έ μžˆμ§€λ§Œ μ—¬κΈ°μ—λŠ” 검증 μ„ΈνŠΈκ°€ μ—†μŠ΅λ‹ˆλ‹€.

validation_split이 0.2(즉, 20%)둜 μ„€μ •λœ tf.keras.utils.text_dataset_from_directoryλ₯Ό μ‚¬μš©ν•˜μ—¬ ν›ˆλ ¨ 데이터λ₯Ό 80:20의 λΉ„μœ¨λ‘œ λΆ„ν• ν•˜λŠ” 검증 μ„ΈνŠΈλ₯Ό μƒμ„±ν•©λ‹ˆλ‹€.

batch_size = 32 seed = 42 raw_train_ds = utils.text_dataset_from_directory( train_dir, batch_size=batch_size, validation_split=0.2, subset='training', seed=seed)

이전 μ…€ 좜λ ₯을 톡해 μ•Œ 수 μžˆλ“―μ΄ ν›ˆλ ¨ ν΄λ”μ—λŠ” 8,000개의 μ˜ˆμ œκ°€ 있으며 μ—¬λŸ¬λΆ„μ€ κ·Έ 쀑 80%(λ˜λŠ” 6,400개의 예제)λ₯Ό ν›ˆλ ¨μ— μ‚¬μš©ν•  κ²ƒμž…λ‹ˆλ‹€. μ—¬λŸ¬λΆ„μ€ tf.data.Datasetλ₯Ό Model.fit에 직접 μ „λ‹¬ν•˜μ—¬ λͺ¨λΈμ„ ν›ˆλ ¨ν•  수 μžˆλ‹€λŠ” 것을 κ³§ 배우게 될 κ²ƒμž…λ‹ˆλ‹€.

λ¨Όμ € λ°μ΄ν„°μ„ΈνŠΈλ₯Ό λ°˜λ³΅ν•˜κ³  λͺ‡ κ°€μ§€ 예제λ₯Ό μΈμ‡„ν•˜μ—¬ 데이터에 λŒ€ν•œ 감각을 μ΅νžˆμ„Έμš”.

μ°Έκ³ : λΆ„λ₯˜ 문제의 λ‚œμ΄λ„λ₯Ό 높이기 μœ„ν•΄ λ°μ΄ν„°μ„ΈνŠΈ μž‘μ„±μžλŠ” ν”„λ‘œκ·Έλž˜λ° μ§ˆλ¬Έμ—μ„œ Python, CSharp, JavaScript λ˜λŠ” JavaλΌλŠ” 단어λ₯Ό blank둜 λŒ€μ²΄ν–ˆμŠ΅λ‹ˆλ‹€.

for text_batch, label_batch in raw_train_ds.take(1): for i in range(10): print("Question: ", text_batch.numpy()[i]) print("Label:", label_batch.numpy()[i])

λ ˆμ΄λΈ”μ€ 0, 1, 2 λ˜λŠ” 3μž…λ‹ˆλ‹€. 이듀이 μ–΄λ– ν•œ λ¬Έμžμ—΄ λ ˆμ΄λΈ”μ— ν•΄λ‹Ήν•˜λŠ”μ§€ ν™•μΈν•˜λ €λ©΄ λ°μ΄ν„°μ„ΈνŠΈμ˜ class_names 속성을 κ²€μ‚¬ν•˜λ©΄ λ©λ‹ˆλ‹€.

for i, label in enumerate(raw_train_ds.class_names): print("Label", i, "corresponds to", label)

λ‹€μŒμœΌλ‘œ tf.keras.utils.text_dataset_from_directoryλ₯Ό μ‚¬μš©ν•˜μ—¬ 검증 및 ν…ŒμŠ€νŠΈ μ„ΈνŠΈλ₯Ό λ§Œλ“­λ‹ˆλ‹€. 검증을 μœ„ν•΄ ν›ˆλ ¨ μ„ΈνŠΈμ˜ λ‚˜λ¨Έμ§€ 1,600개 리뷰λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€.

μ°Έκ³ : tf.keras.utils.text_dataset_from_directory의 validation_split 및 subset 인수λ₯Ό μ‚¬μš©ν•  λ•Œ 검증 및 ν›ˆλ ¨ 뢄할이 κ²ΉμΉ˜μ§€ μ•Šλ„λ‘ μž„μ˜ μ‹œλ“œλ₯Ό μ§€μ •ν•˜κ±°λ‚˜ shuffle=Falseλ₯Ό μ „λ‹¬ν•˜λ„λ‘ ν•©λ‹ˆλ‹€.

# Create a validation set. raw_val_ds = utils.text_dataset_from_directory( train_dir, batch_size=batch_size, validation_split=0.2, subset='validation', seed=seed)
test_dir = dataset_dir/'test' # Create a test set. raw_test_ds = utils.text_dataset_from_directory( test_dir, batch_size=batch_size)

ν›ˆλ ¨μ„ μœ„ν•œ λ°μ΄ν„°μ„ΈνŠΈ μ€€λΉ„ν•˜κΈ°

λ‹€μŒμœΌλ‘œ tf.keras.layers.TextVectorization λ ˆμ΄μ–΄λ₯Ό μ‚¬μš©ν•˜μ—¬ 데이터λ₯Ό ν‘œμ€€ν™”, 토큰화 및 λ²‘ν„°ν™”ν•©λ‹ˆλ‹€.

  • ν‘œμ€€ν™”λŠ” 일반적으둜 λ°μ΄ν„°μ„ΈνŠΈλ₯Ό λ‹¨μˆœν™”ν•˜κΈ° μœ„ν•΄ κ΅¬λ‘μ μ΄λ‚˜ HTML μš”μ†Œλ₯Ό μ œκ±°ν•˜λ„λ‘ ν…μŠ€νŠΈλ₯Ό μ „μ²˜λ¦¬ν•˜λŠ” 것을 μΌμ»«μŠ΅λ‹ˆλ‹€.

  • ν† ν°ν™”λŠ” λ¬Έμžμ—΄μ„ ν† ν°μœΌλ‘œ λΆ„ν• ν•˜λŠ” 것을 μΌμ»«μŠ΅λ‹ˆλ‹€(예: λ¬Έμž₯을 곡백을 μ‚¬μš©ν•˜μ—¬ κ°œλ³„ λ‹¨μ–΄λ‘œ λΆ„ν• ).

  • λ²‘ν„°ν™”λŠ” 신경망에 μ œκ³΅ν•  수 μžˆλ„λ‘ 토큰을 숫자둜 λ³€ν™˜ν•˜λŠ” 것을 μΌμ»«μŠ΅λ‹ˆλ‹€.

μœ„μ˜ λͺ¨λ“  μž‘μ—…μ€ 이 λ ˆμ΄μ–΄λ‘œ μˆ˜ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€(tf.keras.layers.TextVectorization API λ¬Έμ„œμ—μ„œ 각 μž‘μ—…μ— λŒ€ν•΄ μžμ„Ένžˆ μ•Œμ•„λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€).

μ°Έκ³  사항:

  • κΈ°λ³Έ ν‘œμ€€ν™”λŠ” ν…μŠ€νŠΈλ₯Ό μ†Œλ¬Έμžλ‘œ λ³€ν™˜ν•˜κ³  ꡬ두점을 μ œκ±°ν•©λ‹ˆλ‹€(standardize='lower_and_strip_punctuation').

  • κΈ°λ³Έ ν† ν°ν™”λŠ” 곡백으둜 λΆ„ν• ν•©λ‹ˆλ‹€(split='whitespace').

  • κΈ°λ³Έ 벑터화 λͺ¨λ“œλŠ” 'int'(output_mode='int')μž…λ‹ˆλ‹€. 이 λͺ¨λ“œλŠ” μ •μˆ˜ 인덱슀λ₯Ό 좜λ ₯ν•©λ‹ˆλ‹€(토큰당 ν•˜λ‚˜). 이 λͺ¨λ“œλŠ” 단어 μˆœμ„œλ₯Ό κ³ λ €ν•˜λŠ” λͺ¨λΈμ„ λΉŒλ“œν•˜λŠ” 데 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. 'binary'와 같은 λ‹€λ₯Έ λͺ¨λ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ bag-of-words λͺ¨λΈμ„ λΉŒλ“œν•  μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€.

TextVectorization을 μ‚¬μš©ν•˜λŠ” ν‘œμ€€ν™”, 토큰화 및 벑터화에 λŒ€ν•΄ μžμ„Ένžˆ μ•Œμ•„λ³΄κΈ° μœ„ν•΄ λ‹€μŒ 두 κ°€μ§€ λͺ¨λΈμ„ λΉŒλ“œν•©λ‹ˆλ‹€.

  • λ¨Όμ € 'binary' 벑터화 λͺ¨λ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ bag-of-words λͺ¨λΈμ„ λΉŒλ“œν•©λ‹ˆλ‹€.

  • 그런 λ‹€μŒ 1D ConvNetμ—μ„œ 'int' λͺ¨λ“œλ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€.

VOCAB_SIZE = 10000 binary_vectorize_layer = TextVectorization( max_tokens=VOCAB_SIZE, output_mode='binary')

'int' λͺ¨λ“œμ˜ 경우 μ΅œλŒ€ μ–΄νœ˜ 크기 외에 λͺ…μ‹œμ μΈ μ΅œλŒ€ μ‹œν€€μŠ€ 길이(MAX_SEQUENCE_LENGTH)λ₯Ό μ„€μ •ν•΄μ•Ό λ ˆμ΄μ–΄κ°€ νŒ¨λ”©λ˜κ±°λ‚˜ μ‹œν€€μŠ€λ₯Ό μ •ν™•νžˆ output_sequence_length κ°’μœΌλ‘œ μžλ¦…λ‹ˆλ‹€.

MAX_SEQUENCE_LENGTH = 250 int_vectorize_layer = TextVectorization( max_tokens=VOCAB_SIZE, output_mode='int', output_sequence_length=MAX_SEQUENCE_LENGTH)

λ‹€μŒμœΌλ‘œ μ „μ²˜λ¦¬ λ ˆμ΄μ–΄μ˜ μƒνƒœλ₯Ό λ°μ΄ν„°μ„ΈνŠΈμ— λ§žμΆ”κΈ° μœ„ν•΄ TextVectorization.adaptλ₯Ό ν˜ΈμΆœν•©λ‹ˆλ‹€. 그러면 λͺ¨λΈμ΄ λ¬Έμžμ—΄ 인덱슀λ₯Ό μ •μˆ˜λ‘œ λΉŒλ“œν•©λ‹ˆλ‹€.

μ°Έκ³ : ν…ŒμŠ€νŠΈμ„ΈνŠΈλ₯Ό μ‚¬μš©ν•˜λ©΄ 정보가 λˆ„μΆœλ˜λ―€λ‘œ TextVectorization.adaptλ₯Ό ν˜ΈμΆœν•  λ•Œ ν›ˆλ ¨ λ°μ΄ν„°λ§Œ μ‚¬μš©ν•˜λŠ” 것이 μ€‘μš”ν•©λ‹ˆλ‹€.

# Make a text-only dataset (without labels), then call `TextVectorization.adapt`. train_text = raw_train_ds.map(lambda text, labels: text) binary_vectorize_layer.adapt(train_text) int_vectorize_layer.adapt(train_text)

μ΄λŸ¬ν•œ λ ˆμ΄μ–΄λ₯Ό μ‚¬μš©ν•˜μ—¬ 데이터λ₯Ό μ „μ²˜λ¦¬ν•œ κ²°κ³Όλ₯Ό μΈμ‡„ν•©λ‹ˆλ‹€.

def binary_vectorize_text(text, label): text = tf.expand_dims(text, -1) return binary_vectorize_layer(text), label
def int_vectorize_text(text, label): text = tf.expand_dims(text, -1) return int_vectorize_layer(text), label
# Retrieve a batch (of 32 reviews and labels) from the dataset. text_batch, label_batch = next(iter(raw_train_ds)) first_question, first_label = text_batch[0], label_batch[0] print("Question", first_question) print("Label", first_label)
print("'binary' vectorized question:", binary_vectorize_text(first_question, first_label)[0])
print("'int' vectorized question:", int_vectorize_text(first_question, first_label)[0])

μœ„μ— ν‘œμ‹œλœ κ²ƒμ²˜λŸΌ TextVectorization의 'binary' λͺ¨λ“œλŠ” μž…λ ₯에 ν•œ 번 이상 μ‘΄μž¬ν•˜λŠ” 토큰을 λ‚˜νƒ€λ‚΄λŠ” 배열을 λ°˜ν™˜ν•˜λŠ” 반면 'int' λͺ¨λ“œλŠ” 각 토큰을 μ •μˆ˜λ‘œ λŒ€μ²΄ν•˜κΈ°μ— μ›λž˜ μˆœμ„œλ₯Ό μœ μ§€ν•©λ‹ˆλ‹€.

λ ˆμ΄μ–΄μ—μ„œ TextVectorization.get_vocabularyλ₯Ό ν˜ΈμΆœν•˜μ—¬ 각 μ •μˆ˜κ°€ ν•΄λ‹Ήν•˜λŠ” 토큰(λ¬Έμžμ—΄)을 μ‘°νšŒν•  수 μžˆμŠ΅λ‹ˆλ‹€.

print("1289 ---> ", int_vectorize_layer.get_vocabulary()[1289]) print("313 ---> ", int_vectorize_layer.get_vocabulary()[313]) print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))

λͺ¨λΈμ„ ν›ˆλ ¨ν•  μ€€λΉ„κ°€ 거의 λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

μ΅œμ’… μ „μ²˜λ¦¬ λ‹¨κ³„λ‘œ 이전에 μƒμ„±ν•œ TextVectorization λ ˆμ΄μ–΄λ₯Ό ν›ˆλ ¨, 검증 및 ν…ŒμŠ€νŠΈ μ„ΈνŠΈμ— μ μš©ν•©λ‹ˆλ‹€.

binary_train_ds = raw_train_ds.map(binary_vectorize_text) binary_val_ds = raw_val_ds.map(binary_vectorize_text) binary_test_ds = raw_test_ds.map(binary_vectorize_text) int_train_ds = raw_train_ds.map(int_vectorize_text) int_val_ds = raw_val_ds.map(int_vectorize_text) int_test_ds = raw_test_ds.map(int_vectorize_text)

μ„±λŠ₯을 높이도둝 λ°μ΄ν„°μ„ΈνŠΈ κ΅¬μ„±ν•˜κΈ°

λ‹€μŒμ€ I/Oκ°€ μ°¨λ‹¨λ˜μ§€ μ•Šλ„λ‘ 데이터λ₯Ό λ‘œλ“œν•  λ•Œ μ‚¬μš©ν•΄μ•Ό ν•˜λŠ” 두 κ°€μ§€ μ€‘μš”ν•œ λ©”μ„œλ“œμž…λ‹ˆλ‹€.

  • Dataset.cacheλŠ” 데이터가 λ””μŠ€ν¬μ—μ„œ λ‘œλ“œλœ ν›„ λ©”λͺ¨λ¦¬μ— 데이터λ₯Ό λ³΄κ΄€ν•©λ‹ˆλ‹€. μ΄λ ‡κ²Œ ν•˜λ©΄ λͺ¨λΈμ„ ν›ˆλ ¨ν•˜λŠ” λ™μ•ˆ λ°μ΄ν„°μ„ΈνŠΈλ‘œ μΈν•œ 병λͺ© ν˜„μƒμ΄ λ°œμƒν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. λ°μ΄ν„°μ„ΈνŠΈκ°€ λ„ˆλ¬΄ μ»€μ„œ λ©”λͺ¨λ¦¬μ— λ§žμ§€ μ•ŠλŠ” 경우 이 λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ μ„±λŠ₯이 λ›°μ–΄λ‚œ 온 λ””μŠ€ν¬ μΊμ‹œλ₯Ό 생성할 μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€. λ‹€μˆ˜μ˜ μž‘μ€ νŒŒμΌλ³΄λ‹€ 읽기가 더 νš¨μœ¨μ μž…λ‹ˆλ‹€.

  • Dataset.prefetchλŠ” ν›ˆλ ¨ν•˜λŠ” λ™μ•ˆ 데이터 μ „μ²˜λ¦¬ 및 λͺ¨λΈ 싀행을 μ€‘μ²©μ‹œν‚΅λ‹ˆλ‹€.

tf.data APIλ₯Ό ν†΅ν•œ μ„±λŠ₯ ν–₯상 κ°€μ΄λ“œμ˜ ν”„λ¦¬νŽ˜μΉ­ μ„Ήμ…˜μ—μ„œ 두 κ°€μ§€ λ©”μ„œλ“œμ™€ 데이터λ₯Ό λ””μŠ€ν¬μ— μΊμ‹œν•˜λŠ” 방법에 λŒ€ν•΄ μžμ„Ένžˆ μ•Œμ•„λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€.

AUTOTUNE = tf.data.AUTOTUNE def configure_dataset(dataset): return dataset.cache().prefetch(buffer_size=AUTOTUNE)
binary_train_ds = configure_dataset(binary_train_ds) binary_val_ds = configure_dataset(binary_val_ds) binary_test_ds = configure_dataset(binary_test_ds) int_train_ds = configure_dataset(int_train_ds) int_val_ds = configure_dataset(int_val_ds) int_test_ds = configure_dataset(int_test_ds)

λͺ¨λΈ ν›ˆλ ¨ν•˜κΈ°

이제 신경망을 λ§Œλ“€ μ°¨λ‘€μž…λ‹ˆλ‹€.

'binary' λ²‘ν„°ν™”λœ λ°μ΄ν„°μ˜ 경우 κ°„λ‹¨ν•œ bag-of-words μ„ ν˜• λͺ¨λΈμ„ μ •μ˜ν•œ λ‹€μŒ 데이터λ₯Ό κ΅¬μ„±ν•˜κ³  ν›ˆλ ¨ν•©λ‹ˆλ‹€.

binary_model = tf.keras.Sequential([layers.Dense(4)]) binary_model.compile( loss=losses.SparseCategoricalCrossentropy(from_logits=True), optimizer='adam', metrics=['accuracy']) history = binary_model.fit( binary_train_ds, validation_data=binary_val_ds, epochs=10)

그런 λ‹€μŒ 'int' λ²‘ν„°ν™”λœ λ ˆμ΄μ–΄λ₯Ό μ‚¬μš©ν•˜μ—¬ 1D ConvNet을 λΉŒλ“œν•©λ‹ˆλ‹€.

def create_model(vocab_size, num_labels): model = tf.keras.Sequential([ layers.Embedding(vocab_size, 64, mask_zero=True), layers.Conv1D(64, 5, padding="valid", activation="relu", strides=2), layers.GlobalMaxPooling1D(), layers.Dense(num_labels) ]) return model
# `vocab_size` is `VOCAB_SIZE + 1` since `0` is used additionally for padding. int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4) int_model.compile( loss=losses.SparseCategoricalCrossentropy(from_logits=True), optimizer='adam', metrics=['accuracy']) history = int_model.fit(int_train_ds, validation_data=int_val_ds, epochs=5)

두 λͺ¨λΈμ„ λΉ„κ΅ν•©λ‹ˆλ‹€.

print("Linear model on binary vectorized data:") print(binary_model.summary())
print("ConvNet model on int vectorized data:") print(int_model.summary())

ν…ŒμŠ€νŠΈ λ°μ΄ν„°μ—μ„œ 두 λͺ¨λΈμ„ ν‰κ°€ν•©λ‹ˆλ‹€.

binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds) int_loss, int_accuracy = int_model.evaluate(int_test_ds) print("Binary model accuracy: {:2.2%}".format(binary_accuracy)) print("Int model accuracy: {:2.2%}".format(int_accuracy))

μ°Έκ³ : 이 예제 λ°μ΄ν„°μ„ΈνŠΈλŠ” λ‹€μ†Œ λ‹¨μˆœν•œ λΆ„λ₯˜ 문제λ₯Ό λ‚˜νƒ€λƒ…λ‹ˆλ‹€. 더 λ³΅μž‘ν•œ λ°μ΄ν„°μ„ΈνŠΈμ™€ λ¬Έμ œλŠ” μ „μ²˜λ¦¬ μ „λž΅κ³Ό λͺ¨λΈ μ•„ν‚€ν…μ²˜μ—μ„œ λ―Έλ¬˜ν•˜μ§€λ§Œ μ€‘μš”ν•œ 차이λ₯Ό λ‚˜νƒ€λƒ…λ‹ˆλ‹€. λ‹€μ–‘ν•œ μ ‘κ·Ό 방식을 λΉ„κ΅ν•˜λ €λ©΄ λ‹€μ–‘ν•œ ν•˜μ΄νΌ 맀개 λ³€μˆ˜μ™€ epochsλ₯Ό μ‹œλ„ν•΄μ•Ό ν•©λ‹ˆλ‹€.

λͺ¨λΈ 내보내기

μœ„μ˜ μ½”λ“œμ—μ„œλŠ” λͺ¨λΈμ— ν…μŠ€νŠΈλ₯Ό μ œκ³΅ν•˜κΈ° 전에 tf.keras.layers.TextVectorization을 λ°μ΄ν„°μ„ΈνŠΈμ— μ μš©ν–ˆμŠ΅λ‹ˆλ‹€. λͺ¨λΈμ΄ μ›μ‹œ λ¬Έμžμ—΄μ„ μ²˜λ¦¬ν•  수 μžˆλ„λ‘ ν•˜λ €λ©΄(예: 배포λ₯Ό λ‹¨μˆœν™”ν•˜κΈ° μœ„ν•΄) λͺ¨λΈ 내뢀에 TextVectorization λ ˆμ΄μ–΄λ₯Ό 포함할 수 μžˆμŠ΅λ‹ˆλ‹€.

이λ₯Ό μœ„ν•΄ 방금 ν›ˆλ ¨ν•œ κ°€μ€‘μΉ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ μƒˆ λͺ¨λΈμ„ λ§Œλ“€ 수 μžˆμŠ΅λ‹ˆλ‹€.

export_model = tf.keras.Sequential( [binary_vectorize_layer, binary_model, layers.Activation('sigmoid')]) export_model.compile( loss=losses.SparseCategoricalCrossentropy(from_logits=False), optimizer='adam', metrics=['accuracy']) # Test it with `raw_test_ds`, which yields raw strings loss, accuracy = export_model.evaluate(raw_test_ds) print("Accuracy: {:2.2%}".format(accuracy))

이제 μ—¬λŸ¬λΆ„μ˜ λͺ¨λΈμ€ μ›μ‹œ λ¬Έμžμ—΄μ„ μž…λ ₯으둜 μ‚¬μš©ν•˜κ³  Model.predictλ₯Ό μ‚¬μš©ν•˜μ—¬ 각 λ ˆμ΄λΈ”μ˜ 점수λ₯Ό μ˜ˆμΈ‘ν•  수 μžˆμŠ΅λ‹ˆλ‹€. λ‹€μŒκ³Ό 같이 μ΅œλŒ€ 점수λ₯Ό κ°€μ§„ λ ˆμ΄λΈ”μ„ μ°ΎλŠ” ν•¨μˆ˜λ₯Ό μ •μ˜ν•©λ‹ˆλ‹€.

def get_string_labels(predicted_scores_batch): predicted_int_labels = tf.math.argmax(predicted_scores_batch, axis=1) predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels) return predicted_labels

μƒˆ 데이터에 λŒ€ν•œ μΆ”λ‘  μ‹€ν–‰ν•˜κΈ°

inputs = [ "how do I extract keys from a dict into a list?", # 'python' "debug public static void main(string[] args) {...}", # 'java' ] predicted_scores = export_model.predict(inputs) predicted_labels = get_string_labels(predicted_scores) for input, label in zip(inputs, predicted_labels): print("Question: ", input) print("Predicted label: ", label.numpy())

λͺ¨λΈ 내뢀에 ν…μŠ€νŠΈ μ „μ²˜λ¦¬ 논리λ₯Ό ν¬ν•¨ν•˜λ©΄ 배포λ₯Ό λ‹¨μˆœν™”ν•˜κ³  ν›ˆλ ¨/ν…ŒμŠ€νŠΈ μ™œκ³‘ κ°€λŠ₯성을 μ€„μ΄λŠ” ν”„λ‘œλ•μ…˜μš© λͺ¨λΈμ„ 내보낼 수 μžˆμŠ΅λ‹ˆλ‹€.

tf.keras.layers.TextVectorizationλ₯Ό μ μš©ν•  μœ„μΉ˜λ₯Ό 선택할 λ•Œ 염두에 두어야 ν•  μ„±λŠ₯ 차이가 μžˆμŠ΅λ‹ˆλ‹€. λ ˆμ΄μ–΄λ₯Ό λͺ¨λΈ μ™ΈλΆ€μ—μ„œ μ‚¬μš©ν•˜λ©΄ GPUμ—μ„œ ν›ˆλ ¨ν•  λ•Œ 비동기 CPU 처리 및 데이터 버퍼링을 μˆ˜ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€. λ”°λΌμ„œ GPUμ—μ„œ λͺ¨λΈμ„ ν›ˆλ ¨ν•˜λŠ” 경우 λͺ¨λΈμ„ κ°œλ°œν•˜λŠ” λ™μ•ˆ μ΅œμƒμ˜ μ„±λŠ₯을 μ–»κΈ° μœ„ν•΄ 이 μ˜΅μ…˜μ„ μ‚¬μš©ν•˜κ³  배포 μ€€λΉ„κ°€ μ™„λ£Œλ˜λ©΄ λͺ¨λΈ 내뢀에 TextVectorization λ ˆμ΄μ–΄λ₯Ό ν¬ν•¨ν•˜λ„λ‘ μ „ν™˜ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

λͺ¨λΈ μ €μž₯에 λŒ€ν•΄ μžμ„Ένžˆ μ•Œμ•„λ³΄λ €λ©΄ λͺ¨λΈ μ €μž₯κ³Ό 볡원 νŠœν† λ¦¬μ–Όμ„ λ°©λ¬Έν•˜μ„Έμš”.

예제 2: μΌλ¦¬μ•„λ“œ(Iliad) λ²ˆμ—­μ˜ μž‘μ„±μž μ˜ˆμΈ‘ν•˜κΈ°

λ‹€μŒμ€ tf.data.TextLineDatasetλ₯Ό μ‚¬μš©ν•˜μ—¬ ν…μŠ€νŠΈ νŒŒμΌλ‘œλΆ€ν„° 예제λ₯Ό λ‘œλ“œν•˜κ³  TensorFlow Textλ₯Ό μ‚¬μš©ν•˜μ—¬ 데이터λ₯Ό μ „μ²˜λ¦¬ν•˜λŠ” 예λ₯Ό μ œκ³΅ν•©λ‹ˆλ‹€. 호머의 μΌλ¦¬μ•„λ“œ μž‘ν’ˆμ„ λ‹€λ₯΄κ²Œ λ²ˆμ—­ν•œ 3개의 μ˜μ–΄ λ²ˆμ—­λ¬Έμ„ μ‚¬μš©ν•˜κ²Œ 되며, ν•œ μ€„μ˜ ν…μŠ€νŠΈκ°€ μ œκ³΅λ˜μ—ˆμ„ λ•Œ λ²ˆμ—­κ°€λ₯Ό μ‹λ³„ν•˜λŠ” λͺ¨λΈμ„ ν›ˆλ ¨ν•©λ‹ˆλ‹€.

λ°μ΄ν„°μ„ΈνŠΈ λ‹€μš΄λ‘œλ“œ 및 νƒμƒ‰ν•˜κΈ°

3κ°€μ§€ λ²ˆμ—­λ³Έμ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

이 νŠœν† λ¦¬μ–Όμ—μ„œ μ‚¬μš©λœ ν…μŠ€νŠΈ νŒŒμΌμ€ λ¬Έμ„œ 헀더와 λ°”λ‹₯ κΈ€, 쀄 번호 및 챕터 제λͺ© 등을 μ œκ±°ν•˜λŠ” 일반적인 μ „μ²˜λ¦¬ μž‘μ—…μ„ κ±°μ³€μŠ΅λ‹ˆλ‹€.

이 κ°€λ³κ²Œ μ†μ§ˆν•œ νŒŒμΌμ„ 둜컬둜 λ‹€μš΄λ‘œλ“œν•©λ‹ˆλ‹€.

DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/' FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt'] for name in FILE_NAMES: text_dir = utils.get_file(name, origin=DIRECTORY_URL + name) parent_dir = pathlib.Path(text_dir).parent list(parent_dir.iterdir())

λ°μ΄ν„°μ„ΈνŠΈ λ‘œλ“œν•˜κΈ°

μ΄μ „μ—λŠ” tf.keras.utils.text_dataset_from_directoryλ₯Ό μ‚¬μš©ν•  경우 파일의 λͺ¨λ“  μ½˜ν…μΈ λ₯Ό 단일 예제둜 μ·¨κΈ‰ν–ˆμŠ΅λ‹ˆλ‹€. μ—¬κΈ°μ—μ„œλŠ” ν…μŠ€νŠΈ νŒŒμΌλ‘œλΆ€ν„° tf.data.Datasetλ₯Ό μƒμ„±ν•˜λ„λ‘ μ„€κ³„λœ tf.data.TextLineDataset을 μ‚¬μš©ν•©λ‹ˆλ‹€. μ΄λ•Œ 각 μ˜ˆμ œλŠ” 원본 파일의 ν…μŠ€νŠΈ μ€„μž…λ‹ˆλ‹€. TextLineDataset은 주둜 쀄 기반의 ν…μŠ€νŠΈ 데이터(예: μ‹œ λ˜λŠ” 였λ₯˜ 둜그)에 μœ μš©ν•©λ‹ˆλ‹€.

μ΄λŸ¬ν•œ νŒŒμΌμ„ λ°˜λ³΅ν•˜μ—¬ 각 νŒŒμΌμ„ 자체 λ°μ΄ν„°μ„ΈνŠΈμ— λ‘œλ“œν•©λ‹ˆλ‹€. 각 μ˜ˆμ œλŠ” κ°œλ³„μ μœΌλ‘œ λ ˆμ΄λΈ”μ„ μ§€μ •ν•΄μ•Ό ν•˜λ―€λ‘œ Dataset.map을 μ‚¬μš©ν•˜μ—¬ 각 μ˜ˆμ œμ— labeler ν•¨μˆ˜λ₯Ό μ μš©ν•©λ‹ˆλ‹€. μ΄λ ‡κ²Œ ν•˜λ©΄ λ°μ΄ν„°μ„ΈνŠΈμ˜ λͺ¨λ“  예제λ₯Ό λ°˜λ³΅ν•˜μ—¬ (example, label) μŒμ„ λ°˜ν™˜ν•©λ‹ˆλ‹€.

def labeler(example, index): return example, tf.cast(index, tf.int64)
labeled_data_sets = [] for i, file_name in enumerate(FILE_NAMES): lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name)) labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i)) labeled_data_sets.append(labeled_dataset)

λ‹€μŒμœΌλ‘œ Dataset.concatenateλ₯Ό μ‚¬μš©ν•˜μ—¬ λ ˆμ΄λΈ”μ΄ μ§€μ •λœ λ°μ΄ν„°μ„ΈνŠΈλ₯Ό 단일 λ°μ΄ν„°μ„ΈνŠΈλ‘œ κ²°ν•©ν•œ ν›„ Dataset.shuffle을 μ‚¬μš©ν•˜μ—¬ μ…”ν”Œν•©λ‹ˆλ‹€.

BUFFER_SIZE = 50000 BATCH_SIZE = 64 VALIDATION_SIZE = 5000
all_labeled_data = labeled_data_sets[0] for labeled_dataset in labeled_data_sets[1:]: all_labeled_data = all_labeled_data.concatenate(labeled_dataset) all_labeled_data = all_labeled_data.shuffle( BUFFER_SIZE, reshuffle_each_iteration=False)

이전과 같이 λͺ‡ κ°€μ§€ 예제λ₯Ό 좜λ ₯ν•©λ‹ˆλ‹€. λ°μ΄ν„°μ„ΈνŠΈκ°€ 아직 일괄 μ²˜λ¦¬λ˜μ§€ μ•Šμ•˜μœΌλ―€λ‘œ all_labeled_data의 각 ν•­λͺ©μ€ ν•˜λ‚˜μ˜ 데이터 ν¬μΈνŠΈμ— ν•΄λ‹Ήν•©λ‹ˆλ‹€.

for text, label in all_labeled_data.take(10): print("Sentence: ", text.numpy()) print("Label:", label.numpy())

ν›ˆλ ¨μ„ μœ„ν•œ λ°μ΄ν„°μ„ΈνŠΈ μ€€λΉ„ν•˜κΈ°

tf.keras.layers.TextVectorization을 μ‚¬μš©ν•˜μ—¬ ν…μŠ€νŠΈ λ°μ΄ν„°μ„ΈνŠΈλ₯Ό μ „μ²˜λ¦¬ν•˜λŠ” λŒ€μ‹ μ— μ΄μ œλŠ” TensorFlow Text APIλ₯Ό μ‚¬μš©ν•˜μ—¬ 데이터λ₯Ό ν‘œμ€€ν™” 및 ν† ν°ν™”ν•˜κ³  μ–΄νœ˜λ₯Ό λΉŒλ“œν•˜κ³ , tf.lookup.StaticVocabularyTable을 μ‚¬μš©ν•˜μ—¬ 토큰을 λͺ¨λΈμ— μ •μˆ˜μ— λ§€ν•‘ν•œ ν›„ λͺ¨λΈμ— κ³΅κΈ‰ν•©λ‹ˆλ‹€(TensorFlow ν…μŠ€νŠΈμ— λŒ€ν•΄ μžμ„Ένžˆ μ•Œμ•„λ³΄κΈ°).

λ‹€μŒκ³Ό 같이 ν…μŠ€νŠΈλ₯Ό μ†Œλ¬Έμžλ‘œ λ³€ν™˜ν•˜κ³  ν† ν°ν™”ν•˜λŠ” ν•¨μˆ˜λ₯Ό μ •μ˜ν•©λ‹ˆλ‹€.

  • TensorFlow TextλŠ” λ‹€μ–‘ν•œ ν† ν¬λ‚˜μ΄μ €λ₯Ό μ œκ³΅ν•©λ‹ˆλ‹€. 이 μ˜ˆμ œμ—μ„œλŠ” text.UnicodeScriptTokenizerλ₯Ό μ‚¬μš©ν•˜μ—¬ λ°μ΄ν„°μ„ΈνŠΈλ₯Ό ν† ν°ν™”ν•©λ‹ˆλ‹€.

  • μ—¬λŸ¬λΆ„μ€ Dataset.map을 μ‚¬μš©ν•˜μ—¬ λ°μ΄ν„°μ„ΈνŠΈμ— 토큰화λ₯Ό μ μš©ν•©λ‹ˆλ‹€.

tokenizer = tf_text.UnicodeScriptTokenizer()
def tokenize(text, unused_label): lower_case = tf_text.case_fold_utf8(text) return tokenizer.tokenize(lower_case)
tokenized_ds = all_labeled_data.map(tokenize)

λ°μ΄ν„°μ„ΈνŠΈλ₯Ό λ°˜λ³΅ν•˜κ³  λͺ‡ κ°€μ§€ ν† ν°ν™”λœ 예제λ₯Ό 좜λ ₯ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

for text_batch in tokenized_ds.take(5): print("Tokens: ", text_batch.numpy())

λ‹€μŒμœΌλ‘œ λΉˆλ„λ³„λ‘œ 토큰을 μ •λ ¬ν•˜κ³  μƒμœ„ VOCAB_SIZE 토큰을 μœ μ§€ν•˜μ—¬ μ–΄νœ˜λ₯Ό λΉŒλ“œν•©λ‹ˆλ‹€.

tokenized_ds = configure_dataset(tokenized_ds) vocab_dict = collections.defaultdict(lambda: 0) for toks in tokenized_ds.as_numpy_iterator(): for tok in toks: vocab_dict[tok] += 1 vocab = sorted(vocab_dict.items(), key=lambda x: x[1], reverse=True) vocab = [token for token, count in vocab] vocab = vocab[:VOCAB_SIZE] vocab_size = len(vocab) print("Vocab size: ", vocab_size) print("First five vocab entries:", vocab[:5])

토큰을 μ •μˆ˜λ‘œ λ³€ν™˜ν•˜λ €λ©΄ vocab μ„ΈνŠΈλ₯Ό μ‚¬μš©ν•˜μ—¬ tf.lookup.StaticVocabularyTable을 μƒμ„±ν•©λ‹ˆλ‹€. [2, vocab_size + 2] λ²”μœ„μ˜ μ •μˆ˜μ— 토큰을 λ§€ν•‘ν•©λ‹ˆλ‹€. TextVectorization λ ˆμ΄μ–΄μ™€ λ§ˆμ°¬κ°€μ§€λ‘œ 0은 νŒ¨λ”©μ„ λ‚˜νƒ€λ‚΄κΈ° μœ„ν•΄ μ˜ˆμ•½λ˜μ–΄ 있으며 1은 OOV(out-of-vocabulary) 토큰을 λ‚˜νƒ€λ‚΄κΈ° μœ„ν•΄ μ˜ˆμ•½λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.

keys = vocab values = range(2, len(vocab) + 2) # Reserve `0` for padding, `1` for OOV tokens. init = tf.lookup.KeyValueTensorInitializer( keys, values, key_dtype=tf.string, value_dtype=tf.int64) num_oov_buckets = 1 vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)

λ§ˆμ§€λ§‰μœΌλ‘œ ν† ν¬λ‚˜μ΄μ € 및 쑰회 ν…Œμ΄λΈ”μ„ μ‚¬μš©ν•˜μ—¬ λ°μ΄ν„°μ„ΈνŠΈλ₯Ό ν‘œμ€€ν™”, 토큰화 및 λ²‘ν„°ν™”ν•˜λŠ” ν•¨μˆ˜λ₯Ό μ •μ˜ν•©λ‹ˆλ‹€.

def preprocess_text(text, label): standardized = tf_text.case_fold_utf8(text) tokenized = tokenizer.tokenize(standardized) vectorized = vocab_table.lookup(tokenized) return vectorized, label

κ²°κ³Όλ₯Ό 좜λ ₯ν•˜κΈ° μœ„ν•΄ 단일 μ˜ˆμ œμ—μ„œ λ‹€μŒμ„ μ‹œλ„ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

example_text, example_label = next(iter(all_labeled_data)) print("Sentence: ", example_text.numpy()) vectorized_text, example_label = preprocess_text(example_text, example_label) print("Vectorized sentence: ", vectorized_text.numpy())

이제 Dataset.map을 μ‚¬μš©ν•˜μ—¬ λ°μ΄ν„°μ„ΈνŠΈμ—μ„œ μ „μ²˜λ¦¬ ν•¨μˆ˜λ₯Ό μ‹€ν–‰ν•©λ‹ˆλ‹€.

all_encoded_data = all_labeled_data.map(preprocess_text)

λ°μ΄ν„°μ„ΈνŠΈλ₯Ό ν›ˆλ ¨ 및 검증 μ„ΈνŠΈλ‘œ λΆ„ν• ν•˜κΈ°

Keras TextVectorization λ ˆμ΄μ–΄λŠ” λ²‘ν„°ν™”λœ 데이터도 일괄 μ²˜λ¦¬ν•˜κ³  νŒ¨λ”©ν•©λ‹ˆλ‹€. 배치 λ‚΄λΆ€μ˜ μ˜ˆμ œλŠ” 크기와 λͺ¨μ–‘이 κ°™μ•„μ•Ό ν•˜κΈ° λ•Œλ¬Έμ— νŒ¨λ”©μ΄ ν•„μš”ν•˜μ§€λ§Œ μ΄λŸ¬ν•œ λ°μ΄ν„°μ„ΈνŠΈμ˜ μ˜ˆμ œλŠ” λͺ¨λ‘ 같은 크기가 μ•„λ‹ˆλ©° 각 ν…μŠ€νŠΈ μ€„μ˜ 단어 μˆ˜λ„ λ‹€λ¦…λ‹ˆλ‹€.

tf.data.Dataset은 λ°μ΄ν„°μ„ΈνŠΈ λΆ„ν•  및 νŒ¨λ”© 일괄 처리λ₯Ό μ§€μ›ν•©λ‹ˆλ‹€.

train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE) validation_data = all_encoded_data.take(VALIDATION_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE) validation_data = validation_data.padded_batch(BATCH_SIZE)

이제 validation_data 및 train_dataλŠ” (example, label) 쌍의 λͺ¨μŒμ΄ μ•„λ‹ˆλΌ 배치의 λͺ¨μŒμž…λ‹ˆλ‹€. 각 λ°°μΉ˜λŠ” λ°°μ—΄λ‘œ ν‘œμ‹œλ˜λŠ” ν•œ 쌍의 (λ§Žμ€ 예제, λ§Žμ€ λ ˆμ΄λΈ”)μž…λ‹ˆλ‹€.

μ΄λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

sample_text, sample_labels = next(iter(validation_data)) print("Text batch shape: ", sample_text.shape) print("Label batch shape: ", sample_labels.shape) print("First text example: ", sample_text[0]) print("First label example: ", sample_labels[0])

νŒ¨λ”©μ—λŠ” 0을 μ‚¬μš©ν•˜κ³  OOV(out-of-vocabulary) ν† ν°μ—λŠ” 1을 μ‚¬μš©ν•˜μ˜€κΈ°μ— μ–΄νœ˜ 크기가 2λ°° μ¦κ°€ν–ˆμŠ΅λ‹ˆλ‹€.

vocab_size += 2

이전과 같은 더 λ‚˜μ€ μ„±λŠ₯을 μœ„ν•œ λ°μ΄ν„°μ„ΈνŠΈλ₯Ό κ΅¬μ„±ν•©λ‹ˆλ‹€.

train_data = configure_dataset(train_data) validation_data = configure_dataset(validation_data)

λͺ¨λΈ ν›ˆλ ¨ν•˜κΈ°

이전과 같이 이 λ°μ΄ν„°μ„ΈνŠΈμ—μ„œ λͺ¨λΈμ„ ν›ˆλ ¨ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

model = create_model(vocab_size=vocab_size, num_labels=3) model.compile( optimizer='adam', loss=losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy']) history = model.fit(train_data, validation_data=validation_data, epochs=3)
loss, accuracy = model.evaluate(validation_data) print("Loss: ", loss) print("Accuracy: {:2.2%}".format(accuracy))

λͺ¨λΈ 내보내기

μ›μ‹œ λ¬Έμžμ—΄μ„ μž…λ ₯으둜 μ‚¬μš©ν•  수 μžˆλŠ” λͺ¨λΈμ„ λ§Œλ“€κΈ° μœ„ν•΄ μ‚¬μš©μž μ •μ˜ μ „μ²˜λ¦¬ ν•¨μˆ˜μ™€ λ™μΌν•œ 단계λ₯Ό μˆ˜ν–‰ν•˜λŠ” Keras TextVectorization λ ˆμ΄μ–΄λ₯Ό μƒμ„±ν•˜κ²Œ λ©λ‹ˆλ‹€. 이미 μ–΄νœ˜λ₯Ό ν›ˆλ ¨ν–ˆμœΌλ―€λ‘œ TextVectorization.adapt λŒ€μ‹  TextVectorization.set_vocabularyλ₯Ό μ‚¬μš©ν•˜μ—¬ μƒˆ μ–΄νœ˜λ₯Ό ν›ˆλ ¨ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

preprocess_layer = TextVectorization( max_tokens=vocab_size, standardize=tf_text.case_fold_utf8, split=tokenizer.tokenize, output_mode='int', output_sequence_length=MAX_SEQUENCE_LENGTH) preprocess_layer.set_vocabulary(vocab)
export_model = tf.keras.Sequential( [preprocess_layer, model, layers.Activation('sigmoid')]) export_model.compile( loss=losses.SparseCategoricalCrossentropy(from_logits=False), optimizer='adam', metrics=['accuracy'])
# Create a test dataset of raw strings. test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE) test_ds = configure_dataset(test_ds) loss, accuracy = export_model.evaluate(test_ds) print("Loss: ", loss) print("Accuracy: {:2.2%}".format(accuracy))

μΈμ½”λ”©λœ 검증 μ„ΈνŠΈμ˜ λͺ¨λΈκ³Ό μ›μ‹œ 검증 μ„ΈνŠΈμ— λŒ€ν•΄ 내보내기λ₯Ό μˆ˜ν–‰ν•œ λͺ¨λΈμ˜ 손싀 및 정확성은 μ˜ˆμƒλŒ€λ‘œ λ™μΌν•©λ‹ˆλ‹€.

μƒˆ 데이터에 λŒ€ν•œ μΆ”λ‘  μ‹€ν–‰ν•˜κΈ°

inputs = [ "Join'd to th' Ionians with their flowing robes,", # Label: 1 "the allies, and his armour flashed about him so that he seemed to all", # Label: 2 "And with loud clangor of his arms he fell.", # Label: 0 ] predicted_scores = export_model.predict(inputs) predicted_labels = tf.math.argmax(predicted_scores, axis=1) for input, label in zip(inputs, predicted_labels): print("Question: ", input) print("Predicted label: ", label.numpy())

TensorFlow λ°μ΄ν„°μ„ΈνŠΈ(TFDS)λ₯Ό μ‚¬μš©ν•˜μ—¬ 더 λ§Žμ€ λ°μ΄ν„°μ„ΈνŠΈ λ‹€μš΄λ‘œλ“œν•˜κΈ°

TensorFlow λ°μ΄ν„°μ„ΈνŠΈλ‘œλΆ€ν„° 더 λ§Žμ€ λ°μ΄ν„°μ„ΈνŠΈλ₯Ό λ‹€μš΄λ‘œλ“œν•  수 μžˆμŠ΅λ‹ˆλ‹€.

이 μ˜ˆμ œμ—μ„œλŠ” IMDB λŒ€ν˜• μ˜ν™” 리뷰 λ°μ΄ν„°μ„ΈνŠΈλ₯Ό μ‚¬μš©ν•˜μ—¬ 감정 λΆ„λ₯˜μš© λͺ¨λΈμ„ ν›ˆλ ¨ν•©λ‹ˆλ‹€.

# Training set. train_ds = tfds.load( 'imdb_reviews', split='train[:80%]', batch_size=BATCH_SIZE, shuffle_files=True, as_supervised=True)
# Validation set. val_ds = tfds.load( 'imdb_reviews', split='train[80%:]', batch_size=BATCH_SIZE, shuffle_files=True, as_supervised=True)

λͺ‡ κ°€μ§€ 예제λ₯Ό 좜λ ₯ν•©λ‹ˆλ‹€.

for review_batch, label_batch in val_ds.take(1): for i in range(5): print("Review: ", review_batch[i].numpy()) print("Label: ", label_batch[i].numpy())

이제 이전과 같이 데이터λ₯Ό μ „μ²˜λ¦¬ν•˜κ³  λͺ¨λΈμ„ ν›ˆλ ¨ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

μ°Έκ³ : 이진 λΆ„λ₯˜ λ¬Έμ œμ΄λ―€λ‘œ μ—¬λŸ¬λΆ„μ˜ λͺ¨λΈμ— tf.keras.losses.SparseCategoricalCrossentropy λŒ€μ‹  tf.keras.losses.BinaryCrossentropyλ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€.

ν›ˆλ ¨μ„ μœ„ν•œ λ°μ΄ν„°μ„ΈνŠΈ μ€€λΉ„ν•˜κΈ°

vectorize_layer = TextVectorization( max_tokens=VOCAB_SIZE, output_mode='int', output_sequence_length=MAX_SEQUENCE_LENGTH) # Make a text-only dataset (without labels), then call `TextVectorization.adapt`. train_text = train_ds.map(lambda text, labels: text) vectorize_layer.adapt(train_text)
def vectorize_text(text, label): text = tf.expand_dims(text, -1) return vectorize_layer(text), label
train_ds = train_ds.map(vectorize_text) val_ds = val_ds.map(vectorize_text)
# Configure datasets for performance as before. train_ds = configure_dataset(train_ds) val_ds = configure_dataset(val_ds)

λͺ¨λΈ 생성, ꡬ성 및 ν›ˆλ ¨ν•˜κΈ°

model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=1) model.summary()
model.compile( loss=losses.BinaryCrossentropy(from_logits=True), optimizer='adam', metrics=['accuracy'])
history = model.fit(train_ds, validation_data=val_ds, epochs=3)
loss, accuracy = model.evaluate(val_ds) print("Loss: ", loss) print("Accuracy: {:2.2%}".format(accuracy))

λͺ¨λΈ 내보내기

export_model = tf.keras.Sequential( [vectorize_layer, model, layers.Activation('sigmoid')]) export_model.compile( loss=losses.SparseCategoricalCrossentropy(from_logits=False), optimizer='adam', metrics=['accuracy'])
# 0 --> negative review # 1 --> positive review inputs = [ "This is a fantastic movie.", "This is a bad movie.", "This movie was so bad that it was good.", "I will never say yes to watching this movie.", ] predicted_scores = export_model.predict(inputs) predicted_labels = [int(round(x[0])) for x in predicted_scores] for input, label in zip(inputs, predicted_labels): print("Question: ", input) print("Predicted label: ", label)

κ²°λ‘ 

이 νŠœν† λ¦¬μ–Όμ—μ„œλŠ” ν…μŠ€νŠΈλ₯Ό λ‘œλ“œν•˜κ³  μ „μ²˜λ¦¬ν•˜λŠ” μ—¬λŸ¬ 방법을 보여 λ“œλ ΈμŠ΅λ‹ˆλ‹€. λ‹€μŒ λ‹¨κ³„λ‘œ λ‹€μŒκ³Ό 같은 μΆ”κ°€ ν…μŠ€νŠΈ μ „μ²˜λ¦¬ TensorFlow ν…μŠ€νŠΈ νŠœν† λ¦¬μ–Όμ„ 탐색할 수 μžˆμŠ΅λ‹ˆλ‹€.

TensorFlow λ°μ΄ν„°μ„ΈνŠΈμ—μ„œ μƒˆ λ°μ΄ν„°μ„ΈνŠΈλ₯Ό 찾을 μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€. 그리고 tf.data에 λŒ€ν•΄ μžμ„Ένžˆ μ•Œμ•„λ³΄λ €λ©΄ μž…λ ₯ νŒŒμ΄ν”„λΌμΈ λΉŒλ“œ κ°€μ΄λ“œλ₯Ό ν™•μΈν•˜μ„Έμš”.