GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ko/tutorials/text/warmstart_embedding_matrix.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

In [ ]:

##### Copyright 2022 The TensorFlow Authors.


# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Warm-start embedding layer matrix

이 튜토리얼에서는 어휘를 변경할 때 텍스트 감정 분류를 위해 tf.keras.utils.warmstart_embedding_matrix API를 사용하여 학습을 "웜 스타트"하는 방법을 보여줍니다.

기본 어휘를 사용하여 간단한 Keras 모델을 학습한 다음 어휘를 업데이트한 후 모델 학습을 계속합니다. 이를 "웜 스타트(warm-start)" 학습이라고 하며, 이를 위해 새 어휘에 대한 텍스트 임베딩 매트릭스를 다시 매핑해야 합니다.

임베딩 매트릭스

임베딩을 통해 유사한 어휘 토큰이 유사한 인코딩을 갖는 효율적이고 밀집된 표현을 사용할 수 있습니다. 이는 학습 가능한 매개변수(모델이 밀집 레이어에 대한 가중치를 학습하는 것과 같은 방식으로 학습 중에 모델이 학습한 가중치)입니다. 작은 데이터세트의 경우 8차원, 큰 데이터세트로 작업할 때는 최대 1024차원의 임베딩이 일반적입니다. 더 높은 차원의 임베딩은 단어 간의 세분화된 관계를 포착할 수 있지만 더 많은 학습 데이터가 필요할 수 있습니다.

어휘

고유한 단어의 집합을 어휘라고 합니다. 텍스트 모델을 구축하려면 고정된 어휘를 선택해야 합니다. 일반적으로 데이터세트에서 가장 일반적인 단어로 어휘를 만듭니다. 어휘를 사용하면 임베딩 매트릭스에서 조회할 수 있는 일련의 ID로 각 텍스트 조각을 나타낼 수 있습니다. 어휘를 사용하면 텍스트에 나타나는 특정 단어로 각 텍스트를 나타낼 수 있습니다.

임베딩 매트릭스를 웜 스타트하는 이유

모델은 주어진 어휘를 나타내는 일련의 임베딩으로 학습됩니다. 모델을 업데이트하거나 개선해야 하는 경우 이전 실행의 가중치를 재사용하여 훨씬 빠르게 수렴하도록 훈련시킬 수 있습니다. 이전 실행의 임베딩 매트릭스를 사용하는 것이 더 어렵습니다. 문제는 어휘를 변경하면 단어와 id 매핑이 무효화된다는 것입니다.

tf.keras.utils.warmstart_embedding_matrix는 기본 어휘의 임베딩 매트릭스에서 새 어휘에 대한 임베딩 매트릭스를 생성하여 이 문제를 해결합니다. 단어가 두 어휘 모두에 존재하는 경우 기본 임베딩 벡터가 새 임베딩 매트릭스의 올바른 위치에 복사됩니다. 그러면 어휘의 크기나 순서가 변경된 후 학습을 웜 스타트할 수 있습니다.

설정

In [ ]:

!pip install --pre -U "tensorflow>2.10"  # Requires 2.11

In [ ]:

import io
import numpy as np
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import Model
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers import TextVectorization

데이터세트 로드하기

이 튜토리얼에서는 대규모 영화 리뷰 데이터세트를 사용합니다. 이 데이터세트에서 감정 분류 모델을 훈련하고 그 과정에서 임베딩을 처음부터 학습하게 됩니다. 자세한 내용은 텍스트 로드 튜토리얼을 참조하세요.

Keras 파일 유틸리티를 사용하여 데이터세트를 다운로드하고 디렉터리를 검토합니다.

In [ ]:

url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file(
    "aclImdb_v1.tar.gz", url, untar=True, cache_dir=".", cache_subdir=""
)

dataset_dir = os.path.join(os.path.dirname(dataset), "aclImdb")
os.listdir(dataset_dir)

train/ 디렉터리에는 영화 리뷰에 각각 긍정적 및 부정적 레이블이 지정된 pos 및 neg 폴더가 있습니다. pos 및 neg 폴더의 리뷰를 사용하여 바이너리 분류 모델을 학습시킵니다.

In [ ]:

train_dir = os.path.join(dataset_dir, "train")
os.listdir(train_dir)

train 디렉터리에는 학습 세트를 만들기 전에 제거해야 하는 추가 폴더도 포함되어 있습니다.

In [ ]:

remove_dir = os.path.join(train_dir, "unsup")
shutil.rmtree(remove_dir)

다음으로 tf.keras.utils.text_dataset_from_directory를 사용하여 tf.data.Dataset를 생성합니다. 이 텍스트 분류 튜토리얼에서 이 유틸리티 사용에 대해 자세히 알아볼 수 있습니다.

train 디렉터리를 사용하여 유효성 검사를 위해 20% 분할로 학습 및 유효성 검사 세트를 만듭니다.

In [ ]:

batch_size = 1024
seed = 123
train_ds = tf.keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=seed,
)
val_ds = tf.keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="validation",
    seed=seed,
)

성능을 높이도록 데이터세트 구성하기

데이터 성능 가이드에서 Dataset.cache 및 Dataset.prefetch에 대한 자세한 내용과 데이터를 디스크에 캐시하는 방법을 알아볼 수 있습니다.

In [ ]:

AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

텍스트 전처리

다음으로, 감정 분류 모델에 필요한 데이터세트 전처리 단계를 정의합니다. 영화 리뷰를 벡터화하기 위해 원하는 매개변수로 layers.TextVectorization 레이어를 초기화합니다. 텍스트 분류 튜토리얼에서 이 레이어 사용에 대해 자세히 알아볼 수 있습니다.

In [ ]:

# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
        stripped_html, "[%s]" % re.escape(string.punctuation), ""
    )


# Vocabulary size and number of words in a sequence.
vocab_size = 10000
sequence_length = 100

# Use the text vectorization layer to normalize, split, and map strings to
# integers. Note that the layer uses the custom standardization defined above.
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)

# Make a text-only dataset (no labels) and call `Dataset.adapt` to build the
# vocabulary.
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

분류 모델 생성하기

Keras 순차형 API를 사용하여 감정 분류 모델을 정의합니다.

In [ ]:

embedding_dim = 16
text_embedding = Embedding(vocab_size, embedding_dim, name="embedding")

In [ ]:

text_input = tf.keras.Sequential(
    [vectorize_layer, text_embedding], name="text_input"
)
classifier_head = tf.keras.Sequential(
    [GlobalAveragePooling1D(), Dense(16, activation="relu"), Dense(1)],
    name="classifier_head",
)

model = tf.keras.Sequential([text_input, classifier_head])

모델 컴파일 및 학습하기

TensorBoard를 사용하여 손실 및 정확도를 포함한 메트릭을 시각화합니다. tf.keras.callbacks.TensorBoard를 생성합니다.

In [ ]:

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

Adam 옵티마이저와 BinaryCrossentropy 손실을 사용하여 모델을 컴파일하고 학습시킵니다.

In [ ]:

model.compile(
    optimizer="adam",
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

In [ ]:

model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15,
    callbacks=[tensorboard_callback],
)

이 접근 방식을 통해 모델은 약 85%의 검증 정확도에 도달합니다.

참고: 임베딩 레이어를 학습하기 전에 가중치가 무작위로 초기화된 방식에 따라 결과가 약간 다를 수 있습니다.

모델 요약을 살펴보고 모델의 각 레이어에 대해 자세히 알아볼 수 있습니다.

In [ ]:

model.summary()

TensorBoard에서 모델 메트릭을 시각화합니다.

In [ ]:

# docs_infra: no_execute
%load_ext tensorboard
%tensorboard --logdir logs

어휘 재매핑

이제 어휘를 업데이트하고 웜 스타트 학습을 계속 진행할 것입니다.

먼저 기본 어휘와 임베딩 매트릭스를 가져옵니다.

In [ ]:

embedding_weights_base = (
    model.get_layer("text_input").get_layer("embedding").get_weights()[0]
)
vocab_base = vectorize_layer.get_vocabulary()

더 큰 새 어휘를 생성하기 위해 새 벡터화 레이어를 정의합니다.

In [ ]:

# Vocabulary size and number of words in a sequence.
vocab_size_new = 10200
sequence_length = 100

vectorize_layer_new = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size_new,
    output_mode="int",
    output_sequence_length=sequence_length,
)

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer_new.adapt(text_ds)

# Get the new vocabulary
vocab_new = vectorize_layer_new.get_vocabulary()

In [ ]:

# View the new vocabulary tokens that weren't in `vocab_base`
set(vocab_base) ^ set(vocab_new)

keras.utils.warmstart_embedding_matrix 유틸리티를 사용하여 업데이트된 임베딩을 생성합니다.

In [ ]:

# Generate the updated embedding matrix
updated_embedding = tf.keras.utils.warmstart_embedding_matrix(
    base_vocabulary=vocab_base,
    new_vocabulary=vocab_new,
    base_embeddings=embedding_weights_base,
    new_embeddings_initializer="uniform",
)
# Update the model variable
updated_embedding_variable = tf.Variable(updated_embedding)

또는

새 임베딩 매트릭스를 초기화하는 데 사용하려는 임베딩 매트릭스가 있는 경우 keras.initializers.Constant를 new_embeddings 이니셜라이저로 사용합니다. 시도하려면 다음 블록을 코드 셀에 복사하세요. 이는 vocab의 새 단어에 대해 더 나은 임베딩 매트릭스 초기화가 있는 경우에 유용합니다.

# generate updated embedding matrix
new_embedding = np.random.rand(len(vocab_new), 16)
updated_embedding = tf.keras.utils.warmstart_embedding_matrix(
            base_vocabulary=vocab_base,
            new_vocabulary=vocab_new,
            base_embeddings=embedding_weights_base,
            new_embeddings_initializer=tf.keras.initializers.Constant(
                new_embedding
            )
        )
# update model variable
updated_embedding_variable = tf.Variable(updated_embedding)

임베딩 매트릭스의 모양이 새 어휘를 반영하도록 변경되었는지 확인합니다.

In [ ]:

updated_embedding_variable.shape

임베딩 매트릭스가 업데이트되었으므로 다음 단계는 레이어 가중치를 업데이트하는 것입니다.

In [ ]:

text_embedding_layer_new = Embedding(
    vectorize_layer_new.vocabulary_size(), embedding_dim, name="embedding"
)
text_embedding_layer_new.build(input_shape=[None])
text_embedding_layer_new.embeddings.assign(updated_embedding)
text_input_new = tf.keras.Sequential(
    [vectorize_layer_new, text_embedding_layer_new], name="text_input_new"
)
text_input_new.summary()

# Verify the shape of updated weights
# The new weights shape should reflect the new vocabulary size
text_input_new.get_layer("embedding").get_weights()[0].shape

새로운 텍스트 벡터화 레이어를 사용하도록 모델 아키텍처를 수정합니다.

아래와 같이 체크포인트에서 모델을 로드하고 모델 아키텍처를 업데이트할 수도 있습니다.

In [ ]:

warm_started_model = tf.keras.Sequential([text_input_new, classifier_head])
warm_started_model.summary()

새 어휘를 수락하도록 모델을 성공적으로 업데이트했습니다. 임베딩 레이어가 업데이트되어 이전 어휘 단어를 이전 임베딩에 매핑하고 학습할 새 어휘에 대한 임베딩을 초기화합니다. 나머지 모델의 학습된 가중치는 동일하게 유지됩니다. 모델은 이전에 중단된 위치에서 계속 학습하기 위해 웜 스타트됩니다.

이제 다시 매핑이 작동했는지 확인할 수 있습니다. 기본 어휘와 새 어휘 모두에 존재하는 어휘 "the"의 인덱스를 가져오고 임베딩 값을 비교합니다. 이 둘은 동등해야 합니다.

In [ ]:

# New vocab words
base_vocab_index = vectorize_layer("the")[0]
new_vocab_index = vectorize_layer_new("the")[0]
print(
    warm_started_model.get_layer("text_input_new").get_layer("embedding")(
        new_vocab_index
    )
    == embedding_weights_base[base_vocab_index]
)

웜 스타트 학습 계속하기

학습이 웜 스타트되는 방식에 주목하세요. 첫 번째 에포크의 정확도는 약 85%입니다. 이전 학습이 종료된 지점의 정확도에 가깝습니다.

In [ ]:

model.compile(
    optimizer="adam",
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=["accuracy"],
)
model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15,
    callbacks=[tensorboard_callback],
)

웜 스타트 학습 시각화

In [ ]:

# docs_infra: no_execute
%reload_ext tensorboard
%tensorboard --logdir logs

다음 단계

이 튜토리얼에서는 다음을 수행하는 방법을 배웠습니다.

작은 어휘 데이터세트에서 감정 분류 모델을 처음부터 학습합니다.
어휘 크기가 변경되면 모델 아키텍처를 업데이트하고 임베딩 매트릭스를 웜 스타트합니다.
데이터세트 확장으로 모델 정확도를 지속적으로 개선합니다.

임베딩에 대해 자세히 알아보려면 Word2Vec 및 언어 이해를 위한 트랜스포머 모델 튜토리얼을 확인하세요.