GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ko/datasets/tfless_tfds.ipynb
²⁵¹¹⁵ views

Kernel: Python 3

Copyright 2023 The TensorFlow Datasets Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Jax 및 PyTorch를 위한 TFDS

TFDS는 항상 프레임워크에 구애받지 않았습니다. 예를 들어, Jax 및 PyTorch에서 사용할 데이터세트를 NumPy 형식으로 쉽게 로드할 수 있습니다.

TensorFlow와 TensorFlow의 데이터 로딩 솔루션(tf.data)은 설계상 우리 API의 1등 시민입니다.

TensorFlow가 없는 NumPy 전용 데이터 로드를 지원하도록 TFDS를 확장했습니다. 그러면 Jax나 PyTorch와 같은 ML 프레임워크에서 사용하기에 편리할 수 있습니다. 실제로 이후의 사용자들을 위해 TensorFlow는 다음을 수행할 수 있습니다.

GPU/TPU 메모리 예약
CI/CD에서 빌드 시간을 연장
런타임에서 가져오는 시간을 소요

TensorFlow는 더 이상 데이터세트 읽기에 종속적이지 않습니다.

ML 파이프라인은 예제를 로드하고 디코딩하여 모델에 표시하기 위해 데이터 로더를 필요로 합니다. 데이터 로더는 '소스/샘플러/로더' 패러다임을 사용합니다.

 TFDS dataset       ┌────────────────┐
   on disk          │                │
        ┌──────────►│      Data      │
|..|... │     |     │     source     ├─┐
├──┼────┴─────┤     │                │ │
│12│image12   │     └────────────────┘ │    ┌────────────────┐
├──┼──────────┤                        │    │                │
│13│image13   │                        ├───►│      Data      ├───► ML pipeline
├──┼──────────┤                        │    │     loader     │
│14│image14   │     ┌────────────────┐ │    │                │
├──┼──────────┤     │                │ │    └────────────────┘
|..|...       |     │     Index      ├─┘
                    │    sampler     │
                    │                │
                    └────────────────┘

데이터 소스는 TFDS 데이터세트의 예제에 즉시 액세스하고 디코딩하는 역할을 담당합니다.
인덱스 샘플러는 레코드를 처리하는 순서를 결정합니다. 이는 레코드를 읽기 전에 전역 변환(예: 전역 셔플링, 샤딩, 여러 에포크 반복)을 현하는 데 중요합니다.
데이터 로더는 데이터 소스와 인덱스 샘플러를 활용하여 로드를 오케스트레이션합니다. 이를 통해 성능 최적화(예: 프리페칭, 멀티 프로세싱 또는 멀티 스레딩)가 가능합니다.

TL;DR

tfds.data_source는 데이터 소스를 생성하기 위한 API입니다.

순수 Python 파이프라인에서 빠른 프로토타이핑이 목적
데이터 집약적인 ML 파이프라인을 대규모로 관리

설치하기

필요한 종속성을 설치하고 가져오겠습니다.

In [ ]:

!pip install array_record
!pip install tfds-nightly

import os
os.environ.pop('TFDS_DATA_DIR', None)

import tensorflow_datasets as tfds

데이터 소스

데이터 소스는 기본적으로 Python 시퀀스입니다. 따라서 다음 프로토콜을 구현해야 합니다.

class RandomAccessDataSource(Protocol):
  """Interface for datasources where storage supports efficient random access."""

  def __len__(self) -> int:
    """Number of records in the dataset."""

  def __getitem__(self, record_key: int) -> Sequence[Any]:
    """Retrieves records for the given record_keys."""

경고: 이 API는 아직 개발 중입니다. 특히 이 시점에서 __getitem__은 입력으로 int 및 list[int]를 모두 지원해야 합니다. 앞으로는 표준에 따라 int만 지원할 것입니다.

기본 파일 형식은 효율적인 랜덤 액세스를 지원해야 합니다. 현재 TFDS는 array_record에 의존합니다.

array_record는 Riegeli에서 파생된 새로운 파일 형식으로 IO 효율성의 새로운 지평을 열었습니다. 특히 ArrayRecord는 레코드 인덱스로 병렬 읽기, 쓰기 및 랜덤 액세스를 지원합니다. ArrayRecord는 Riegeli를 기반으로 구축되며 동일한 압축 알고리즘을 지원합니다.

fashion_mnist는 컴퓨터 비전용 일반 데이터세트입니다. TFDS로 ArrayRecord 기반 데이터 소스를 검색하려면 다음을 사용하면 됩니다.

In [ ]:

ds = tfds.data_source('fashion_mnist')

tfds.data_source는 편리한 래퍼입니다. 이는 다음과 같습니다.

In [ ]:

builder = tfds.builder('fashion_mnist', file_format='array_record')
builder.download_and_prepare()
ds = builder.as_data_source()

이렇게 하면 데이터 소스 사전이 출력됩니다.

{
  'train': DataSource(name=fashion_mnist, split='train', decoders=None),
  'test': DataSource(name=fashion_mnist, split='test', decoders=None),
}

download_and_prepare가 실행되고 레코드 파일이 생성되면 더 이상 TensorFlow가 필요하지 않습니다. 모든 것은 Python/NumPy에서 이루어집니다.

TensorFlow를 제거하고 다른 하위 프로세스에서 데이터 소스를 다시 로드하여 이를 확인해보겠습니다.

In [ ]:

!pip uninstall -y tensorflow

In [ ]:

%%writefile no_tensorflow.py
import os
os.environ.pop('TFDS_DATA_DIR', None)

import tensorflow_datasets as tfds

try:
  import tensorflow as tf
except ImportError:
  print('No TensorFlow found...')

ds = tfds.data_source('fashion_mnist')
print('...but the data source could still be loaded...')
ds['train'][0]
print('...and the records can be decoded.')

In [ ]:

!python no_tensorflow.py

향후 버전에서는 데이터세트 준비를 TensorFlow 없이 할 수 있게 만들 것입니다.

데이터 소스의 길이는 다음과 같습니다.

In [ ]:

len(ds['train'])

데이터세트의 첫 번째 요소에 액세스합니다.

In [ ]:

%%timeit
ds['train'][0]

이것은 다른 요소에 액세스하는 것만큼이나 저렴합니다. 이것이 랜덤 액세스의 정의입니다.

In [ ]:

%%timeit
ds['train'][1000]

이제 특성으로 TensorFlow DType 대신 NumPy DTypes를 사용합니다. 다음을 사용하여 특성을 검사할 수 있습니다.

In [ ]:

features = tfds.builder('fashion_mnist').info.features

문서 자료에서 특성에 대한 자세한 정보를 확인할 수 있습니다. 여기에서 이미지의 모양과 클래스 수를 검색할 수 있습니다.

In [ ]:

shape = features['image'].shape
num_classes = features['label'].num_classes

순수 Python에서 사용하기

Python에서 데이터 소스를 반복하여 사용할 수 있습니다.

In [ ]:

for example in ds['train']:
  print(example)
  break

요소를 검사하면 모든 특성이 이미 NumPy를 사용하여 디코딩되어 있음을 알 수 있습니다. 애초에 OpenCV를 기본으로 사용하는데, 이는 속도가 빠르기 때문입니다. OpenCV가 설치되어 있지 않은 경우, 가볍고 빠른 이미지 디코딩을 제공하기 위해 기본값을 Pillow로 설정합니다.

{
  'image': array([[[0], [0], ..., [0]],
                  [[0], [0], ..., [0]]], dtype=uint8),
  'label': 2,
}

참고: 현재 이 기능은 Tensor, Image 및 Scalar 특성에서만 사용할 수 있습니다. Audio 및 Video 특성은 곧 제공될 예정입니다. 계속 지켜봐주세요!

PyTorch와 함께 사용하기

PyTorch는 소스/샘플러/로더 패러다임을 사용합니다. Torch에서는 "데이터 소스"를 "데이터세트"라고 부릅니다. torch.utils.data에는 Torch에서 효율적인 입력 파이프라인을 구축하기 위해 알아야 할 모든 세부 정보가 포함되어 있습니다.

TFDS 데이터 소스는 일반 지도 스타일 데이터세트로 사용할 수 있습니다.

먼저 Torch를 설치하고 가져옵니다.

In [ ]:

!pip install torch

from tqdm import tqdm
import torch

우리는 이미 훈련 및 테스트용 데이터 소스를 정의했습니다(각각 ds['train'] 및 ds['test']). 이제 샘플러와 로더를 정의할 수 있습니다.

In [ ]:

batch_size = 128
train_sampler = torch.utils.data.RandomSampler(ds['train'], num_samples=5_000)
train_loader = torch.utils.data.DataLoader(
    ds['train'],
    sampler=train_sampler,
    batch_size=batch_size,
)
test_loader = torch.utils.data.DataLoader(
    ds['test'],
    sampler=None,
    batch_size=batch_size,
)

PyTorch를 사용하여 첫 번째 예제에 대한 간단한 로지스틱 회귀를 훈련하고 평가합니다.

In [ ]:

class LinearClassifier(torch.nn.Module):
  def __init__(self, shape, num_classes):
    super(LinearClassifier, self).__init__()
    height, width, channels = shape
    self.classifier = torch.nn.Linear(height * width * channels, num_classes)

  def forward(self, image):
    image = image.view(image.size()[0], -1).to(torch.float32)
    return self.classifier(image)


model = LinearClassifier(shape, num_classes)
optimizer = torch.optim.Adam(model.parameters())
loss_function = torch.nn.CrossEntropyLoss()

print('Training...')
model.train()
for example in tqdm(train_loader):
  image, label = example['image'], example['label']
  prediction = model(image)
  loss = loss_function(prediction, label)
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

print('Testing...')
model.eval()
num_examples = 0
true_positives = 0
for example in tqdm(test_loader):
  image, label = example['image'], example['label']
  prediction = model(image)
  num_examples += image.shape[0]
  predicted_label = prediction.argmax(dim=1)
  true_positives += (predicted_label == label).sum().item()
print(f'\nAccuracy: {true_positives/num_examples * 100:.2f}%')

곧 출시 예정: JAX와 함께 사용하기

우리는 Grain과 긴밀히 협력하고 있습니다. Grain은 Python에서 사용할 수 있는 빠르고 결정론적인 오픈 소스 데이터 로더입니다. 계속 지켜봐주세요!

더 읽어보기

자세한 내용은 tfds.data_source API 문서를 참조하세요.