GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ko/datasets/determinism.ipynb
²⁵¹¹⁵ views

Kernel: Python 3

Copyright 2020 The TensorFlow Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

TFDS와 결정성

이 문서에서는 다음을 설명합니다.

결정성을 보장하는 TFDS
TFDS가 예시를 읽는 순서
다양한 주의사항과 문제들

설치

데이터 세트

TFDS가 데이터를 읽는 방식을 이해하려면 약간의 맥락이 필요합니다.

TFDS는 생성하는 동안 표준화된 .tfrecord 파일에 원본 데이터를 기록합니다. 큰 데이터 세트에서는 여러 .tfrecord 파일을 생성하며 각각에는 다양한 예시가 포함되어 있습니다. 우리는 각각의 .tfrecord 파일을 샤드라고 부릅니다.

이 가이드에서는 1024개의 샤드가 있는 이미지넷을 사용합니다.

In [ ]:

import re
import tensorflow_datasets as tfds

imagenet = tfds.builder('imagenet2012')

num_shards = imagenet.info.splits['train'].num_shards
num_examples = imagenet.info.splits['train'].num_examples
print(f'imagenet has {num_shards} shards ({num_examples} examples)')

imagenet has 1024 shards (1281167 examples)

데이터 세트 예시 ID 찾기

결정성에 대한 내용만 알고 싶은 경우 다음 섹션으로 건너뛸 수 있습니다.

각 데이터 세트 예시는 id(예: 'imagenet2012-train.tfrecord-01023-of-01024__32')를 통해 고유하게 식별됩니다. tf.data.Dataset로부터 dict의 'tfds_id' 키를 추가하는 read_config.add_tfds_id = True를 전달하여 이 id를 복구할 수 있습니다.

이 튜토리얼에서는 데이터 세트의 예시 ID를 인쇄하는 작은 유틸을 정의합니다(사람이 읽기 쉬도록 정수로 변환).

In [ ]:

#@title

def load_dataset(builder, **as_dataset_kwargs):
  """Load the dataset with the tfds_id."""
  read_config = as_dataset_kwargs.pop('read_config', tfds.ReadConfig())
  read_config.add_tfds_id = True  # Set `True` to return the 'tfds_id' key
  return builder.as_dataset(read_config=read_config, **as_dataset_kwargs)

def print_ex_ids(
    builder,
    *,
    take: int,
    skip: int = None,
    **as_dataset_kwargs,
) -> None:
  """Print the example ids from the given dataset split."""
  ds = load_dataset(builder, **as_dataset_kwargs)
  if skip:
    ds = ds.skip(skip)
  ds = ds.take(take)
  exs = [ex['tfds_id'].numpy().decode('utf-8') for ex in ds]
  exs = [id_to_int(tfds_id, builder=builder) for tfds_id in exs]
  print(exs)

def id_to_int(tfds_id: str, builder) -> str:
  """Format the tfds_id in a more human-readable."""
  match = re.match(r'\w+-(\w+).\w+-(\d+)-of-\d+__(\d+)', tfds_id)
  split_name, shard_id, ex_id = match.groups()
  split_info = builder.info.splits[split_name]
  return sum(split_info.shard_lengths[:int(shard_id)]) + int(ex_id)

읽기 작업 시의 결정성

이 섹션에서는 tfds.load의 결정성 보장에 대해 설명합니다.

`shuffle_files=False`의 사용(기본값)

기본적으로 TFDS는 결정적으로 예시를 산출합니다(shuffle_files=False).

In [ ]:

# Same as: imagenet.as_dataset(split='train').take(20)
print_ex_ids(imagenet, split='train', take=20)
print_ex_ids(imagenet, split='train', take=20)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254]

성능을 위해 TFDS는 tf.data.Dataset.interleave를 사용하여 동시에 여러 샤드를 읽습니다. 이 예시에서는 TFDS가 16개의 예시(..., 14, 15, 1251, 1252, ...)를 읽은 후 샤드 2로 전환하는 것을 보게 됩니다. 이터리브에 대한 자세한 내용은 아래를 참조합니다.

마찬가지로 subsplit API(하위 분할 API)도 다음과 같이 결정적입니다.

In [ ]:

print_ex_ids(imagenet, split='train[67%:84%]', take=20)
print_ex_ids(imagenet, split='train[67%:84%]', take=20)

[858382, 858383, 858384, 858385, 858386, 858387, 858388, 858389, 858390, 858391, 858392, 858393, 858394, 858395, 858396, 858397, 859533, 859534, 859535, 859536]
[858382, 858383, 858384, 858385, 858386, 858387, 858388, 858389, 858390, 858391, 858392, 858393, 858394, 858395, 858396, 858397, 859533, 859534, 859535, 859536]

둘 이상의 epoch에 대한 훈련을 진행하는 경우 모든 epoch가 동일한 순서로 샤드를 읽기에 위의 설정은 권장되지 않습니다(따라서 임의성은 ds = ds.shuffle(buffer) 버퍼 크기로 제한됨).

`shuffle_files=True`의 사용

shuffle_files=True를 사용하는 경우 각 epoch에 대해 샤드가 셔플되므로 읽기 작업이 더 이상 결정적이지 않습니다.

In [ ]:

print_ex_ids(imagenet, split='train', shuffle_files=True, take=20)
print_ex_ids(imagenet, split='train', shuffle_files=True, take=20)

[568017, 329050, 329051, 329052, 329053, 329054, 329056, 329055, 568019, 568020, 568021, 568022, 568023, 568018, 568025, 568024, 568026, 568028, 568030, 568031]
[43790, 43791, 43792, 43793, 43796, 43794, 43797, 43798, 43795, 43799, 43800, 43801, 43802, 43803, 43804, 43805, 43806, 43807, 43809, 43810]

참고: shuffle_files=True를 설정하면 약간의 성능 향상을 위해 tf.data.Options의 deterministic도 비활성화됩니다. 따라서 단일 샤드(예: mnist)만 있는 작은 데이터 세트도 비결정적으로 됩니다.

결정적 파일 셔플링을 가져오려면 아래의 레시피를 참조하세요.

결정성 주의사항: 인터리브 args

read_config.interleave_cycle_length를 변경하면 read_config.interleave_block_length가 예시 순서를 변경합니다.

TFDS는 tf.data.Dataset.interleave에 의존하여 한 번에 몇 개의 샤드만 로드하여 성능을 개선하고 메모리 사용량을 줄입니다.

예시 순서는 인터리브 args의 고정값에 대해서만 동일하도록 보장됩니다. cycle_length와 block_length에 해당하는 대상이 무엇인지 이해하려면 interleave doc을 참조하여 인터리브 문서를 참조하세요.

cycle_length=16, block_length=16 (기본값, 위와 동일):

In [ ]:

print_ex_ids(imagenet, split='train', take=20)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254]

cycle_length=3, block_length=2:

In [ ]:

read_config = tfds.ReadConfig(
    interleave_cycle_length=3,
    interleave_block_length=2,
)
print_ex_ids(imagenet, split='train', read_config=read_config, take=20)

[0, 1, 1251, 1252, 2502, 2503, 2, 3, 1253, 1254, 2504, 2505, 4, 5, 1255, 1256, 2506, 2507, 6, 7]

두 번째 예시에서 데이터 세트가 한 샤드에서 2 (block_length=2)의 예시를 읽고 다음 샤드로 전환하는 것을 볼 수 있습니다. 2 * 3(cycle_length=3) 예시마다 첫 번째 샤드(shard0-ex0, shard0-ex1, shard1-ex0, shard1-ex1, shard2-ex0, shard2-ex1, shard0-ex2, shard0-ex3, shard1-ex2, shard1-ex3, shard2-ex2,...)로 돌아갑니다.

하위 분할(Subsplit)과 예시 순서

각 예시에는 ID 0, 1, ..., num_examples-1가 있습니다. 하위 분할 API는 예시 조각을 선택합니다(예: train[:x]는 0, 1, ..., x-1을 선택).

다만 하위 분할 내에서 예시의 ID 순서가 오름차순으로 읽히지 않습니다(샤드 및 인터리브로 인함).

더 구체적으로 ds.take(x)와 split='train[:x]'는 동등하지 않습니다!

이는 예시가 다른 샤드로부터 오는 위의 인터리브 예시에서도 쉽게 확인할 수 있습니다.

In [ ]:

print_ex_ids(imagenet, split='train', take=25)  # tfds.load(..., split='train').take(25)
print_ex_ids(imagenet, split='train[:25]', take=-1)  # tfds.load(..., split='train[:25]')

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258, 1259]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]

16 (block_length) 예시 이후에 .take(25)는 다음 샤드로 전환하고 train[:25]는 첫 번째 샤드로부터 예시를 계속 읽습니다.

레시피

결정적 파일 셔플링 가져오기

결정적 셔플링을 보유하는 방법에는 2가지가 있습니다.

shuffle_seed를 설정합니다. 참고: 이렇게 하려면 각 epoch에서 시드를 변경해야 합니다. 그렇지 않으면 epoch 사이에서 동일한 순서로 샤드를 읽습니다.

In [ ]:

read_config = tfds.ReadConfig(
    shuffle_seed=32,
)

# Deterministic order, different from the default shuffle_files=False above
print_ex_ids(imagenet, split='train', shuffle_files=True, read_config=read_config, take=22)
print_ex_ids(imagenet, split='train', shuffle_files=True, read_config=read_config, take=22)

[176411, 176412, 176413, 176414, 176415, 176416, 176417, 176418, 176419, 176420, 176421, 176422, 176423, 176424, 176425, 176426, 710647, 710648, 710649, 710650, 710651, 710652]
[176411, 176412, 176413, 176414, 176415, 176416, 176417, 176418, 176419, 176420, 176421, 176422, 176423, 176424, 176425, 176426, 710647, 710648, 710649, 710650, 710651, 710652]

experimental_interleave_sort_fn의 사용: 이 경우 ds.shuffle 순서에 의존하지 않고 어떤 샤드를 어떤 순서로 읽을지 완전히 제어할 수 있습니다.

In [ ]:

def _reverse_order(file_instructions):
  return list(reversed(file_instructions))

read_config = tfds.ReadConfig(
    experimental_interleave_sort_fn=_reverse_order,
)

# Last shard (01023-of-01024) is read first
print_ex_ids(imagenet, split='train', read_config=read_config, take=5)

[1279916, 1279917, 1279918, 1279919, 1279920]

결정적 선점 가능 파이프라인 가져오기

이 방법은 더 복잡합니다. 쉽고 만족스러운 솔루션은 없습니다.

ds.shuffle 없이 결정적 셔플링을 사용할 경우 이론적으로 읽은 예시를 셀 수 있고 각 샤드 내에서 읽은 예시를 추론할 수 있어야 합니다(cycle_length, block_length 및 샤드 순서의 함수로). 그런 다음 experimental_interleave_sort_fn를 통해 각 샤드에 대한 skip, take를 삽입할 수 있습니다.
ds.shuffle을 사용하는 경우 전체 훈련 파이프라인을 재생하지 않으면 작업이 불가능할 수 있습니다. 읽은 예시를 추론하려면 ds.shuffle 버퍼 상태를 저장해야 합니다. 예시는 비연속적일 수 있습니다(예: shard5_ex2, shard5_ex4 읽기. 단, shard5_ex3는 해당 없음).
ds.shuffle을 사용하는 경우 한 가지 방법은 읽은 모든 shards_ids/example_ids(tfds_id에서 추론)를 저장한 후 여기로부터 파일 지침을 추론하는 것입니다.

1.에 대한 가장 간단한 사례는 .skip(x).take(y)와 train[x:x+y]를 일치시키는 것입니다. 필요 사항:

cycle_length=1 설정하기(샤드를 순차적으로 읽기 위함)
shuffle_files=False 설정하기
ds.shuffle 사용하지 않기

훈련이 1 epoch에 불과한 대형 데이터 세트에만 사용해야 합니다. 예시는 기본 셔플 순서로 읽습니다.

In [ ]:

read_config = tfds.ReadConfig(
    interleave_cycle_length=1,  # Read shards sequentially
)

print_ex_ids(imagenet, split='train', read_config=read_config, skip=40, take=22)
# If the job get pre-empted, using the subsplit API will skip at most `len(shard0)`
print_ex_ids(imagenet, split='train[40:]', read_config=read_config, take=22)

[40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61]
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61]

제공된 하위 분할에서 읽은 샤드/예시 찾기

tfds.core.DatasetInfo를 사용할 경우 읽기 지침에 직접 액세스할 수 있습니다.

In [ ]:

imagenet.info.splits['train[44%:45%]'].file_instructions

[FileInstruction(filename='imagenet2012-train.tfrecord-00450-of-01024', skip=700, take=-1, num_examples=551),
 FileInstruction(filename='imagenet2012-train.tfrecord-00451-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00452-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00453-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00454-of-01024', skip=0, take=-1, num_examples=1252),
 FileInstruction(filename='imagenet2012-train.tfrecord-00455-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00456-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00457-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00458-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00459-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00460-of-01024', skip=0, take=1001, num_examples=1001)]

Copyright 2020 The TensorFlow Authors.

TFDS와 결정성

설치

데이터 세트

데이터 세트 예시 ID 찾기

읽기 작업 시의 결정성

`shuffle_files=False`의 사용(기본값)

`shuffle_files=True`의 사용

결정성 주의사항: 인터리브 args

하위 분할(Subsplit)과 예시 순서

레시피

결정적 파일 셔플링 가져오기

결정적 선점 가능 파이프라인 가져오기

제공된 하위 분할에서 읽은 샤드/예시 찾기

Product

Resources

Company

Copyright 2020 The TensorFlow Authors.

TFDS와 결정성

설치

데이터 세트

데이터 세트 예시 ID 찾기

읽기 작업 시의 결정성

shuffle_files=False의 사용(기본값)

shuffle_files=True의 사용

결정성 주의사항: 인터리브 args

하위 분할(Subsplit)과 예시 순서

레시피

결정적 파일 셔플링 가져오기

결정적 선점 가능 파이프라인 가져오기

제공된 하위 분할에서 읽은 샤드/예시 찾기

`shuffle_files=False`의 사용(기본값)

`shuffle_files=True`의 사용