Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
tensorflow
GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ko/datasets/add_dataset.md
25115 views

์‚ฌ์šฉ์ž ์ •์˜ ๋ฐ์ดํ„ฐ์„ธํŠธ ์ž‘์„ฑํ•˜๊ธฐ

์ด ๊ฐ€์ด๋“œ์— ๋”ฐ๋ผ ์ƒˆ ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ์ƒ์„ฑํ•˜์„ธ์š”(TFDS ๋˜๋Š” ์ž์ฒด ๋ฆฌํฌ์ง€ํ† ๋ฆฌ ์ด์šฉ).

์›ํ•˜๋Š” ๋ฐ์ดํ„ฐ์„ธํŠธ๊ฐ€ ์ด๋ฏธ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๋ ค๋ฉด ๋ฐ์ดํ„ฐ์„ธํŠธ ๋ชฉ๋ก์„ ํ™•์ธํ•˜์„ธ์š”.

TL;DR

์ƒˆ ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ์ž‘์„ฑํ•˜๋Š” ๊ฐ€์žฅ ์‰ฌ์šด ๋ฐฉ๋ฒ•์€ TFDS CLI๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

cd path/to/my/project/datasets/ tfds new my_dataset # Create `my_dataset/my_dataset.py` template files # [...] Manually modify `my_dataset/my_dataset_dataset_builder.py` to implement your dataset. cd my_dataset/ tfds build # Download and prepare the dataset to `~/tensorflow_datasets/`

tfds.load('my_dataset')์™€ ํ•จ๊ป˜ ์ƒˆ ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋‹ค์Œ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

  • tfds.load๊ฐ€ ~/tensorflow_datasets/my_dataset/์—์„œ ์ƒ์„ฑ๋œ(์˜ˆ: tfds build๋ฅผ ํ†ตํ•ด) ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ์ž๋™์œผ๋กœ ๊ฐ์ง€ํ•˜๊ณ  ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.

  • ๋˜๋Š”, my.project.datasets.my_dataset๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๊ฐ€์ ธ์™€ ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ๋“ฑ๋กํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

import my.project.datasets.my_dataset # Register `my_dataset` ds = tfds.load('my_dataset') # `my_dataset` registered

๊ฐœ์š”

๋ฐ์ดํ„ฐ์„ธํŠธ๋Š” ๋ชจ๋“  ์ข…๋ฅ˜์˜ ํ˜•์‹์œผ๋กœ ๋ชจ๋“  ์žฅ์†Œ์— ๋ฐฐํฌ๋˜๋ฉฐ, ํ•ญ์ƒ ๋จธ์‹ ๋Ÿฌ๋‹ ํŒŒ์ดํ”„๋ผ์ธ์— ๊ณต๊ธ‰ํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•์‹์œผ๋กœ ์ €์žฅ๋˜๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค. TFDS๋ฅผ ์ž…๋ ฅํ•˜์„ธ์š”.

TFDS๋Š” ์ด๋Ÿฌํ•œ ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ํ‘œ์ค€ ํ˜•์‹(์™ธ๋ถ€ ๋ฐ์ดํ„ฐ -> ์ง๋ ฌํ™”๋œ ํŒŒ์ผ)์œผ๋กœ ์ฒ˜๋ฆฌํ•œ ๋‹ค์Œ ๋จธ์‹ ๋Ÿฌ๋‹ ํŒŒ์ดํ”„๋ผ์ธ(์ง๋ ฌํ™”๋œ ํŒŒ์ผ -> tf.data.Dataset)์œผ๋กœ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ง๋ ฌํ™”๋Š” ํ•œ ๋ฒˆ๋งŒ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ์ดํ›„ ์•ก์„ธ์Šค ๋•Œ๋Š” ์ด๋Ÿฌํ•œ ์‚ฌ์ „ ์ฒ˜๋ฆฌ๋œ ํŒŒ์ผ์—์„œ ์ง์ ‘ ์ฝ์Šต๋‹ˆ๋‹ค.

๋Œ€๋ถ€๋ถ„์˜ ์ „์ฒ˜๋ฆฌ๋Š” ์ž๋™์œผ๋กœ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ๊ฐ ๋ฐ์ดํ„ฐ์„ธํŠธ๋Š” ๋‹ค์Œ์„ ์ง€์ •ํ•˜๋Š” tfds.core.DatasetBuilder์˜ ์„œ๋ธŒ ํด๋ž˜์Šค๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

  • ๋ฐ์ดํ„ฐ์˜ ์ถœ์ฒ˜(์˜ˆ: URL)

  • ๋ฐ์ดํ„ฐ์„ธํŠธ์˜ ๋ชจ์Šต(์ฆ‰, ํŠน์„ฑ)

  • ๋ฐ์ดํ„ฐ ๋ถ„ํ•  ๋ฐฉ๋ฒ•(์˜ˆ: TRAIN ๋ฐ TEST )

  • ๋ฐ์ดํ„ฐ์„ธํŠธ์˜ ๊ฐœ๋ณ„ ์˜ˆ

๋ฐ์ดํ„ฐ์„ธํŠธ ์ž‘์„ฑํ•˜๊ธฐ

๊ธฐ๋ณธ ํ…œํ”Œ๋ฆฟ: tfds new

TFDS CLI๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•„์š”ํ•œ ํ…œํ”Œ๋ฆฟ Python ํŒŒ์ผ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

cd path/to/project/datasets/ # Or use `--dir=path/to/project/datasets/` below tfds new my_dataset

์ด ๋ช…๋ น์œผ๋กœ ๋‹ค์Œ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„ ์ƒˆ๋กœ์šด my_dataset/๊ฐ€ ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค.

my_dataset/ __init__.py README.md # Markdown description of the dataset. CITATIONS.bib # Bibtex citation for the dataset. TAGS.txt # List of tags describing the dataset. my_dataset_dataset_builder.py # Dataset definition my_dataset_dataset_builder_test.py # Test dummy_data/ # (optional) Fake data (used for testing) checksum.tsv # (optional) URL checksums (see `checksums` section).

์—ฌ๊ธฐ์—์„œ TODO(my_dataset)๋ฅผ ๊ฒ€์ƒ‰ํ•˜๊ณ  ๊ทธ์— ๋”ฐ๋ผ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ์„ธํŠธ ์˜ˆ์ œ

๋ชจ๋“  ๋ฐ์ดํ„ฐ์„ธํŠธ๋Š” ๋Œ€๋ถ€๋ถ„์˜ ์ƒ์šฉ๊ตฌ๋ฅผ ๊ด€๋ฆฌํ•˜๋Š” tfds.core.DatasetBuilder์˜ ์„œ๋ธŒ ํด๋ž˜์Šค๋กœ ๊ตฌํ˜„๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋‹ค์Œ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

  • ๋‹จ์ผ ๋จธ์‹ ์—์„œ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ์ž‘์€/์ค‘๊ฐ„ ํฌ๊ธฐ์˜ ๋ฐ์ดํ„ฐ์„ธํŠธ(์ด ํŠœํ† ๋ฆฌ์–ผ)

  • ๋ถ„์‚ฐ ์ƒ์„ฑ์ด ํ•„์š”ํ•œ ๋งค์šฐ ํฐ ๋ฐ์ดํ„ฐ์„ธํŠธ(Apache Beam์„ ์‚ฌ์šฉํ•˜๋ฉฐ, ๋ฐฉ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ์„ธํŠธ ๊ฐ€์ด๋“œ ์ฐธ์กฐ)

๋‹ค์Œ์€ tfds.core.GeneratorBasedBuilder๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ•˜๋Š” ๋ฐ์ดํ„ฐ์„ธํŠธ ๋นŒ๋”์˜ ์ตœ์†Œ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

class Builder(tfds.core.GeneratorBasedBuilder): """DatasetBuilder for my_dataset dataset.""" VERSION = tfds.core.Version('1.0.0') RELEASE_NOTES = { '1.0.0': 'Initial release.', } def _info(self) -> tfds.core.DatasetInfo: """Dataset metadata (homepage, citation,...).""" return self.dataset_info_from_configs( features=tfds.features.FeaturesDict({ 'image': tfds.features.Image(shape=(256, 256, 3)), 'label': tfds.features.ClassLabel( names=['no', 'yes'], doc='Whether this is a picture of a cat'), }), ) def _split_generators(self, dl_manager: tfds.download.DownloadManager): """Download the data and define splits.""" extracted_path = dl_manager.download_and_extract('http://data.org/data.zip') # dl_manager returns pathlib-like objects with `path.read_text()`, # `path.iterdir()`,... return { 'train': self._generate_examples(path=extracted_path / 'train_images'), 'test': self._generate_examples(path=extracted_path / 'test_images'), } def _generate_examples(self, path) -> Iterator[Tuple[Key, Example]]: """Generator of examples for each split.""" for img_path in path.glob('*.jpeg'): # Yields (key, example) yield img_path.name, { 'image': img_path, 'label': 'yes' if img_path.name.startswith('yes_') else 'no', }

์ผ๋ถ€ ํŠน์ • ๋ฐ์ดํ„ฐ ํ˜•์‹์˜ ๊ฒฝ์šฐ, ๋Œ€๋ถ€๋ถ„์˜ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ๊ด€๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•  ์ค€๋น„๊ฐ€ ๋œ ๋ฐ์ดํ„ฐ์„ธํŠธ ๋นŒ๋”๋ฅผ ์ œ๊ณตํ•ด ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

๋ฎ์–ด์“ฐ๊ธฐ๋ฅผ ์œ„ํ•œ 3๊ฐ€์ง€ ์ถ”์ƒ ๋ฉ”์„œ๋“œ์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

_info: ๋ฐ์ดํ„ฐ์„ธํŠธ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ

_info๋Š” ๋ฐ์ดํ„ฐ์„ธํŠธ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•˜๋Š” tfds.core.DatasetInfo๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

def _info(self): # The `dataset_info_from_configs` base method will construct the # `tfds.core.DatasetInfo` object using the passed-in parameters and # adding: builder (self), description/citations/tags from the config # files located in the same package. return self.dataset_info_from_configs( homepage='https://dataset-homepage.org', features=tfds.features.FeaturesDict({ 'image_description': tfds.features.Text(), 'image': tfds.features.Image(), # Here, 'label' can be 0-4. 'label': tfds.features.ClassLabel(num_classes=5), }), # If there's a common `(input, target)` tuple from the features, # specify them here. They'll be used if as_supervised=True in # builder.as_dataset. supervised_keys=('image', 'label'), # Specify whether to disable shuffling on the examples. Set to False by default. disable_shuffling=False, )

๋Œ€๋ถ€๋ถ„์˜ ํ•„๋“œ๋Š” ์ž์ฒด์ ์œผ๋กœ ๋ช…ํ™•ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ถ€ ์ •๋ฐ€๋„:

BibText CITATIONS.bib ํŒŒ์ผ ์ž‘์„ฑ:

  • ๋ฐ์ดํ„ฐ์„ธํŠธ ์›น์‚ฌ์ดํŠธ์—์„œ ์ธ์šฉ ๋ช…๋ น์–ด๋ฅผ ๊ฒ€์ƒ‰ํ•ฉ๋‹ˆ๋‹ค(BibTex ํ˜•์‹์œผ๋กœ ์‚ฌ์šฉ).

  • arXiv ๋…ผ๋ฌธ์˜ ๊ฒฝ์šฐ: ๋…ผ๋ฌธ์„ ์ฐพ์•„ ์˜ค๋ฅธ์ชฝ์— ์žˆ๋Š” BibText ๋งํฌ๋ฅผ ํด๋ฆญํ•ฉ๋‹ˆ๋‹ค.

  • Google Scholar์—์„œ ๋…ผ๋ฌธ์„ ์ฐพ์•„ ์ œ๋ชฉ ์•„๋ž˜์— ์žˆ๋Š” ํฐ๋”ฐ์˜ดํ‘œ๋ฅผ ํด๋ฆญํ•˜๊ณ  ํŒ์—…์—์„œ BibTeX๋ฅผ ํด๋ฆญํ•ฉ๋‹ˆ๋‹ค.

  • ๊ด€๋ จ ๋…ผ๋ฌธ์ด ์—†์œผ๋ฉด(์˜ˆ๋ฅผ ๋“ค์–ด, ์›น ์‚ฌ์ดํŠธ๋งŒ ์žˆ์Œ), BibTeX ์˜จ๋ผ์ธ ํŽธ์ง‘๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‚ฌ์šฉ์ž ์ •์˜ BibTeX ํ•ญ๋ชฉ์„ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(๋“œ๋กญ๋‹ค์šด ๋ฉ”๋‰ด์— Online ํ•ญ๋ชฉ ์œ ํ˜•์ด ์žˆ์Œ).

TAGS.txt ํŒŒ์ผ ์—…๋ฐ์ดํŠธ:

  • ํ—ˆ์šฉ๋œ ๋ชจ๋“  ํƒœ๊ทธ๋Š” ์ƒ์„ฑ๋œ ํŒŒ์ผ์— ๋ฏธ๋ฆฌ ์ฑ„์›Œ์ง‘๋‹ˆ๋‹ค.

  • ๋ฐ์ดํ„ฐ์„ธํŠธ์— ์ ์šฉ๋˜์ง€ ์•Š๋Š” ๋ชจ๋“  ํƒœ๊ทธ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

  • ์œ ํšจํ•œ ํƒœ๊ทธ๋Š” tensorflow_datasets/core/valid_tags.txt์— ๋‚˜์—ด๋ฉ๋‹ˆ๋‹ค.

  • ํ•ด๋‹น ๋ชฉ๋ก์— ํƒœ๊ทธ๋ฅผ ์ถ”๊ฐ€ํ•˜๋ ค๋ฉด PR์„ ๋ณด๋‚ด์„ธ์š”.

๋ฐ์ดํ„ฐ์„ธํŠธ ์ˆœ์„œ ์œ ์ง€

๋™์ผํ•œ ํด๋ž˜์Šค์— ์†ํ•˜๋Š” ๋ ˆ์ฝ”๋“œ๊ฐ€ ์ธ์ ‘ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ์ดํ„ฐ์„ธํŠธ ์ „์ฒด์—์„œ ํด๋ž˜์Šค ๋ถ„ํฌ๋ฅผ ๋” ๊ท ์ผํ•˜๊ฒŒ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์ €์žฅํ•  ๋•Œ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ฐ์ดํ„ฐ์„ธํŠธ์˜ ๋ ˆ์ฝ”๋“œ๊ฐ€ ์„ž์ž…๋‹ˆ๋‹ค. _generate_examples์—์„œ ์ œ๊ณตํ•˜๋Š” ์ƒ์„ฑ๋œ ํ‚ค๋กœ ๋ฐ์ดํ„ฐ์„ธํŠธ๊ฐ€ ๋ถ„๋ฅ˜๋˜๋„๋ก ์ง€์ •ํ•˜๋ ค๋ฉด disable_shuffling ํ•„๋“œ๋ฅผ True๋กœ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ, False๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค.

def _info(self): return self.dataset_info_from_configs( # [...] disable_shuffling=True, # [...] )

์…”ํ”Œ์„ ๋น„ํ™œ์„ฑํ™”ํ•˜๋ฉด ์ƒค๋“œ๋ฅผ ๋” ์ด์ƒ ๋ณ‘๋ ฌ๋กœ ์ฝ์„ ์ˆ˜ ์—†์œผ๋ฏ€๋กœ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค.

_split_generators: ๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ ๋ฐ ๋ถ„ํ• 

์†Œ์Šค ๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ ๋ฐ ์ถ”์ถœํ•˜๊ธฐ

๋Œ€๋ถ€๋ถ„์˜ ๋ฐ์ดํ„ฐ์„ธํŠธ๋Š” ์›น์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋กœ๋“œํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ž‘์—…์€ _split_generators์˜ tfds.download.DownloadManager ์ž…๋ ฅ ์ธ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. dl_manager๋Š” ๋‹ค์Œ ๋ฉ”์„œ๋“œ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

  • download: http(s)://, ftp(s)://๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

  • extract: ํ˜„์žฌ .zip, .gz ๋ฐ .tar ํŒŒ์ผ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

  • download_and_extract: dl_manager.extract(dl_manager.download(urls))์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ชจ๋“  ๋ฉ”์„œ๋“œ๋Š” pathlib.Path-like ๊ฐœ์ฒด์ธ tfds.core.Path(epath.Path์˜ ๋ณ„์นญ)๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ฉ”์„œ๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ž„์˜์˜ ์ค‘์ฒฉ ๊ตฌ์กฐ(list, dict)๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

extracted_paths = dl_manager.download_and_extract({ 'foo': 'https://example.com/foo.zip', 'bar': 'https://example.com/bar.zip', }) # This returns: assert extracted_paths == { 'foo': Path('/path/to/extracted_foo/'), 'bar': Path('/path/extracted_bar/'), }

์ˆ˜๋™ ๋‹ค์šด๋กœ๋“œ ๋ฐ ์ถ”์ถœ

์ผ๋ถ€ ๋ฐ์ดํ„ฐ๋Š” ์ž๋™์œผ๋กœ ๋‹ค์šด๋กœ๋“œํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค(์˜ˆ: ๋กœ๊ทธ์ธ ํ•„์š”). ์ด ๊ฒฝ์šฐ ์‚ฌ์šฉ์ž๋Š” ์ˆ˜๋™์œผ๋กœ ์†Œ์Šค ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋กœ๋“œํ•˜์—ฌ manual_dir/(๊ธฐ๋ณธ์ ์œผ๋กœ ~/tensorflow_datasets/downloads/manual/)์— ๋†“์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋ฉด dl_manager.manual_dir๋ฅผ ํ†ตํ•ด ํŒŒ์ผ์— ์•ก์„ธ์Šคํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

class MyDataset(tfds.core.GeneratorBasedBuilder): MANUAL_DOWNLOAD_INSTRUCTIONS = """ Register into https://example.org/login to get the data. Place the `data.zip` file in the `manual_dir/`. """ def _split_generators(self, dl_manager): # data_path is a pathlib-like `Path('<manual_dir>/data.zip')` archive_path = dl_manager.manual_dir / 'data.zip' # Extract the manually downloaded `data.zip` extracted_path = dl_manager.extract(archive_path) ...

manual_dir ์œ„์น˜๋Š” tfds build --manual_dir= ๋˜๋Š” tfds.download.DownloadConfig๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‚ฌ์šฉ์ž ์ •์˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์•„์นด์ด๋ธŒ ์ง์ ‘ ์ฝ๊ธฐ

dl_manager.iter_archive๋Š” ์••์ถ•์„ ํ’€์ง€ ์•Š๊ณ  ์ˆœ์ฐจ์ ์œผ๋กœ ์•„์นด์ด๋ธŒ๋ฅผ ์ฝ์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์ €์žฅ ๊ณต๊ฐ„์„ ์ ˆ์•ฝํ•˜๊ณ  ์ผ๋ถ€ ํŒŒ์ผ ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

for filename, fobj in dl_manager.iter_archive('path/to/archive.zip'): ...

fobj์—๋Š” with open('rb') as fobj:์™€ ๋™์ผํ•œ ๋ฉ”์„œ๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค(์˜ˆ: fobj.read()).

๋ฐ์ดํ„ฐ์„ธํŠธ ๋ถ„ํ•  ์ง€์ •ํ•˜๊ธฐ

๋ฐ์ดํ„ฐ์„ธํŠธ์— ์‚ฌ์ „ ์ •์˜๋œ ๋ถ„ํ• ์ด ์žˆ๋Š” ๊ฒฝ์šฐ(์˜ˆ: MNIST์— train ๋ฐ test ๋ถ„ํ• ์ด ์žˆ์Œ) ์ด๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ๋‹จ์ผ all ๋ถ„ํ• ๋งŒ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž๋Š” ํ•˜์œ„ ๋ถ„ํ•  API(์˜ˆ: split='train[80%:]')๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ์œ ํ•œ ํ•˜์œ„ ๋ถ„ํ• ์„ ๋™์ ์œผ๋กœ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ, ์•ž์„œ ์–ธ๊ธ‰ํ•œ all์„ ์ œ์™ธํ•œ ๋ชจ๋“  ์•ŒํŒŒ๋ฒณ ๋ฌธ์ž์—ด์„ ๋ถ„ํ•  ์ด๋ฆ„์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์— ์œ ์˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

def _split_generators(self, dl_manager): # Download source data extracted_path = dl_manager.download_and_extract(...) # Specify the splits return { 'train': self._generate_examples( images_path=extracted_path / 'train_imgs', label_path=extracted_path / 'train_labels.csv', ), 'test': self._generate_examples( images_path=extracted_path / 'test_imgs', label_path=extracted_path / 'test_labels.csv', ), }

_generate_examples: ์˜ˆ์ œ ์ƒ์„ฑ๊ธฐ

_generate_examples๋Š” ์†Œ์Šค ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ ๋ถ„ํ• ์— ๋Œ€ํ•œ ์˜ˆ์ œ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ฉ”์„œ๋“œ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์†Œ์Šค ๋ฐ์ดํ„ฐ์„ธํŠธ ์•„ํ‹ฐํŒฉํŠธ(์˜ˆ: CSV ํŒŒ์ผ)๋ฅผ ํŒ๋…ํ•˜๊ณ  (key, feature_dict) ํŠœํ”Œ์„ ์‚ฐ์ถœํ•ฉ๋‹ˆ๋‹ค.

  • key: ์˜ˆ์‹œ ์‹๋ณ„์ž. hash(key) ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ œ๋ฅผ ๊ฒฐ์ •์„ฑ ์žˆ๊ฒŒ ์…”ํ”Œํ•˜๊ฑฐ๋‚˜ ์…”ํ”Œ์ด ๋น„ํ™œ์„ฑํ™”๋œ ๊ฒฝ์šฐ ํ‚ค๋ณ„๋กœ ์ •๋ ฌํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค(๋ฐ์ดํ„ฐ์„ธํŠธ ์ˆœ์„œ ์œ ์ง€ ์„น์…˜ ์ฐธ์กฐ). ๋‹ค์Œ๊ฐ€ ๊ฐ™์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค.

    • ๊ณ ์œ ํ•จ: ๋‘ ์˜ˆ์ œ๊ฐ€ ๋™์ผํ•œ ํ‚ค๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์˜ˆ์™ธ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

    • ๊ฒฐ์ •์„ฑ์ด ์žˆ์Œ: download_dir, os.path.listdir ์ˆœ์„œ์— ์˜์กดํ•˜์ง€ ์•Š์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ๋‘ ๋ฒˆ ์ƒ์„ฑํ•˜๋ฉด ๋™์ผํ•œ ํ‚ค๊ฐ€ ์‚ฐ์ถœ๋ฉ๋‹ˆ๋‹ค.

    • ๋น„๊ต ๊ฐ€๋Šฅ: ์…”ํ”Œ๋ง์ด ๋น„ํ™œ์„ฑํ™”๋œ ๊ฒฝ์šฐ ํ‚ค๊ฐ€ ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ์ •๋ ฌํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  • feature_dict: ์˜ˆ์ œ ๊ฐ’์„ ํฌํ•จํ•œ dict

    • ๊ตฌ์กฐ๋Š” tfds.core.DatasetInfo์— ์ •์˜๋œ features= ๊ตฌ์กฐ์™€ ์ผ์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

    • ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ ํ˜•์‹(์ด๋ฏธ์ง€, ๋น„๋””์˜ค, ์˜ค๋””์˜ค ๋“ฑ)์€ ์ž๋™์œผ๋กœ ์ธ์ฝ”๋”ฉ๋ฉ๋‹ˆ๋‹ค.

    • ๊ฐ ๊ธฐ๋Šฅ์€ ์ข…์ข… ์—ฌ๋Ÿฌ ์ž…๋ ฅ ์œ ํ˜•์„ ํ—ˆ์šฉํ•ฉ๋‹ˆ๋‹ค(์˜ˆ: ๋น„๋””์˜ค๋Š” /path/to/vid.mp4, np.array(shape=(l, h, w, c)), List[paths], List[np.array(shape=(h, w, c)], List[img_bytes] ๋“ฑ์„ ์ˆ˜์šฉํ•จ).

    • ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๊ธฐ๋Šฅ ์ปค๋„ฅํ„ฐ ๊ฐ€์ด๋“œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

def _generate_examples(self, images_path, label_path): # Read the input data out of the source files with label_path.open() as f: for row in csv.DictReader(f): image_id = row['image_id'] # And yield (key, feature_dict) yield image_id, { 'image_description': row['description'], 'image': images_path / f'{image_id}.jpeg', 'label': row['label'], }

๊ฒฝ๊ณ : ๋ฌธ์ž์—ด ๋˜๋Š” ์ •์ˆ˜์—์„œ ๋ถ€์šธ ๊ฐ’์„ ๊ตฌ๋ฌธ ๋ถ„์„ํ•  ๋•Œ util ํ•จ์ˆ˜ tfds.core.utils.bool_utils.parse_bool์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ตฌ๋ฌธ ๋ถ„์„ ์˜ค๋ฅ˜๋ฅผ ๋ฐฉ์ง€ํ•˜์„ธ์š”(์˜ˆ: bool("False") == True).

ํŒŒ์ผ ์•ก์„ธ์Šค ๋ฐ tf.io.gfile

ํด๋ผ์šฐ๋“œ ์Šคํ† ๋ฆฌ์ง€ ์‹œ์Šคํ…œ์„ ์ง€์›ํ•˜๋ ค๋ฉด Python ๋‚ด์žฅ I/O ops๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ๋งˆ์„ธ์š”.

๋Œ€์‹  dl_manager๋Š” Google Cloud Storage์™€ ์ง์ ‘ ํ˜ธํ™˜๋˜๋Š” pathlib ์œ ์‚ฌ ๊ฐ์ฒด๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

path = dl_manager.download_and_extract('http://some-website/my_data.zip') json_path = path / 'data/file.json' json.loads(json_path.read_text())

๋˜๋Š”, ํŒŒ์ผ ์—ฐ์‚ฐ์— ๋‚ด์žฅ๋œ API ๋Œ€์‹  tf.io.gfile API๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • open -> tf.io.gfile.GFile

  • os.rename -> tf.io.gfile.rename

  • ...

tf.io.gfile๋ณด๋‹ค๋Š” Pathlib์ด ์„ ํ˜ธ๋ฉ๋‹ˆ๋‹ค(์ด๋ก ์  ๊ทผ๊ฑฐ ์ฐธ์กฐ).

์ถ”๊ฐ€ ์ข…์†์„ฑ

์ผ๋ถ€ ๋ฐ์ดํ„ฐ์„ธํŠธ์—๋Š” ์ƒ์„ฑ ์ค‘์—๋งŒ ์ถ”๊ฐ€ Python ์ข…์†์„ฑ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด SVHN ๋ฐ์ดํ„ฐ์„ธํŠธ๋Š” scipy๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ผ๋ถ€ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.

TFDS ๋ฆฌํฌ์ง€ํ† ๋ฆฌ์— ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒฝ์šฐ, tfds.core.lazy_imports๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ tensorflow-datasets ํŒจํ‚ค์ง€๋ฅผ ์ž‘๊ฒŒ ์œ ์ง€ํ•˜์„ธ์š”. ์‚ฌ์šฉ์ž๋Š” ํ•„์š”ํ•œ ๊ฒฝ์šฐ์—๋งŒ ์ถ”๊ฐ€ ์ข…์†์„ฑ์„ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค.

lazy_imports๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด:

  • setup.py์—์„œ ๋ฐ์ดํ„ฐ์„ธํŠธ์˜ ํ•ญ๋ชฉ์„ DATASET_EXTRAS์— ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์‚ฌ์šฉ์ž๋Š” ์˜ˆ๋ฅผ ๋“ค์–ด pip install 'tensorflow-datasets[svhn]'์„ ์‹คํ–‰ํ•˜์—ฌ ์ถ”๊ฐ€ ์ข…์†์„ฑ์„ ์„ค์น˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ๊ฐ€์ ธ์˜ค๊ธฐ์˜ ํ•ญ๋ชฉ์„ LazyImporter์™€ LazyImportsTest์— ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

  • tfds.core.lazy_imports๋ฅผ ์‚ฌ์šฉํ•˜์—ฌDatasetBuilder์—์„œ ์ข…์†์„ฑ(์˜ˆ๋ฅผ ๋“ค์–ด, tfds.core.lazy_imports.scipy)์— ์•ก์„ธ์Šคํ•ฉ๋‹ˆ๋‹ค.

์†์ƒ๋œ ๋ฐ์ดํ„ฐ

์ผ๋ถ€ ๋ฐ์ดํ„ฐ์„ธํŠธ๋Š” ์™„๋ฒฝํ•˜๊ฒŒ ์ •๋ฆฌ๋˜์ง€ ์•Š์•˜์œผ๋ฉฐ, ์ผ๋ถ€ ์†์ƒ๋œ ๋ฐ์ดํ„ฐ(์˜ˆ: ์ด๋ฏธ์ง€๋Š” JPEG ํŒŒ์ผ์ด์ง€๋งŒ, ์ผ๋ถ€๋Š” ์œ ํšจํ•˜์ง€ ์•Š์€ JPEG์ผ ๋•Œ)๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ด๋“ค ์˜ˆ์ œ๋Š” ๊ฑด๋„ˆ๋›ฐ์–ด์•ผ ํ•˜์ง€๋งŒ, ๋ฐ์ดํ„ฐ์„ธํŠธ ์„ค๋ช…์— ๋ช‡ ๊ฐœ์˜ ์˜ˆ์ œ๊ฐ€ ์‚ญ์ œ๋˜์—ˆ์œผ๋ฉฐ ๊ทธ ์ด์œ ๋Š” ๋ฌด์—‡์ธ์ง€ ๋ฉ”๋ชจ๋ฅผ ๋‚จ๊ฒจ ์ฃผ์„ธ์š”.

๋ฐ์ดํ„ฐ์„ธํŠธ ๊ตฌ์„ฑ/๋ณ€ํ˜•(tfds.core.BuilderConfig)

์ผ๋ถ€ ๋ฐ์ดํ„ฐ์„ธํŠธ์—๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์‚ฌ์ „ ์ฒ˜๋ฆฌ๋˜๊ณ  ๋””์Šคํฌ์— ๊ธฐ๋ก๋˜๋Š” ๋ฐฉ์‹์— ๋Œ€ํ•œ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ณ€ํ˜• ๋˜๋Š” ์˜ต์…˜์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, cycle_gan์—๋Š” ๊ฐ์ฒด ์Œ๋ณ„๋กœ ํ•˜๋‚˜์˜ ๊ตฌ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค(cycle_gan/horse2zebra, cycle_gan/monet2photo,...)

์ด๋Š” tfds.core.BuilderConfig๋ฅผ ํ†ตํ•ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

  1. ์ž์‹ ์˜ ๊ตฌ์„ฑ ๊ฐ์ฒด๋ฅผ tfds.core.BuilderConfig์˜ ์„œ๋ธŒ ํด๋ž˜์Šค๋กœ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด, MyDatasetConfig์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

    @dataclasses.dataclass class MyDatasetConfig(tfds.core.BuilderConfig): img_size: Tuple[int, int] = (0, 0)

    ์ฐธ๊ณ : https://bugs.python.org/issue33129๋กœ ์ธํ•ด ๊ธฐ๋ณธ๊ฐ’์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

  2. ๋ฐ์ดํ„ฐ์„ธํŠธ๊ฐ€ ๋…ธ์ถœํ•˜๋Š” MyDatasetConfig๋ฅผ ๋‚˜์—ดํ•˜๋Š” MyDataset์—์„œ BUILDER_CONFIGS = [] ํด๋ž˜์Šค ๊ตฌ์„ฑ์›์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

    class MyDataset(tfds.core.GeneratorBasedBuilder): VERSION = tfds.core.Version('1.0.0') # pytype: disable=wrong-keyword-args BUILDER_CONFIGS = [ # `name` (and optionally `description`) are required for each config MyDatasetConfig(name='small', description='Small ...', img_size=(8, 8)), MyDatasetConfig(name='big', description='Big ...', img_size=(32, 32)), ] # pytype: enable=wrong-keyword-args

    ์ฐธ๊ณ : ๋ฐ์ดํ„ฐ ํด๋ž˜์Šค ์ƒ์†์„ ๊ฐ–๋Š” Pytype ๋ฒ„๊ทธ๋กœ ์ธํ•ด # pytype: disable=wrong-keyword-args๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

  3. MyDataset์˜ self.builder_config๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค(์˜ˆ:shape=self.builder_config.img_size). ์—ฌ๊ธฐ์—๋Š” _info()์—์„œ ์—ฌ๋Ÿฌ ๊ฐ’์„ ์„ค์ •ํ•˜๊ฑฐ๋‚˜ ๋‹ค์šด๋กœ๋“œ ๋ฐ์ดํ„ฐ ์•ก์„ธ์Šค๋ฅผ ๋ณ€๊ฒฝํ•˜๋Š” ๊ฒƒ์ด ํฌํ•จ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฐธ๊ณ :

  • ๊ฐ ๊ตฌ์„ฑ์—๋Š” ๊ณ ์œ ํ•œ ์ด๋ฆ„์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์„ฑ์˜ ์ •๊ทœํ™”๋œ ์ด๋ฆ„์€ dataset_name/config_name์ž…๋‹ˆ๋‹ค(์˜ˆ: coco/2017).

  • ์ง€์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ BUILDER_CONFIGS์˜ ์ฒซ ๋ฒˆ์งธ ๊ตฌ์„ฑ์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค(์˜ˆ: tfds.load('c4')์˜ ๊ธฐ๋ณธ๊ฐ’์€ c4/en).

BuilderConfig๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ์„ธํŠธ์˜ ์˜ˆ๋Š” anli๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

๋ฒ„์ „

๋ฒ„์ „์€ ๋‘ ๊ฐ€์ง€ ๋‹ค๋ฅธ ์˜๋ฏธ๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • "์™ธ๋ถ€" ์›๋ณธ ๋ฐ์ดํ„ฐ ๋ฒ„์ „: ์˜ˆ: COCO v2019, v2017,...

  • "๋‚ด๋ถ€" TFDS ์ฝ”๋“œ ๋ฒ„์ „: ์˜ˆ๋ฅผ ๋“ค์–ด tfds.features.FeaturesDict์˜ ๊ธฐ๋Šฅ ์ด๋ฆ„ ๋ฐ”๊พธ๊ธฐ, _generate_examples์˜ ๋ฒ„๊ทธ ์ˆ˜์ •

๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋ ค๋ฉด:

  • "์™ธ๋ถ€" ๋ฐ์ดํ„ฐ ์—…๋ฐ์ดํŠธ์˜ ๊ฒฝ์šฐ: ์—ฌ๋Ÿฌ ์‚ฌ์šฉ์ž๊ฐ€ ํŠน์ • ์—ฐ๋„/๋ฒ„์ „์— ๋™์‹œ์— ์•ก์„ธ์Šคํ•˜๊ธฐ๋ฅผ ์›ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๋ฒ„์ „๋‹น ํ•˜๋‚˜์˜ tfds.core.BuilderConfig(์˜ˆ: coco/2017, coco/2019) ๋˜๋Š” ๋ฒ„์ „๋‹น ํ•˜๋‚˜์˜ ํด๋ž˜์Šค(์˜ˆ: Voc2007, Voc2012)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

  • "๋‚ด๋ถ€" ์ฝ”๋“œ ์—…๋ฐ์ดํŠธ์˜ ๊ฒฝ์šฐ: ์‚ฌ์šฉ์ž๋Š” ์ตœ์‹  ๋ฒ„์ „๋งŒ ๋‹ค์šด๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค. ์ฝ”๋“œ๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋ฉด ์‹œ๋งจํ‹ฑ ๋ฒ„์ „ ๊ด€๋ฆฌ์— ๋”ฐ๋ผ VERSION ํด๋ž˜์Šค ์†์„ฑ์ด ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค(์˜ˆ: 1.0.0์—์„œ VERSION = tfds.core.Version('2.0.0')๋กœ ์ฆ๊ฐ€).

๋“ฑ๋ก์„ ์œ„ํ•ด ๊ฐ€์ ธ์˜ค๊ธฐ ์ถ”๊ฐ€

๋ฐ์ดํ„ฐ์„ธํŠธ ๋ชจ๋“ˆ์„ ํ”„๋กœ์ ํŠธ __init__๋กœ ๊ฐ€์ ธ์™€ tfds.load, tfds.builder์— ์ž๋™์œผ๋กœ ๋“ฑ๋ก๋˜๋„๋ก ํ•˜๋Š” ๊ฒƒ์„ ์žŠ์ง€ ๋งˆ์„ธ์š”.

import my_project.datasets.my_dataset # Register MyDataset ds = tfds.load('my_dataset') # MyDataset available

์˜ˆ๋ฅผ ๋“ค์–ด, tensorflow/datasets์— ์ œ๊ณตํ•˜๋Š” ๊ฒฝ์šฐ, ํ•ด๋‹น ํ•˜์œ„ ๋””๋ ‰ํ„ฐ๋ฆฌ์˜ __init__.py์— ๋ชจ๋“ˆ ๊ฐ€์ ธ์˜ค๊ธฐ๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค(์˜ˆ: image/__init__.py.

์ผ๋ฐ˜์ ์ธ ๊ตฌํ˜„ ๋ฌธ์ œ ์ ๊ฒ€ํ•˜๊ธฐ

์ผ๋ฐ˜์ ์ธ ๊ตฌํ˜„ ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”.

๋ฐ์ดํ„ฐ์„ธํŠธ ํ…Œ์ŠคํŠธํ•˜๊ธฐ

๋‹ค์šด๋กœ๋“œ ๋ฐ ์ค€๋น„: tfds build

๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋ ค๋ฉด my_dataset/ ๋””๋ ‰ํ„ฐ๋ฆฌ์—์„œ tfds build๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

cd path/to/datasets/my_dataset/ tfds build --register_checksums

๊ฐœ๋ฐœ์— ์œ ์šฉํ•œ ๋ช‡ ๊ฐ€์ง€ ํ”Œ๋ž˜๊ทธ:

  • --pdb: ์˜ˆ์™ธ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ๋””๋ฒ„๊น… ๋ชจ๋“œ๋กœ ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค.

  • --overwrite: ๋ฐ์ดํ„ฐ์„ธํŠธ๊ฐ€ ์ด๋ฏธ ์ƒ์„ฑ๋œ ๊ฒฝ์šฐ ๊ธฐ์กด ํŒŒ์ผ์„ ์‚ญ์ œํ•ฉ๋‹ˆ๋‹ค.

  • --max_examples_per_split: ์ „์ฒด ๋ฐ์ดํ„ฐ์„ธํŠธ๊ฐ€ ์•„๋‹Œ ์ฒ˜์Œ X๊ฐœ ์˜ˆ์ œ(๊ธฐ๋ณธ๊ฐ’์€ 1)๋งŒ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

  • --register_checksums: ๋‹ค์šด๋กœ๋“œํ•œ URL์˜ ์ฒดํฌ์„ฌ์„ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค. ๊ฐœ๋ฐœ ์ค‘์—๋งŒ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ”Œ๋ž˜๊ทธ์˜ ์ „์ฒด ๋ชฉ๋ก์€ CLI ์„ค๋ช…์„œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

์ฒดํฌ์„ฌ

๊ฒฐ์ •์„ฑ์„ ๋ณด์žฅํ•˜๊ณ  ๋ฌธ์„œํ™”๋ฅผ ๋•๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ์„ธํŠธ์˜ ์ฒดํฌ์„ฌ์„ ๊ธฐ๋กํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด --register_checksums๋กœ ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค(์ด์ „ ์„น์…˜ ์ฐธ์กฐ).

PyPI๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ๋ฆด๋ฆฌ์Šคํ•˜๋Š” ๊ฒฝ์šฐ checksums.tsv ํŒŒ์ผ์„ ๋‚ด๋ณด๋‚ด๋Š” ๊ฒƒ์„ ์žŠ์ง€ ๋งˆ์„ธ์š”(์˜ˆ: setup.py์˜ package_data์—).

๋ฐ์ดํ„ฐ์„ธํŠธ ๋‹จ์œ„ ํ…Œ์ŠคํŠธ

tfds.testing.DatasetBuilderTestCase๋Š” ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ์™„์ „ํžˆ ์‹คํ–‰ํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ๋ณธ TestCase์ž…๋‹ˆ๋‹ค. ์ด ๋•Œ "๋”๋ฏธ ๋ฐ์ดํ„ฐ"๋ฅผ ์†Œ์Šค ๋ฐ์ดํ„ฐ์„ธํŠธ์˜ ๊ตฌ์กฐ๋ฅผ ๋ชจ๋ฐฉํ•œ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋Š” my_dataset/dummy_data/ ๋””๋ ‰ํ† ๋ฆฌ์— ๋„ฃ์–ด์•ผ ํ•˜๋ฉฐ ๋‹ค์šด๋กœ๋“œ ๋ฐ ์ถ”์ถœ๋œ ์†Œ์Šค ๋ฐ์ดํ„ฐ์„ธํŠธ ์•„ํ‹ฐํŒฉํŠธ๋ฅผ ๋ชจ๋ฐฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์Šคํฌ๋ฆฝํŠธ(์˜ˆ์ œ ์Šคํฌ๋ฆฝํŠธ)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜๋™ ๋˜๋Š” ์ž๋™์œผ๋กœ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ๋ฐ์ดํ„ฐ์„ธํŠธ๊ฐ€ ๊ฒน์น˜๋ฉด ํ…Œ์ŠคํŠธ๊ฐ€ ์‹คํŒจํ•˜๋ฏ€๋กœ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ถ„ํ• ์— ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์—๋Š” ์ €์ž‘๊ถŒ์ด ์žˆ๋Š” ์ž๋ฃŒ๊ฐ€ ํฌํ•จ๋˜์–ด์„œ๋Š” ์•ˆ ๋ฉ๋‹ˆ๋‹ค. ์˜์‹ฌ์Šค๋Ÿฌ์šด ๊ฒฝ์šฐ, ์›๋ž˜ ๋ฐ์ดํ„ฐ์„ธํŠธ์˜ ์ž๋ฃŒ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์ง€ ๋งˆ์„ธ์š”.

import tensorflow_datasets as tfds from . import my_dataset_dataset_builder class MyDatasetTest(tfds.testing.DatasetBuilderTestCase): """Tests for my_dataset dataset.""" DATASET_CLASS = my_dataset_dataset_builder.Builder SPLITS = { 'train': 3, # Number of fake train example 'test': 1, # Number of fake test example } # If you are calling `download/download_and_extract` with a dict, like: # dl_manager.download({'some_key': 'http://a.org/out.txt', ...}) # then the tests needs to provide the fake output paths relative to the # fake data directory DL_EXTRACT_RESULT = { 'name1': 'path/to/file1', # Relative to my_dataset/dummy_data dir. 'name2': 'file2', } if __name__ == '__main__': tfds.testing.test_main()

๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ํ…Œ์ŠคํŠธํ•ฉ๋‹ˆ๋‹ค.

python my_dataset_test.py

ํ”ผ๋“œ๋ฐฑ ๋ณด๋‚ด๊ธฐ

์ง€์†ํ•ด์„œ ๋ฐ์ดํ„ฐ์„ธํŠธ ์ƒ์„ฑ ์›Œํฌํ”Œ๋กœ๋ฅผ ๊ฐœ์„ ํ•˜๋ ค๊ณ  ์‹œ๋„ํ•˜๊ณ  ์žˆ์ง€๋งŒ, ๋ฌธ์ œ์— ๋Œ€ํ•ด ์•Œ๊ณ  ์žˆ๋Š” ๊ฒฝ์šฐ์—๋งŒ ๊ทธ๋ ‡๊ฒŒ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋™์•ˆ ๊ฒช์€ ๋ฌธ์ œ๋‚˜ ์˜ค๋ฅ˜๋Š” ๋ฌด์—‡์ž…๋‹ˆ๊นŒ? ํ—ท๊ฐˆ๋ฆฌ๋Š” ๋ถ€๋ถ„์ด ์žˆ์—ˆ๋‚˜์š”, ํ˜น์€ ์ฒ˜์Œ์— ์ž‘๋™ํ•˜์ง€ ์•Š์€ ๋ถ€๋ถ„์ด ์žˆ์—ˆ๋‚˜์š”?

GitHub์— ํ”ผ๋“œ๋ฐฑ์„ ๊ณต์œ ํ•ด ์ฃผ์‹œ๊ธธ ๋ฐ”๋ž๋‹ˆ๋‹ค.