GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ko/federated/tutorials/private_heavy_hitters.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2022 The TensorFlow Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

비공개 헤비 히터

참고: 이 colab은 tensorflow_federated pip 패키지의 최신 릴리스 버전에서 작동하는 것으로 확인되었습니다. 이 colab은 main에 대해 작동하도록 업데이트되지 않을 수 있습니다.

이 튜토리얼은 tff.analytics.heavy_hitters.iblt.build_iblt_computation API를 사용해 페더레이션 분석 계산을 빌드함으로써 모집단에서 가장 빈번한 문자열(비공개 헤비 히트)을 발견하는 방법을 보여줍니다.

환경 설정

다음을 실행하여 환경이 올바르게 설정되었는지 확인하세요. 인사말이 표시되지 않으면 설치 가이드에서 지침을 참조하세요.

In [ ]:

#@test {"skip": true}

# tensorflow_federated_nightly also bring in tf_nightly, which
# can causes a duplicate tensorboard install, leading to errors.
!pip install --quiet tensorflow-text-nightly
!pip install --quiet --upgrade tensorflow-federated

In [ ]:

import collections

import numpy as np
import tensorflow as tf
import tensorflow_federated as tff
import tensorflow_text as tf_text

np.random.seed(0)
tff.backends.test.set_sync_test_cpp_execution_context()

tff.federated_computation(lambda: 'Hello, World!')()

b'Hello, World!'

배경: Federated Analytics의 비공개 헤비 히터

다음 설정을 고려하세요: 각 클라이언트에는 문자열 목록이 있고 각 문자열은 공개 세트에서 가져온 것이므로 임의적일 수 있습니다. 목표는 페더레이션 환경에서 가장 인기 있는 문자열(헤비 히터)과 그 카운트를 비공개로 확인하는 것입니다. 이 colab은 다음과 같은 프라이버시 속성을 사용하여 이 문제에 대한 솔루션을 보여줍니다.

보안 집계: 서버가 클라이언트의 개별 값을 학습할 수 없도록 집계된 문자열 카운트를 계산합니다. 자세한 내용은 tff.federated_secure_sum을 참조하세요.
차등 프라이버시(DP): 분석에서 민감한 데이터의 개인 정보 유출 한계를 정하고 정량화하는 데 널리 사용되는 방법입니다. 헤비 히터 결과에 사용자 수준의 중앙 DP를 적용할 수 있습니다.

보안 집계 API tff.federated_secure_sum은 정수 벡터의 선형 합을 지원합니다. 문자열이 n 크기의 닫힌 세트에서 가져온 것이라면 각 클라이언트의 문자열을 크기 n의 벡터로 인코딩하는 것이 쉽습니다: 벡터의 인덱스 i에 있는 값을 닫힌 세트에서 i^번째 문자열의 카운트라고 해보겠습니다. 그러면 모든 클라이언트의 벡터를 안전하게 합산하여 전체 모집단의 문자열 카운트를 얻을 수 있습니다. 그러나 문자열을 열린 세트에서 가져온 경우, 보안 합계를 얻기 위해 적절하게 인코딩하는 방법이 명확하지 않습니다. 여기서는 문자열을 IBLT(Invertible Bloom Lookup Tables)로 인코딩할 수 있습니다. 이 테이블은 효율적인 방식으로 큰(또는 열린) 도메인의 항목을 인코딩할 수 있는 확률적 데이터 구조입니다. IBLT 스케치는 선형으로 합산될 수 있으므로 보안 합과 호환됩니다.

tff.analytics.heavy_hitters.iblt.build_iblt_computation을 사용하여 각 클라이언트의 로컬 문자열을 IBLT 구조로 인코딩하는 TFF 계산을 생성할 수 있습니다. 이러한 구조는 암호화된 보안 다자간 계산 프로토콜을 통해 서버가 디코딩할 수 있는 집계된 IBLT 구조로 안전하게 합산됩니다. 그런 다음 서버는 최고의 헤비 히터를 반환할 수 있습니다. 다음 섹션에서는 이 API를 사용하여 TFF 계산을 생성하고 셰익스피어 데이터세트로 시뮬레이션을 실행하는 방법을 보여줍니다.

페더레이션 셰익스피어 데이터 로드 및 전처리

셰익스피어 데이터세트에는 셰익스피어 연극의 등장인물 라인이 포함되어 있습니다. 이 예에서는 문자의 하위 집합(즉, 클라이언트)이 선택됩니다. 전처리기는 각 인물의 라인을 문자열 목록으로 변환하고 구두점이나 기호로만 구성된 문자열은 삭제됩니다.

In [ ]:

# Load the simulation data.
source, _ = tff.simulation.datasets.shakespeare.load_data()

In [ ]:

# Preprocessing function to tokenize a line into words.
def tokenize(ds):
  """Tokenizes a line into words with alphanum characters."""
  def extract_strings(example):
    return tf.expand_dims(example['snippets'], 0)

  def tokenize_line(line):
    return tf.data.Dataset.from_tensor_slices(tokenizer.tokenize(line)[0])

  def mask_all_symbolic_words(word):
    return tf.math.logical_not(
        tf_text.wordshape(word, tf_text.WordShape.IS_PUNCT_OR_SYMBOL))

  tokenizer = tf_text.WhitespaceTokenizer()
  ds = ds.map(extract_strings)
  ds = ds.flat_map(tokenize_line)
  ds = ds.map(tf_text.case_fold_utf8)
  ds = ds.filter(mask_all_symbolic_words)
  return ds

batch_size = 5

def client_data(n: int) -> tf.data.Dataset:
  return tokenize(source.create_tf_dataset_for_client(
      source.client_ids[n])).batch(batch_size)

# Pick a subset of client devices to participate in the computation.
dataset = [client_data(n) for n in range(10)]

시뮬레이션

셰익스피어 데이터세트에서 가장 인기 있는 단어(헤비 히터)를 찾기 위해 시뮬레이션을 실행하려면 먼저 다음 매개변수와 함께 tff.analytics.heavy_hitters.iblt.build_iblt_computation API를 사용하여 TFF 계산을 생성해야 합니다.

capacity: IBLT 스케치의 용량입니다. 이 숫자는 대략 한 번의 계산 라운드에서 나타날 수 있는 고유 문자열의 총 수여야 합니다. 기본값은 1000입니다. 이 숫자가 너무 작으면 해시 값의 충돌로 인해 디코딩이 실패할 수 있습니다. 이 숫자가 너무 크면 필요한 것보다 더 많은 메모리를 소비합니다.
string_max_bytes: IBLT에서 문자열의 최대 길이. 기본값은 10이고 양수여야 합니다. string_max_bytes보다 긴 문자열은 잘립니다.
max_words_per_user: 각 클라이언트가 기여할 수 있는 최대 문자열 수입니다. None이 아니면 양의 정수여야 합니다. 기본값은 None이며, 이는 모든 클라이언트가 모든 문자열에 기여함을 의미합니다.
max_heavy_hitters: 반환할 최대 항목 수입니다. 디코딩된 결과에 이 수보다 많은 항목이 있는 경우, 항목이 예상 개수만큼 내림차순으로 정렬하고 상위 max_heavy_hitters 항목을 반환합니다. 기본값은 None으로, 결과에 모든 헤비 히터를 반환함을 의미입니다.
secure_sum_bitwidth: 보안 합계에 사용되는 비트 폭입니다. 기본값은 보안 합계를 비활성화하는 None입니다. None이 아니면 [1,62] 범위에 있어야 합니다. tff.federated_secure_sum을 참조하세요.
multi_contribution: 각 클라이언트가 각 고유 단어에 대해 여러 카운트 또는 하나의 카운트만 제공할 수 있는지 여부입니다. 기본값은 True입니다. 이 인수는 차등 프라이버시가 필요할 때 유용성을 향상시킬 수 있습니다.
batch_size: 데이터세트의 각 배치에 있는 요소의 수입니다. 기본값은 1이고, 입력 데이터세트가 tf.data.Dataset.batch(1)에 의해 처리됨을 의미합니다. 양의 정수여야 합니다.

In [ ]:

max_words_per_user = 8
iblt_computation = tff.analytics.heavy_hitters.iblt.build_iblt_computation(
    capacity=100,
    string_max_bytes=20,
    max_words_per_user=max_words_per_user,
    max_heavy_hitters=10,
    secure_sum_bitwidth=32,
    multi_contribution=False,
    batch_size=batch_size)

이제 TFF 계산 iblt_computation 및 사전 처리 입력 데이터세트를 사용하여 시뮬레이션을 실행할 준비가 되었습니다. 출력 iblt_computation에는 네 가지 속성이 있습니다.

clients: 계산에 참여한 클라이언트의 스칼라 수입니다.
heavy_hitters: 집계된 헤비 히터의 목록입니다.
heavy_hitters_counts: 집계된 헤비 히터의 카운트 목록입니다.
num_not_decoded: 성공적으로 디코딩되지 않은 문자열의 스칼라 수입니다.

In [ ]:

def run_simulation(one_round_computation: tff.Computation, dataset):
  output = one_round_computation(dataset)
  heavy_hitters = output.heavy_hitters
  heavy_hitters_counts = output.heavy_hitters_counts
  heavy_hitters = [word.decode('utf-8', 'ignore') for word in heavy_hitters]

  results = {}
  for index in range(len(heavy_hitters)):
    results[heavy_hitters[index]] = heavy_hitters_counts[index]
  return output.clients, dict(results)

In [ ]:

clients, result = run_simulation(iblt_computation, dataset)
print(f'Number of clients participated: {clients}')
print('Discovered heavy hitters and counts:')
print(result)

Number of clients participated: 10
Discovered heavy hitters and counts:
{'to': 8, 'the': 8, 'and': 7, 'you': 4, 'i': 4, 'a': 3, 'he': 3, 'your': 3, 'is': 3, 'of': 2}

차등 프라이버시가 있는 비공개 헤비 히터

중앙 DP가 있는 비공개 헤비 히터를 얻기 위해 DP 메커니즘이 열린 세트 히스토그램에 적용됩니다. 여기서 개념은 집계된 히스토그램의 문자열 수에 노이즈를 추가한 다음 특정 임계값보다 높은 수의 문자열만 유지하는 것입니다. 노이즈 및 임계값은 (엡실론, 델타)-DP 예산에 따라 다릅니다. 자세한 알고리즘 및 증명은 이 문서를 참조하세요. 노이즈가 많은 카운트는 사후 처리 단계로서, 정수로 반올림되어 DP 보장을 약화시키지 않습니다. DP가 필요할 때는 헤비 히터를 덜 발견하게 될 것이라는 점에 주목하세요. 이는 임계값 단계에서 수가 적은 문자열이 필터링되어 제외되기 때문입니다.

In [ ]:

iblt_computation = tff.analytics.heavy_hitters.iblt.build_iblt_computation(
    capacity=100,
    string_max_bytes=20,
    max_words_per_user=max_words_per_user,
    secure_sum_bitwidth=32,
    multi_contribution=False,
    batch_size=batch_size)

clients, result = run_simulation(iblt_computation, dataset)

In [ ]:

# DP parameters
eps = 20
delta = 0.01

# Calculating scale for Laplace noise
scale = max_words_per_user / eps

# Calculating the threshold
tau = 1 + (max_words_per_user / eps) * np.log(max_words_per_user / (2 * delta))

result_with_dp = {}
for word in result:
  noised_count = result[word] + np.random.laplace(scale=scale)
  if noised_count >= tau:
    result_with_dp[word] = int(noised_count)
print(f'Discovered heavy hitters and counts with central DP:')
print(result_with_dp)

Discovered heavy hitters and counts with central DP:
{'the': 8, 'you': 4, 'to': 7, 'tear': 3, 'and': 7, 'i': 3}