GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ja/federated/tutorials/private_heavy_hitters.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2022 The TensorFlow Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

プライベートヘビーヒッター

注意: この Colab は最新リリースバージョンの tensorflow_federated pip パッケージでの動作が確認されていますが、main では動作しない可能性があります。

このチュートリアルでは、tff.analytics.heavy_hitters.iblt.build_iblt_computation API を使用して、母集団で最も頻繁な文字列（プライベートヘビーヒッター）を検出するための連合分析計算を構築する方法を実演します。

環境の設定

環境が正しくセットアップされていることを確認するために、以下を実行してください。動作しない場合は、インストールガイドで手順を確認してください。

In [ ]:

#@test {"skip": true}

# tensorflow_federated_nightly also bring in tf_nightly, which
# can causes a duplicate tensorboard install, leading to errors.
!pip install --quiet tensorflow-text-nightly
!pip install --quiet --upgrade tensorflow-federated

In [ ]:

import collections

import numpy as np
import tensorflow as tf
import tensorflow_federated as tff
import tensorflow_text as tf_text

np.random.seed(0)
tff.backends.test.set_sync_test_cpp_execution_context()

tff.federated_computation(lambda: 'Hello, World!')()

b'Hello, World!'

背景: 連合分析のプライベートヘビーヒッター

次の場合を検討してみてください。各クライアントには文字列のリストがあり、各文字列は開集団からのもので、任意である可能性があります。目標は、最も頻繁な文字列（ヘビーヒッター）とその数を連合設定でプライベートに発見することです。この colab は、次のプライバシープロパティを使用してこの問題の解決策を実演します。

セキュアな集計: サーバーがクライアントの個々の値を学習できないように、集計された文字列数を計算します。詳細については tff.federated_secure_sum を参照してください。
差分プライバシー（DP: 分析における機密データのプライバシー漏洩を制限および定量化するために広く使用されている方法。ヘビーヒッターの結果にユーザーレベルの中央 DP を適用できます。

セキュアな集計 API tff.federated_secure_sum は、整数ベクトルの線形の和をサポートします。文字列がサイズ n の閉集合からのものである場合、各クライアントの文字列をサイズ n のベクトルにエンコードするのは簡単です。ベクトルのインデックス i の値を、閉集合の i ^番目の文字列のカウントとします。すべてのクライアントのベクトルをセキュアに集計し、母集団全体の文字列の数を取得できます。ただし、文字列が開集合からのものである場合、セキュアな集計を取得するために文字列を適切にエンコードする方法は明らかではありません。この場合、文字列を Invertible Bloom Lookup Tables (IBLT) にエンコードできます。これは、大規模な（またはオープンな）ドメインのアイテムを効率的にエンコードできる確率的なデータ構造です。IBLT スケッチは線形和で表すことができるので、セキュアな集計と互換性があります。

tff.analytics.heavy_hitters.iblt.build_iblt_computation を使用して、各クライアントのローカル文字列を IBLT 構造にエンコードする TFF 計算を作成します。これらの構造は、暗号化されたセキュアなマルチパーティ計算プロトコルを介して、サーバーがデコードできる集約された IBLT 構造にセキュアな集計として表されます。その後、サーバーは上位のヘビーヒッターを返します。次のセクションでは、この API を使用して TFF 計算を作成し、シェイクスピアデータセットでシミュレーションを実行する方法を示します。

シェイクスピアの連合データを読み込んで事前処理する

シェイクスピアのデータセットには、シェイクスピアの戯曲の登場人物の台詞が含まれています。この例では、文字のサブセット（つまり、クライアント）が選択されています。プリプロセッサは各登場人物の台詞を文字列のリストに変換し、句読点または記号のみの文字列はすべて削除されます。

In [ ]:

# Load the simulation data.
source, _ = tff.simulation.datasets.shakespeare.load_data()

In [ ]:

# Preprocessing function to tokenize a line into words.
def tokenize(ds):
  """Tokenizes a line into words with alphanum characters."""
  def extract_strings(example):
    return tf.expand_dims(example['snippets'], 0)

  def tokenize_line(line):
    return tf.data.Dataset.from_tensor_slices(tokenizer.tokenize(line)[0])

  def mask_all_symbolic_words(word):
    return tf.math.logical_not(
        tf_text.wordshape(word, tf_text.WordShape.IS_PUNCT_OR_SYMBOL))

  tokenizer = tf_text.WhitespaceTokenizer()
  ds = ds.map(extract_strings)
  ds = ds.flat_map(tokenize_line)
  ds = ds.map(tf_text.case_fold_utf8)
  ds = ds.filter(mask_all_symbolic_words)
  return ds

batch_size = 5

def client_data(n: int) -> tf.data.Dataset:
  return tokenize(source.create_tf_dataset_for_client(
      source.client_ids[n])).batch(batch_size)

# Pick a subset of client devices to participate in the computation.
dataset = [client_data(n) for n in range(10)]

シミュレーション

シミュレーションを実行してシェイクスピアデータセットで最も頻繁な単語（ヘビーヒッター）を見つけるには、最初に tff.analytics.heavy_hitters.iblt.build_iblt_computation API と次のパラメータを使用して TFF 計算を作成する必要があります。

capacity: IBLT スケッチの容量。この数は、1 回の計算で表示される可能性のある一意の文字列のおおよその総数である必要があります。デフォルトは 1000 です。この数が小さすぎると、ハッシュ値の衝突によりデコードが失敗する可能性があります。この数が大きすぎると、必要以上のメモリを消費します。
string_max_bytes: IBLT 内の文字列の最大長。デフォルトは 10 です。正の値でなければなりません。string_max_bytes より長い文字列は切り捨てられます。
max_words_per_user: 各クライアントが提供できる文字列の最大数。Noneでない場合は、正の整数である必要があります。デフォルトは None です。これは、すべてのクライアントがすべての文字列を提供することを意味します。
max_heavy_hitters: 返すアイテムの最大数。デコードされた結果にこの数を超えるアイテムがある場合、推定カウントの降順で並べ替えられ、上位の max_heavy_hitters アイテムが返されます。デフォルトは Noneです。これは、結果のすべてのヘビーヒッターを返すことを意味します。
secure_sum_bitwidth: セキュアな集計に使用されるビット幅。デフォルト値は None で、これはセキュアな集計を無効にします。 None でない場合は、[1,62] の範囲内である必要があります。tff.federated_secure_sum を参照してください。
multi_contribution: 各クライアントが複数のカウントを提供できるか、または一意の単語ごとに 1 つのカウントのみを提供できるか。デフォルトはTrueです。この引数は、差分プライバシーが必要な場合の効用を改善する可能性があります
batch_size: データセットの各バッチ内の要素の数。デフォルトは 1 で、入力データセットが tf.data.Dataset.batch(1) によって処理されることを意味します。正の整数である必要があります。

In [ ]:

max_words_per_user = 8
iblt_computation = tff.analytics.heavy_hitters.iblt.build_iblt_computation(
    capacity=100,
    string_max_bytes=20,
    max_words_per_user=max_words_per_user,
    max_heavy_hitters=10,
    secure_sum_bitwidth=32,
    multi_contribution=False,
    batch_size=batch_size)

これで、TFF 計算 iblt_computation と前処理入力データセットを使用してシミュレーションを実行する準備が整いました。出力 iblt_computation には 4 つの属性があります。

clients: 計算に参加したクライアントのスカラー数。
heavy_hitters: 集約されたヘビーヒッターのリスト。
heavy_hitters_counts: 集約されたヘビーヒッターの数のリスト。
num_not_decoded: 正常にデコードされなかった文字列のスカラー数。

In [ ]:

def run_simulation(one_round_computation: tff.Computation, dataset):
  output = one_round_computation(dataset)
  heavy_hitters = output.heavy_hitters
  heavy_hitters_counts = output.heavy_hitters_counts
  heavy_hitters = [word.decode('utf-8', 'ignore') for word in heavy_hitters]

  results = {}
  for index in range(len(heavy_hitters)):
    results[heavy_hitters[index]] = heavy_hitters_counts[index]
  return output.clients, dict(results)

In [ ]:

clients, result = run_simulation(iblt_computation, dataset)
print(f'Number of clients participated: {clients}')
print('Discovered heavy hitters and counts:')
print(result)

Number of clients participated: 10
Discovered heavy hitters and counts:
{'to': 8, 'the': 8, 'and': 7, 'you': 4, 'i': 4, 'a': 3, 'he': 3, 'your': 3, 'is': 3, 'of': 2}

差分プライバシーを適用したプライベートヘビーヒッター

中央 DP を使用してプライベートヘビーヒッターを取得するためには、開集合ヒストグラムに DP メカニズムが適用されます。集計されたヒストグラムの文字列の数にノイズを追加し、特定のしきい値を超える数の文字列のみを保持します。ノイズとしきい値は（epsilon、delta）- DP バジェットによって異なります。詳細なアルゴリズムと証明については、このドキュメントを参照してください。ノイズの多いカウントは、後処理ステップとして整数に丸められますが、DP の保証は弱化しません。DP が必要な場合は、ヘビーヒッターを多く発見できないことに注意してください。これは、しきい値処理ステップでカウントの少ない文字列が除外されるためです。

In [ ]:

iblt_computation = tff.analytics.heavy_hitters.iblt.build_iblt_computation(
    capacity=100,
    string_max_bytes=20,
    max_words_per_user=max_words_per_user,
    secure_sum_bitwidth=32,
    multi_contribution=False,
    batch_size=batch_size)

clients, result = run_simulation(iblt_computation, dataset)

In [ ]:

# DP parameters
eps = 20
delta = 0.01

# Calculating scale for Laplace noise
scale = max_words_per_user / eps

# Calculating the threshold
tau = 1 + (max_words_per_user / eps) * np.log(max_words_per_user / (2 * delta))

result_with_dp = {}
for word in result:
  noised_count = result[word] + np.random.laplace(scale=scale)
  if noised_count >= tau:
    result_with_dp[word] = int(noised_count)
print(f'Discovered heavy hitters and counts with central DP:')
print(result_with_dp)

Discovered heavy hitters and counts with central DP:
{'the': 8, 'you': 4, 'to': 7, 'tear': 3, 'and': 7, 'i': 3}