GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ja/agents/tutorials/ranking_tutorial.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2023 The TF-Agents Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

TF-Agents でのランキングに関するチュートリアル

はじめに

TensorFlow.org で表示

Google Colab で実行

ノートブックをダウンロード

セットアップ

In [ ]:

!pip install tf-agents[reverb]

In [ ]:

#@title Imports
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

from tf_agents.bandits.agents import ranking_agent
from tf_agents.bandits.agents.examples.v2 import trainer
from tf_agents.bandits.environments import ranking_environment
from tf_agents.bandits.networks import global_and_arm_feature_network
from tf_agents.environments import tf_py_environment
from tf_agents.bandits.policies import ranking_policy
from tf_agents.bandits.replay_buffers import bandit_replay_buffer
from tf_agents.drivers import dynamic_step_driver
from tf_agents.metrics import tf_metrics
from tf_agents.specs import bandit_spec_utils
from tf_agents.specs import tensor_spec
from tf_agents.trajectories import trajectory

はじめに

このチュートリアルでは、TF-Agents Bandits ライブラリの一部として実装されているランキングアルゴリズムについて説明します。ランキング問題では、イテレーションごとにエージェントにアイテムのセットが提示され、それらの一部またはすべてをリストに階数付けするタスクが課されます。このランキングの決定は、何らかの形式のフィードバックを受け取ります（例えば、ユーザーが選択した項目の 1 つまたは複数をクリックするかしないかなど）。エージェントの目標は、時間の経過とともにより良い意思決定を行うことを目標に、いくつかの指標/報酬を最適化することです。

前提条件

TF-Agents のランキングアルゴリズムは、「アームごと」のバンディット問題で動作する特殊なタイプのバンディットエージェントに属しています。したがって、このチュートリアルを最大限に活用するには、読者はバンディットとアームごとのバンディットのチュートリアルをよく理解する必要があります。

多様性と探求

ユーザーがアイテムをクリックする可能性を最大化するには、最もスコアの高いアイテムを選択してランキングの上位に入れるだけでは十分ではありません。さまざまな興味を持つユーザーの場合、彼らはスポーツに最も興味があるかもしれませんが、芸術や旅行も好きです。すべてのスポーツアイテムに最高の推定スコアを与え、最高のスロットにすべてのスポーツアイテムを表示することは、最適ではない場合があります。ユーザーは芸術や旅行の気分になっている可能性があります。したがって、高得点の興味の対象を組み合わせて表示することをお勧めします。表示されるアイテムのスコアを最大化するだけでなく、それらが多様なセットを形成していることを確認することが重要です。

他の限定情報学習問題（バンディットなど）と同様に、私たちの決定は即時の報酬だけでなく、トレーニングデータと将来の報酬にも影響することを覚えておくことも重要です。常に現在の推定スコアに基づいてアイテムのみを表示すると、まだ十分に調査していない高スコアのアイテムを見逃す可能性があり、そのため、それらがどれほど優れているかを認識できません。つまり、意思決定プロセスに探索を組み込む必要があります。

上記の概念と考慮事項はすべて、私たちのライブラリで対応しています。このチュートリアルでは、詳細について説明します。

ユーザーのシミュレーション: テスト環境

コードベースを詳しく見ていきましょう！

まず、ユーザーとアイテムの特徴量をランダムに生成するクラスである環境を定義し、決定後にフィードバックを提供します。

In [ ]:

feedback_model = ranking_environment.FeedbackModel.CASCADING #@param["ranking_environment.FeedbackModel.SCORE_VECTOR", "ranking_environment.FeedbackModel.CASCADING"] {type:"raw"}

いつクリックしないかを決定する環境のモデルも必要です。ライブラリには、距離ベースとゴーストアクションの 2 つの方法があります。

距離ベースでは、ユーザー特徴量がどのアイテム特徴量にも十分に近くない場合、ユーザーはクリックしません。
ゴーストアクションモデルでは、追加の虚数アクションを単位ベクトルアイテム特徴量の形で設定します。ユーザーがゴーストアクションのいずれかを選択すると、ノークリックになります。

In [ ]:

click_type = "ghost_actions"  #@param["distance_based", "ghost_actions"]
click_model = (ranking_environment.ClickModel.DISTANCE_BASED
               if click_type == "distance_based" else
               ranking_environment.ClickModel.GHOST_ACTIONS)

ランキング環境を定義する準備がほぼ整いました。あともう少しで完了です。グローバル（ユーザー）とアイテム特徴量のサンプリング関数を定義します。これらの特徴量は、ユーザーの行動をシミュレートするために環境によって使用されます。グローバルおよびアイテム特徴量の加重内積が計算され、ユーザーがクリックする確率は内積値に比例します。内積の重み付けは、以下の scores_weight_matrix によって定義されます。

In [ ]:

global_dim = 9  #@param{ type: "integer"}
item_dim   = 11  #@param{ type: "integer"}
num_items  = 50 #@param{ type: "integer"}
num_slots  = 3  #@param{ type: "integer"}
distance_threshold = 5.0  #@param{ type: "number" }
batch_size = 128   #@param{ type: "integer"}

def global_sampling_fn():
  return np.random.randint(-1, 1, [global_dim]).astype(np.float32)

def item_sampling_fn():
  return np.random.randint(-2, 3, [item_dim]).astype(np.float32)

# Inner product with excess dimensions ignored.
scores_weight_matrix = np.eye(11, 9, dtype=np.float32)

env = ranking_environment.RankingPyEnvironment(
    global_sampling_fn,
    item_sampling_fn,
    num_items=num_items,
    num_slots=num_slots,
    scores_weight_matrix=scores_weight_matrix,
    feedback_model=feedback_model,
    click_model=click_model,
    distance_threshold=distance_threshold,
    batch_size=batch_size)

# Convert the python environment to tf environment.
environment = tf_py_environment.TFPyEnvironment(env)

それでは、上記の環境に取り組むいくつかの異なるエージェントを定義しましょう！すべてのエージェントは、アイテムとユーザーのペアのスコアを推定するネットワークをトレーニングします。違いはポリシー、つまり、トレーニングされたネットワークを使用してランキングを決定する方法にあります。実装されたポリシーは、スコアに基づく単なるスタックランキングから、多様性と探索を考慮して、これらの側面の混合を調整する機能にまで及びます。

In [ ]:

#@title Defining the Network and Training Params
scoring_network = (
      global_and_arm_feature_network.create_feed_forward_common_tower_network(
          environment.observation_spec(), (20, 10), (20, 10), (20, 10)))
learning_rate = 0.005  #@param{ type: "number"}

feedback_dict = {ranking_environment.FeedbackModel.CASCADING: ranking_agent.FeedbackModel.CASCADING,
                 ranking_environment.FeedbackModel.SCORE_VECTOR: ranking_agent.FeedbackModel.SCORE_VECTOR}
agent_feedback_model = feedback_dict[feedback_model]

In [ ]:

#@title Stack Ranking Deterministically by Scores

policy_type = ranking_agent.RankingPolicyType.DESCENDING_SCORES
descending_scores_agent = ranking_agent.RankingAgent(
    time_step_spec=environment.time_step_spec(),
    action_spec=environment.action_spec(),
    scoring_network=scoring_network,
    optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate),
    feedback_model=agent_feedback_model,
    policy_type=policy_type,
    summarize_grads_and_vars=True)

In [ ]:

#@title Sampling Sequentially Based on Scores

policy_type = ranking_agent.RankingPolicyType.NO_PENALTY
logits_temperature = 1.0  #@param{ type: "number" }

no_penalty_agent = ranking_agent.RankingAgent(
    time_step_spec=environment.time_step_spec(),
    action_spec=environment.action_spec(),
    scoring_network=scoring_network,
    optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate),
    feedback_model=agent_feedback_model,
    policy_type=policy_type,
    logits_temperature=logits_temperature,
    summarize_grads_and_vars=True)

In [ ]:

#@title Sampling Sequentally and Taking Diversity into Account
#@markdown The balance between ranking based on scores and taking diversity into account is governed by the following "penalty mixture" parameter. A low positive value results in rankings that hardly mix in diversity, a higher value will enforce more diversity.

policy_type = ranking_agent.RankingPolicyType.COSINE_DISTANCE
penalty_mixture = 1.0 #@param{ type: "number"}

cosine_distance_agent = ranking_agent.RankingAgent(
    time_step_spec=environment.time_step_spec(),
    action_spec=environment.action_spec(),
    scoring_network=scoring_network,
    optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate),
    feedback_model=agent_feedback_model,
    policy_type=policy_type,
    logits_temperature=logits_temperature,
    penalty_mixture_coefficient=penalty_mixture,
    summarize_grads_and_vars=True)

In [ ]:

#@title Choosing the desired agent.
agent_type = "cosine_distance_agent" #@param["cosine_distance_agent", "no_penalty_agent", "descending_scores_agent"]
if agent_type == "descending_scores_agent":
  agent = descending_scores_agent
elif agent_type == "no_penalty_agent":
  agent = no_penalty_agent
else:
  agent = cosine_distance_agent

トレーニングループを開始する前に、トレーニングデータに関してもう 1 つ注意しなければならないことがあります。

決定時にポリシーに提示されるアーム特徴量には、ポリシーが選択できるすべてのアイテムが含まれます。ただし、トレーニングでは、便宜上、決定出力の順序で選択されたアイテムの特徴量が必要です。この目的のために、次の関数が使用されます（わかりやすくするためにここにコピーされています）。

In [ ]:

def order_items_from_action_fn(orig_trajectory):
  """Puts the features of the selected items in the recommendation order.

  This function is used to make sure that at training the item observation is
  filled with features of items selected by the policy, in the order of the
  selection. Features of unselected items are discarded.

  Args:
    orig_trajectory: The trajectory as output by the policy

  Returns:
    The modified trajectory that contains slotted item features.
  """
  item_obs = orig_trajectory.observation[
      bandit_spec_utils.PER_ARM_FEATURE_KEY]
  action = orig_trajectory.action
  if isinstance(
      orig_trajectory.observation[bandit_spec_utils.PER_ARM_FEATURE_KEY],
      tensor_spec.TensorSpec):
    dtype = orig_trajectory.observation[
        bandit_spec_utils.PER_ARM_FEATURE_KEY].dtype
    shape = [
        num_slots, orig_trajectory.observation[
            bandit_spec_utils.PER_ARM_FEATURE_KEY].shape[-1]
    ]
    new_observation = {
        bandit_spec_utils.GLOBAL_FEATURE_KEY:
            orig_trajectory.observation[bandit_spec_utils.GLOBAL_FEATURE_KEY],
        bandit_spec_utils.PER_ARM_FEATURE_KEY:
            tensor_spec.TensorSpec(dtype=dtype, shape=shape)
    }
  else:
    slotted_items = tf.gather(item_obs, action, batch_dims=1)
    new_observation = {
        bandit_spec_utils.GLOBAL_FEATURE_KEY:
            orig_trajectory.observation[bandit_spec_utils.GLOBAL_FEATURE_KEY],
        bandit_spec_utils.PER_ARM_FEATURE_KEY:
            slotted_items
    }
  return trajectory.Trajectory(
      step_type=orig_trajectory.step_type,
      observation=new_observation,
      action=(),
      policy_info=(),
      next_step_type=orig_trajectory.next_step_type,
      reward=orig_trajectory.reward,
      discount=orig_trajectory.discount)

In [ ]:

#@title Defininfing Parameters to Run the Agent on the Defined Environment
num_iterations = 400 #@param{ type: "number" }
steps_per_loop = 2   #@param{ type: "integer" }

バンディットのチュートリアルと同様に、トレーニングするサンプルをエージェントに供給するリプレイバッファを定義します。次に、ドライバーを使用してすべてをまとめます。環境が特徴量を提供し、ポリシーがランキングを選択し、サンプルが収集されてトレーニングされます。

In [ ]:

replay_buffer = bandit_replay_buffer.BanditReplayBuffer(
      data_spec=order_items_from_action_fn(agent.policy.trajectory_spec),
      batch_size=batch_size,
      max_length=steps_per_loop)

if feedback_model == ranking_environment.FeedbackModel.SCORE_VECTOR:
  reward_metric = tf_metrics.AverageReturnMetric(
      batch_size=environment.batch_size,
      buffer_size=200)
else:
  reward_metric = tf_metrics.AverageReturnMultiMetric(
        reward_spec=environment.reward_spec(),
        batch_size=environment.batch_size,
        buffer_size=200)

add_batch_fn = lambda data: replay_buffer.add_batch(
        order_items_from_action_fn(data))

observers = [add_batch_fn, reward_metric]

driver = dynamic_step_driver.DynamicStepDriver(
    env=environment,
    policy=agent.collect_policy,
    num_steps=steps_per_loop * batch_size,
    observers=observers)

reward_values = []

for _ in range(num_iterations):
  driver.run()
  loss_info = agent.train(replay_buffer.gather_all())
  replay_buffer.clear()
  if feedback_model == ranking_environment.FeedbackModel.SCORE_VECTOR:
    reward_values.append(reward_metric.result())
  else:
    reward_values.append(reward_metric.result())

報酬をプロットしましょう！

In [ ]:

if feedback_model == ranking_environment.FeedbackModel.SCORE_VECTOR:
  reward = reward_values
else:
  reward = [r["chosen_value"] for r in reward_values]
plt.plot(reward)
plt.ylabel('Average Return')
plt.xlabel('Number of Iterations')

次のステップ

このチュートリアルには、使用するポリシー/エージェント、環境のいくつかのプロパティ、さらにはフィードバックモデルなど、調整可能なパラメータが多数含まれています。これらのパラメータを自由に試してみてください。

tf_agents/bandits/agents/examples/v2/train_eval_ranking.py には、すぐに実行できるランキングの例もあります。

Copyright 2023 The TF-Agents Authors.

TF-Agents でのランキングに関するチュートリアル

はじめに

セットアップ

はじめに

前提条件

ランキング問題とそのバリアント

ベクトルフィードバック

多様性と探求

ユーザーのシミュレーション: テスト環境

次のステップ

Product

Resources

Company