GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ko/agents/tutorials/ranking_tutorial.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2023 The TF-Agents Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

TF-Agents 순위에 대한 튜토리얼

시작하기

설정

In [ ]:

!pip install tf-agents[reverb]

In [ ]:

#@title Imports
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

from tf_agents.bandits.agents import ranking_agent
from tf_agents.bandits.agents.examples.v2 import trainer
from tf_agents.bandits.environments import ranking_environment
from tf_agents.bandits.networks import global_and_arm_feature_network
from tf_agents.environments import tf_py_environment
from tf_agents.bandits.policies import ranking_policy
from tf_agents.bandits.replay_buffers import bandit_replay_buffer
from tf_agents.drivers import dynamic_step_driver
from tf_agents.metrics import tf_metrics
from tf_agents.specs import bandit_spec_utils
from tf_agents.specs import tensor_spec
from tf_agents.trajectories import trajectory

소개

이 튜토리얼에서는 TF-Agents Bandits 라이브러리의 일부로 구현된 순위 알고리즘을 안내합니다. 순위 문제의 모든 반복에서 에이전트에 일련의 항목이 제시되고 일부 또는 전체를 목록에 순위 지정하는 작업을 수행합니다. 그런 다음 이 순위 결정은 일정 형태의 피드백을 받습니다(예를 들어 사용자가 선택한 항목 중 하나 이상을 클릭하거나 클릭하지 않을 수 있음). 에이전트의 목표는 시간이 지남에 따라 더 나은 결정을 내리기 위해 일부 지표/보상을 최적화하는 것입니다.

전제 조건

TF-Agents의 순위 알고리즘은 "손잡이 당"(per-arm) 슬롯머신 문제에서 작동하는 특수한 유형의 슬롯머신 에이전트에 속합니다. 따라서 이 튜토리얼을 최대한 활용하려면 독자가 슬롯머신과 손잡이 별 슬롯머신 튜토리얼에 익숙해져야 합니다.

순위 문제와 그 변형

이 튜토리얼에서는 사용자에게 판매할 항목을 제시하는 예를 사용합니다. 모든 반복에서 항목 집합과 표시해야 하는 항목 수를 설명하는 숫자를 받습니다. 수중에 있는 항목의 수가 이것들을 놓을 슬롯의 수보다 항상 크거나 같다고 가정합니다. 사용자가 하나 이상의 표시된 항목과 상호 작용할 확률을 최대화하기 위해 디스플레이의 슬롯을 채워야 합니다. 사용자 및 항목은 모두 특성으로 설명됩니다.

사용자가 좋아하는 항목을 표시할 수 있다면 사용자/항목 상호 작용의 가능성이 높아집니다. 따라서 사용자-항목 쌍이 일치하는 방법을 배우는 것이 좋습니다. 그러나 사용자가 항목을 좋아하는지 어떻게 알 수 있을까요? 이를 위해 피드백 유형을 도입합니다.

#피드백 유형

피드백 신호(보상)가 단일 선택 항목과 직접적으로 연관되는 슬롯머신 문제와 달리 순위에서는 피드백이 표시된 항목의 "좋은 정도"로 변환되는 방식을 고려해야 합니다. 즉, 표시된 항목 전체 또는 일부에 점수를 할당해야 합니다. 우리 라이브러리에서는 벡터 피드백과 계단식 피드백의 두 가지 피드백 유형을 제공합니다.

벡터 피드백

벡터 피드백 유형에서는 에이전트가 출력 순위의 모든 항목에 대해 스칼라 점수를 받는다고 가정합니다. 이러한 스칼라는 출력 순위와 동일한 순서로 벡터에 함께 배치됩니다. 따라서 피드백은 순위의 요소 수와 동일한 크기의 벡터입니다.

이 피드백 유형은 피드백 신호를 점수로 변환하는 것에 대해 걱정할 필요가 없다는 점에서 매우 간단합니다. 반면에 항목에 점수를 매기는 책임은 디자이너(바로 여러분)에게 있습니다. 항목과 그 위치 및 사용자와 상호 작용했는지 여부를 기반으로 어떤 점수를 줄지 결정하는 것은 시스템 디자이너의 몫입니다.

##계단식 피드백

계단식 피드백 유형(Craswell 등이 2008년에 도입한 용어)에서는 사용자가 맨 위 슬롯에서 시작하여 표시된 항목을 순차적으로 본다고 가정합니다. 사용자가 클릭할 가치가 있는 항목을 찾으면 클릭하고 현재 순위 목록으로 돌아가지 않습니다. 클릭한 항목 아래에 있는 항목도 보지 않습니다. 어떤 항목도 클릭하지 않을 수 있습니다. 표시된 항목 중 어느 것도 클릭할 가치가 없을 때 이러한 상황이 발생합니다. 이 경우 사용자는 모든 항목을 봅니다.

피드백 신호는 선택한 요소의 인덱스와 클릭 값의 두 가지 요소로 구성됩니다. 그런 다음 이 정보를 점수로 변환하는 것은 에이전트가 할 작업입니다. 슬롯머신 라이브러리의 구현에서 우리는 보았지만 클릭하지 않은 항목은 낮은 점수(일반적으로 0 또는 -1)를 받고, 클릭한 항목은 클릭 값을 수신하고, 클릭한 항목을 초과하는 항목은 에이전트에 의해 무시되는 규칙을 구현했습니다.

다양성과 탐구

사용자가 항목을 클릭할 확률을 최대화하려면 가장 높은 점수를 받는 항목을 선택하여 순위에 올리는 것만으로는 충분하지 않습니다. 다양한 관심사를 가진 사용자의 경우 스포츠에 가장 관심이 많을 수 있지만 예술과 여행도 좋아합니다. 스포츠와 관련된 모든 항목에 가장 높은 예상 점수를 부여하고 가장 높은 슬롯에 있는 스포츠와 관련된 항목을 모두 표시하는 것이 최적이 아닐 수 있습니다. 사용자가 기분에 따라 예술이나 여행에 관심을 가질 수도 있습니다. 따라서 높은 점수의 관심을 혼합하여 표시하는 것이 좋습니다. 표시된 항목의 점수를 극대화하는 것뿐만 아니라 다양한 세트를 구성하는 것도 중요합니다.

다른 제한된 정보 학습 문제(예: 슬롯머신)와 마찬가지로 우리의 결정이 즉각적인 보상뿐만 아니라 훈련 데이터와 미래의 보상에도 영향을 미친다는 점을 명심해야 합니다. 항상 현재 예상 점수를 기반으로 항목을 표시하는 경우 아직 충분히 탐색하지 않은 높은 점수의 항목을 놓칠 수 있으므로 얼마나 좋은지 알 수 없습니다. 즉, 의사 결정 과정에 탐색을 통합해야 합니다.

위의 모든 개념과 고려 사항은 라이브러리에서 다루어집니다. 이 튜토리얼에서는 세부 사항을 안내합니다.

사용자 시뮬레이션: 테스트 환경

코드베이스에 대해 알아봅시다!

먼저, 사용자 및 항목 특성을 무작위로 생성하는 클래스인 환경을 정의하고 결정 후 피드백을 제공합니다.

In [ ]:

feedback_model = ranking_environment.FeedbackModel.CASCADING #@param["ranking_environment.FeedbackModel.SCORE_VECTOR", "ranking_environment.FeedbackModel.CASCADING"] {type:"raw"}

또한 클릭하지 않을 때를 환경이 결정하는 모델이 필요합니다. 라이브러리에는 거리 기반 및 고스트 작업의 두 가지 방법이 있습니다.

거리 기반에서 사용자 특성이 항목 특성에 충분히 가깝지 않은 경우 사용자는 클릭하지 않습니다.
고스트 작업 모델에서 우리는 단위 벡터 항목 특성의 형태로 추가 가상 작업을 설정합니다. 사용자가 고스트 작업 중 하나를 선택하면 클릭하지 않습니다.

In [ ]:

click_type = "ghost_actions"  #@param["distance_based", "ghost_actions"]
click_model = (ranking_environment.ClickModel.DISTANCE_BASED
               if click_type == "distance_based" else
               ranking_environment.ClickModel.GHOST_ACTIONS)

몇 가지 준비만 하면 순위 환경을 정의할 준비가 거의 완료되었습니다. 전역(사용자) 및 항목 특성에 대한 샘플링 기능을 정의합니다. 이러한 특성은 환경에서 사용자 행동을 시뮬레이션하는 데 사용됩니다. 전역 및 항목 특성의 가중 내적이 계산되고 사용자가 클릭할 확률은 내적 값에 비례합니다. 내적의 가중치는 아래의 scores_weight_matrix에 의해 정의됩니다.

In [ ]:

global_dim = 9  #@param{ type: "integer"}
item_dim   = 11  #@param{ type: "integer"}
num_items  = 50 #@param{ type: "integer"}
num_slots  = 3  #@param{ type: "integer"}
distance_threshold = 5.0  #@param{ type: "number" }
batch_size = 128   #@param{ type: "integer"}

def global_sampling_fn():
  return np.random.randint(-1, 1, [global_dim]).astype(np.float32)

def item_sampling_fn():
  return np.random.randint(-2, 3, [item_dim]).astype(np.float32)

# Inner product with excess dimensions ignored.
scores_weight_matrix = np.eye(11, 9, dtype=np.float32)

env = ranking_environment.RankingPyEnvironment(
    global_sampling_fn,
    item_sampling_fn,
    num_items=num_items,
    num_slots=num_slots,
    scores_weight_matrix=scores_weight_matrix,
    feedback_model=feedback_model,
    click_model=click_model,
    distance_threshold=distance_threshold,
    batch_size=batch_size)

# Convert the python environment to tf environment.
environment = tf_py_environment.TFPyEnvironment(env)

이제 위의 환경을 다룰 몇 가지 다른 에이전트를 정의하겠습니다! 모든 에이전트는 항목/사용자 쌍의 점수를 추정하는 네트워크를 훈련합니다. 차이점은 정책, 즉 훈련된 네트워크를 사용하여 순위 결정을 내리는 방식에 있습니다. 구현된 정책은 점수를 기반으로 하여 순위 스택을 구성하는 것부터 다양성을 고려하고 이러한 측면의 혼합을 조정할 수 있는 탐색에 이르기까지 다양합니다.

In [ ]:

#@title Defining the Network and Training Params
scoring_network = (
      global_and_arm_feature_network.create_feed_forward_common_tower_network(
          environment.observation_spec(), (20, 10), (20, 10), (20, 10)))
learning_rate = 0.005  #@param{ type: "number"}

feedback_dict = {ranking_environment.FeedbackModel.CASCADING: ranking_agent.FeedbackModel.CASCADING,
                 ranking_environment.FeedbackModel.SCORE_VECTOR: ranking_agent.FeedbackModel.SCORE_VECTOR}
agent_feedback_model = feedback_dict[feedback_model]

In [ ]:

#@title Stack Ranking Deterministically by Scores

policy_type = ranking_agent.RankingPolicyType.DESCENDING_SCORES
descending_scores_agent = ranking_agent.RankingAgent(
    time_step_spec=environment.time_step_spec(),
    action_spec=environment.action_spec(),
    scoring_network=scoring_network,
    optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate),
    feedback_model=agent_feedback_model,
    policy_type=policy_type,
    summarize_grads_and_vars=True)

In [ ]:

#@title Sampling Sequentially Based on Scores

policy_type = ranking_agent.RankingPolicyType.NO_PENALTY
logits_temperature = 1.0  #@param{ type: "number" }

no_penalty_agent = ranking_agent.RankingAgent(
    time_step_spec=environment.time_step_spec(),
    action_spec=environment.action_spec(),
    scoring_network=scoring_network,
    optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate),
    feedback_model=agent_feedback_model,
    policy_type=policy_type,
    logits_temperature=logits_temperature,
    summarize_grads_and_vars=True)

In [ ]:

#@title Sampling Sequentally and Taking Diversity into Account
#@markdown The balance between ranking based on scores and taking diversity into account is governed by the following "penalty mixture" parameter. A low positive value results in rankings that hardly mix in diversity, a higher value will enforce more diversity.

policy_type = ranking_agent.RankingPolicyType.COSINE_DISTANCE
penalty_mixture = 1.0 #@param{ type: "number"}

cosine_distance_agent = ranking_agent.RankingAgent(
    time_step_spec=environment.time_step_spec(),
    action_spec=environment.action_spec(),
    scoring_network=scoring_network,
    optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate),
    feedback_model=agent_feedback_model,
    policy_type=policy_type,
    logits_temperature=logits_temperature,
    penalty_mixture_coefficient=penalty_mixture,
    summarize_grads_and_vars=True)

In [ ]:

#@title Choosing the desired agent.
agent_type = "cosine_distance_agent" #@param["cosine_distance_agent", "no_penalty_agent", "descending_scores_agent"]
if agent_type == "descending_scores_agent":
  agent = descending_scores_agent
elif agent_type == "no_penalty_agent":
  agent = no_penalty_agent
else:
  agent = cosine_distance_agent

훈련 루프를 시작하기 전에 훈련 데이터와 관련하여 한 가지 더 처리해야 할 사항이 있습니다.

결정 시 정책에 제공된 손잡이(arm) 특성에는 정책에서 선택할 수 있는 모든 항목이 포함됩니다. 그러나 훈련 시에는 선택된 항목의 특성이 필요하고 편의를 위해 의사 결정이 출력되는 순서대로 필요합니다. 이를 위해 다음 기능이 사용됩니다(여기에서 명확성을 위해 복사함).

In [ ]:

def order_items_from_action_fn(orig_trajectory):
  """Puts the features of the selected items in the recommendation order.

  This function is used to make sure that at training the item observation is
  filled with features of items selected by the policy, in the order of the
  selection. Features of unselected items are discarded.

  Args:
    orig_trajectory: The trajectory as output by the policy

  Returns:
    The modified trajectory that contains slotted item features.
  """
  item_obs = orig_trajectory.observation[
      bandit_spec_utils.PER_ARM_FEATURE_KEY]
  action = orig_trajectory.action
  if isinstance(
      orig_trajectory.observation[bandit_spec_utils.PER_ARM_FEATURE_KEY],
      tensor_spec.TensorSpec):
    dtype = orig_trajectory.observation[
        bandit_spec_utils.PER_ARM_FEATURE_KEY].dtype
    shape = [
        num_slots, orig_trajectory.observation[
            bandit_spec_utils.PER_ARM_FEATURE_KEY].shape[-1]
    ]
    new_observation = {
        bandit_spec_utils.GLOBAL_FEATURE_KEY:
            orig_trajectory.observation[bandit_spec_utils.GLOBAL_FEATURE_KEY],
        bandit_spec_utils.PER_ARM_FEATURE_KEY:
            tensor_spec.TensorSpec(dtype=dtype, shape=shape)
    }
  else:
    slotted_items = tf.gather(item_obs, action, batch_dims=1)
    new_observation = {
        bandit_spec_utils.GLOBAL_FEATURE_KEY:
            orig_trajectory.observation[bandit_spec_utils.GLOBAL_FEATURE_KEY],
        bandit_spec_utils.PER_ARM_FEATURE_KEY:
            slotted_items
    }
  return trajectory.Trajectory(
      step_type=orig_trajectory.step_type,
      observation=new_observation,
      action=(),
      policy_info=(),
      next_step_type=orig_trajectory.next_step_type,
      reward=orig_trajectory.reward,
      discount=orig_trajectory.discount)

In [ ]:

#@title Defininfing Parameters to Run the Agent on the Defined Environment
num_iterations = 400 #@param{ type: "number" }
steps_per_loop = 2   #@param{ type: "integer" }

슬롯머신 튜토리얼에서와 같이 에이전트에 훈련할 샘플을 공급할 재생 버퍼를 정의합니다. 그런 다음 드라이버를 사용하여 모든 것을 통합합니다. 환경은 특성을 제공하고 정책은 순위를 선택하며 훈련을 위해 샘플을 수집합니다.

In [ ]:

replay_buffer = bandit_replay_buffer.BanditReplayBuffer(
      data_spec=order_items_from_action_fn(agent.policy.trajectory_spec),
      batch_size=batch_size,
      max_length=steps_per_loop)

if feedback_model == ranking_environment.FeedbackModel.SCORE_VECTOR:
  reward_metric = tf_metrics.AverageReturnMetric(
      batch_size=environment.batch_size,
      buffer_size=200)
else:
  reward_metric = tf_metrics.AverageReturnMultiMetric(
        reward_spec=environment.reward_spec(),
        batch_size=environment.batch_size,
        buffer_size=200)

add_batch_fn = lambda data: replay_buffer.add_batch(
        order_items_from_action_fn(data))

observers = [add_batch_fn, reward_metric]

driver = dynamic_step_driver.DynamicStepDriver(
    env=environment,
    policy=agent.collect_policy,
    num_steps=steps_per_loop * batch_size,
    observers=observers)

reward_values = []

for _ in range(num_iterations):
  driver.run()
  loss_info = agent.train(replay_buffer.gather_all())
  replay_buffer.clear()
  if feedback_model == ranking_environment.FeedbackModel.SCORE_VECTOR:
    reward_values.append(reward_metric.result())
  else:
    reward_values.append(reward_metric.result())

보상을 플롯해 보겠습니다!

In [ ]:

if feedback_model == ranking_environment.FeedbackModel.SCORE_VECTOR:
  reward = reward_values
else:
  reward = [r["chosen_value"] for r in reward_values]
plt.plot(reward)
plt.ylabel('Average Return')
plt.xlabel('Number of Iterations')

다음 단계

이 튜토리얼에는 사용할 정책/에이전트, 환경의 일부 속성 및 피드백 모델을 포함하여 많은 조정 가능한 매개변수가 있습니다. 해당 매개변수를 자유롭게 실험해 보세요!

tf_agents/bandits/agents/examples/v2/train_eval_ranking.py에 순위 지정을 위한 바로 실행 가능한 예제도 있습니다.