GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/pt-br/hub/tutorials/movinet.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2021 The TensorFlow Hub Authors.

Licensed under the Apache License, Version 2.0 (the "License");

In [ ]:

# Copyright 2021 The TensorFlow Hub Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

MoViNet para o reconhecimento de ações de streaming

Ver em TensorFlow.org

Executar no Google Colab

Ver no GitHub

Baixar notebook

Ver modelos do TF Hub

Este tutorial demonstra como usar um modelo de classificação de vídeos pré-treinado para classificar uma atividade (como dançar, nadar, pedalar, etc.) no vídeo fornecido.

A arquitetura do modelo usada neste tutorial é chamada de MoViNet (Mobile Video Networks – Redes de vídeo móveis). As MoViNets são uma família de modelos eficientes de classificação de vídeos treinados com um dataset enorme (Kinetics 600).

Diferentemente dos modelos I3D disponíveis no TF Hub, os modelos MoViNet também têm suporte a inferência quadro a quadro em streaming de vídeo.

Os modelos pré-treinados estão disponíveis no TF Hub. A coleção do TF Hub também inclui modelos quantizados otimizados para o TF Lite.

O código-fonte para esses modelos está disponível no TensorFlow Model Garden, que inclui uma versão mais longa deste tutorial que também aborda a criação e ajustes finos de um modelo MoViNet.

Este tutorial sobre o MoViNet é uma parte de uma série de tutoriais sobre vídeo do TensorFlow. Aqui estão os outros três tutoriais:

Carregue dados de vídeo: este tutorial explica como carregar e pré-processar dados de vídeo em um pipeline de dataset do TensorFlow do zero.
Crie um modelo CNN 3D para a classificação de vídeos: observe que este tutorial usa uma CNN (2+1)D que decompõe os aspectos espaciais e temporais dos dados 3D. Se você estiver usando dados volumétricos, como uma ressonância magnética, considere usar uma CNN 3D em vez de uma CNN (2+1)D.
Aprendizado por transferência para a classificação de vídeos com MoViNet: este tutorial explica como usar um modelo de classificação de vídeos pré-treinado com um dataset diferente, o UCF-101.

jumping jacks plot

Configuração

Para inferência em modelos menores (A0-A2), é suficiente usar CPU para este Colab.

In [ ]:

!sudo apt install -y ffmpeg
!pip install -q mediapy

In [ ]:

!pip uninstall -q -y opencv-python-headless
!pip install -q "opencv-python-headless<4.3"

In [ ]:

# Import libraries
import pathlib

import matplotlib as mpl
import matplotlib.pyplot as plt
import mediapy as media
import numpy as np
import PIL

import tensorflow as tf
import tensorflow_hub as hub
import tqdm

mpl.rcParams.update({
    'font.size': 10,
})

Baixe a lista de rótulos do Kinetics 600 e exiba via print os primeiros rótulos:

In [ ]:

labels_path = tf.keras.utils.get_file(
    fname='labels.txt',
    origin='https://raw.githubusercontent.com/tensorflow/models/f8af2291cced43fc9f1d9b41ddbf772ae7b0d7d2/official/projects/movinet/files/kinetics_600_labels.txt'
)
labels_path = pathlib.Path(labels_path)

lines = labels_path.read_text().splitlines()
KINETICS_600_LABELS = np.array([line.strip() for line in lines])
KINETICS_600_LABELS[:20]

Para fornecer um vídeo simples de exemplo para classificação, podemos carregar um pequeno GIF de polichinelos.

jumping jacks

Crédito: imagens compartilhadas pelo técnico Bobby Bluford no YouTube, com licença CC-BY.

Baixe o GIF.

In [ ]:

jumpingjack_url = 'https://github.com/tensorflow/models/raw/f8af2291cced43fc9f1d9b41ddbf772ae7b0d7d2/official/projects/movinet/files/jumpingjack.gif'
jumpingjack_path = tf.keras.utils.get_file(
    fname='jumpingjack.gif',
    origin=jumpingjack_url,
    cache_dir='.', cache_subdir='.',
)

Defina uma função para ler um GIF e colocar em um tf.Tensor:

In [ ]:

#@title
# Read and process a video
def load_gif(file_path, image_size=(224, 224)):
  """Loads a gif file into a TF tensor.

  Use images resized to match what's expected by your model.
  The model pages say the "A2" models expect 224 x 224 images at 5 fps

  Args:
    file_path: path to the location of a gif file.
    image_size: a tuple of target size.

  Returns:
    a video of the gif file
  """
  # Load a gif file, convert it to a TF tensor
  raw = tf.io.read_file(file_path)
  video = tf.io.decode_gif(raw)
  # Resize the video
  video = tf.image.resize(video, image_size)
  # change dtype to a float32
  # Hub models always want images normalized to [0,1]
  # ref: https://www.tensorflow.org/hub/common_signatures/images#input
  video = tf.cast(video, tf.float32) / 255.
  return video

O formato do vídeo é (frames, height, width, colors).

In [ ]:

jumpingjack=load_gif(jumpingjack_path)
jumpingjack.shape

Como usar o modelo

Esta seção explica como usar os modelos do TensorFlow Hub. Se você quiser ver os modelos em ação, prossiga para a próxima seção.

Existem duas versões de cada modelo : base e streaming.

A versão base recebe um vídeo como entrada e retorna a média das probabilidades para os quadros.
A versão streaming recebe um quadro do vídeo e um estado da RNN como entrada e retorna as previsões para esse quadro, além do novo estado da RNN.

Modelo de referência

Baixe o modelo pré-treinado no TensorFlow Hub.

In [ ]:

%%time
id = 'a2'
mode = 'base'
version = '3'
hub_url = f'https://tfhub.dev/tensorflow/movinet/{id}/{mode}/kinetics-600/classification/{version}'
model = hub.load(hub_url)

Essa versão do modelo tem uma assinatura (signature) que recebe um argumento image, que é tf.float32, com formato (batch, frames, height, width, colors). Ela retorna um dicionário contendo uma saída: um tensor tf.float32 de logits com formato (batch, classes).

In [ ]:

sig = model.signatures['serving_default']
print(sig.pretty_printed_signature())

Para executar essa assinatura no vídeo, primeiro você precisa adicionar a dimensão externa de batch ao vídeo.

In [ ]:

#warmup
sig(image = jumpingjack[tf.newaxis, :1]);

In [ ]:

%%time
logits = sig(image = jumpingjack[tf.newaxis, ...])
logits = logits['classifier_head'][0]

print(logits.shape)
print()

Defina uma função get_top_k que encapsule o processamento da saída acima para uso posterior.

In [ ]:

#@title
# Get top_k labels and probabilities
def get_top_k(probs, k=5, label_map=KINETICS_600_LABELS):
  """Outputs the top k model labels and probabilities on the given video.

  Args:
    probs: probability tensor of shape (num_frames, num_classes) that represents
      the probability of each class on each frame.
    k: the number of top predictions to select.
    label_map: a list of labels to map logit indices to label strings.

  Returns:
    a tuple of the top-k labels and probabilities.
  """
  # Sort predictions to find top_k
  top_predictions = tf.argsort(probs, axis=-1, direction='DESCENDING')[:k]
  # collect the labels of top_k predictions
  top_labels = tf.gather(label_map, top_predictions, axis=-1)
  # decode lablels
  top_labels = [label.decode('utf8') for label in top_labels.numpy()]
  # top_k probabilities of the predictions
  top_probs = tf.gather(probs, top_predictions, axis=-1).numpy()
  return tuple(zip(top_labels, top_probs))

Converta os logits em probabilidades e identifique as 5 principais classes para o vídeo. O modelo confirma que provavelmente o vídeo é de jumping jacks (polichinelos).

In [ ]:

probs = tf.nn.softmax(logits, axis=-1)
for label, p in get_top_k(probs):
  print(f'{label:20s}: {p:.3f}')

Modelo streaming

A seção anterior usou um modelo que é executado para um vídeo inteiro. Geralmente, ao processar um vídeo, você não deseja uma única previsão no final, mas sim atualizar as previsões quadro a quadro. As versões stream do modelo permitem fazer isso.

Carregue a versão stream do modelo.

In [ ]:

%%time
id = 'a2'
mode = 'stream'
version = '3'
hub_url = f'https://tfhub.dev/tensorflow/movinet/{id}/{mode}/kinetics-600/classification/{version}'
model = hub.load(hub_url)

Usar esse modelo é ligeiramente mais complexo do que o modelo base. Você precisa manter o controle do estado interno das RNNs do modelo.

In [ ]:

list(model.signatures.keys())

A assinatura init_states recebe o formato (batch, frames, height, width, colors) do vídeo como entrada e retorna um grande dicionário de tensores contendo os estados iniciais das RNNs:

In [ ]:

lines = model.signatures['init_states'].pretty_printed_signature().splitlines()
lines = lines[:10]
lines.append('      ...')
print('.\n'.join(lines))

In [ ]:

initial_state = model.init_states(jumpingjack[tf.newaxis, ...].shape)

In [ ]:

type(initial_state)

In [ ]:

list(sorted(initial_state.keys()))[:5]

Depois que você tiver o estado inicial das RNNs, pode passar o estado e um quadro do vídeo como entrada (mantendo o formato (batch, frames, height, width, colors) para o quadro do vídeo). O modelo retorna um par (logits, state).

Após observar o primeiro quadro, o modelo não está convencido de que o vídeo é de polichinelos:

In [ ]:

inputs = initial_state.copy()

# Add the batch axis, take the first frme, but keep the frame-axis.
inputs['image'] = jumpingjack[tf.newaxis, 0:1, ...]

In [ ]:

# warmup
model(inputs);

In [ ]:

logits, new_state = model(inputs)
logits = logits[0]
probs = tf.nn.softmax(logits, axis=-1)

for label, p in get_top_k(probs):
  print(f'{label:20s}: {p:.3f}')

print()

Se você executar o modelo em um loop, passando o estado atualizado a cada quadro, o modelo converge rapidamente para o resultado correto:

In [ ]:

%%time
state = initial_state.copy()
all_logits = []

for n in range(len(jumpingjack)):
  inputs = state
  inputs['image'] = jumpingjack[tf.newaxis, n:n+1, ...]
  result, state = model(inputs)
  all_logits.append(logits)

probabilities = tf.nn.softmax(all_logits, axis=-1)

In [ ]:

for label, p in get_top_k(probabilities[-1]):
  print(f'{label:20s}: {p:.3f}')

In [ ]:

id = tf.argmax(probabilities[-1])
plt.plot(probabilities[:, id])
plt.xlabel('Frame #')
plt.ylabel(f"p('{KINETICS_600_LABELS[id]}')");

Talvez você perceba que a probabilidade final é muito mais alta do que na seção anterior, em que o modelo base foi executado. O modelo base retorna uma média das previsões para os quadros.

In [ ]:

for label, p in get_top_k(tf.reduce_mean(probabilities, axis=0)):
  print(f'{label:20s}: {p:.3f}')

Anime as previsões ao longo do tempo

A seção anterior forneceu detalhes de como usar esses modelos. Esta seção se aprofunda para gerar algumas animações de inferência interessantes.

A célula oculta abaixo define funções helper usadas nesta seção.

In [ ]:

#@title
# Get top_k labels and probabilities predicted using MoViNets streaming model
def get_top_k_streaming_labels(probs, k=5, label_map=KINETICS_600_LABELS):
  """Returns the top-k labels over an entire video sequence.

  Args:
    probs: probability tensor of shape (num_frames, num_classes) that represents
      the probability of each class on each frame.
    k: the number of top predictions to select.
    label_map: a list of labels to map logit indices to label strings.

  Returns:
    a tuple of the top-k probabilities, labels, and logit indices
  """
  top_categories_last = tf.argsort(probs, -1, 'DESCENDING')[-1, :1]
  # Sort predictions to find top_k
  categories = tf.argsort(probs, -1, 'DESCENDING')[:, :k]
  categories = tf.reshape(categories, [-1])

  counts = sorted([
      (i.numpy(), tf.reduce_sum(tf.cast(categories == i, tf.int32)).numpy())
      for i in tf.unique(categories)[0]
  ], key=lambda x: x[1], reverse=True)

  top_probs_idx = tf.constant([i for i, _ in counts[:k]])
  top_probs_idx = tf.concat([top_categories_last, top_probs_idx], 0)
  # find unique indices of categories
  top_probs_idx = tf.unique(top_probs_idx)[0][:k+1]
  # top_k probabilities of the predictions
  top_probs = tf.gather(probs, top_probs_idx, axis=-1)
  top_probs = tf.transpose(top_probs, perm=(1, 0))
  # collect the labels of top_k predictions
  top_labels = tf.gather(label_map, top_probs_idx, axis=0)
  # decode the top_k labels
  top_labels = [label.decode('utf8') for label in top_labels.numpy()]

  return top_probs, top_labels, top_probs_idx

# Plot top_k predictions at a given time step
def plot_streaming_top_preds_at_step(
    top_probs,
    top_labels,
    step=None,
    image=None,
    legend_loc='lower left',
    duration_seconds=10,
    figure_height=500,
    playhead_scale=0.8,
    grid_alpha=0.3):
  """Generates a plot of the top video model predictions at a given time step.

  Args:
    top_probs: a tensor of shape (k, num_frames) representing the top-k
      probabilities over all frames.
    top_labels: a list of length k that represents the top-k label strings.
    step: the current time step in the range [0, num_frames].
    image: the image frame to display at the current time step.
    legend_loc: the placement location of the legend.
    duration_seconds: the total duration of the video.
    figure_height: the output figure height.
    playhead_scale: scale value for the playhead.
    grid_alpha: alpha value for the gridlines.

  Returns:
    A tuple of the output numpy image, figure, and axes.
  """
  # find number of top_k labels and frames in the video
  num_labels, num_frames = top_probs.shape
  if step is None:
    step = num_frames
  # Visualize frames and top_k probabilities of streaming video
  fig = plt.figure(figsize=(6.5, 7), dpi=300)
  gs = mpl.gridspec.GridSpec(8, 1)
  ax2 = plt.subplot(gs[:-3, :])
  ax = plt.subplot(gs[-3:, :])
  # display the frame
  if image is not None:
    ax2.imshow(image, interpolation='nearest')
    ax2.axis('off')
  # x-axis (frame number)
  preview_line_x = tf.linspace(0., duration_seconds, num_frames)
  # y-axis (top_k probabilities)
  preview_line_y = top_probs

  line_x = preview_line_x[:step+1]
  line_y = preview_line_y[:, :step+1]

  for i in range(num_labels):
    ax.plot(preview_line_x, preview_line_y[i], label=None, linewidth='1.5',
            linestyle=':', color='gray')
    ax.plot(line_x, line_y[i], label=top_labels[i], linewidth='2.0')


  ax.grid(which='major', linestyle=':', linewidth='1.0', alpha=grid_alpha)
  ax.grid(which='minor', linestyle=':', linewidth='0.5', alpha=grid_alpha)

  min_height = tf.reduce_min(top_probs) * playhead_scale
  max_height = tf.reduce_max(top_probs)
  ax.vlines(preview_line_x[step], min_height, max_height, colors='red')
  ax.scatter(preview_line_x[step], max_height, color='red')

  ax.legend(loc=legend_loc)

  plt.xlim(0, duration_seconds)
  plt.ylabel('Probability')
  plt.xlabel('Time (s)')
  plt.yscale('log')

  fig.tight_layout()
  fig.canvas.draw()

  data = np.frombuffer(fig.canvas.tostring_rgb(), dtype=np.uint8)
  data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
  plt.close()

  figure_width = int(figure_height * data.shape[1] / data.shape[0])
  image = PIL.Image.fromarray(data).resize([figure_width, figure_height])
  image = np.array(image)

  return image

# Plotting top_k predictions from MoViNets streaming model
def plot_streaming_top_preds(
    probs,
    video,
    top_k=5,
    video_fps=25.,
    figure_height=500,
    use_progbar=True):
  """Generates a video plot of the top video model predictions.

  Args:
    probs: probability tensor of shape (num_frames, num_classes) that represents
      the probability of each class on each frame.
    video: the video to display in the plot.
    top_k: the number of top predictions to select.
    video_fps: the input video fps.
    figure_fps: the output video fps.
    figure_height: the height of the output video.
    use_progbar: display a progress bar.

  Returns:
    A numpy array representing the output video.
  """
  # select number of frames per second
  video_fps = 8.
  # select height of the image
  figure_height = 500
  # number of time steps of the given video
  steps = video.shape[0]
  # estimate duration of the video (in seconds)
  duration = steps / video_fps
  # estiamte top_k probabilities and corresponding labels
  top_probs, top_labels, _ = get_top_k_streaming_labels(probs, k=top_k)

  images = []
  step_generator = tqdm.trange(steps) if use_progbar else range(steps)
  for i in step_generator:
    image = plot_streaming_top_preds_at_step(
        top_probs=top_probs,
        top_labels=top_labels,
        step=i,
        image=video[i],
        duration_seconds=duration,
        figure_height=figure_height,
    )
    images.append(image)

  return np.array(images)

Comece executando o modelo streaming para os quadros do vídeo e obtendo os logits:

In [ ]:

init_states = model.init_states(jumpingjack[tf.newaxis].shape)

In [ ]:

# Insert your video clip here
video = jumpingjack
images = tf.split(video[tf.newaxis], video.shape[0], axis=1)

all_logits = []

# To run on a video, pass in one frame at a time
states = init_states
for image in tqdm.tqdm(images):
  # predictions for each frame
  logits, states = model({**states, 'image': image})
  all_logits.append(logits)

# concatinating all the logits
logits = tf.concat(all_logits, 0)
# estimating probabilities
probs = tf.nn.softmax(logits, axis=-1)

In [ ]:

final_probs = probs[-1]
print('Top_k predictions and their probablities\n')
for label, p in get_top_k(final_probs):
  print(f'{label:20s}: {p:.3f}')

Converta a sequência de probabilidades em um vídeo:

In [ ]:

# Generate a plot and output to a video tensor
plot_video = plot_streaming_top_preds(probs, video, video_fps=8.)

In [ ]:

# For gif format, set codec='gif'
media.show_video(plot_video, fps=3)

Recursos

Os modelos pré-treinados estão disponíveis no TF Hub. A coleção do TF Hub também inclui modelos quantizados otimizados para o TF Lite.

O código-fonte para esses modelos está disponível no TensorFlow Model Garden, que inclui uma versão mais longa deste tutorial que também aborda a criação e ajustes finos de um modelo MoViNet.

Próximos passos

Para saber mais sobre como trabalhar com dados de vídeo no TensorFlow, confira os tutoriais a seguir: