GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/es-419/hub/tutorials/movinet.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2021 The TensorFlow Hub Authors.

Licensed under the Apache License, Version 2.0 (the "License");

In [ ]:

# Copyright 2021 The TensorFlow Hub Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

MoViNet para reconocimiento de acciones en streaming

Ver en TensorFlow.org

Ejecutar en Google Colab

Ver en GitHub

Descargar el bloc de notas

Ver modelos de TF Hub

En este tutorial se demuestra cómo usar un modelo de clasificación de video previamente entrenado para clasificar una actividad (como bailar, nadar, pedalear, etc.) en un video dado.

La arquitectura del modelo usada en este tutorial se denomina MoViNet (Mobile Video Networks, redes móviles de video). Las MoVieNet son una familia de modelos de clasificación de video eficientes entrenados con conjuntos de datos muy grandes (Kinetics 600).

Por el contrario a lo que sucede con los modelos i3d disponibles en TF Hub, las MoViNet también se pueden utilizar con inferencias cuadro por cuadro en transmisión de videos.

Los videos preentrenados se encuentran disponibles en TF Hub. La colección de TF Hub también incluye modelos cuantificados optimizados para TFLite.

La fuente para estos modelos se encuentra en TensorFlow Model Garden. Incluye una versión más extensa de este tutorial que también abarca la construcción y el ajuste fino de un modelo MoViNet.

Este tutorial sobre MoViNet es parte de una serie de tutoriales en video de TensorFlow. A continuación, compartimos otros tres tutoriales:

Carga de datos de video: en este tutorial se explica cómo cargar y preprocesar datos de video desde cero para una canalización de conjuntos de datos de TensorFlow.
Creación de un modelo 3D CNN para la clasificación de video: tenga en cuenta que en este tutorial se usa (2+1)D CNN que descompone los aspectos espaciales y temporales de los datos en 3D. Si usa datos volumétricos como un escaneo MRI, considere utilizar un 3D CNN en vez de un (2+1)D CNN.
Transferencia de aprendizaje para la clasificación de videos con MoViNet: en este tutorial se explica cómo usar, con el conjunto de datos UCF-101, un modelo de clasificación de videos previamente entrenado en un conjunto de datos diferente.

Corto de saltos tijera

Preparación

Para inferir a partir de modelos más pequeños (A0-A2), la CPU es suficiente en este caso.

In [ ]:

!sudo apt install -y ffmpeg
!pip install -q mediapy

In [ ]:

!pip uninstall -q -y opencv-python-headless
!pip install -q "opencv-python-headless<4.3"

In [ ]:

# Import libraries
import pathlib

import matplotlib as mpl
import matplotlib.pyplot as plt
import mediapy as media
import numpy as np
import PIL

import tensorflow as tf
import tensorflow_hub as hub
import tqdm

mpl.rcParams.update({
    'font.size': 10,
})

Tomamos una lista de 600 etiquetas cinéticas e imprimimos las primeras:

In [ ]:

labels_path = tf.keras.utils.get_file(
    fname='labels.txt',
    origin='https://raw.githubusercontent.com/tensorflow/models/f8af2291cced43fc9f1d9b41ddbf772ae7b0d7d2/official/projects/movinet/files/kinetics_600_labels.txt'
)
labels_path = pathlib.Path(labels_path)

lines = labels_path.read_text().splitlines()
KINETICS_600_LABELS = np.array([line.strip() for line in lines])
KINETICS_600_LABELS[:20]

Para trabajar en la clasificación con un video de ejemplo simple, podemos cargar un gif corto de una persona haciendo saltos tijera.

saltos tijera

Créditos: la filmación ha sido compartida por Coach Bobby Bluford en YouTube bajo licencia de CC-BY.

Descargamos el gif.

In [ ]:

jumpingjack_url = 'https://github.com/tensorflow/models/raw/f8af2291cced43fc9f1d9b41ddbf772ae7b0d7d2/official/projects/movinet/files/jumpingjack.gif'
jumpingjack_path = tf.keras.utils.get_file(
    fname='jumpingjack.gif',
    origin=jumpingjack_url,
    cache_dir='.', cache_subdir='.',
)

Definimos una función para leer un archivo gif en un tf.Tensor:

In [ ]:

#@title
# Read and process a video
def load_gif(file_path, image_size=(224, 224)):
  """Loads a gif file into a TF tensor.

  Use images resized to match what's expected by your model.
  The model pages say the "A2" models expect 224 x 224 images at 5 fps

  Args:
    file_path: path to the location of a gif file.
    image_size: a tuple of target size.

  Returns:
    a video of the gif file
  """
  # Load a gif file, convert it to a TF tensor
  raw = tf.io.read_file(file_path)
  video = tf.io.decode_gif(raw)
  # Resize the video
  video = tf.image.resize(video, image_size)
  # change dtype to a float32
  # Hub models always want images normalized to [0,1]
  # ref: https://www.tensorflow.org/hub/common_signatures/images#input
  video = tf.cast(video, tf.float32) / 255.
  return video

El formato de video es (frames, height, width, colors)

In [ ]:

jumpingjack=load_gif(jumpingjack_path)
jumpingjack.shape

Cómo se usa el modelo

Esta sección contiene una descripción en la que se muestra paso a paso cómo se usan los modelos de TensorFlow Hub. Si lo que quiere, solamente, es ver los modelos en acción, puede saltarse la siguiente sección.

Hay dos versiones para cada modelo: base y streaming.

La versión base toma un video como entrada y devuelve las probabilidades calculadas en base al promedio de los fotogramas.
La versión streaming toma un fotograma de video y un estado de RNN como entrada, y devuelve las predicciones para ese fotograma y el nuevo estado RNN.

El modelo base

Descargue el modelo previamente entrenado de TensorFlow Hub.

In [ ]:

%%time
id = 'a2'
mode = 'base'
version = '3'
hub_url = f'https://tfhub.dev/tensorflow/movinet/{id}/{mode}/kinetics-600/classification/{version}'
model = hub.load(hub_url)

Esta versión del modelo tiene una signature. Toma un argumento de image que es un tf.float32 con formato (batch, frames, height, width, colors). Devuelve un diccionario que contiene una salida: un tensor tf.float32 de funciones logit con formato (batch, classes).

In [ ]:

sig = model.signatures['serving_default']
print(sig.pretty_printed_signature())

Para ejecutar la firma en el video, primero hay que agregar la dimensión del batch exterior al video.

In [ ]:

#warmup
sig(image = jumpingjack[tf.newaxis, :1]);

In [ ]:

%%time
logits = sig(image = jumpingjack[tf.newaxis, ...])
logits = logits['classifier_head'][0]

print(logits.shape)
print()

Defina una función get_top_k que empaquete el procesamiento de salida que figura arriba para después.

In [ ]:

#@title
# Get top_k labels and probabilities
def get_top_k(probs, k=5, label_map=KINETICS_600_LABELS):
  """Outputs the top k model labels and probabilities on the given video.

  Args:
    probs: probability tensor of shape (num_frames, num_classes) that represents
      the probability of each class on each frame.
    k: the number of top predictions to select.
    label_map: a list of labels to map logit indices to label strings.

  Returns:
    a tuple of the top-k labels and probabilities.
  """
  # Sort predictions to find top_k
  top_predictions = tf.argsort(probs, axis=-1, direction='DESCENDING')[:k]
  # collect the labels of top_k predictions
  top_labels = tf.gather(label_map, top_predictions, axis=-1)
  # decode lablels
  top_labels = [label.decode('utf8') for label in top_labels.numpy()]
  # top_k probabilities of the predictions
  top_probs = tf.gather(probs, top_predictions, axis=-1).numpy()
  return tuple(zip(top_labels, top_probs))

Convierta las logits a probabilidades y busque las 5 clases principales de video. El modelo confirma que el video probablemente sea de jumping jacks (saltos tijera).

In [ ]:

probs = tf.nn.softmax(logits, axis=-1)
for label, p in get_top_k(probs):
  print(f'{label:20s}: {p:.3f}')

El modelo de streaming

En la sección anterior usamos un modelo que se ejecuta en un video completo. Por lo general, cuando se procesa el video no se pretende lograr una sola predicción al final, lo que se busca es que las predicciones se actualicen fotograma a fotograma. Las versiones stream del modelo permiten hacerlo.

Cargue la versión stream del modelo.

In [ ]:

%%time
id = 'a2'
mode = 'stream'
version = '3'
hub_url = f'https://tfhub.dev/tensorflow/movinet/{id}/{mode}/kinetics-600/classification/{version}'
model = hub.load(hub_url)

Este modelo es un poco más complejo que el base. Hay que controlar el estado interno de las RNN del modelo.

In [ ]:

list(model.signatures.keys())

La firma init_states toma la forma (batch, frames, height, width, colors) del video como entrada y devuelve un diccionario grande de tensores que contiene los estados iniciales de RNN:

In [ ]:

lines = model.signatures['init_states'].pretty_printed_signature().splitlines()
lines = lines[:10]
lines.append('      ...')
print('.\n'.join(lines))

In [ ]:

initial_state = model.init_states(jumpingjack[tf.newaxis, ...].shape)

In [ ]:

type(initial_state)

In [ ]:

list(sorted(initial_state.keys()))[:5]

Una vez que cuenta con el estado inicial para las RNN, puede pasar el estado y un fotograma de video como entrada (el fotograma conserva la forma (batch, frames, height, width, colors)). El modelo devuelve un par (logits, state).

Después de ver el primer fotograma, el modelo no está convencido de que el video sea sobre "saltos tijera":

In [ ]:

inputs = initial_state.copy()

# Add the batch axis, take the first frme, but keep the frame-axis.
inputs['image'] = jumpingjack[tf.newaxis, 0:1, ...]

In [ ]:

# warmup
model(inputs);

In [ ]:

logits, new_state = model(inputs)
logits = logits[0]
probs = tf.nn.softmax(logits, axis=-1)

for label, p in get_top_k(probs):
  print(f'{label:20s}: {p:.3f}')

print()

Si el modelo se ejecuta en un ciclo, pasando el estado actualizado en cada fotograma, el modelo, rápidamente, converge y concluye el resultado correcto:

In [ ]:

%%time
state = initial_state.copy()
all_logits = []

for n in range(len(jumpingjack)):
  inputs = state
  inputs['image'] = jumpingjack[tf.newaxis, n:n+1, ...]
  result, state = model(inputs)
  all_logits.append(logits)

probabilities = tf.nn.softmax(all_logits, axis=-1)

In [ ]:

for label, p in get_top_k(probabilities[-1]):
  print(f'{label:20s}: {p:.3f}')

In [ ]:

id = tf.argmax(probabilities[-1])
plt.plot(probabilities[:, id])
plt.xlabel('Frame #')
plt.ylabel(f"p('{KINETICS_600_LABELS[id]}')");

Notará que la probabilidad es mucho más certera que en secciones anteriores en las que se ejecutó el modelo base. El modelo base devuelve un promedio de las predicciones basadas en los fotogramas.

In [ ]:

for label, p in get_top_k(tf.reduce_mean(probabilities, axis=0)):
  print(f'{label:20s}: {p:.3f}')

Animación de las predicciones a lo largo del tiempo

En la sección anterior repasamos algunos detalles sobre cómo usar estos modelos. Esta sección toma como base lo descrito allí para producir algunas animaciones inferidas.

La celda oculta (debajo) define funciones ayudante usadas en esta sección.

In [ ]:

#@title
# Get top_k labels and probabilities predicted using MoViNets streaming model
def get_top_k_streaming_labels(probs, k=5, label_map=KINETICS_600_LABELS):
  """Returns the top-k labels over an entire video sequence.

  Args:
    probs: probability tensor of shape (num_frames, num_classes) that represents
      the probability of each class on each frame.
    k: the number of top predictions to select.
    label_map: a list of labels to map logit indices to label strings.

  Returns:
    a tuple of the top-k probabilities, labels, and logit indices
  """
  top_categories_last = tf.argsort(probs, -1, 'DESCENDING')[-1, :1]
  # Sort predictions to find top_k
  categories = tf.argsort(probs, -1, 'DESCENDING')[:, :k]
  categories = tf.reshape(categories, [-1])

  counts = sorted([
      (i.numpy(), tf.reduce_sum(tf.cast(categories == i, tf.int32)).numpy())
      for i in tf.unique(categories)[0]
  ], key=lambda x: x[1], reverse=True)

  top_probs_idx = tf.constant([i for i, _ in counts[:k]])
  top_probs_idx = tf.concat([top_categories_last, top_probs_idx], 0)
  # find unique indices of categories
  top_probs_idx = tf.unique(top_probs_idx)[0][:k+1]
  # top_k probabilities of the predictions
  top_probs = tf.gather(probs, top_probs_idx, axis=-1)
  top_probs = tf.transpose(top_probs, perm=(1, 0))
  # collect the labels of top_k predictions
  top_labels = tf.gather(label_map, top_probs_idx, axis=0)
  # decode the top_k labels
  top_labels = [label.decode('utf8') for label in top_labels.numpy()]

  return top_probs, top_labels, top_probs_idx

# Plot top_k predictions at a given time step
def plot_streaming_top_preds_at_step(
    top_probs,
    top_labels,
    step=None,
    image=None,
    legend_loc='lower left',
    duration_seconds=10,
    figure_height=500,
    playhead_scale=0.8,
    grid_alpha=0.3):
  """Generates a plot of the top video model predictions at a given time step.

  Args:
    top_probs: a tensor of shape (k, num_frames) representing the top-k
      probabilities over all frames.
    top_labels: a list of length k that represents the top-k label strings.
    step: the current time step in the range [0, num_frames].
    image: the image frame to display at the current time step.
    legend_loc: the placement location of the legend.
    duration_seconds: the total duration of the video.
    figure_height: the output figure height.
    playhead_scale: scale value for the playhead.
    grid_alpha: alpha value for the gridlines.

  Returns:
    A tuple of the output numpy image, figure, and axes.
  """
  # find number of top_k labels and frames in the video
  num_labels, num_frames = top_probs.shape
  if step is None:
    step = num_frames
  # Visualize frames and top_k probabilities of streaming video
  fig = plt.figure(figsize=(6.5, 7), dpi=300)
  gs = mpl.gridspec.GridSpec(8, 1)
  ax2 = plt.subplot(gs[:-3, :])
  ax = plt.subplot(gs[-3:, :])
  # display the frame
  if image is not None:
    ax2.imshow(image, interpolation='nearest')
    ax2.axis('off')
  # x-axis (frame number)
  preview_line_x = tf.linspace(0., duration_seconds, num_frames)
  # y-axis (top_k probabilities)
  preview_line_y = top_probs

  line_x = preview_line_x[:step+1]
  line_y = preview_line_y[:, :step+1]

  for i in range(num_labels):
    ax.plot(preview_line_x, preview_line_y[i], label=None, linewidth='1.5',
            linestyle=':', color='gray')
    ax.plot(line_x, line_y[i], label=top_labels[i], linewidth='2.0')


  ax.grid(which='major', linestyle=':', linewidth='1.0', alpha=grid_alpha)
  ax.grid(which='minor', linestyle=':', linewidth='0.5', alpha=grid_alpha)

  min_height = tf.reduce_min(top_probs) * playhead_scale
  max_height = tf.reduce_max(top_probs)
  ax.vlines(preview_line_x[step], min_height, max_height, colors='red')
  ax.scatter(preview_line_x[step], max_height, color='red')

  ax.legend(loc=legend_loc)

  plt.xlim(0, duration_seconds)
  plt.ylabel('Probability')
  plt.xlabel('Time (s)')
  plt.yscale('log')

  fig.tight_layout()
  fig.canvas.draw()

  data = np.frombuffer(fig.canvas.tostring_rgb(), dtype=np.uint8)
  data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
  plt.close()

  figure_width = int(figure_height * data.shape[1] / data.shape[0])
  image = PIL.Image.fromarray(data).resize([figure_width, figure_height])
  image = np.array(image)

  return image

# Plotting top_k predictions from MoViNets streaming model
def plot_streaming_top_preds(
    probs,
    video,
    top_k=5,
    video_fps=25.,
    figure_height=500,
    use_progbar=True):
  """Generates a video plot of the top video model predictions.

  Args:
    probs: probability tensor of shape (num_frames, num_classes) that represents
      the probability of each class on each frame.
    video: the video to display in the plot.
    top_k: the number of top predictions to select.
    video_fps: the input video fps.
    figure_fps: the output video fps.
    figure_height: the height of the output video.
    use_progbar: display a progress bar.

  Returns:
    A numpy array representing the output video.
  """
  # select number of frames per second
  video_fps = 8.
  # select height of the image
  figure_height = 500
  # number of time steps of the given video
  steps = video.shape[0]
  # estimate duration of the video (in seconds)
  duration = steps / video_fps
  # estiamte top_k probabilities and corresponding labels
  top_probs, top_labels, _ = get_top_k_streaming_labels(probs, k=top_k)

  images = []
  step_generator = tqdm.trange(steps) if use_progbar else range(steps)
  for i in step_generator:
    image = plot_streaming_top_preds_at_step(
        top_probs=top_probs,
        top_labels=top_labels,
        step=i,
        image=video[i],
        duration_seconds=duration,
        figure_height=figure_height,
    )
    images.append(image)

  return np.array(images)

Comience por ejecutar el modelo de streaming con los fotogramas de video y recolecte las funciones logit:

In [ ]:

init_states = model.init_states(jumpingjack[tf.newaxis].shape)

In [ ]:

# Insert your video clip here
video = jumpingjack
images = tf.split(video[tf.newaxis], video.shape[0], axis=1)

all_logits = []

# To run on a video, pass in one frame at a time
states = init_states
for image in tqdm.tqdm(images):
  # predictions for each frame
  logits, states = model({**states, 'image': image})
  all_logits.append(logits)

# concatinating all the logits
logits = tf.concat(all_logits, 0)
# estimating probabilities
probs = tf.nn.softmax(logits, axis=-1)

In [ ]:

final_probs = probs[-1]
print('Top_k predictions and their probablities\n')
for label, p in get_top_k(final_probs):
  print(f'{label:20s}: {p:.3f}')

Convierta la secuencia de probabilidades en un video:

In [ ]:

# Generate a plot and output to a video tensor
plot_video = plot_streaming_top_preds(probs, video, video_fps=8.)

In [ ]:

# For gif format, set codec='gif'
media.show_video(plot_video, fps=3)

Recursos

Los videos preentrenados se encuentran disponibles en TF Hub. La colección de TF Hub también incluye modelos cuantificados optimizados para TFLite.

La fuente para estos modelos se encuentra en TensorFlow Model Garden. Incluye una versión más extensa de este tutorial que también abarca la construcción y el ajuste fino de un modelo MoViNet.

Próximos pasos

Para obtener más información sobre cómo trabajar con datos de video en TensorFlow, consulte los siguientes tutoriales: