GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/zh-cn/hub/tutorials/movinet.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2021 The TensorFlow Hub Authors.

Licensed under the Apache License, Version 2.0 (the "License");

In [ ]:

# Copyright 2021 The TensorFlow Hub Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

用于流式动作识别的 MoViNet

本教程演示了如何使用预训练视频分类模型对给定视频中的活动（例如跳舞、游泳、骑自行车等）进行分类。

本教程中使用的模型架构称为 MoViNet（移动视频网络）。MoVieNet 是一系列在庞大数据集 (Kinetics 600) 上训练的高效视频分类模型。

与 TF Hub 上提供的 i3d 模型相比，MoViNet 还支持流式视频的逐帧推断。

预训练模型可从 TF Hub 获得。TF Hub 集合还包括为 TFLite 优化的量化模型。

这些模型的源代码可在 TensorFlow Model Garden 中找到。包括本教程的较长版本，较长版本还介绍了如何构建和微调 MoViNet 模型。

此 MoViNet 教程是 TensorFlow 视频教程系列的一部分。这是其他三个教程：

加载视频数据：本教程介绍了如何从头开始将视频数据加载和预处理到 TensorFlow 数据集流水线中。
构建用于视频分类的 3D CNN 模型。请注意，本教程使用分解 3D 数据的空间和时间方面的 (2+1)D CNN；如果使用 MRI 扫描等体数据，请考虑使用 3D CNN 而不是 (2+1)D CNN。
使用 MoViNet 进行视频分类的迁移学习：本教程介绍了如何使用预训练的视频分类模型，该模型是在具有 UCF-101 数据集的不同数据集上训练的。

jumping jacks plot

安装

对于较小模型 (A0-A2) 的推断，CPU 对于此 Colab 来说已经足够。

In [ ]:

!sudo apt install -y ffmpeg
!pip install -q mediapy

In [ ]:

!pip uninstall -q -y opencv-python-headless
!pip install -q "opencv-python-headless<4.3"

In [ ]:

# Import libraries
import pathlib

import matplotlib as mpl
import matplotlib.pyplot as plt
import mediapy as media
import numpy as np
import PIL

import tensorflow as tf
import tensorflow_hub as hub
import tqdm

mpl.rcParams.update({
    'font.size': 10,
})

获取 kinetics 600 标签列表，并打印前几个标签：

In [ ]:

labels_path = tf.keras.utils.get_file(
    fname='labels.txt',
    origin='https://raw.githubusercontent.com/tensorflow/models/f8af2291cced43fc9f1d9b41ddbf772ae7b0d7d2/official/projects/movinet/files/kinetics_600_labels.txt'
)
labels_path = pathlib.Path(labels_path)

lines = labels_path.read_text().splitlines()
KINETICS_600_LABELS = np.array([line.strip() for line in lines])
KINETICS_600_LABELS[:20]

为了提供一个简单的示例视频进行分类，我们可以加载一个正在执行的跳跃运动的简短 gif。

jumping jacks

出处：Bobby Bluford 教练根据 CC-BY 许可在 YouTube 上分享的视频。

下载 gif。

In [ ]:

jumpingjack_url = 'https://github.com/tensorflow/models/raw/f8af2291cced43fc9f1d9b41ddbf772ae7b0d7d2/official/projects/movinet/files/jumpingjack.gif'
jumpingjack_path = tf.keras.utils.get_file(
    fname='jumpingjack.gif',
    origin=jumpingjack_url,
    cache_dir='.', cache_subdir='.',
)

定义一个将 gif 文件读入 tf.Tensor 的函数：

In [ ]:

#@title
# Read and process a video
def load_gif(file_path, image_size=(224, 224)):
  """Loads a gif file into a TF tensor.

  Use images resized to match what's expected by your model.
  The model pages say the "A2" models expect 224 x 224 images at 5 fps

  Args:
    file_path: path to the location of a gif file.
    image_size: a tuple of target size.

  Returns:
    a video of the gif file
  """
  # Load a gif file, convert it to a TF tensor
  raw = tf.io.read_file(file_path)
  video = tf.io.decode_gif(raw)
  # Resize the video
  video = tf.image.resize(video, image_size)
  # change dtype to a float32
  # Hub models always want images normalized to [0,1]
  # ref: https://tensorflow.google.cn/hub/common_signatures/images#input
  video = tf.cast(video, tf.float32) / 255.
  return video

视频的形状为 (frames, height, width, colors)

In [ ]:

jumpingjack=load_gif(jumpingjack_path)
jumpingjack.shape

如何使用模型

本部分包含演示如何使用 TensorFlow Hub 中的模型的演练。如果您只想查看模型的实际运作，请跳至下一部分。

每个模型都有两个版本：base 和 streaming。

base 版本将视频作为输入，并返回帧上的平均概率。
streaming 版本将视频帧和 RNN 状态作为输入，并返回该帧的预测和新的 RNN 状态。

基础模型

从 TensorFlow Hub 下载预训练模型。

In [ ]:

%%time
id = 'a2'
mode = 'base'
version = '3'
hub_url = f'https://tfhub.dev/tensorflow/movinet/{id}/{mode}/kinetics-600/classification/{version}'
model = hub.load(hub_url)

此版本的模型具有一个 signature。它接受一个 image 参数，此参数是一个形状为 (batch, frames, height, width, colors) 的 tf.float32。它返回包含一个输出的字典：一个由形状为 (batch, classes) 的 logit 组成的 tf.float32 张量。

In [ ]:

sig = model.signatures['serving_default']
print(sig.pretty_printed_signature())

要在视频上运行此签名，您需要先将外部 batch 维度添加到视频中。

In [ ]:

#warmup
sig(image = jumpingjack[tf.newaxis, :1]);

In [ ]:

%%time
logits = sig(image = jumpingjack[tf.newaxis, ...])
logits = logits['classifier_head'][0]

print(logits.shape)
print()

定义一个 get_top_k 函数，将上述输出处理打包以备后用。

In [ ]:

#@title
# Get top_k labels and probabilities
def get_top_k(probs, k=5, label_map=KINETICS_600_LABELS):
  """Outputs the top k model labels and probabilities on the given video.

  Args:
    probs: probability tensor of shape (num_frames, num_classes) that represents
      the probability of each class on each frame.
    k: the number of top predictions to select.
    label_map: a list of labels to map logit indices to label strings.

  Returns:
    a tuple of the top-k labels and probabilities.
  """
  # Sort predictions to find top_k
  top_predictions = tf.argsort(probs, axis=-1, direction='DESCENDING')[:k]
  # collect the labels of top_k predictions
  top_labels = tf.gather(label_map, top_predictions, axis=-1)
  # decode lablels
  top_labels = [label.decode('utf8') for label in top_labels.numpy()]
  # top_k probabilities of the predictions
  top_probs = tf.gather(probs, top_predictions, axis=-1).numpy()
  return tuple(zip(top_labels, top_probs))

将 logits 转换为概率，并查找视频的前 5 个类。模型确认该视频可能是 jumping jacks。

In [ ]:

probs = tf.nn.softmax(logits, axis=-1)
for label, p in get_top_k(probs):
  print(f'{label:20s}: {p:.3f}')

流式模型

上一部分使用了一个贯穿整个视频的模型。通常在处理视频时，您不希望在最后进行单次预测，而是希望逐帧更新预测。stream 版本的模型可让您实现此目的。

加载 stream 版本的模型。

In [ ]:

%%time
id = 'a2'
mode = 'stream'
version = '3'
hub_url = f'https://tfhub.dev/tensorflow/movinet/{id}/{mode}/kinetics-600/classification/{version}'
model = hub.load(hub_url)

此模型的用法比 base 模型略微复杂一些。您必须跟踪模型 RNN 的内部状态。

In [ ]:

list(model.signatures.keys())

init_states 签名将视频的形状 (batch, frames, height, width, colors) 作为输入，并返回包含初始 RNN 状态的大型张量字典：

In [ ]:

lines = model.signatures['init_states'].pretty_printed_signature().splitlines()
lines = lines[:10]
lines.append('      ...')
print('.\n'.join(lines))

In [ ]:

initial_state = model.init_states(jumpingjack[tf.newaxis, ...].shape)

In [ ]:

type(initial_state)

In [ ]:

list(sorted(initial_state.keys()))[:5]

获得 RNN 的初始状态后，您可以传递状态和视频帧作为输入（保持视频帧的 (batch, frames, height, width, colors) 形状）。该模型会返回一个 (logits, state) 对。

刚看到第一帧后，模型不相信视频是“跳跃运动”：

In [ ]:

inputs = initial_state.copy()

# Add the batch axis, take the first frme, but keep the frame-axis.
inputs['image'] = jumpingjack[tf.newaxis, 0:1, ...]

In [ ]:

# warmup
model(inputs);

In [ ]:

logits, new_state = model(inputs)
logits = logits[0]
probs = tf.nn.softmax(logits, axis=-1)

for label, p in get_top_k(probs):
  print(f'{label:20s}: {p:.3f}')

print()

如果您在循环中运行模型，并在每一帧中传递更新的状态，模型会迅速收敛到正确的结果：

In [ ]:

%%time
state = initial_state.copy()
all_logits = []

for n in range(len(jumpingjack)):
  inputs = state
  inputs['image'] = jumpingjack[tf.newaxis, n:n+1, ...]
  result, state = model(inputs)
  all_logits.append(logits)

probabilities = tf.nn.softmax(all_logits, axis=-1)

In [ ]:

for label, p in get_top_k(probabilities[-1]):
  print(f'{label:20s}: {p:.3f}')

In [ ]:

id = tf.argmax(probabilities[-1])
plt.plot(probabilities[:, id])
plt.xlabel('Frame #')
plt.ylabel(f"p('{KINETICS_600_LABELS[id]}')");

您可能会注意到，最终概率比上一部分中运行 base 模型的确定性要高得多。base 模型返回帧上预测的平均值。

In [ ]:

for label, p in get_top_k(tf.reduce_mean(probabilities, axis=0)):
  print(f'{label:20s}: {p:.3f}')

让预测变成随时间变化的动画

上一部分详细介绍了如何使用这些模型。本部分在此基础上生成一些不错的推断动画。

下面的隐藏单元定义了本部分中使用的辅助函数。

In [ ]:

#@title
# Get top_k labels and probabilities predicted using MoViNets streaming model
def get_top_k_streaming_labels(probs, k=5, label_map=KINETICS_600_LABELS):
  """Returns the top-k labels over an entire video sequence.

  Args:
    probs: probability tensor of shape (num_frames, num_classes) that represents
      the probability of each class on each frame.
    k: the number of top predictions to select.
    label_map: a list of labels to map logit indices to label strings.

  Returns:
    a tuple of the top-k probabilities, labels, and logit indices
  """
  top_categories_last = tf.argsort(probs, -1, 'DESCENDING')[-1, :1]
  # Sort predictions to find top_k
  categories = tf.argsort(probs, -1, 'DESCENDING')[:, :k]
  categories = tf.reshape(categories, [-1])

  counts = sorted([
      (i.numpy(), tf.reduce_sum(tf.cast(categories == i, tf.int32)).numpy())
      for i in tf.unique(categories)[0]
  ], key=lambda x: x[1], reverse=True)

  top_probs_idx = tf.constant([i for i, _ in counts[:k]])
  top_probs_idx = tf.concat([top_categories_last, top_probs_idx], 0)
  # find unique indices of categories
  top_probs_idx = tf.unique(top_probs_idx)[0][:k+1]
  # top_k probabilities of the predictions
  top_probs = tf.gather(probs, top_probs_idx, axis=-1)
  top_probs = tf.transpose(top_probs, perm=(1, 0))
  # collect the labels of top_k predictions
  top_labels = tf.gather(label_map, top_probs_idx, axis=0)
  # decode the top_k labels
  top_labels = [label.decode('utf8') for label in top_labels.numpy()]

  return top_probs, top_labels, top_probs_idx

# Plot top_k predictions at a given time step
def plot_streaming_top_preds_at_step(
    top_probs,
    top_labels,
    step=None,
    image=None,
    legend_loc='lower left',
    duration_seconds=10,
    figure_height=500,
    playhead_scale=0.8,
    grid_alpha=0.3):
  """Generates a plot of the top video model predictions at a given time step.

  Args:
    top_probs: a tensor of shape (k, num_frames) representing the top-k
      probabilities over all frames.
    top_labels: a list of length k that represents the top-k label strings.
    step: the current time step in the range [0, num_frames].
    image: the image frame to display at the current time step.
    legend_loc: the placement location of the legend.
    duration_seconds: the total duration of the video.
    figure_height: the output figure height.
    playhead_scale: scale value for the playhead.
    grid_alpha: alpha value for the gridlines.

  Returns:
    A tuple of the output numpy image, figure, and axes.
  """
  # find number of top_k labels and frames in the video
  num_labels, num_frames = top_probs.shape
  if step is None:
    step = num_frames
  # Visualize frames and top_k probabilities of streaming video
  fig = plt.figure(figsize=(6.5, 7), dpi=300)
  gs = mpl.gridspec.GridSpec(8, 1)
  ax2 = plt.subplot(gs[:-3, :])
  ax = plt.subplot(gs[-3:, :])
  # display the frame
  if image is not None:
    ax2.imshow(image, interpolation='nearest')
    ax2.axis('off')
  # x-axis (frame number)
  preview_line_x = tf.linspace(0., duration_seconds, num_frames)
  # y-axis (top_k probabilities)
  preview_line_y = top_probs

  line_x = preview_line_x[:step+1]
  line_y = preview_line_y[:, :step+1]

  for i in range(num_labels):
    ax.plot(preview_line_x, preview_line_y[i], label=None, linewidth='1.5',
            linestyle=':', color='gray')
    ax.plot(line_x, line_y[i], label=top_labels[i], linewidth='2.0')


  ax.grid(which='major', linestyle=':', linewidth='1.0', alpha=grid_alpha)
  ax.grid(which='minor', linestyle=':', linewidth='0.5', alpha=grid_alpha)

  min_height = tf.reduce_min(top_probs) * playhead_scale
  max_height = tf.reduce_max(top_probs)
  ax.vlines(preview_line_x[step], min_height, max_height, colors='red')
  ax.scatter(preview_line_x[step], max_height, color='red')

  ax.legend(loc=legend_loc)

  plt.xlim(0, duration_seconds)
  plt.ylabel('Probability')
  plt.xlabel('Time (s)')
  plt.yscale('log')

  fig.tight_layout()
  fig.canvas.draw()

  data = np.frombuffer(fig.canvas.tostring_rgb(), dtype=np.uint8)
  data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
  plt.close()

  figure_width = int(figure_height * data.shape[1] / data.shape[0])
  image = PIL.Image.fromarray(data).resize([figure_width, figure_height])
  image = np.array(image)

  return image

# Plotting top_k predictions from MoViNets streaming model
def plot_streaming_top_preds(
    probs,
    video,
    top_k=5,
    video_fps=25.,
    figure_height=500,
    use_progbar=True):
  """Generates a video plot of the top video model predictions.

  Args:
    probs: probability tensor of shape (num_frames, num_classes) that represents
      the probability of each class on each frame.
    video: the video to display in the plot.
    top_k: the number of top predictions to select.
    video_fps: the input video fps.
    figure_fps: the output video fps.
    figure_height: the height of the output video.
    use_progbar: display a progress bar.

  Returns:
    A numpy array representing the output video.
  """
  # select number of frames per second
  video_fps = 8.
  # select height of the image
  figure_height = 500
  # number of time steps of the given video
  steps = video.shape[0]
  # estimate duration of the video (in seconds)
  duration = steps / video_fps
  # estiamte top_k probabilities and corresponding labels
  top_probs, top_labels, _ = get_top_k_streaming_labels(probs, k=top_k)

  images = []
  step_generator = tqdm.trange(steps) if use_progbar else range(steps)
  for i in step_generator:
    image = plot_streaming_top_preds_at_step(
        top_probs=top_probs,
        top_labels=top_labels,
        step=i,
        image=video[i],
        duration_seconds=duration,
        figure_height=figure_height,
    )
    images.append(image)

  return np.array(images)

首先，在视频的帧上运行流式模型，然后收集 logit：

In [ ]:

init_states = model.init_states(jumpingjack[tf.newaxis].shape)

In [ ]:

# Insert your video clip here
video = jumpingjack
images = tf.split(video[tf.newaxis], video.shape[0], axis=1)

all_logits = []

# To run on a video, pass in one frame at a time
states = init_states
for image in tqdm.tqdm(images):
  # predictions for each frame
  logits, states = model({**states, 'image': image})
  all_logits.append(logits)

# concatinating all the logits
logits = tf.concat(all_logits, 0)
# estimating probabilities
probs = tf.nn.softmax(logits, axis=-1)

In [ ]:

final_probs = probs[-1]
print('Top_k predictions and their probablities\n')
for label, p in get_top_k(final_probs):
  print(f'{label:20s}: {p:.3f}')

将概率序列转换为视频：

In [ ]:

# Generate a plot and output to a video tensor
plot_video = plot_streaming_top_preds(probs, video, video_fps=8.)

In [ ]:

# For gif format, set codec='gif'
media.show_video(plot_video, fps=3)

资源

预训练模型可从 TF Hub 获得。TF Hub 集合还包括为 TFLite 优化的量化模型。

这些模型的源代码可在 TensorFlow Model Garden 中找到。包括本教程的较长版本，较长版本还介绍了如何构建和微调 MoViNet 模型。

后续步骤

要详细了解如何在 TensorFlow 中处理视频数据，请查看以下教程：