Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
tensorflow
GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/zh-cn/hub/tutorials/movinet.ipynb
25118 views
Kernel: Python 3

Licensed under the Apache License, Version 2.0 (the "License");

# Copyright 2021 The TensorFlow Hub Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # ==============================================================================

用于流式动作识别的 MoViNet

本教程演示了如何使用预训练视频分类模型对给定视频中的活动(例如跳舞、游泳、骑自行车等)进行分类。

本教程中使用的模型架构称为 MoViNet(移动视频网络)。MoVieNet 是一系列在庞大数据集 (Kinetics 600) 上训练的高效视频分类模型。

与 TF Hub 上提供的 i3d 模型相比,MoViNet 还支持流式视频的逐帧推断。

预训练模型可从 TF Hub 获得。TF Hub 集合还包括为 TFLite 优化的量化模型。

这些模型的源代码可在 TensorFlow Model Garden 中找到。包括本教程的较长版本,较长版本还介绍了如何构建和微调 MoViNet 模型。

此 MoViNet 教程是 TensorFlow 视频教程系列的一部分。这是其他三个教程:

jumping jacks plot

安装

对于较小模型 (A0-A2) 的推断,CPU 对于此 Colab 来说已经足够。

!sudo apt install -y ffmpeg !pip install -q mediapy
!pip uninstall -q -y opencv-python-headless !pip install -q "opencv-python-headless<4.3"
# Import libraries import pathlib import matplotlib as mpl import matplotlib.pyplot as plt import mediapy as media import numpy as np import PIL import tensorflow as tf import tensorflow_hub as hub import tqdm mpl.rcParams.update({ 'font.size': 10, })

获取 kinetics 600 标签列表,并打印前几个标签:

labels_path = tf.keras.utils.get_file( fname='labels.txt', origin='https://raw.githubusercontent.com/tensorflow/models/f8af2291cced43fc9f1d9b41ddbf772ae7b0d7d2/official/projects/movinet/files/kinetics_600_labels.txt' ) labels_path = pathlib.Path(labels_path) lines = labels_path.read_text().splitlines() KINETICS_600_LABELS = np.array([line.strip() for line in lines]) KINETICS_600_LABELS[:20]

为了提供一个简单的示例视频进行分类,我们可以加载一个正在执行的跳跃运动的简短 gif。

jumping jacks

出处:Bobby Bluford 教练根据 CC-BY 许可在 YouTube 上分享的视频。

下载 gif。

jumpingjack_url = 'https://github.com/tensorflow/models/raw/f8af2291cced43fc9f1d9b41ddbf772ae7b0d7d2/official/projects/movinet/files/jumpingjack.gif' jumpingjack_path = tf.keras.utils.get_file( fname='jumpingjack.gif', origin=jumpingjack_url, cache_dir='.', cache_subdir='.', )

定义一个将 gif 文件读入 tf.Tensor 的函数:

#@title # Read and process a video def load_gif(file_path, image_size=(224, 224)): """Loads a gif file into a TF tensor. Use images resized to match what's expected by your model. The model pages say the "A2" models expect 224 x 224 images at 5 fps Args: file_path: path to the location of a gif file. image_size: a tuple of target size. Returns: a video of the gif file """ # Load a gif file, convert it to a TF tensor raw = tf.io.read_file(file_path) video = tf.io.decode_gif(raw) # Resize the video video = tf.image.resize(video, image_size) # change dtype to a float32 # Hub models always want images normalized to [0,1] # ref: https://tensorflow.google.cn/hub/common_signatures/images#input video = tf.cast(video, tf.float32) / 255. return video

视频的形状为 (frames, height, width, colors)

jumpingjack=load_gif(jumpingjack_path) jumpingjack.shape

如何使用模型

本部分包含演示如何使用 TensorFlow Hub 中的模型的演练。如果您只想查看模型的实际运作,请跳至下一部分。

每个模型都有两个版本:basestreaming

  • base 版本将视频作为输入,并返回帧上的平均概率。

  • streaming 版本将视频帧和 RNN 状态作为输入,并返回该帧的预测和新的 RNN 状态。

基础模型

%%time id = 'a2' mode = 'base' version = '3' hub_url = f'https://tfhub.dev/tensorflow/movinet/{id}/{mode}/kinetics-600/classification/{version}' model = hub.load(hub_url)

此版本的模型具有一个 signature。它接受一个 image 参数,此参数是一个形状为 (batch, frames, height, width, colors)tf.float32。它返回包含一个输出的字典:一个由形状为 (batch, classes) 的 logit 组成的 tf.float32 张量。

sig = model.signatures['serving_default'] print(sig.pretty_printed_signature())

要在视频上运行此签名,您需要先将外部 batch 维度添加到视频中。

#warmup sig(image = jumpingjack[tf.newaxis, :1]);
%%time logits = sig(image = jumpingjack[tf.newaxis, ...]) logits = logits['classifier_head'][0] print(logits.shape) print()

定义一个 get_top_k 函数,将上述输出处理打包以备后用。

#@title # Get top_k labels and probabilities def get_top_k(probs, k=5, label_map=KINETICS_600_LABELS): """Outputs the top k model labels and probabilities on the given video. Args: probs: probability tensor of shape (num_frames, num_classes) that represents the probability of each class on each frame. k: the number of top predictions to select. label_map: a list of labels to map logit indices to label strings. Returns: a tuple of the top-k labels and probabilities. """ # Sort predictions to find top_k top_predictions = tf.argsort(probs, axis=-1, direction='DESCENDING')[:k] # collect the labels of top_k predictions top_labels = tf.gather(label_map, top_predictions, axis=-1) # decode lablels top_labels = [label.decode('utf8') for label in top_labels.numpy()] # top_k probabilities of the predictions top_probs = tf.gather(probs, top_predictions, axis=-1).numpy() return tuple(zip(top_labels, top_probs))

logits 转换为概率,并查找视频的前 5 个类。模型确认该视频可能是 jumping jacks

probs = tf.nn.softmax(logits, axis=-1) for label, p in get_top_k(probs): print(f'{label:20s}: {p:.3f}')

流式模型

上一部分使用了一个贯穿整个视频的模型。通常在处理视频时,您不希望在最后进行单次预测,而是希望逐帧更新预测。stream 版本的模型可让您实现此目的。

加载 stream 版本的模型。

%%time id = 'a2' mode = 'stream' version = '3' hub_url = f'https://tfhub.dev/tensorflow/movinet/{id}/{mode}/kinetics-600/classification/{version}' model = hub.load(hub_url)

此模型的用法比 base 模型略微复杂一些。您必须跟踪模型 RNN 的内部状态。

list(model.signatures.keys())

init_states 签名将视频的形状 (batch, frames, height, width, colors) 作为输入,并返回包含初始 RNN 状态的大型张量字典:

lines = model.signatures['init_states'].pretty_printed_signature().splitlines() lines = lines[:10] lines.append(' ...') print('.\n'.join(lines))
initial_state = model.init_states(jumpingjack[tf.newaxis, ...].shape)
type(initial_state)
list(sorted(initial_state.keys()))[:5]

获得 RNN 的初始状态后,您可以传递状态和视频帧作为输入(保持视频帧的 (batch, frames, height, width, colors) 形状)。该模型会返回一个 (logits, state) 对。

刚看到第一帧后,模型不相信视频是“跳跃运动”:

inputs = initial_state.copy() # Add the batch axis, take the first frme, but keep the frame-axis. inputs['image'] = jumpingjack[tf.newaxis, 0:1, ...]
# warmup model(inputs);
logits, new_state = model(inputs) logits = logits[0] probs = tf.nn.softmax(logits, axis=-1) for label, p in get_top_k(probs): print(f'{label:20s}: {p:.3f}') print()

如果您在循环中运行模型,并在每一帧中传递更新的状态,模型会迅速收敛到正确的结果:

%%time state = initial_state.copy() all_logits = [] for n in range(len(jumpingjack)): inputs = state inputs['image'] = jumpingjack[tf.newaxis, n:n+1, ...] result, state = model(inputs) all_logits.append(logits) probabilities = tf.nn.softmax(all_logits, axis=-1)
for label, p in get_top_k(probabilities[-1]): print(f'{label:20s}: {p:.3f}')
id = tf.argmax(probabilities[-1]) plt.plot(probabilities[:, id]) plt.xlabel('Frame #') plt.ylabel(f"p('{KINETICS_600_LABELS[id]}')");

您可能会注意到,最终概率比上一部分中运行 base 模型的确定性要高得多。base 模型返回帧上预测的平均值。

for label, p in get_top_k(tf.reduce_mean(probabilities, axis=0)): print(f'{label:20s}: {p:.3f}')

让预测变成随时间变化的动画

上一部分详细介绍了如何使用这些模型。本部分在此基础上生成一些不错的推断动画。

下面的隐藏单元定义了本部分中使用的辅助函数。

#@title # Get top_k labels and probabilities predicted using MoViNets streaming model def get_top_k_streaming_labels(probs, k=5, label_map=KINETICS_600_LABELS): """Returns the top-k labels over an entire video sequence. Args: probs: probability tensor of shape (num_frames, num_classes) that represents the probability of each class on each frame. k: the number of top predictions to select. label_map: a list of labels to map logit indices to label strings. Returns: a tuple of the top-k probabilities, labels, and logit indices """ top_categories_last = tf.argsort(probs, -1, 'DESCENDING')[-1, :1] # Sort predictions to find top_k categories = tf.argsort(probs, -1, 'DESCENDING')[:, :k] categories = tf.reshape(categories, [-1]) counts = sorted([ (i.numpy(), tf.reduce_sum(tf.cast(categories == i, tf.int32)).numpy()) for i in tf.unique(categories)[0] ], key=lambda x: x[1], reverse=True) top_probs_idx = tf.constant([i for i, _ in counts[:k]]) top_probs_idx = tf.concat([top_categories_last, top_probs_idx], 0) # find unique indices of categories top_probs_idx = tf.unique(top_probs_idx)[0][:k+1] # top_k probabilities of the predictions top_probs = tf.gather(probs, top_probs_idx, axis=-1) top_probs = tf.transpose(top_probs, perm=(1, 0)) # collect the labels of top_k predictions top_labels = tf.gather(label_map, top_probs_idx, axis=0) # decode the top_k labels top_labels = [label.decode('utf8') for label in top_labels.numpy()] return top_probs, top_labels, top_probs_idx # Plot top_k predictions at a given time step def plot_streaming_top_preds_at_step( top_probs, top_labels, step=None, image=None, legend_loc='lower left', duration_seconds=10, figure_height=500, playhead_scale=0.8, grid_alpha=0.3): """Generates a plot of the top video model predictions at a given time step. Args: top_probs: a tensor of shape (k, num_frames) representing the top-k probabilities over all frames. top_labels: a list of length k that represents the top-k label strings. step: the current time step in the range [0, num_frames]. image: the image frame to display at the current time step. legend_loc: the placement location of the legend. duration_seconds: the total duration of the video. figure_height: the output figure height. playhead_scale: scale value for the playhead. grid_alpha: alpha value for the gridlines. Returns: A tuple of the output numpy image, figure, and axes. """ # find number of top_k labels and frames in the video num_labels, num_frames = top_probs.shape if step is None: step = num_frames # Visualize frames and top_k probabilities of streaming video fig = plt.figure(figsize=(6.5, 7), dpi=300) gs = mpl.gridspec.GridSpec(8, 1) ax2 = plt.subplot(gs[:-3, :]) ax = plt.subplot(gs[-3:, :]) # display the frame if image is not None: ax2.imshow(image, interpolation='nearest') ax2.axis('off') # x-axis (frame number) preview_line_x = tf.linspace(0., duration_seconds, num_frames) # y-axis (top_k probabilities) preview_line_y = top_probs line_x = preview_line_x[:step+1] line_y = preview_line_y[:, :step+1] for i in range(num_labels): ax.plot(preview_line_x, preview_line_y[i], label=None, linewidth='1.5', linestyle=':', color='gray') ax.plot(line_x, line_y[i], label=top_labels[i], linewidth='2.0') ax.grid(which='major', linestyle=':', linewidth='1.0', alpha=grid_alpha) ax.grid(which='minor', linestyle=':', linewidth='0.5', alpha=grid_alpha) min_height = tf.reduce_min(top_probs) * playhead_scale max_height = tf.reduce_max(top_probs) ax.vlines(preview_line_x[step], min_height, max_height, colors='red') ax.scatter(preview_line_x[step], max_height, color='red') ax.legend(loc=legend_loc) plt.xlim(0, duration_seconds) plt.ylabel('Probability') plt.xlabel('Time (s)') plt.yscale('log') fig.tight_layout() fig.canvas.draw() data = np.frombuffer(fig.canvas.tostring_rgb(), dtype=np.uint8) data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,)) plt.close() figure_width = int(figure_height * data.shape[1] / data.shape[0]) image = PIL.Image.fromarray(data).resize([figure_width, figure_height]) image = np.array(image) return image # Plotting top_k predictions from MoViNets streaming model def plot_streaming_top_preds( probs, video, top_k=5, video_fps=25., figure_height=500, use_progbar=True): """Generates a video plot of the top video model predictions. Args: probs: probability tensor of shape (num_frames, num_classes) that represents the probability of each class on each frame. video: the video to display in the plot. top_k: the number of top predictions to select. video_fps: the input video fps. figure_fps: the output video fps. figure_height: the height of the output video. use_progbar: display a progress bar. Returns: A numpy array representing the output video. """ # select number of frames per second video_fps = 8. # select height of the image figure_height = 500 # number of time steps of the given video steps = video.shape[0] # estimate duration of the video (in seconds) duration = steps / video_fps # estiamte top_k probabilities and corresponding labels top_probs, top_labels, _ = get_top_k_streaming_labels(probs, k=top_k) images = [] step_generator = tqdm.trange(steps) if use_progbar else range(steps) for i in step_generator: image = plot_streaming_top_preds_at_step( top_probs=top_probs, top_labels=top_labels, step=i, image=video[i], duration_seconds=duration, figure_height=figure_height, ) images.append(image) return np.array(images)

首先,在视频的帧上运行流式模型,然后收集 logit:

init_states = model.init_states(jumpingjack[tf.newaxis].shape)
# Insert your video clip here video = jumpingjack images = tf.split(video[tf.newaxis], video.shape[0], axis=1) all_logits = [] # To run on a video, pass in one frame at a time states = init_states for image in tqdm.tqdm(images): # predictions for each frame logits, states = model({**states, 'image': image}) all_logits.append(logits) # concatinating all the logits logits = tf.concat(all_logits, 0) # estimating probabilities probs = tf.nn.softmax(logits, axis=-1)
final_probs = probs[-1] print('Top_k predictions and their probablities\n') for label, p in get_top_k(final_probs): print(f'{label:20s}: {p:.3f}')

将概率序列转换为视频:

# Generate a plot and output to a video tensor plot_video = plot_streaming_top_preds(probs, video, video_fps=8.)
# For gif format, set codec='gif' media.show_video(plot_video, fps=3)

资源

预训练模型可从 TF Hub 获得。TF Hub 集合还包括为 TFLite 优化的量化模型。

这些模型的源代码可在 TensorFlow Model Garden 中找到。包括本教程的较长版本,较长版本还介绍了如何构建和微调 MoViNet 模型。

后续步骤

要详细了解如何在 TensorFlow 中处理视频数据,请查看以下教程: