Path: blob/master/templates/examples/audio/transformer_asr.md
3297 views
Automatic Speech Recognition with Transformer
Author: Apoorv Nandan
Date created: 2021/01/13
Last modified: 2021/01/13
Description: Training a sequence-to-sequence Transformer for automatic speech recognition.
Introduction
Automatic speech recognition (ASR) consists of transcribing audio speech segments into text. ASR can be treated as a sequence-to-sequence problem, where the audio can be represented as a sequence of feature vectors and the text as a sequence of characters, words, or subword tokens.
For this demonstration, we will use the LJSpeech dataset from the LibriVox project. It consists of short audio clips of a single speaker reading passages from 7 non-fiction books. Our model will be similar to the original Transformer (both encoder and decoder) as proposed in the paper, "Attention is All You Need".
References:
Define the Transformer Input Layer
When processing past target tokens for the decoder, we compute the sum of position embeddings and token embeddings.
When processing audio features, we apply convolutional layers to downsample them (via convolution strides) and process local relationships.
Transformer Encoder Layer
Transformer Decoder Layer
Complete the Transformer model
Our model takes audio spectrograms as inputs and predicts a sequence of characters. During training, we give the decoder the target character sequence shifted to the left as input. During inference, the decoder uses its own past predictions to predict the next token.
Download the dataset
Note: This requires ~3.6 GB of disk space and takes ~5 minutes for the extraction of files.
Learning rate schedule
Create & train the end-to-end model
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1700071380.331418 678094 device_compiler.h:187] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
203/203 ━━━━━━━━━━━━━━━━━━━━ 0s 947ms/step - loss: 1.8285target: <the relations between lee and marina oswald are of great importance in any attempt to understand oswald#s possible motivation.> prediction: <the the he at the t the an of t te the ale t he t te ar the in the the s the s tan as t the t as re the te the ast he and t the s s the thee thed the the thes the s te te he t the of in anae o the or
target: <he was in consequence put out of the protection of their internal law, end quote. their code was a subject of some curiosity.> prediction: <the the he at the t the an of t te the ale t he t te ar the in the the s the s tan as t the t as re the te the ast he and t the s s the thee thed the the thes the s te te he t the of in anae o the or
target: prediction: <the the he at the t the an of t te the ale t he t te ar the in the the s the s tan ase athe t as re the te the ast he and t the s s the thee thed the the thes the s te te he t the of in anse o the or
target: <it probably contributed greatly to the general dissatisfaction which he exhibited with his environment,> prediction: <the the he at the t the an of t te the ale t he t te ar the in the the s the s tan as t the t as re the te the ast he and t the s s the thee thed the the thes the s te te he t the of in anae o the or
203/203 ━━━━━━━━━━━━━━━━━━━━ 428s 1s/step - loss: 1.8276 - val_loss: 1.5233
target: <as they sat in the car, frazier asked oswald where his lunch was> prediction:
target: <under the entry for may one, nineteen sixty,> prediction: <under the introus for may monee, nin the sixty,>