Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
keras-team
GitHub Repository: keras-team/keras-io
Path: blob/master/examples/audio/uk_ireland_accent_recognition.py
8338 views
1
"""
2
Title: English speaker accent recognition using Transfer Learning
3
Author: [Fadi Badine](https://twitter.com/fadibadine)
4
Date created: 2022/04/16
5
Last modified: 2022/04/16
6
Description: Training a model to classify UK & Ireland accents using feature extraction from Yamnet.
7
Accelerator: GPU
8
"""
9
10
"""
11
## Introduction
12
13
The following example shows how to use feature extraction in order to
14
train a model to classify the English accent spoken in an audio wave.
15
16
Instead of training a model from scratch, transfer learning enables us to
17
take advantage of existing state-of-the-art deep learning models and use them as feature extractors.
18
19
Our process:
20
21
* Use a TF Hub pre-trained model (Yamnet) and apply it as part of the tf.data pipeline which transforms
22
the audio files into feature vectors.
23
* Train a dense model on the feature vectors.
24
* Use the trained model for inference on a new audio file.
25
26
Note:
27
28
* We need to install TensorFlow IO in order to resample audio files to 16 kHz as required by Yamnet model.
29
* In the test section, ffmpeg is used to convert the mp3 file to wav.
30
31
You can install TensorFlow IO with the following command:
32
"""
33
34
"""shell
35
pip install -U -q tensorflow_io
36
"""
37
38
"""
39
## Configuration
40
"""
41
42
SEED = 1337
43
EPOCHS = 100
44
BATCH_SIZE = 64
45
VALIDATION_RATIO = 0.1
46
MODEL_NAME = "uk_irish_accent_recognition"
47
48
# Location where the dataset will be downloaded.
49
# By default (None), keras.utils.get_file will use ~/.keras/ as the CACHE_DIR
50
CACHE_DIR = None
51
52
# The location of the dataset
53
URL_PATH = "https://www.openslr.org/resources/83/"
54
55
# List of datasets compressed files that contain the audio files
56
zip_files = {
57
0: "irish_english_male.zip",
58
1: "midlands_english_female.zip",
59
2: "midlands_english_male.zip",
60
3: "northern_english_female.zip",
61
4: "northern_english_male.zip",
62
5: "scottish_english_female.zip",
63
6: "scottish_english_male.zip",
64
7: "southern_english_female.zip",
65
8: "southern_english_male.zip",
66
9: "welsh_english_female.zip",
67
10: "welsh_english_male.zip",
68
}
69
70
# We see that there are 2 compressed files for each accent (except Irish):
71
# - One for male speakers
72
# - One for female speakers
73
# However, we will be using a gender agnostic dataset.
74
75
# List of gender agnostic categories
76
gender_agnostic_categories = [
77
"ir", # Irish
78
"mi", # Midlands
79
"no", # Northern
80
"sc", # Scottish
81
"so", # Southern
82
"we", # Welsh
83
]
84
85
class_names = [
86
"Irish",
87
"Midlands",
88
"Northern",
89
"Scottish",
90
"Southern",
91
"Welsh",
92
"Not a speech",
93
]
94
95
"""
96
## Imports
97
"""
98
99
import os
100
import io
101
import csv
102
import numpy as np
103
import pandas as pd
104
import tensorflow as tf
105
import tensorflow_hub as hub
106
import tensorflow_io as tfio
107
from tensorflow import keras
108
import matplotlib.pyplot as plt
109
import seaborn as sns
110
from scipy import stats
111
from IPython.display import Audio
112
113
# Set all random seeds in order to get reproducible results
114
keras.utils.set_random_seed(SEED)
115
116
# Where to download the dataset
117
DATASET_DESTINATION = os.path.join(CACHE_DIR if CACHE_DIR else "~/.keras/", "datasets")
118
119
"""
120
## Yamnet Model
121
122
Yamnet is an audio event classifier trained on the AudioSet dataset to predict audio
123
events from the AudioSet ontology. It is available on TensorFlow Hub.
124
125
Yamnet accepts a 1-D tensor of audio samples with a sample rate of 16 kHz.
126
As output, the model returns a 3-tuple:
127
128
* Scores of shape `(N, 521)` representing the scores of the 521 classes.
129
* Embeddings of shape `(N, 1024)`.
130
* The log-mel spectrogram of the entire audio frame.
131
132
We will use the embeddings, which are the features extracted from the audio samples, as the input to our dense model.
133
134
For more detailed information about Yamnet, please refer to its [TensorFlow Hub](https://tfhub.dev/google/yamnet/1) page.
135
"""
136
137
yamnet_model = hub.load("https://tfhub.dev/google/yamnet/1")
138
139
"""
140
## Dataset
141
142
The dataset used is the
143
[Crowdsourced high-quality UK and Ireland English Dialect speech data set](https://openslr.org/83/)
144
which consists of a total of 17,877 high-quality audio wav files.
145
146
This dataset includes over 31 hours of recording from 120 volunteers who self-identify as
147
native speakers of Southern England, Midlands, Northern England, Wales, Scotland and Ireland.
148
149
For more info, please refer to the above link or to the following paper:
150
[Open-source Multi-speaker Corpora of the English Accents in the British Isles](https://aclanthology.org/2020.lrec-1.804.pdf)
151
"""
152
153
"""
154
## Download the data
155
"""
156
157
# CSV file that contains information about the dataset. For each entry, we have:
158
# - ID
159
# - wav file name
160
# - transcript
161
line_index_file = keras.utils.get_file(
162
fname="line_index_file", origin=URL_PATH + "line_index_all.csv"
163
)
164
165
# Download the list of compressed files that contain the audio wav files
166
for i in zip_files:
167
fname = zip_files[i].split(".")[0]
168
url = URL_PATH + zip_files[i]
169
170
zip_file = keras.utils.get_file(fname=fname, origin=url, extract=True)
171
os.remove(zip_file)
172
173
"""
174
## Load the data in a Dataframe
175
176
Of the 3 columns (ID, filename and transcript), we are only interested in the filename column in order to read the audio file.
177
We will ignore the other two.
178
"""
179
180
dataframe = pd.read_csv(
181
line_index_file, names=["id", "filename", "transcript"], usecols=["filename"]
182
)
183
dataframe.head()
184
185
"""
186
Let's now preprocess the dataset by:
187
188
* Adjusting the filename (removing a leading space & adding ".wav" extension to the
189
filename).
190
* Creating a label using the first 2 characters of the filename which indicate the
191
accent.
192
* Shuffling the samples.
193
"""
194
195
196
# The purpose of this function is to preprocess the dataframe by applying the following:
197
# - Cleaning the filename from a leading space
198
# - Generating a label column that is gender agnostic i.e.
199
# welsh english male and welsh english female for example are both labeled as
200
# welsh english
201
# - Add extension .wav to the filename
202
# - Shuffle samples
203
def preprocess_dataframe(dataframe):
204
# Remove leading space in filename column
205
dataframe["filename"] = dataframe.apply(lambda row: row["filename"].strip(), axis=1)
206
207
# Create gender agnostic labels based on the filename first 2 letters
208
dataframe["label"] = dataframe.apply(
209
lambda row: gender_agnostic_categories.index(row["filename"][:2]), axis=1
210
)
211
212
# Add the file path to the name
213
dataframe["filename"] = dataframe.apply(
214
lambda row: os.path.join(DATASET_DESTINATION, row["filename"] + ".wav"), axis=1
215
)
216
217
# Shuffle the samples
218
dataframe = dataframe.sample(frac=1, random_state=SEED).reset_index(drop=True)
219
220
return dataframe
221
222
223
dataframe = preprocess_dataframe(dataframe)
224
dataframe.head()
225
226
"""
227
## Prepare training & validation sets
228
229
Let's split the samples creating training and validation sets.
230
"""
231
232
split = int(len(dataframe) * (1 - VALIDATION_RATIO))
233
train_df = dataframe[:split]
234
valid_df = dataframe[split:]
235
236
print(
237
f"We have {train_df.shape[0]} training samples & {valid_df.shape[0]} validation ones"
238
)
239
240
"""
241
## Prepare a TensorFlow Dataset
242
243
Next, we need to create a `tf.data.Dataset`.
244
This is done by creating a `dataframe_to_dataset` function that does the following:
245
246
* Create a dataset using filenames and labels.
247
* Get the Yamnet embeddings by calling another function `filepath_to_embeddings`.
248
* Apply caching, reshuffling and setting batch size.
249
250
The `filepath_to_embeddings` does the following:
251
252
* Load audio file.
253
* Resample audio to 16 kHz.
254
* Generate scores and embeddings from Yamnet model.
255
* Since Yamnet generates multiple samples for each audio file,
256
this function also duplicates the label for all the generated samples
257
that have `score=0` (speech) whereas sets the label for the others as
258
'other' indicating that this audio segment is not a speech and we won't label it as one of the accents.
259
260
The below `load_16k_audio_file` is copied from the following tutorial
261
[Transfer learning with YAMNet for environmental sound classification](https://www.tensorflow.org/tutorials/audio/transfer_learning_audio)
262
"""
263
264
265
@tf.function
266
def load_16k_audio_wav(filename):
267
# Read file content
268
file_content = tf.io.read_file(filename)
269
270
# Decode audio wave
271
audio_wav, sample_rate = tf.audio.decode_wav(file_content, desired_channels=1)
272
audio_wav = tf.squeeze(audio_wav, axis=-1)
273
sample_rate = tf.cast(sample_rate, dtype=tf.int64)
274
275
# Resample to 16k
276
audio_wav = tfio.audio.resample(audio_wav, rate_in=sample_rate, rate_out=16000)
277
278
return audio_wav
279
280
281
def filepath_to_embeddings(filename, label):
282
# Load 16k audio wave
283
audio_wav = load_16k_audio_wav(filename)
284
285
# Get audio embeddings & scores.
286
# The embeddings are the audio features extracted using transfer learning
287
# while scores will be used to identify time slots that are not speech
288
# which will then be gathered into a specific new category 'other'
289
scores, embeddings, _ = yamnet_model(audio_wav)
290
291
# Number of embeddings in order to know how many times to repeat the label
292
embeddings_num = tf.shape(embeddings)[0]
293
labels = tf.repeat(label, embeddings_num)
294
295
# Change labels for time-slots that are not speech into a new category 'other'
296
labels = tf.where(tf.argmax(scores, axis=1) == 0, label, len(class_names) - 1)
297
298
# Using one-hot in order to use AUC
299
return (embeddings, tf.one_hot(labels, len(class_names)))
300
301
302
def dataframe_to_dataset(dataframe, batch_size=64):
303
dataset = tf.data.Dataset.from_tensor_slices(
304
(dataframe["filename"], dataframe["label"])
305
)
306
307
dataset = dataset.map(
308
lambda x, y: filepath_to_embeddings(x, y),
309
num_parallel_calls=tf.data.experimental.AUTOTUNE,
310
).unbatch()
311
312
return dataset.cache().batch(batch_size).prefetch(tf.data.AUTOTUNE)
313
314
315
train_ds = dataframe_to_dataset(train_df)
316
valid_ds = dataframe_to_dataset(valid_df)
317
318
"""
319
## Build the model
320
321
The model that we use consists of:
322
323
* An input layer which is the embedding output of the Yamnet classifier.
324
* 4 dense hidden layers and 4 dropout layers.
325
* An output dense layer.
326
327
The model's hyperparameters were selected using
328
[KerasTuner](https://keras.io/keras_tuner/).
329
"""
330
331
keras.backend.clear_session()
332
333
334
def build_and_compile_model():
335
inputs = keras.layers.Input(shape=(1024), name="embedding")
336
337
x = keras.layers.Dense(256, activation="relu", name="dense_1")(inputs)
338
x = keras.layers.Dropout(0.15, name="dropout_1")(x)
339
340
x = keras.layers.Dense(384, activation="relu", name="dense_2")(x)
341
x = keras.layers.Dropout(0.2, name="dropout_2")(x)
342
343
x = keras.layers.Dense(192, activation="relu", name="dense_3")(x)
344
x = keras.layers.Dropout(0.25, name="dropout_3")(x)
345
346
x = keras.layers.Dense(384, activation="relu", name="dense_4")(x)
347
x = keras.layers.Dropout(0.2, name="dropout_4")(x)
348
349
outputs = keras.layers.Dense(len(class_names), activation="softmax", name="ouput")(
350
x
351
)
352
353
model = keras.Model(inputs=inputs, outputs=outputs, name="accent_recognition")
354
355
model.compile(
356
optimizer=keras.optimizers.Adam(learning_rate=1.9644e-5),
357
loss=keras.losses.CategoricalCrossentropy(),
358
metrics=["accuracy", keras.metrics.AUC(name="auc")],
359
)
360
361
return model
362
363
364
model = build_and_compile_model()
365
model.summary()
366
367
"""
368
## Class weights calculation
369
370
Since the dataset is quite unbalanced, we will use `class_weight` argument during training.
371
372
Getting the class weights is a little tricky because even though we know the number of
373
audio files for each class, it does not represent the number of samples for that class
374
since Yamnet transforms each audio file into multiple audio samples of 0.96 seconds each.
375
So every audio file will be split into a number of samples that is proportional to its length.
376
377
Therefore, to get those weights, we have to calculate the number of samples for each class
378
after preprocessing through Yamnet.
379
"""
380
381
class_counts = tf.zeros(shape=(len(class_names),), dtype=tf.int32)
382
383
for x, y in iter(train_ds):
384
class_counts = class_counts + tf.math.bincount(
385
tf.cast(tf.math.argmax(y, axis=1), tf.int32), minlength=len(class_names)
386
)
387
388
class_weight = {
389
i: tf.math.reduce_sum(class_counts).numpy() / class_counts[i].numpy()
390
for i in range(len(class_counts))
391
}
392
393
print(class_weight)
394
395
"""
396
## Callbacks
397
398
We use Keras callbacks in order to:
399
400
* Stop whenever the validation AUC stops improving.
401
* Save the best model.
402
* Call TensorBoard in order to later view the training and validation logs.
403
"""
404
405
early_stopping_cb = keras.callbacks.EarlyStopping(
406
monitor="val_auc", patience=10, restore_best_weights=True
407
)
408
409
model_checkpoint_cb = keras.callbacks.ModelCheckpoint(
410
MODEL_NAME + ".h5", monitor="val_auc", save_best_only=True
411
)
412
413
tensorboard_cb = keras.callbacks.TensorBoard(
414
os.path.join(os.curdir, "logs", model.name)
415
)
416
417
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]
418
419
"""
420
## Training
421
"""
422
423
history = model.fit(
424
train_ds,
425
epochs=EPOCHS,
426
validation_data=valid_ds,
427
class_weight=class_weight,
428
callbacks=callbacks,
429
verbose=2,
430
)
431
432
"""
433
## Results
434
435
Let's plot the training and validation AUC and accuracy.
436
"""
437
438
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(14, 5))
439
440
axs[0].plot(range(EPOCHS), history.history["accuracy"], label="Training")
441
axs[0].plot(range(EPOCHS), history.history["val_accuracy"], label="Validation")
442
axs[0].set_xlabel("Epochs")
443
axs[0].set_title("Training & Validation Accuracy")
444
axs[0].legend()
445
axs[0].grid(True)
446
447
axs[1].plot(range(EPOCHS), history.history["auc"], label="Training")
448
axs[1].plot(range(EPOCHS), history.history["val_auc"], label="Validation")
449
axs[1].set_xlabel("Epochs")
450
axs[1].set_title("Training & Validation AUC")
451
axs[1].legend()
452
axs[1].grid(True)
453
454
plt.show()
455
456
"""
457
## Evaluation
458
"""
459
460
train_loss, train_acc, train_auc = model.evaluate(train_ds)
461
valid_loss, valid_acc, valid_auc = model.evaluate(valid_ds)
462
463
"""
464
Let's try to compare our model's performance to Yamnet's using one of Yamnet metrics (d-prime)
465
Yamnet achieved a d-prime value of 2.318.
466
Let's check our model's performance.
467
"""
468
469
470
# The following function calculates the d-prime score from the AUC
471
def d_prime(auc):
472
standard_normal = stats.norm()
473
d_prime = standard_normal.ppf(auc) * np.sqrt(2.0)
474
return d_prime
475
476
477
print(
478
"train d-prime: {0:.3f}, validation d-prime: {1:.3f}".format(
479
d_prime(train_auc), d_prime(valid_auc)
480
)
481
)
482
483
"""
484
We can see that the model achieves the following results:
485
486
Results | Training | Validation
487
-----------|-----------|------------
488
Accuracy | 54% | 51%
489
AUC | 0.91 | 0.89
490
d-prime | 1.882 | 1.740
491
492
"""
493
494
"""
495
## Confusion Matrix
496
497
Let's now plot the confusion matrix for the validation dataset.
498
499
The confusion matrix lets us see, for every class, not only how many samples were correctly classified,
500
but also which other classes were the samples confused with.
501
502
It allows us to calculate the precision and recall for every class.
503
"""
504
505
# Create x and y tensors
506
x_valid = None
507
y_valid = None
508
509
for x, y in iter(valid_ds):
510
if x_valid is None:
511
x_valid = x.numpy()
512
y_valid = y.numpy()
513
else:
514
x_valid = np.concatenate((x_valid, x.numpy()), axis=0)
515
y_valid = np.concatenate((y_valid, y.numpy()), axis=0)
516
517
# Generate predictions
518
y_pred = model.predict(x_valid)
519
520
# Calculate confusion matrix
521
confusion_mtx = tf.math.confusion_matrix(
522
np.argmax(y_valid, axis=1), np.argmax(y_pred, axis=1)
523
)
524
525
# Plot the confusion matrix
526
plt.figure(figsize=(10, 8))
527
sns.heatmap(
528
confusion_mtx, xticklabels=class_names, yticklabels=class_names, annot=True, fmt="g"
529
)
530
plt.xlabel("Prediction")
531
plt.ylabel("Label")
532
plt.title("Validation Confusion Matrix")
533
plt.show()
534
535
"""
536
## Precision & recall
537
538
For every class:
539
540
* Recall is the ratio of correctly classified samples i.e. it shows how many samples
541
of this specific class, the model is able to detect.
542
It is the ratio of diagonal elements to the sum of all elements in the row.
543
* Precision shows the accuracy of the classifier. It is the ratio of correctly predicted
544
samples among the ones classified as belonging to this class.
545
It is the ratio of diagonal elements to the sum of all elements in the column.
546
"""
547
548
for i, label in enumerate(class_names):
549
precision = confusion_mtx[i, i] / np.sum(confusion_mtx[:, i])
550
recall = confusion_mtx[i, i] / np.sum(confusion_mtx[i, :])
551
print(
552
"{0:15} Precision:{1:.2f}%; Recall:{2:.2f}%".format(
553
label, precision * 100, recall * 100
554
)
555
)
556
557
"""
558
## Run inference on test data
559
560
Let's now run a test on a single audio file.
561
Let's check this example from [The Scottish Voice](https://www.thescottishvoice.org.uk/home/)
562
563
We will:
564
565
* Download the mp3 file.
566
* Convert it to a 16k wav file.
567
* Run the model on the wav file.
568
* Plot the results.
569
"""
570
571
filename = "audio-sample-Stuart"
572
url = "https://www.thescottishvoice.org.uk/files/cm/files/"
573
574
if os.path.exists(filename + ".wav") == False:
575
print(f"Downloading {filename}.mp3 from {url}")
576
command = f"wget {url}{filename}.mp3"
577
os.system(command)
578
579
print(f"Converting mp3 to wav and resampling to 16 kHZ")
580
command = (
581
f"ffmpeg -hide_banner -loglevel panic -y -i {filename}.mp3 -acodec "
582
f"pcm_s16le -ac 1 -ar 16000 {filename}.wav"
583
)
584
os.system(command)
585
586
filename = filename + ".wav"
587
588
589
"""
590
The below function `yamnet_class_names_from_csv` was copied and very slightly changed
591
from this [Yamnet Notebook](https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/yamnet.ipynb).
592
"""
593
594
595
def yamnet_class_names_from_csv(yamnet_class_map_csv_text):
596
"""Returns list of class names corresponding to score vector."""
597
yamnet_class_map_csv = io.StringIO(yamnet_class_map_csv_text)
598
yamnet_class_names = [
599
name for (class_index, mid, name) in csv.reader(yamnet_class_map_csv)
600
]
601
yamnet_class_names = yamnet_class_names[1:] # Skip CSV header
602
return yamnet_class_names
603
604
605
yamnet_class_map_path = yamnet_model.class_map_path().numpy()
606
yamnet_class_names = yamnet_class_names_from_csv(
607
tf.io.read_file(yamnet_class_map_path).numpy().decode("utf-8")
608
)
609
610
611
def calculate_number_of_non_speech(scores):
612
number_of_non_speech = tf.math.reduce_sum(
613
tf.where(tf.math.argmax(scores, axis=1, output_type=tf.int32) != 0, 1, 0)
614
)
615
616
return number_of_non_speech
617
618
619
def filename_to_predictions(filename):
620
# Load 16k audio wave
621
audio_wav = load_16k_audio_wav(filename)
622
623
# Get audio embeddings & scores.
624
scores, embeddings, mel_spectrogram = yamnet_model(audio_wav)
625
626
print(
627
"Out of {} samples, {} are not speech".format(
628
scores.shape[0], calculate_number_of_non_speech(scores)
629
)
630
)
631
632
# Predict the output of the accent recognition model with embeddings as input
633
predictions = model.predict(embeddings)
634
635
return audio_wav, predictions, mel_spectrogram
636
637
638
"""
639
Let's run the model on the audio file:
640
"""
641
642
audio_wav, predictions, mel_spectrogram = filename_to_predictions(filename)
643
644
infered_class = class_names[predictions.mean(axis=0).argmax()]
645
print(f"The main accent is: {infered_class} English")
646
647
"""
648
Listen to the audio
649
"""
650
651
Audio(audio_wav, rate=16000)
652
653
"""
654
The below function was copied from this [Yamnet notebook](tinyurl.com/4a8xn7at) and adjusted to our need.
655
656
This function plots the following:
657
658
* Audio waveform
659
* Mel spectrogram
660
* Predictions for every time step
661
"""
662
663
plt.figure(figsize=(10, 6))
664
665
# Plot the waveform.
666
plt.subplot(3, 1, 1)
667
plt.plot(audio_wav)
668
plt.xlim([0, len(audio_wav)])
669
670
# Plot the log-mel spectrogram (returned by the model).
671
plt.subplot(3, 1, 2)
672
plt.imshow(
673
mel_spectrogram.numpy().T, aspect="auto", interpolation="nearest", origin="lower"
674
)
675
676
# Plot and label the model output scores for the top-scoring classes.
677
mean_predictions = np.mean(predictions, axis=0)
678
679
top_class_indices = np.argsort(mean_predictions)[::-1]
680
plt.subplot(3, 1, 3)
681
plt.imshow(
682
predictions[:, top_class_indices].T,
683
aspect="auto",
684
interpolation="nearest",
685
cmap="gray_r",
686
)
687
688
# patch_padding = (PATCH_WINDOW_SECONDS / 2) / PATCH_HOP_SECONDS
689
# values from the model documentation
690
patch_padding = (0.025 / 2) / 0.01
691
plt.xlim([-patch_padding - 0.5, predictions.shape[0] + patch_padding - 0.5])
692
# Label the top_N classes.
693
yticks = range(0, len(class_names), 1)
694
plt.yticks(yticks, [class_names[top_class_indices[x]] for x in yticks])
695
_ = plt.ylim(-0.5 + np.array([len(class_names), 0]))
696
697