CoCalc -- uk_ireland_accent

GitHub Repository: keras-team/keras-io
Path: blob/master/examples/audio/uk_ireland_accent_recognition.py
⁸³³⁸ views
1
"""
2
Title: English speaker accent recognition using Transfer Learning
3
Author: [Fadi Badine](https://twitter.com/fadibadine)
4
Date created: 2022/04/16
5
Last modified: 2022/04/16
6
Description: Training a model to classify UK & Ireland accents using feature extraction from Yamnet.
7
Accelerator: GPU
8
"""
9

10
"""
11
## Introduction
12

13
The following example shows how to use feature extraction in order to
14
train a model to classify the English accent spoken in an audio wave.
15

16
Instead of training a model from scratch, transfer learning enables us to
17
take advantage of existing state-of-the-art deep learning models and use them as feature extractors.
18

19
Our process:
20

21
* Use a TF Hub pre-trained model (Yamnet) and apply it as part of the tf.data pipeline which transforms
22
the audio files into feature vectors.
23
* Train a dense model on the feature vectors.
24
* Use the trained model for inference on a new audio file.
25

26
Note:
27

28
* We need to install TensorFlow IO in order to resample audio files to 16 kHz as required by Yamnet model.
29
* In the test section, ffmpeg is used to convert the mp3 file to wav.
30

31
You can install TensorFlow IO with the following command:
32
"""
33

34
"""shell
35
pip install -U -q tensorflow_io
36
"""
37

38
"""
39
## Configuration
40
"""
41

42
SEED = 1337
43
EPOCHS = 100
44
BATCH_SIZE = 64
45
VALIDATION_RATIO = 0.1
46
MODEL_NAME = "uk_irish_accent_recognition"
47

48
# Location where the dataset will be downloaded.
49
# By default (None), keras.utils.get_file will use ~/.keras/ as the CACHE_DIR
50
CACHE_DIR = None
51

52
# The location of the dataset
53
URL_PATH = "https://www.openslr.org/resources/83/"
54

55
# List of datasets compressed files that contain the audio files
56
zip_files = {
57
    0: "irish_english_male.zip",
58
    1: "midlands_english_female.zip",
59
    2: "midlands_english_male.zip",
60
    3: "northern_english_female.zip",
61
    4: "northern_english_male.zip",
62
    5: "scottish_english_female.zip",
63
    6: "scottish_english_male.zip",
64
    7: "southern_english_female.zip",
65
    8: "southern_english_male.zip",
66
    9: "welsh_english_female.zip",
67
    10: "welsh_english_male.zip",
68
}
69

70
# We see that there are 2 compressed files for each accent (except Irish):
71
# - One for male speakers
72
# - One for female speakers
73
# However, we will be using a gender agnostic dataset.
74

75
# List of gender agnostic categories
76
gender_agnostic_categories = [
77
    "ir",  # Irish
78
    "mi",  # Midlands
79
    "no",  # Northern
80
    "sc",  # Scottish
81
    "so",  # Southern
82
    "we",  # Welsh
83
]
84

85
class_names = [
86
    "Irish",
87
    "Midlands",
88
    "Northern",
89
    "Scottish",
90
    "Southern",
91
    "Welsh",
92
    "Not a speech",
93
]
94

95
"""
96
## Imports
97
"""
98

99
import os
100
import io
101
import csv
102
import numpy as np
103
import pandas as pd
104
import tensorflow as tf
105
import tensorflow_hub as hub
106
import tensorflow_io as tfio
107
from tensorflow import keras
108
import matplotlib.pyplot as plt
109
import seaborn as sns
110
from scipy import stats
111
from IPython.display import Audio
112

113
# Set all random seeds in order to get reproducible results
114
keras.utils.set_random_seed(SEED)
115

116
# Where to download the dataset
117
DATASET_DESTINATION = os.path.join(CACHE_DIR if CACHE_DIR else "~/.keras/", "datasets")
118

119
"""
120
## Yamnet Model
121

122
Yamnet is an audio event classifier trained on the AudioSet dataset to predict audio
123
events from the AudioSet ontology. It is available on TensorFlow Hub.
124

125
Yamnet accepts a 1-D tensor of audio samples with a sample rate of 16 kHz.
126
As output, the model returns a 3-tuple:
127

128
* Scores of shape `(N, 521)` representing the scores of the 521 classes.
129
* Embeddings of shape `(N, 1024)`.
130
* The log-mel spectrogram of the entire audio frame.
131

132
We will use the embeddings, which are the features extracted from the audio samples, as the input to our dense model.
133

134
For more detailed information about Yamnet, please refer to its [TensorFlow Hub](https://tfhub.dev/google/yamnet/1) page.
135
"""
136

137
yamnet_model = hub.load("https://tfhub.dev/google/yamnet/1")
138

139
"""
140
## Dataset
141

142
The dataset used is the
143
[Crowdsourced high-quality UK and Ireland English Dialect speech data set](https://openslr.org/83/)
144
which consists of a total of 17,877 high-quality audio wav files.
145

146
This dataset includes over 31 hours of recording from 120 volunteers who self-identify as
147
native speakers of Southern England, Midlands, Northern England, Wales, Scotland and Ireland.
148

149
For more info, please refer to the above link or to the following paper:
150
[Open-source Multi-speaker Corpora of the English Accents in the British Isles](https://aclanthology.org/2020.lrec-1.804.pdf)
151
"""
152

153
"""
154
## Download the data
155
"""
156

157
# CSV file that contains information about the dataset. For each entry, we have:
158
# - ID
159
# - wav file name
160
# - transcript
161
line_index_file = keras.utils.get_file(
162
    fname="line_index_file", origin=URL_PATH + "line_index_all.csv"
163
)
164

165
# Download the list of compressed files that contain the audio wav files
166
for i in zip_files:
167
    fname = zip_files[i].split(".")[0]
168
    url = URL_PATH + zip_files[i]
169

170
    zip_file = keras.utils.get_file(fname=fname, origin=url, extract=True)
171
    os.remove(zip_file)
172

173
"""
174
## Load the data in a Dataframe
175

176
Of the 3 columns (ID, filename and transcript), we are only interested in the filename column in order to read the audio file.
177
We will ignore the other two.
178
"""
179

180
dataframe = pd.read_csv(
181
    line_index_file, names=["id", "filename", "transcript"], usecols=["filename"]
182
)
183
dataframe.head()
184

185
"""
186
Let's now preprocess the dataset by:
187

188
* Adjusting the filename (removing a leading space & adding ".wav" extension to the
189
filename).
190
* Creating a label using the first 2 characters of the filename which indicate the
191
accent.
192
* Shuffling the samples.
193
"""
194

195

196
# The purpose of this function is to preprocess the dataframe by applying the following:
197
# - Cleaning the filename from a leading space
198
# - Generating a label column that is gender agnostic i.e.
199
#   welsh english male and welsh english female for example are both labeled as
200
#   welsh english
201
# - Add extension .wav to the filename
202
# - Shuffle samples
203
def preprocess_dataframe(dataframe):
204
    # Remove leading space in filename column
205
    dataframe["filename"] = dataframe.apply(lambda row: row["filename"].strip(), axis=1)
206

207
    # Create gender agnostic labels based on the filename first 2 letters
208
    dataframe["label"] = dataframe.apply(
209
        lambda row: gender_agnostic_categories.index(row["filename"][:2]), axis=1
210
    )
211

212
    # Add the file path to the name
213
    dataframe["filename"] = dataframe.apply(
214
        lambda row: os.path.join(DATASET_DESTINATION, row["filename"] + ".wav"), axis=1
215
    )
216

217
    # Shuffle the samples
218
    dataframe = dataframe.sample(frac=1, random_state=SEED).reset_index(drop=True)
219

220
    return dataframe
221

222

223
dataframe = preprocess_dataframe(dataframe)
224
dataframe.head()
225

226
"""
227
## Prepare training & validation sets
228

229
Let's split the samples creating training and validation sets.
230
"""
231

232
split = int(len(dataframe) * (1 - VALIDATION_RATIO))
233
train_df = dataframe[:split]
234
valid_df = dataframe[split:]
235

236
print(
237
    f"We have {train_df.shape[0]} training samples & {valid_df.shape[0]} validation ones"
238
)
239

240
"""
241
## Prepare a TensorFlow Dataset
242

243
Next, we need to create a `tf.data.Dataset`.
244
This is done by creating a `dataframe_to_dataset` function that does the following:
245

246
* Create a dataset using filenames and labels.
247
* Get the Yamnet embeddings by calling another function `filepath_to_embeddings`.
248
* Apply caching, reshuffling and setting batch size.
249

250
The `filepath_to_embeddings` does the following:
251

252
* Load audio file.
253
* Resample audio to 16 kHz.
254
* Generate scores and embeddings from Yamnet model.
255
* Since Yamnet generates multiple samples for each audio file,
256
this function also duplicates the label for all the generated samples
257
that have `score=0` (speech) whereas sets the label for the others as
258
'other' indicating that this audio segment is not a speech and we won't label it as one of the accents.
259

260
The below `load_16k_audio_file` is copied from the following tutorial
261
[Transfer learning with YAMNet for environmental sound classification](https://www.tensorflow.org/tutorials/audio/transfer_learning_audio)
262
"""
263

264

265
@tf.function
266
def load_16k_audio_wav(filename):
267
    # Read file content
268
    file_content = tf.io.read_file(filename)
269

270
    # Decode audio wave
271
    audio_wav, sample_rate = tf.audio.decode_wav(file_content, desired_channels=1)
272
    audio_wav = tf.squeeze(audio_wav, axis=-1)
273
    sample_rate = tf.cast(sample_rate, dtype=tf.int64)
274

275
    # Resample to 16k
276
    audio_wav = tfio.audio.resample(audio_wav, rate_in=sample_rate, rate_out=16000)
277

278
    return audio_wav
279

280

281
def filepath_to_embeddings(filename, label):
282
    # Load 16k audio wave
283
    audio_wav = load_16k_audio_wav(filename)
284

285
    # Get audio embeddings & scores.
286
    # The embeddings are the audio features extracted using transfer learning
287
    # while scores will be used to identify time slots that are not speech
288
    # which will then be gathered into a specific new category 'other'
289
    scores, embeddings, _ = yamnet_model(audio_wav)
290

291
    # Number of embeddings in order to know how many times to repeat the label
292
    embeddings_num = tf.shape(embeddings)[0]
293
    labels = tf.repeat(label, embeddings_num)
294

295
    # Change labels for time-slots that are not speech into a new category 'other'
296
    labels = tf.where(tf.argmax(scores, axis=1) == 0, label, len(class_names) - 1)
297

298
    # Using one-hot in order to use AUC
299
    return (embeddings, tf.one_hot(labels, len(class_names)))
300

301

302
def dataframe_to_dataset(dataframe, batch_size=64):
303
    dataset = tf.data.Dataset.from_tensor_slices(
304
        (dataframe["filename"], dataframe["label"])
305
    )
306

307
    dataset = dataset.map(
308
        lambda x, y: filepath_to_embeddings(x, y),
309
        num_parallel_calls=tf.data.experimental.AUTOTUNE,
310
    ).unbatch()
311

312
    return dataset.cache().batch(batch_size).prefetch(tf.data.AUTOTUNE)
313

314

315
train_ds = dataframe_to_dataset(train_df)
316
valid_ds = dataframe_to_dataset(valid_df)
317

318
"""
319
## Build the model
320

321
The model that we use consists of:
322

323
* An input layer which is the embedding output of the Yamnet classifier.
324
* 4 dense hidden layers and 4 dropout layers.
325
* An output dense layer.
326

327
The model's hyperparameters were selected using
328
[KerasTuner](https://keras.io/keras_tuner/).
329
"""
330

331
keras.backend.clear_session()
332

333

334
def build_and_compile_model():
335
    inputs = keras.layers.Input(shape=(1024), name="embedding")
336

337
    x = keras.layers.Dense(256, activation="relu", name="dense_1")(inputs)
338
    x = keras.layers.Dropout(0.15, name="dropout_1")(x)
339

340
    x = keras.layers.Dense(384, activation="relu", name="dense_2")(x)
341
    x = keras.layers.Dropout(0.2, name="dropout_2")(x)
342

343
    x = keras.layers.Dense(192, activation="relu", name="dense_3")(x)
344
    x = keras.layers.Dropout(0.25, name="dropout_3")(x)
345

346
    x = keras.layers.Dense(384, activation="relu", name="dense_4")(x)
347
    x = keras.layers.Dropout(0.2, name="dropout_4")(x)
348

349
    outputs = keras.layers.Dense(len(class_names), activation="softmax", name="ouput")(
350
        x
351
    )
352

353
    model = keras.Model(inputs=inputs, outputs=outputs, name="accent_recognition")
354

355
    model.compile(
356
        optimizer=keras.optimizers.Adam(learning_rate=1.9644e-5),
357
        loss=keras.losses.CategoricalCrossentropy(),
358
        metrics=["accuracy", keras.metrics.AUC(name="auc")],
359
    )
360

361
    return model
362

363

364
model = build_and_compile_model()
365
model.summary()
366

367
"""
368
## Class weights calculation
369

370
Since the dataset is quite unbalanced, we will use `class_weight` argument during training.
371

372
Getting the class weights is a little tricky because even though we know the number of
373
audio files for each class, it does not represent the number of samples for that class
374
since Yamnet transforms each audio file into multiple audio samples of 0.96 seconds each.
375
So every audio file will be split into a number of samples that is proportional to its length.
376

377
Therefore, to get those weights, we have to calculate the number of samples for each class
378
after preprocessing through Yamnet.
379
"""
380

381
class_counts = tf.zeros(shape=(len(class_names),), dtype=tf.int32)
382

383
for x, y in iter(train_ds):
384
    class_counts = class_counts + tf.math.bincount(
385
        tf.cast(tf.math.argmax(y, axis=1), tf.int32), minlength=len(class_names)
386
    )
387

388
class_weight = {
389
    i: tf.math.reduce_sum(class_counts).numpy() / class_counts[i].numpy()
390
    for i in range(len(class_counts))
391
}
392

393
print(class_weight)
394

395
"""
396
## Callbacks
397

398
We use Keras callbacks in order to:
399

400
* Stop whenever the validation AUC stops improving.
401
* Save the best model.
402
* Call TensorBoard in order to later view the training and validation logs.
403
"""
404

405
early_stopping_cb = keras.callbacks.EarlyStopping(
406
    monitor="val_auc", patience=10, restore_best_weights=True
407
)
408

409
model_checkpoint_cb = keras.callbacks.ModelCheckpoint(
410
    MODEL_NAME + ".h5", monitor="val_auc", save_best_only=True
411
)
412

413
tensorboard_cb = keras.callbacks.TensorBoard(
414
    os.path.join(os.curdir, "logs", model.name)
415
)
416

417
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]
418

419
"""
420
## Training
421
"""
422

423
history = model.fit(
424
    train_ds,
425
    epochs=EPOCHS,
426
    validation_data=valid_ds,
427
    class_weight=class_weight,
428
    callbacks=callbacks,
429
    verbose=2,
430
)
431

432
"""
433
## Results
434

435
Let's plot the training and validation AUC and accuracy.
436
"""
437

438
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(14, 5))
439

440
axs[0].plot(range(EPOCHS), history.history["accuracy"], label="Training")
441
axs[0].plot(range(EPOCHS), history.history["val_accuracy"], label="Validation")
442
axs[0].set_xlabel("Epochs")
443
axs[0].set_title("Training & Validation Accuracy")
444
axs[0].legend()
445
axs[0].grid(True)
446

447
axs[1].plot(range(EPOCHS), history.history["auc"], label="Training")
448
axs[1].plot(range(EPOCHS), history.history["val_auc"], label="Validation")
449
axs[1].set_xlabel("Epochs")
450
axs[1].set_title("Training & Validation AUC")
451
axs[1].legend()
452
axs[1].grid(True)
453

454
plt.show()
455

456
"""
457
## Evaluation
458
"""
459

460
train_loss, train_acc, train_auc = model.evaluate(train_ds)
461
valid_loss, valid_acc, valid_auc = model.evaluate(valid_ds)
462

463
"""
464
Let's try to compare our model's performance to Yamnet's using one of Yamnet metrics (d-prime)
465
Yamnet achieved a d-prime value of 2.318.
466
Let's check our model's performance.
467
"""
468

469

470
# The following function calculates the d-prime score from the AUC
471
def d_prime(auc):
472
    standard_normal = stats.norm()
473
    d_prime = standard_normal.ppf(auc) * np.sqrt(2.0)
474
    return d_prime
475

476

477
print(
478
    "train d-prime: {0:.3f}, validation d-prime: {1:.3f}".format(
479
        d_prime(train_auc), d_prime(valid_auc)
480
    )
481
)
482

483
"""
484
We can see that the model achieves the following results:
485

486
Results    | Training  | Validation
487
-----------|-----------|------------
488
Accuracy   | 54%       | 51%
489
AUC        | 0.91      | 0.89
490
d-prime    | 1.882     | 1.740
491

492
"""
493

494
"""
495
## Confusion Matrix
496

497
Let's now plot the confusion matrix for the validation dataset.
498

499
The confusion matrix lets us see, for every class, not only how many samples were correctly classified,
500
but also which other classes were the samples confused with.
501

502
It allows us to calculate the precision and recall for every class.
503
"""
504

505
# Create x and y tensors
506
x_valid = None
507
y_valid = None
508

509
for x, y in iter(valid_ds):
510
    if x_valid is None:
511
        x_valid = x.numpy()
512
        y_valid = y.numpy()
513
    else:
514
        x_valid = np.concatenate((x_valid, x.numpy()), axis=0)
515
        y_valid = np.concatenate((y_valid, y.numpy()), axis=0)
516

517
# Generate predictions
518
y_pred = model.predict(x_valid)
519

520
# Calculate confusion matrix
521
confusion_mtx = tf.math.confusion_matrix(
522
    np.argmax(y_valid, axis=1), np.argmax(y_pred, axis=1)
523
)
524

525
# Plot the confusion matrix
526
plt.figure(figsize=(10, 8))
527
sns.heatmap(
528
    confusion_mtx, xticklabels=class_names, yticklabels=class_names, annot=True, fmt="g"
529
)
530
plt.xlabel("Prediction")
531
plt.ylabel("Label")
532
plt.title("Validation Confusion Matrix")
533
plt.show()
534

535
"""
536
## Precision & recall
537

538
For every class:
539

540
* Recall is the ratio of correctly classified samples i.e. it shows how many samples
541
of this specific class, the model is able to detect.
542
It is the ratio of diagonal elements to the sum of all elements in the row.
543
* Precision shows the accuracy of the classifier. It is the ratio of correctly predicted
544
samples among the ones classified as belonging to this class.
545
It is the ratio of diagonal elements to the sum of all elements in the column.
546
"""
547

548
for i, label in enumerate(class_names):
549
    precision = confusion_mtx[i, i] / np.sum(confusion_mtx[:, i])
550
    recall = confusion_mtx[i, i] / np.sum(confusion_mtx[i, :])
551
    print(
552
        "{0:15} Precision:{1:.2f}%; Recall:{2:.2f}%".format(
553
            label, precision * 100, recall * 100
554
        )
555
    )
556

557
"""
558
## Run inference on test data
559

560
Let's now run a test on a single audio file.
561
Let's check this example from [The Scottish Voice](https://www.thescottishvoice.org.uk/home/)
562

563
We will:
564

565
* Download the mp3 file.
566
* Convert it to a 16k wav file.
567
* Run the model on the wav file.
568
* Plot the results.
569
"""
570

571
filename = "audio-sample-Stuart"
572
url = "https://www.thescottishvoice.org.uk/files/cm/files/"
573

574
if os.path.exists(filename + ".wav") == False:
575
    print(f"Downloading {filename}.mp3 from {url}")
576
    command = f"wget {url}{filename}.mp3"
577
    os.system(command)
578

579
    print(f"Converting mp3 to wav and resampling to 16 kHZ")
580
    command = (
581
        f"ffmpeg -hide_banner -loglevel panic -y -i {filename}.mp3 -acodec "
582
        f"pcm_s16le -ac 1 -ar 16000 {filename}.wav"
583
    )
584
    os.system(command)
585

586
filename = filename + ".wav"
587

588

589
"""
590
The below function `yamnet_class_names_from_csv` was copied and very slightly changed
591
from this [Yamnet Notebook](https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/yamnet.ipynb).
592
"""
593

594

595
def yamnet_class_names_from_csv(yamnet_class_map_csv_text):
596
    """Returns list of class names corresponding to score vector."""
597
    yamnet_class_map_csv = io.StringIO(yamnet_class_map_csv_text)
598
    yamnet_class_names = [
599
        name for (class_index, mid, name) in csv.reader(yamnet_class_map_csv)
600
    ]
601
    yamnet_class_names = yamnet_class_names[1:]  # Skip CSV header
602
    return yamnet_class_names
603

604

605
yamnet_class_map_path = yamnet_model.class_map_path().numpy()
606
yamnet_class_names = yamnet_class_names_from_csv(
607
    tf.io.read_file(yamnet_class_map_path).numpy().decode("utf-8")
608
)
609

610

611
def calculate_number_of_non_speech(scores):
612
    number_of_non_speech = tf.math.reduce_sum(
613
        tf.where(tf.math.argmax(scores, axis=1, output_type=tf.int32) != 0, 1, 0)
614
    )
615

616
    return number_of_non_speech
617

618

619
def filename_to_predictions(filename):
620
    # Load 16k audio wave
621
    audio_wav = load_16k_audio_wav(filename)
622

623
    # Get audio embeddings & scores.
624
    scores, embeddings, mel_spectrogram = yamnet_model(audio_wav)
625

626
    print(
627
        "Out of {} samples, {} are not speech".format(
628
            scores.shape[0], calculate_number_of_non_speech(scores)
629
        )
630
    )
631

632
    # Predict the output of the accent recognition model with embeddings as input
633
    predictions = model.predict(embeddings)
634

635
    return audio_wav, predictions, mel_spectrogram
636

637

638
"""
639
Let's run the model on the audio file:
640
"""
641

642
audio_wav, predictions, mel_spectrogram = filename_to_predictions(filename)
643

644
infered_class = class_names[predictions.mean(axis=0).argmax()]
645
print(f"The main accent is: {infered_class} English")
646

647
"""
648
Listen to the audio
649
"""
650

651
Audio(audio_wav, rate=16000)
652

653
"""
654
The below function was copied from this [Yamnet notebook](tinyurl.com/4a8xn7at) and adjusted to our need.
655

656
This function plots the following:
657

658
* Audio waveform
659
* Mel spectrogram
660
* Predictions for every time step
661
"""
662

663
plt.figure(figsize=(10, 6))
664

665
# Plot the waveform.
666
plt.subplot(3, 1, 1)
667
plt.plot(audio_wav)
668
plt.xlim([0, len(audio_wav)])
669

670
# Plot the log-mel spectrogram (returned by the model).
671
plt.subplot(3, 1, 2)
672
plt.imshow(
673
    mel_spectrogram.numpy().T, aspect="auto", interpolation="nearest", origin="lower"
674
)
675

676
# Plot and label the model output scores for the top-scoring classes.
677
mean_predictions = np.mean(predictions, axis=0)
678

679
top_class_indices = np.argsort(mean_predictions)[::-1]
680
plt.subplot(3, 1, 3)
681
plt.imshow(
682
    predictions[:, top_class_indices].T,
683
    aspect="auto",
684
    interpolation="nearest",
685
    cmap="gray_r",
686
)
687

688
# patch_padding = (PATCH_WINDOW_SECONDS / 2) / PATCH_HOP_SECONDS
689
# values from the model documentation
690
patch_padding = (0.025 / 2) / 0.01
691
plt.xlim([-patch_padding - 0.5, predictions.shape[0] + patch_padding - 0.5])
692
# Label the top_N classes.
693
yticks = range(0, len(class_names), 1)
694
plt.yticks(yticks, [class_names[top_class_indices[x]] for x in yticks])
695
_ = plt.ylim(-0.5 + np.array([len(class_names), 0]))
696

697
Product

Resources

Company