Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
TensorSpeech
GitHub Repository: TensorSpeech/TensorFlowTTS
Path: blob/master/examples/mfa_extraction/README.md
1558 views

MFA based extraction for FastSpeech

Prepare

Everything is done from main repo folder so TensorflowTTS/

  1. Optional* Modify MFA scripts to work with your language (https://montreal-forced-aligner.readthedocs.io/en/latest/pretrained_models.html)

  2. Download pretrained mfa, lexicon and run extract textgrids:

  • bash examples/mfa_extraction/scripts/prepare_mfa.sh
  • python examples/mfa_extraction/run_mfa.py \ --corpus_directory ./libritts \ --output_directory ./mfa/parsed \ --jobs 8

    After this step, the TextGrids is allocated at ./mfa/parsed.

  1. Extract duration from textgrid files:

  • python examples/mfa_extraction/txt_grid_parser.py \ --yaml_path examples/fastspeech2_libritts/conf/fastspeech2libritts.yaml \ --dataset_path ./libritts \ --text_grid_path ./mfa/parsed \ --output_durations_path ./libritts/durations \ --sample_rate 24000
  • Dataset structure after finish this step:

    |- TensorFlowTTS/ | |- LibriTTS/ | |- |- train-clean-100/ | |- |- SPEAKERS.txt | |- |- ... | |- dataset/ | |- |- 200/ | |- |- |- 200_124139_000001_000000.txt | |- |- |- 200_124139_000001_000000.wav | |- |- |- ... | |- |- 250/ | |- |- ... | |- |- durations/ | |- |- train.txt | |- tensorflow_tts/ | |- models/ | |- ... ```
  1. Optional* add your own dataset parser based on tensorflow_tts/processor/experiment/example_dataset.py ( If base processor dataset didnt match yours )

  2. Run preprocess and normalization (Step 4,5 in examples/fastspeech2_libritts/README.MD)

  3. Run fix mismatch to fix few frames difference in audio and duration files:

  • python examples/mfa_extraction/fix_mismatch.py \ --base_path ./dump \ --trimmed_dur_path ./dataset/trimmed-durations \ --dur_path ./dataset/durations

Problems with MFA extraction

Looks like MFA have problems with trimmed files it works better (in my experiments) with ~100ms of silence at start and end

Short files can get a lot of false positive like only silence extraction (LibriTTS example) so i would get only samples >2s