diff --git a/README.md b/README.md index 3cbf02b..85a0498 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,8 @@ -# Kaldi: VoxForge tutorial -This repositoty is modified from [yesno_tutorial](https://github.com/ekapolc/ASR_classproject/tree/master/yesnotutorial) +# Kaldi: auto speech recognition tutorial +This repository is mainly modified from this [yesno_tutorial](https://github.com/ekapolc/ASR_classproject/tree/master/yesnotutorial). and all the references are addressed below the tutorial. -This tutorial will guide you through some basic functionalities and operations of [Kaldi](http://kaldi-asr.org/) ASR toolkit using [VoxForge](http://www.voxforge.org/home/downloads) dataset which is one of the most popular datasets for auto speech recognition. +This tutorial will guide you through some basic functionalities and operations of [Kaldi](http://kaldi-asr.org/) ASR toolkit which can be applied in any general auto speech recognition tasks. +In this tutorial, we will use [VoxForge](http://www.voxforge.org/home/downloads) dataset which is one of the most popular datasets for auto speech recognition. ## Step 0 - Installing Kaldi @@ -15,7 +16,7 @@ The Kaldi will run on POSIX systems, with these software/libraries pre-installed * [`git`](https://git-scm.com/) * (optional) [`sox`](http://sox.sourceforge.net/) -Recommendation: For Windows users, although Kaldi is supported in Windows, I highly recommend you to install it in a container of the UNIX operating system such as Linux. +Recommendation: For Windows users, although Kaldi is supported in Windows, I highly recommend you to install Kaldi in a container of the UNIX operating system such as Linux. The entire compilation can take a couple of hours and up to 8 GB of storage depending on your system specification and configuration. Make sure you have enough resource before start compiling. @@ -88,8 +89,7 @@ Now, for each dataset (train, test), we need to generate these files representin * Since we have only one speaker in this example, let's use "global" as speaker_id * `spk2utt` * Simply inverse indexed `utt2spk` (` `) - * Can use a Kaldi utility to generate - * `utils/utt2spk_to_spk2utt.pl data/train_yesno/utt2spk > data/train_yesno/spk2utt` +* `full_vocab` : list of all the vocabulary in the text of training data. (this file will be used for making the dictionary) * (optional) `segments`: *not used for this data.* * Contains utterance segmentation/alignment information for each recording. * Only required when a file contains multiple utterances, which is not this case. @@ -99,30 +99,227 @@ Now, for each dataset (train, test), we need to generate these files representin * Map from speakers to their gender information. * Used in vocal tract length normalization. -our task is to generate these files. although you can use this [preparation_data.ipynb](https://github.com/nessessence/Kaldi_VoxForge/blob/master/data_preparation.ipynb) python notebook which is very easy to use, I encourage you to write your own script because it'll improve your understanding of Kaldi format. -Kaldi has a scrip that clean up any possible errors in the data. Run +Our task is to generate these files. You can use this python notebook [preparation_data.ipynb](https://github.com/nessessence/Kaldi_VoxForge/blob/master/data_preparation.ipynb). but if this's your first time in Kaldi, I encourage you to write your own script because it'll improve your understanding of Kaldi format. +Note: you can generate the "spk2utt" file using Kaldi utility: +```utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt``` +##### ```bash utils/fix_data_dir.sh data/train/ utils/fix_data_dir.sh data/test/ ``` - If you're done with the code, your data directory should look like this, at this point. ``` data -├───train_yesno +├───train │ ├───text │ ├───utt2spk │ ├───spk2utt -│ └───wav.scp -└───test_yesno +│ |───wav.scp +| └───full_vocab *only for train directory +└───test ├───text ├───utt2spk ├───spk2utt └───wav.scp ``` - +## Step 2 - Dictionary preparation + +This section will cover how to build language knowledge - lexicon and phone dictionaries - for Kaldi recognizer. + +### Before moving on + +From here, we will use several Kaldi utilities (included in `steps` and `utils` directories) to process further. To do that, Kaldi binaries should be in your `$PATH`. +However, Kaldi is a huge framework, and there are so many binaries distributed over many different directories, depending on their purpose. +So, we will use a script, `path.sh` to add all of them to `$PATH` of the subshell every time a script runs (we will see this later). +All you need to do right now is to open the `path.sh` file and edit the `$KALDI_ROOT` variable to point your Kaldi Installation location. + +### Defining blocks of the language: Lexicon + +Next we will build dictionaries. Let's start with creating intermediate `dict` directory at the root. + +```bash +vf# mkdir -p data/local/dict +``` +Your `dict` directory should contain at least these 5 files: + +* `lexicon.txt`: list of word-phone pairs +* (optional)`lexiconp.txt` : list of word-prob-phone pairs +* `silence_phones.txt`: list of silent phones +* `nonsilence_phones.txt`: list of non-silent phones (including various kinds of noise, laugh, cough, filled pauses etc) +* `optional_silence.txt`: contains just a single phone (typically SIL) + +we can use `/utils/prepare_dict.sh` to generate all the files above excluding `lexiconp.txt` +brief explaination for the command `/utils/prepare_dict.sh`: +1. downloads the general word-phone pairs open source dictionary ( this tutorial uses "cmudict" ). +2. the pairs of the word which contained in both the general dictionary and `full_vocab` will be in `lexicon-iv.txt`. +3. the words which contained in the `full_vocab`(which we have been generated since data preparation), but not in the general dictionary will be contained in `vocab-oov.txt` (oov standfor "out-of-vocab"). +4. generates the pronounciations of those oov-vocab using a pre-trained Sequitur G2P model in `conf/g2p_model` and stores the pairs in `lexicon-oov.txt`. +5. merges `lexicon-iv.txt` and `lexicon-oov.txt` then adds the silence symbol (typically (,SIL)) at the end to generate the `lexicon.txt`. +6. generates the other files. + +Note: all the files are in an alphabetical order. and you change the parameter `ss` at the top in ` /utils/prepare_dict.sh ` file to set the silence symbol as you want. (In this tutorial use `` as the silence symbol ) + +Let's look at each file format and overview. + +lexicon.txt: ``` ..... ``` in an alphabetical order of word. + +```bash +vf# head -5 data/local/dict/lexicon.txt +A AH +A EY +ABANDONMENT AH B AE N D AH N M AH N T +ABLE EY B AH L +ABNORMAL AE B N AO R M AH L +``` +Note: as you can see, lexicon.txt will contain repeated entries for the same word on separate lines, if we have multiple pronunciations for it. + +`lexiconp.txt`: ``` …. ``` #similar to lexicon.txt, just add the pronounciation-probability term. + +`silence_phones.txt`: +```bash +vf# more data/local/dict/silence_phones.txt +SIL +``` +`nonsilence_phones.txt`: +```bash +vf# head -10 data/local/dict/nonsilence_phones.txt +AA +AE +AH +AO +AW +AY +B +CH +D +DH +``` +`optional_silence.txt`: +```bash +vf# more data/local/dict/optional_silence.txt +SIL +``` +**Note** that `` will also be used as our OOV token later. + +Finally, we need to convert our dictionaries into a data structure that Kaldi would accept - finite state transducer (FST). Among many scripts Kaldi provides, we will use `utils/prepare_lang.sh` to generate FST-ready data formats to represent our language definition. + +```bash +utils/prepare_lang.sh --position-dependent-phones false +``` +We're using `--position-dependent-phones` flag to be false in our tiny, tiny toy language. There's not enough context, anyways. For required parameters we will use: + +* ``: `data/local/dict` +* ``: `""` +* ``: Could be anywhere. I'll just put a new directory `tmp` inside `dict`. +* ``: This output will be used in further training. Set it to `data/lang`. + +```bash +vf# ls data/lang +L.fst L_disambig.fst oov.int oov.txt phones phones.txt topo words.txt +``` + +## Step 3 - Feature extraction and training + +This section will cover how to perform MFCC feature extraction and GMM modeling. + +### Feature extraction + +Once we have all data ready, it's time to extract features for GMM training. + +First extract mel-frequency cepstral coefficients. + +```bash +vf# steps/make_mfcc.sh --nj +``` + +* `--nj ` : number of processors, defaults to 4. Kaldi splits the processes by speaker information. Therefore, `nj` must be lesser than or equal to the number of speakers in ``. For this simple tutorial which has 1 speaker, `nj` must be 1. +* `` : where we put our 'data' of training set +* `` : directory to dumb log files. Let's put output to `exp/make_mfcc/train_yesno`, following Kaldi recipes convention +* `` : Directory to put the features. The convention uses `mfcc/train` + +```bash +vf# ls mfcc/train +raw_mfcc_train.1.ark raw_mfcc_train.2.scp raw_mfcc_train.4.ark +raw_mfcc_train.1.scp raw_mfcc_train.3.ark raw_mfcc_train.4.scp +raw_mfcc_train.2.ark raw_mfcc_train.3.scp +``` +Now normalize cepstral features using Cepstral Mean Normalization just like we did in our previous homework. This step also does an extra variance normalization. Thus, the process is called Cepstral Mean and Variance Normalization (CMVN). + + +```bash +vf# steps/compute_cmvn_stats.sh +``` +``, ``, and `` are the same as above. + +The two scripts will create `wav.scp` and `cmvn.scp` which specifies where the computed MFCC and CMVN are. `wav.scp` and `cmvn.scp` are just text files with just ` ` for each line. With this setup, by passing the `data/train` directory to a Kaldi script, you are passing various information, such as the transcription, the location of the wav file, or the MFCC features. + +**Note** that these shell scripts (`.sh`) are all pipelines through Kaldi binaries with trivial text processing on the fly. To see which commands were actually executed, see log files in ``. Or even better, see inside the scripts. For details on specific Kaldi commands, refer to [the official documentation](http://kaldi-asr.org/doc/tools.html). + +### Training Acoustic Models +In this step, we'll train acoustic model using Kaldi Utilities. +you can follow this ` ` +```bash +vf# steps/train_mono.sh --nj --cmd --totgauss 400 +``` +* `--cmd `: To use local machine resources, use `"utils/run.pl"` pipeline. +* `--totgauss : limits the number of gaussian mixtures to 400 +* `--nj `: Utterances from a speaker cannot be processed in parallel. Since we have only one, we must use 1 job only. +* ``: Path to our training 'data' +* ``: Path to language definition (output of the `prepare_lang` script) +* ``: like the previous, use `exp/mono`. + +When you run the command, you will notice it doing EM. Each iteration does an alignment stage and an update stage. + +This will generate FST-based lattice for acoustic model. Kaldi provides a tool to see inside the model (which may not make any sense now). + +```bash +/path/to/kaldi/src/fstbin/fstcopy 'ark:gunzip -c exp/mono/fsts.1.gz|' ark,t:- | head -n 20 +``` +This will print out first 20 lines of the lattice in human-readable(!!) format (Each column indicates: Q-from, Q-to, S-in, S-out, Cost) + +## Step 4 - Decoding and testing + +This section will cover decoding of the model we trained. + +### Graph decoding + +Now we're done with acoustic model training. +For decoding, we need a new input that goes over our lattices of AM & LM. +In step 1, we prepared separate testset in `data/test_yesno` for this purpose. +Now it's time to project it into the feature space as well. +Use `steps/make_mfcc.sh` and `steps/compute_cmvn_stats.sh` . + +Then, we need to build a fully connected FST (HCLG) network. + +```bash +utils/mkgraph.sh --mono data/lang_test_tg exp/mono exp/mono/graph_tgpr +``` +This will build a connected HCLG in `exp/mono/graph_tgpr` directory. + +Finally, we need to find the best paths for utterances in the test set, using decode script. Look inside the decode script, figure out what to give as its parameter, and run it. Write the decoding results in `exp/mono/decode_test_yesno`. + +```bash +steps/decode.sh +``` + +This will end up with `lat.N.gz` files in the output directory, where N goes from 1 up to the number of jobs you used (which must be 1 for this task). These files contain lattices from utterances that were processed by N’th thread of your decoding operation. + + +### Looking at results + +If you look inside the decoding script, it ends with calling the scoring script (`local/score.sh`), which generates hypotheses and computes word error rate of the testset +See `exp/mono/decode_test_yesno/wer_X` files to look the WER's, and `exp/mono/decode_test_yesno/scoring/X.tra` files for transcripts. +`X` here indicates language model weight, *LMWT*, that scoring script used at each iteration to interpret the best paths for utterances in `lat.N.gz` files into word sequences. (Remember `N` is #thread during decoing operartion) +You can deliberately specify the weight using `--min_lmwt` and `--max_lmwt` options when `score.sh` is called, if you want. +(See lecture slides on decoding to refresh what LMWT is, if you are not sure) + +Or if you are interested in getting word-level alignment information for each reocoding file, take a look at `steps/get_ctm.sh` script. + +### references and useful resources +[official Kaldi document](https://kaldi-asr.org/doc) +https://github.com/keighrim/kaldi-yesno-tutorial/blob/master/README.md