Uses pyannote and Whisper to get realtime transcriptions along with speaker lables for two users speaking through a single streaming microphone. Credit goes to this youtube video by Tech Giant. Code is largely based on his, but tweaked to work with two people at once
I would recommend running on Python 3.12.0, other versions may produce conflicts
- Will hold the 15s audio clips of each speaker for pyannote to use to create your voiceprint
- Create your account and go here to get access to pyannote models
- Then click on your profile on the top-right and make yourself an access token with write permissions
- Add this token to a
.envfile, and label itHF_API_KEY=[your HF access token]
- Within this folder, run
git clone https://github.com/ggerganov/whisper.cpp.git - Go into the whisper.cpp folder and run
bash ./models/download-ggml-model.sh base.en - This will download the base english model from Whisper
5. Install Cmake at this link
- Download the Windows 64 installer and run
cd whisper.cpp
mkdir build
cd build
cmake ..
cmake --build . --config Release
- Each audio file should be around 15s, and should feature only your voice in a clear non-noisy envrionment
- Label them as
[name].wavso that your speaker labels have names - This works for two speakers ONLY, so there must be exactly two audio files here
- Transcription along with speaker labels will be printed out into the terminal
- Print statements with cosine distance will also be printed, which just tells you how likely it was that you were speaking (>0.675 means unlikely, <0.675 means likley)
- For best performance, try not to speak at the same time