Live speech recognition
Asked Answered
M

1

14

I have a Python script using the speech_recognition package to recognize speech and return the text of what was spoken. The transcription has a few seconds delay, however. Is there another way to write this script to return each word as it is spoken? I have another script to do this, using the pysphinx package, but the results are wildly inaccurate.

Install dependencies:

pip install SpeechRecognition
pip install pocketsphinx

Script 1 - Delayed speech-to-text:

import speech_recognition as sr  

# obtain audio from the microphone  
r = sr.Recognizer()  
with sr.Microphone() as source:  
    print("Please wait. Calibrating microphone...")  
    # listen for 5 seconds and create the ambient noise energy level  
    r.adjust_for_ambient_noise(source, duration=5)  
    print("Say something!")  
    audio = r.listen(source)  

    # recognize speech using Sphinx  
    try:  
        print("Sphinx thinks you said '" + r.recognize_sphinx(audio) + "'")  
    except sr.UnknownValueError:  
        print("Sphinx could not understand audio")  
    except sr.RequestError as e:  
        print("Sphinx error; {0}".format(e))

Script 2 - Immediate albeit inaccurate speech-to-text:

import os
from pocketsphinx import LiveSpeech, get_model_path

model_path = get_model_path()
speech = LiveSpeech(
    verbose=False,
    sampling_rate=16000,
    buffer_size=2048,
    no_search=False,
    full_utt=False,
    hmm=os.path.join(model_path, 'en-us'),
    lm=os.path.join(model_path, 'en-us.lm.bin'),
    dic=os.path.join(model_path, 'cmudict-en-us.dict')
)
for phrase in speech:
    print(phrase)
Matazzoni answered 29/10, 2017 at 20:32 Comment(3)
Most likely you run this on something like raspberry-pi which is not powerful enough to run large vocabulary continuous speech recognition with large dictionary.Sanctus
what-if you listen for a 1s and then print the word, there might be some loss but it will return per-word, would that work?Excurvature
are you sure that both systems are using the same language model?Paresis
C
2

If you happen to have a CUDA enabled GPU then you can try Mozilla's DeepSpeech GPU library. They also have a CPU version in case you don't have a CUDA enabled GPU. A CPU transcribes an audio file using DeepSpeech in 1.3x time whereas, on GPU, the speed is 0.3x ie it transcribes a 1-second audio file in 0.33 seconds. Quickstart:

# Create and activate a virtualenv
virtualenv -p python3 $HOME/tmp/deepspeech-gpu-venv/
source $HOME/tmp/deepspeech-gpu-venv/bin/activate

# Install DeepSpeech CUDA enabled package
pip3 install deepspeech-gpu

# Transcribe an audio file.
deepspeech --model deepspeech-0.6.1-models/output_graph.pbmm --lm deepspeech- 
0.6.1-models/lm.binary --trie deepspeech-0.6.1-models/trie --audio audio/2830- 
3980-0043.wav

Some important notes- Deepspeech-gpu has some dependencies like tensorflow, CUDA, cuDNN etc. So check out their github repo for more details -https://github.com/mozilla/DeepSpeech

Cinchonism answered 24/1, 2020 at 19:49 Comment(5)
What about something that is not hardware related?Guidotti
@Damian-TeodorBeleș Can you please elaborate more? I am not sure about what you are asking.Cinchonism
What if this doesn't hold: "If you happen to have a CUDA enabled GPU then you can try Mozilla's DeepSpeech."?Guidotti
DeepSpeech can run on a CPU as well. It's just that doing inference is faster on the GPU than the CPU. Other than that, it's all the same.Cinchonism
Ok, I see, thanks, but which is the "live" part in an "audio file"?Guidotti

© 2022 - 2024 — McMap. All rights reserved.