How can I get results for each utterances from google speech api and save each audio utterance chunk seperately as wav file?

Asked 25/7, 2020 at 23:36 Answered 2/8, 2020 at 6:6

python python-3.x google-cloud-platform google-speech-api google-speech-to-text-api

I'm using the below python script for getting predictions from google speech API from live streaming audio input.

The issue is, I need predictions from google speech API for each utterance and then also save the audio for each utterance spoken to disk.

I'm not sure, how I can modify the script to save the live audio for each utterance and also print results for each utterance rather than continuous prediction.

#!/usr/bin/env python

import os
import re
import sys
import time

from google.cloud import speech
import pyaudio
from six.moves import queue

# Audio recording parameters
STREAMING_LIMIT = 240000  # 4 minutes
SAMPLE_RATE = 16000
CHUNK_SIZE = int(SAMPLE_RATE / 10)  # 100ms

api_key = r'path_to_json_file\google.json'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = api_key

RED = '\033[0;31m'
GREEN = '\033[0;32m'
YELLOW = '\033[0;33m'


def get_current_time():
    """Return Current Time in MS."""

    return int(round(time.time() * 1000))


class ResumableMicrophoneStream:
    """Opens a recording stream as a generator yielding the audio chunks."""

    def __init__(self, rate, chunk_size):
        self._rate = rate
        self.chunk_size = chunk_size
        self._num_channels = 1
        self._buff = queue.Queue()
        self.closed = True
        self.start_time = get_current_time()
        self.restart_counter = 0
        self.audio_input = []
        self.last_audio_input = []
        self.result_end_time = 0
        self.is_final_end_time = 0
        self.final_request_end_time = 0
        self.bridging_offset = 0
        self.last_transcript_was_final = False
        self.new_stream = True
        self._audio_interface = pyaudio.PyAudio()
        self._audio_stream = self._audio_interface.open(
            format=pyaudio.paInt16,
            channels=self._num_channels,
            rate=self._rate,
            input=True,
            frames_per_buffer=self.chunk_size,
            # Run the audio stream asynchronously to fill the buffer object.
            # This is necessary so that the input device's buffer doesn't
            # overflow while the calling thread makes network requests, etc.
            stream_callback=self._fill_buffer,
        )

    def __enter__(self):

        self.closed = False
        return self

    def __exit__(self, type, value, traceback):

        self._audio_stream.stop_stream()
        self._audio_stream.close()
        self.closed = True
        # Signal the generator to terminate so that the client's
        # streaming_recognize method will not block the process termination.
        self._buff.put(None)
        self._audio_interface.terminate()

    def _fill_buffer(self, in_data, *args, **kwargs):
        """Continuously collect data from the audio stream, into the buffer."""

        self._buff.put(in_data)
        return None, pyaudio.paContinue

    def generator(self):
        """Stream Audio from microphone to API and to local buffer"""

        while not self.closed:
            data = []

            if self.new_stream and self.last_audio_input:

                chunk_time = STREAMING_LIMIT / len(self.last_audio_input)

                if chunk_time != 0:

                    if self.bridging_offset < 0:
                        self.bridging_offset = 0

                    if self.bridging_offset > self.final_request_end_time:
                        self.bridging_offset = self.final_request_end_time

                    chunks_from_ms = round((self.final_request_end_time -
                                            self.bridging_offset) / chunk_time)

                    self.bridging_offset = (round((
                        len(self.last_audio_input) - chunks_from_ms)
                                                  * chunk_time))

                    for i in range(chunks_from_ms, len(self.last_audio_input)):
                        data.append(self.last_audio_input[i])

                self.new_stream = False

            # Use a blocking get() to ensure there's at least one chunk of
            # data, and stop iteration if the chunk is None, indicating the
            # end of the audio stream.
            chunk = self._buff.get()
            self.audio_input.append(chunk)

            if chunk is None:
                return
            data.append(chunk)
            # Now consume whatever other data's still buffered.
            while True:
                try:
                    chunk = self._buff.get(block=False)

                    if chunk is None:
                        return
                    data.append(chunk)
                    self.audio_input.append(chunk)

                except queue.Empty:
                    break

            yield b''.join(data)


def listen_print_loop(responses, stream):
    """Iterates through server responses and prints them.
    The responses passed is a generator that will block until a response
    is provided by the server.
    Each response may contain multiple results, and each result may contain
    multiple alternatives;  Here we
    print only the transcription for the top alternative of the top result.
    In this case, responses are provided for interim results as well. If the
    response is an interim one, print a line feed at the end of it, to allow
    the next result to overwrite it, until the response is a final one. For the
    final one, print a newline to preserve the finalized transcription.
    """

    for response in responses:

        if get_current_time() - stream.start_time > STREAMING_LIMIT:
            stream.start_time = get_current_time()
            break

        if not response.results:
            continue

        result = response.results[0]

        if not result.alternatives:
            continue

        transcript = result.alternatives[0].transcript

        result_seconds = 0
        result_nanos = 0

        if result.result_end_time.seconds:
            result_seconds = result.result_end_time.seconds

        if result.result_end_time.nanos:
            result_nanos = result.result_end_time.nanos

        stream.result_end_time = int((result_seconds * 1000)
                                     + (result_nanos / 1000000))

        corrected_time = (stream.result_end_time - stream.bridging_offset
                          + (STREAMING_LIMIT * stream.restart_counter))
        # Display interim results, but with a carriage return at the end of the
        # line, so subsequent lines will overwrite them.

        if result.is_final:

            sys.stdout.write(GREEN)
            sys.stdout.write('\033[K')
            sys.stdout.write(str(corrected_time) + ': ' + transcript + '\n')

            stream.is_final_end_time = stream.result_end_time
            stream.last_transcript_was_final = True

            # Exit recognition if any of the transcribed phrases could be
            # one of our keywords.
            if re.search(r'\b(exit|quit)\b', transcript, re.I):
                sys.stdout.write(YELLOW)
                sys.stdout.write('Exiting...\n')
                stream.closed = True
                break

        else:
            sys.stdout.write(RED)
            sys.stdout.write('\033[K')
            sys.stdout.write(str(corrected_time) + ': ' + transcript + '\r')

            stream.last_transcript_was_final = False


def main():
    """start bidirectional streaming from microphone input to speech API"""

    client = speech.SpeechClient()
    config = speech.types.RecognitionConfig(
        encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=SAMPLE_RATE,
        language_code='en-US',
        max_alternatives=1)
    streaming_config = speech.types.StreamingRecognitionConfig(
        config=config,
        interim_results=True)

    mic_manager = ResumableMicrophoneStream(SAMPLE_RATE, CHUNK_SIZE)
    print(mic_manager.chunk_size)
    sys.stdout.write(YELLOW)
    sys.stdout.write('\nListening, say "Quit" or "Exit" to stop.\n\n')
    sys.stdout.write('End (ms)       Transcript Results/Status\n')
    sys.stdout.write('=====================================================\n')

    with mic_manager as stream:

        while not stream.closed:
            sys.stdout.write(YELLOW)
            sys.stdout.write('\n' + str(
                STREAMING_LIMIT * stream.restart_counter) + ': NEW REQUEST\n')

            stream.audio_input = []
            audio_generator = stream.generator()

            requests = (speech.types.StreamingRecognizeRequest(
                audio_content=content)for content in audio_generator)

            responses = client.streaming_recognize(streaming_config,
                                                   requests)

            # Now, put the transcription responses to use.
            listen_print_loop(responses, stream)

            if stream.result_end_time > 0:
                stream.final_request_end_time = stream.is_final_end_time
            stream.result_end_time = 0
            stream.last_audio_input = []
            stream.last_audio_input = stream.audio_input
            stream.audio_input = []
            stream.restart_counter = stream.restart_counter + 1

            if not stream.last_transcript_was_final:
                sys.stdout.write('\n')
            stream.new_stream = True


if __name__ == '__main__':
    main()

Bertle answered 25/7, 2020 at 23:36 Comment(3)

What do you mean by utterance? A single word, a sentence? – Nag 30/7, 2020 at 22:2

@MatthewSalvatoreViglione I mean a sentence. Usually people do pause when going over next sentence so that each sentence I want to consider as utterance and get text for that utterance also save the audio file for each utterance separately to disk as wav format. And thank you so much for responding to my question. – Bertle 30/7, 2020 at 22:13

NOTE: If any one need google-speech-api json file to test it out please let me know I'll provide you the json api key. I hope someone can help me out here. – Bertle 1/8, 2020 at 21:48

It's hard for me to understand everything that is happening in this code, and I don't want to pay for a licence to try it out, but here are some ideas. Maybe someone else will find them useful and can help you further.

Detecting ends of sentences

First, a big problem with separating sentences from speech is that not everyone obeys the same pausing between sentences. Some people will wait longer, while others will plow right on to the next one. Some people also pause during sentences. This makes detecting the end of a sentence from audio data hard if you are doing it a relatively simple way like trying to detect pauses.

The best way I can imagine would be to use the interpretation you get back from the Google Speech API and split on ending punctuation (!, ?,.). Your problem would then be reduced to correlating the returned responses to specific chunks of audio data.

It looks like you could just pass None back to your generator and it will already end gracefully, so that shouldn't be too bad. You would want to save whatever chunks of audio data that generated the transcript around when you decide a sentence is over.

This may be hard because when more audio is received, the Google Speech API may decide retroactively that a completed sentence wasn't actually complete but rather part of a longer sentence, so you'll want to watch out for that too.

Saving audio data

As for saving your raw audio data, once you know which chunks apply to what transcription, just append them all in order to a list (e.g. list_of_chunks) and use wave:

import wave 

with wave.open("foo.wav", 'wb') as f: 
    f.setnchannels(self._num_channels)
    f.setsampwidth(audio.get_sample_size(pyaudio.paInt16))
    f.setframerate(self._rate)
    f.writeframes(b''.join(list_of_chunks))

You will of course have to make num_channels and rate accessible if you do this outside of your ResumableMicrophoneStream class.

Nag answered 31/7, 2020 at 19:33 Comment(1)

If you can provide your email, I can send you the google api json file that I have so you can help me out? You can then write your own function for getting outputs. Please let me know. It would help me a lot. – Bertle 1/8, 2020 at 8:48

You can use `StreamingRecognitionConfig' to detect single utterance. The API stops and returns the result as soon as it detects the first pause /silence. That's useful for short commands. Beyond that single utterance, I haven't seen any similar option to detect multiple utterances.

https://cloud.google.com/speech-to-text/docs/basics

The following settings will give you punctuation and time info for the words identified. Maybe you can use them to accomplish what @matthew-salvatore-viglione has suggested (i.e. separating sentences via punctuation and then using the word time list to identify the parts in the audio file. If you are not using streaming recognition, then you shouldn't worry about the retroactive speech recognition issues either).

{ "enableWordTimeOffsets": boolean, "enableAutomaticPunctuation": boolean, ..... }

https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig

Before going to deep into this with Google Speech Recognition API, I suggest you also look at other speech recognition services and see if they provide a sentence detection feature (utterance is not the same as sentence) to your liking.

Tuppence answered 2/8, 2020 at 6:6 Comment(1)

I've tried deepspeech but it's not working as good as google speech API, so that's the reason I've switched to google-speech-api. So, can you provide what changes I have to make in order to get single-utterance in my code above? – Bertle 2/8, 2020 at 21:38

Detecting ends of sentences

Saving audio data

Recommended topics

Hot tags