Synchronize video subtitle with text-to-speech voice
Asked Answered
B

0

8

I try to create a video of a text in which the text is narrated by text-to-speech.

To create the video file, I use the VideoFileWriter of Aforge.Net as the following:

VideoWriter = new VideoFileWriter();

VideoWriter.Open(CurVideoFile, (int)(Properties.Settings.Default.VideoWidth),
    (int)(Properties.Settings.Default.VideoHeight), 25, VideoCodec.MPEG4, 800000);

To read aloud the text I use SpeechSynthesizer class and write the output to a wave stream

AudioStream = new FileStream(CurAudioFile, FileMode.Create);
synth.SetOutputToWaveStream(AudioStream);

I want to highlight the word is spoken in the video, so I synchronize them by the SpeakProgress event:

void synth_SpeakProgress(object sender, SpeakProgressEventArgs e)
{

    curAuidoPosition = e.AudioPosition;
    using (Graphics g = Graphics.FromImage(Screen))
    {
         g.DrawString(e.Text,....); 
    }                    
    VideoWriter.WriteVideoFrame(Screen, curAuidoPosition);
}

And finally, I merge the video and audio using ffmpeg

using (Process process = new Process())
{
        process.StartInfo.FileName = exe_path;
        process.StartInfo.Arguments = 
            string.Format(@"-i ""{0}"" -i ""{1}"" -y -acodec copy -vcodec copy ""{2}""", avi_path, mp3_path, output_file);

        // ...
}

The problem is that for some voices like Microsoft Hazel, Zira and David, in windows 8.1 the video is not synchronized with the audio, and the audio is much faster than the shown subtitle. However, for the voices in windows 7, it works.

How can I synchronize them so that it works for any text-to-speech voices on any operating system?

It seems the e.AudioPosition is inaccurate as it is mentioned in Are the SpeakProgressEventArgs of the SpeechSynthesizer inaccurate? , I had the same experiment and the same result.

I have noticed if I adjust the audio format, I can be close to the actual time, however it doesn't work for any voice.

var formats = CurVoice.VoiceInfo.SupportedAudioFormats;
if (formats.Count > 0)
{
    var format = formats[0];
    reader.SetOutputToWaveFile(CurAudioFile, format);
}
else
{
     AudioStream = new FileStream(CurAudioFile, FileMode.Create);
     reader.SelectVoice(CurVoice.VoiceInfo.Name);
    var fmt = new SpeechAudioFormatInfo(16000, AudioBitsPerSample.Sixteen, AudioChannel.Mono);
    // this is more close but not precise yet
    MemStream = new MemoryStream();
    var mi = reader.GetType().GetMethod("SetOutputStream", BindingFlags.Instance | BindingFlags.NonPublic);
    mi.Invoke(reader, new object[] { MemStream, fmt, true, true }); 
 }
Breslau answered 26/11, 2015 at 7:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.