I try to create a video of a text in which the text is narrated by text-to-speech.
To create the video file, I use the VideoFileWriter
of Aforge.Net
as the following:
VideoWriter = new VideoFileWriter();
VideoWriter.Open(CurVideoFile, (int)(Properties.Settings.Default.VideoWidth),
(int)(Properties.Settings.Default.VideoHeight), 25, VideoCodec.MPEG4, 800000);
To read aloud the text I use SpeechSynthesizer
class and write the output to a wave stream
AudioStream = new FileStream(CurAudioFile, FileMode.Create);
synth.SetOutputToWaveStream(AudioStream);
I want to highlight the word is spoken in the video, so I synchronize them by the SpeakProgress
event:
void synth_SpeakProgress(object sender, SpeakProgressEventArgs e)
{
curAuidoPosition = e.AudioPosition;
using (Graphics g = Graphics.FromImage(Screen))
{
g.DrawString(e.Text,....);
}
VideoWriter.WriteVideoFrame(Screen, curAuidoPosition);
}
And finally, I merge the video and audio using ffmpeg
using (Process process = new Process())
{
process.StartInfo.FileName = exe_path;
process.StartInfo.Arguments =
string.Format(@"-i ""{0}"" -i ""{1}"" -y -acodec copy -vcodec copy ""{2}""", avi_path, mp3_path, output_file);
// ...
}
The problem is that for some voices like Microsoft Hazel, Zira and David, in windows 8.1 the video is not synchronized with the audio, and the audio is much faster than the shown subtitle. However, for the voices in windows 7, it works.
How can I synchronize them so that it works for any text-to-speech voices on any operating system?
It seems the e.AudioPosition
is inaccurate as it is mentioned in Are the SpeakProgressEventArgs of the SpeechSynthesizer inaccurate? , I had the same experiment and the same result.
I have noticed if I adjust the audio format, I can be close to the actual time, however it doesn't work for any voice.
var formats = CurVoice.VoiceInfo.SupportedAudioFormats;
if (formats.Count > 0)
{
var format = formats[0];
reader.SetOutputToWaveFile(CurAudioFile, format);
}
else
{
AudioStream = new FileStream(CurAudioFile, FileMode.Create);
reader.SelectVoice(CurVoice.VoiceInfo.Name);
var fmt = new SpeechAudioFormatInfo(16000, AudioBitsPerSample.Sixteen, AudioChannel.Mono);
// this is more close but not precise yet
MemStream = new MemoryStream();
var mi = reader.GetType().GetMethod("SetOutputStream", BindingFlags.Instance | BindingFlags.NonPublic);
mi.Invoke(reader, new object[] { MemStream, fmt, true, true });
}