I am trying to implement Nexmo's Voice api, with websockets, in a .Net Core 2 web api.
This api needs to :- receive audio from phone call, through Nexmo
- use Microsoft Cognitive Speech to text api
- send the text to a bot
- use Microsoft Cognitive text to speech on the reply of the bot
- send back the speech to nexmo, through their voice api websocket
For now, I'm bypassing the bot steps, as I am first trying to connect to the websocket. When trying an echo method (send back to the websocket the audio received), it works without any issue. But when I try to send the speech from Microsoft text to speech, the phone call ends.
I am not finding any documentation implementing something different than just an echo.
The TextToSpeech and SpeechToText methods work as expected when used outside of the websocket.
Here's the websocket with the speech-to-text :
public static async Task Echo(HttpContext context, WebSocket webSocket)
{
var buffer = new byte[1024 * 4];
WebSocketReceiveResult result = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
while (!result.CloseStatus.HasValue)
{
while(!result.EndOfMessage)
{
result = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
}
var text = SpeechToText.RecognizeSpeechFromBytesAsync(buffer).Result;
Console.WriteLine(text);
}
await webSocket.CloseAsync(result.CloseStatus.Value, result.CloseStatusDescription, CancellationToken.None);
}
And here's the websocket with the text-to-speech :
public static async Task Echo(HttpContext context, WebSocket webSocket)
{
var buffer = new byte[1024 * 4];
WebSocketReceiveResult result = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
while (!result.CloseStatus.HasValue)
{
var ttsAudio = await TextToSpeech.TransformTextToSpeechAsync("Hello, this is a test", "en-US");
await webSocket.SendAsync(new ArraySegment<byte>(ttsAudio, 0, ttsAudio.Length), WebSocketMessageType.Binary, true, CancellationToken.None);
result = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
}
await webSocket.CloseAsync(result.CloseStatus.Value, result.CloseStatusDescription, CancellationToken.None);
}
Update March 1st 2019
in reply to Sam Machin's comment I tried splitting the array into chunks of 640 bytes each (I'm using 16000khz sample rate), but nexmo still hangs up the call, and I still don't hear anything.
public static async Task NexmoTextToSpeech(HttpContext context, WebSocket webSocket)
{
var ttsAudio = await TextToSpeech.TransformTextToSpeechAsync("This is a test", "en-US");
var buffer = new byte[1024 * 4];
WebSocketReceiveResult result = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
while (!result.CloseStatus.HasValue)
{
await SendSpeech(context, webSocket, ttsAudio);
result = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
}
await webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing Socket", CancellationToken.None);
}
private static async Task SendSpeech(HttpContext context, WebSocket webSocket, byte[] ttsAudio)
{
const int chunkSize = 640;
var chunkCount = 1;
var offset = 0;
var lastFullChunck = ttsAudio.Length < (offset + chunkSize);
try
{
while(!lastFullChunck)
{
await webSocket.SendAsync(new ArraySegment<byte>(ttsAudio, offset, chunkSize), WebSocketMessageType.Binary, false, CancellationToken.None);
offset = chunkSize * chunkCount;
lastFullChunck = ttsAudio.Length < (offset + chunkSize);
chunkCount++;
}
var lastMessageSize = ttsAudio.Length - offset;
await webSocket.SendAsync(new ArraySegment<byte>(ttsAudio, offset, lastMessageSize), WebSocketMessageType.Binary, true, CancellationToken.None);
}
catch (Exception ex)
{
}
}
Here's the exception that sometimes appears in the logs :
System.Net.WebSockets.WebSocketException (0x80004005): The remote party closed the WebSocket connection without completing the close handshake.