I am working on a web application that will serve as the help system for one of my companies existing products. One of the features I have implemented is a chatbot that is powered by an Azure Open AI instance (using GPT 4). When a user types a prompt in the chat window their prompt is seant to a cognitive search service and the content returned by that service is bundled with the prompt so the LLM can use that context to aid in responding to the prompt.
Overall this works quite well but there is a bit of a performance issue in that the responses can take upwards of 20 to 30 seconds to get a response. I know that Open AI supports a Streaming endpoint so my plan was to try and use that to see if that would at least have the chat feel more responsive while the LLM was generating the response. For context, the application I am working on is a React web application with an ASP.NET Core backend and I am using the pre-release Azure.AI.OpenAI C# library. Based on the references below I decided to try and use the GetChatCompletionsStreamingAsync
method on the OpenAI client. However, when using that method I am not actually observing any difference in response times compared to the non-streaming GetChatCompletionsAsync
method. I would expect that the streaming version of the API would return faster than the non-streaming because it should be returning an object which will stream subsequent results. Am I misunderstanding the purpose of the streaming API and/or am I using it incorrectly?
(I have seen this issue on multiple versions, the example code I provided most recently was running on 1.0.0-beta.5)
To help illustrate this problem I have created a .NET Console Application. Here is the Program.cs file:
// Program.CS
// See https://aka.ms/new-console-template for more information
using Azure.AI.OpenAI;
using OpenAiTest;
var _openAiPersonaPrompt = "You are Rick from Rick and Morty.";
var _openAiConsumer = new OpenAIConsumer();
var question = "Let's go on a five minute adventure";
await PerformSynchronousQuestion();
await PerformAsynchronousQuestion();
async Task PerformSynchronousQuestion()
{
var messages = new List<ChatMessage>()
{
new ChatMessage(ChatRole.System, _openAiPersonaPrompt),
new ChatMessage(ChatRole.User, question),
};
var startTime = DateTime.Now;
Console.WriteLine($"#### Starting at: {startTime}####");
var response = await _openAiConsumer.GenerateText(messages, false);
var endTime = DateTime.Now;
Console.WriteLine($"#### Ending at: {endTime}####");
Console.WriteLine($"#### Duration: {endTime.Subtract(startTime)}");
var completions = response.Value.Choices[0].Message.Content;
Console.WriteLine(completions);
}
async Task PerformAsynchronousQuestion()
{
var messages = new List<ChatMessage>()
{
new ChatMessage(ChatRole.System, _openAiPersonaPrompt.ToString()),
new ChatMessage(ChatRole.User, question),
};
var startTime = DateTime.Now;
Console.WriteLine($"#### Starting at: {startTime}####");
var response = await _openAiConsumer.GenerateTextStreaming(messages, false);
var endTime = DateTime.Now;
Console.WriteLine($"#### Ending at: {endTime}####");
Console.WriteLine($"#### Duration: {endTime.Subtract(startTime)}");
using var streamingChatCompletions = response.Value;
var cancellationToken = new CancellationToken();
await foreach (var choice in streamingChatCompletions.GetChoicesStreaming())
{
await foreach (var message in choice.GetMessageStreaming())
{
if (message.Content == null)
{
continue;
}
Console.Write(message.Content);
await Task.Delay(TimeSpan.FromMilliseconds(200));
}
}
}
Here is the OpenAIConumer wrapper I created. This was pulled out from the larger repo for the app I was working on so it's unecessary for this proof of concept but I wanted to keep the separation in case that was the problem.
using Azure.AI.OpenAI;
using Azure;
namespace OpenAiTest
{
public class OpenAIConsumer
{
// Add your own values here to test
private readonly OpenAIClient _client;
private readonly string baseOpenAiUrl = "";
private readonly string openAiApiKey = "";
private readonly string _model = "";
public ChatCompletionsOptions Options { get; }
public OpenAIConsumer()
{
var uri = new Uri(baseOpenAiUrl);
var apiKey = new AzureKeyCredential(openAiApiKey);
_client = new OpenAIClient(uri, apiKey);
// Default set of options. We can add more configuration in the future if needed
Options = new ChatCompletionsOptions()
{
MaxTokens = 1500,
FrequencyPenalty = 0,
PresencePenalty = 0,
};
}
/// <summary>
/// Helper function that initializes the messages for the chat completion options
/// Note that this will clear any existing messages
/// </summary>
/// <param name="messages"></param>
private void InitializeMessages(List<ChatMessage> messages)
{
Options.Messages.Clear();
foreach (var chatMessage in messages)
{
Options.Messages.Add(chatMessage);
}
}
/// <summary>
/// Wrapper around the GetCompletions API from the OpenAI service
/// </summary>
/// <param name="messages">List of messages including the user's prompt</param>
/// <returns>See GetChatCompletionsAsync on the OpenAIClient object</returns>
public async Task<Response<ChatCompletions>> GenerateText(List<ChatMessage> messages, bool useAzureSearchAsDataSource)
{
InitializeMessages(messages);
var result = await _client.GetChatCompletionsAsync(_model, Options);
return result;
}
public async Task<Response<StreamingChatCompletions>> GenerateTextStreaming(List<ChatMessage> messages, bool useAzureSearchAsDataSource)
{
InitializeMessages(messages);
var result = await _client.GetChatCompletionsStreamingAsync(_model, Options);
return result;
}
}
}
From the code above my expectation would be that the call to _openAiConsumer.GenerateText
would take longer to return than _openAiConsumer.GenerateTextStreaming
. However, what I am noticing is that they effectively return at the same time and all the second one does is loop over the stream of responses but it's already full when it's received.
Resources I have already used while investigating this problem:
Microsoft documentation on how to use response streaming
- Note this article is about using your own Azure Cognitive Search index. My example code isn't doing this, but when I was I noticed the same behavior so I assume what I am experiencing is a general issue with how I'm using the streaming code
Edit 10/10/23
I'm adding an excerpt here detailing what I'm observing that is causing confusion. To clarify, my assumption is that GetChatCompletionsStreamingAsync
should return faster than GetChatCompletionsAsync
. To clarify, the former should return faster because it is returning an object (StreamingChatCompletions
) which can be used to "stream" the response as it is completed by OpenAI. My assumption is the latter should take longer because it returns the actual full response from OpenAI. However, I wrote the following method to show what I'm observing:
public async Task CompareMethods(List<ChatMessage> messages)
{
InitializeMessages(messages);
var startTime = DateTime.Now;
Console.WriteLine("### Starting Sync ###");
await _client.GetChatCompletionsAsync(_model, Options);
Console.WriteLine("### Ending Sync ###");
var endTime = DateTime.Now;
Console.WriteLine($"#### Duration: {endTime.Subtract(startTime)}");
startTime = DateTime.Now;
Console.WriteLine("### Starting Async ###");
await _client.GetChatCompletionsStreamingAsync(_model, Options, CancellationToken.None);
Console.WriteLine("### Ending Async ###");
endTime = DateTime.Now;
Console.WriteLine($"#### Duration: {endTime.Subtract(startTime)}");
}
So in the above function I am simply calling the two methods assuming that the call to GetChatCompletionsAsync
will take longer than the call to GetChatCompletionsStreamingAsyng
. However it is not taking longer, here's the output (obviously the times and relative differences change over time, but I would expect that the call to the Streaming function to take very little time compared to the non-Streaming one.
### Starting Sync ###
### Ending Sync ###
#### Duration: 00:00:16.6944412
### Starting Async ###
### Ending Async ###
#### Duration: 00:00:14.6443387
GetChatCompletionsStreamingAsync
takes about the same time to return asGetChatCompletionsAsync
. I tried using your code and it works, but I'm noticing the same behavior :( – Beverly