How to appropriately use the Azure.AI.OpenAI.OpenAIClient.GetChatCompletionsStreamingAsync Method
Asked Answered
B

2

7

I am working on a web application that will serve as the help system for one of my companies existing products. One of the features I have implemented is a chatbot that is powered by an Azure Open AI instance (using GPT 4). When a user types a prompt in the chat window their prompt is seant to a cognitive search service and the content returned by that service is bundled with the prompt so the LLM can use that context to aid in responding to the prompt.

Overall this works quite well but there is a bit of a performance issue in that the responses can take upwards of 20 to 30 seconds to get a response. I know that Open AI supports a Streaming endpoint so my plan was to try and use that to see if that would at least have the chat feel more responsive while the LLM was generating the response. For context, the application I am working on is a React web application with an ASP.NET Core backend and I am using the pre-release Azure.AI.OpenAI C# library. Based on the references below I decided to try and use the GetChatCompletionsStreamingAsync method on the OpenAI client. However, when using that method I am not actually observing any difference in response times compared to the non-streaming GetChatCompletionsAsync method. I would expect that the streaming version of the API would return faster than the non-streaming because it should be returning an object which will stream subsequent results. Am I misunderstanding the purpose of the streaming API and/or am I using it incorrectly?

(I have seen this issue on multiple versions, the example code I provided most recently was running on 1.0.0-beta.5)

To help illustrate this problem I have created a .NET Console Application. Here is the Program.cs file:

// Program.CS
// See https://aka.ms/new-console-template for more information
using Azure.AI.OpenAI;
using OpenAiTest;

var _openAiPersonaPrompt = "You are Rick from Rick and Morty.";
var _openAiConsumer = new OpenAIConsumer();
var question = "Let's go on a five minute adventure";
await PerformSynchronousQuestion();
await PerformAsynchronousQuestion();


async Task PerformSynchronousQuestion()
{
    var messages = new List<ChatMessage>()
            {
                new ChatMessage(ChatRole.System, _openAiPersonaPrompt),
                new ChatMessage(ChatRole.User, question),
            };
    var startTime = DateTime.Now;
    Console.WriteLine($"#### Starting at: {startTime}####");

    var response = await _openAiConsumer.GenerateText(messages, false);
    var endTime = DateTime.Now;
    Console.WriteLine($"#### Ending at: {endTime}####");
    Console.WriteLine($"#### Duration: {endTime.Subtract(startTime)}");
    var completions = response.Value.Choices[0].Message.Content;
    Console.WriteLine(completions);
}

async Task PerformAsynchronousQuestion()
{
    var messages = new List<ChatMessage>()
            {
                new ChatMessage(ChatRole.System, _openAiPersonaPrompt.ToString()),
                new ChatMessage(ChatRole.User, question),
            };
    var startTime = DateTime.Now;
    Console.WriteLine($"#### Starting at: {startTime}####");
    var response = await _openAiConsumer.GenerateTextStreaming(messages, false);

    var endTime = DateTime.Now;
    Console.WriteLine($"#### Ending at: {endTime}####");
    Console.WriteLine($"#### Duration: {endTime.Subtract(startTime)}");
    using var streamingChatCompletions = response.Value;
    var cancellationToken = new CancellationToken();
    await foreach (var choice in streamingChatCompletions.GetChoicesStreaming())
    {
        await foreach (var message in choice.GetMessageStreaming())
        {
            if (message.Content == null)
            {
                continue;
            }
             Console.Write(message.Content);
            await Task.Delay(TimeSpan.FromMilliseconds(200));
        }
    }
}


Here is the OpenAIConumer wrapper I created. This was pulled out from the larger repo for the app I was working on so it's unecessary for this proof of concept but I wanted to keep the separation in case that was the problem.

using Azure.AI.OpenAI;
using Azure;


namespace OpenAiTest
{
    public class OpenAIConsumer
    {
        // Add your own values here to test
        private readonly OpenAIClient _client;
        private readonly string baseOpenAiUrl = "";
        private readonly string openAiApiKey = "";
        private readonly string _model = "";
        public ChatCompletionsOptions Options { get; }

        public OpenAIConsumer()
        {
            var uri = new Uri(baseOpenAiUrl);
            var apiKey = new AzureKeyCredential(openAiApiKey);
            _client = new OpenAIClient(uri, apiKey);

            // Default set of options. We can add more configuration in the future if needed
            Options = new ChatCompletionsOptions()
            {
                MaxTokens = 1500,
                FrequencyPenalty = 0,
                PresencePenalty = 0,
            };


        }

        /// <summary>
        /// Helper function that initializes the messages for the chat completion options
        /// Note that this will clear any existing messages
        /// </summary>
        /// <param name="messages"></param>
        private void InitializeMessages(List<ChatMessage> messages)
        {
            Options.Messages.Clear();
            foreach (var chatMessage in messages)
            {
                Options.Messages.Add(chatMessage);
            }
        }

        /// <summary>
        /// Wrapper around the GetCompletions API from the OpenAI service
        /// </summary>
        /// <param name="messages">List of messages including the user's prompt</param>
        /// <returns>See GetChatCompletionsAsync on the OpenAIClient object</returns>
        public async Task<Response<ChatCompletions>> GenerateText(List<ChatMessage> messages, bool useAzureSearchAsDataSource)
        {
            InitializeMessages(messages);
            var result = await _client.GetChatCompletionsAsync(_model, Options);
            return result;
        }

        public async Task<Response<StreamingChatCompletions>> GenerateTextStreaming(List<ChatMessage> messages, bool useAzureSearchAsDataSource)
        {
            InitializeMessages(messages);
            var result = await _client.GetChatCompletionsStreamingAsync(_model, Options);
            return result;
        }
    }
}

From the code above my expectation would be that the call to _openAiConsumer.GenerateText would take longer to return than _openAiConsumer.GenerateTextStreaming. However, what I am noticing is that they effectively return at the same time and all the second one does is loop over the stream of responses but it's already full when it's received.

Resources I have already used while investigating this problem:

Edit 10/10/23

I'm adding an excerpt here detailing what I'm observing that is causing confusion. To clarify, my assumption is that GetChatCompletionsStreamingAsync should return faster than GetChatCompletionsAsync. To clarify, the former should return faster because it is returning an object (StreamingChatCompletions) which can be used to "stream" the response as it is completed by OpenAI. My assumption is the latter should take longer because it returns the actual full response from OpenAI. However, I wrote the following method to show what I'm observing:

public async Task CompareMethods(List<ChatMessage> messages)
{
    InitializeMessages(messages);
    var startTime = DateTime.Now;
    Console.WriteLine("### Starting Sync ###");
    await _client.GetChatCompletionsAsync(_model, Options);
    Console.WriteLine("### Ending Sync ###");
    var endTime = DateTime.Now;
    Console.WriteLine($"#### Duration: {endTime.Subtract(startTime)}");
    startTime = DateTime.Now;
    Console.WriteLine("### Starting Async ###");
    await _client.GetChatCompletionsStreamingAsync(_model, Options, CancellationToken.None);
    Console.WriteLine("### Ending Async ###");
    endTime = DateTime.Now;
    Console.WriteLine($"#### Duration: {endTime.Subtract(startTime)}");
}

So in the above function I am simply calling the two methods assuming that the call to GetChatCompletionsAsync will take longer than the call to GetChatCompletionsStreamingAsyng. However it is not taking longer, here's the output (obviously the times and relative differences change over time, but I would expect that the call to the Streaming function to take very little time compared to the non-Streaming one.

### Starting Sync ###
### Ending Sync ###
#### Duration: 00:00:16.6944412
### Starting Async ###
### Ending Async ###
#### Duration: 00:00:14.6443387
Beverly answered 9/10, 2023 at 20:16 Comment(0)
R
5

Both operations will ultimately take the same amount of time because they are doing the same work on the openAi side. The difference in the streaming method is that you receive response chunks as they become available. You are not misunderstanding the purpose of the streaming method but you are not utilizing it correctly.

If you want your react application to receive message parts as they become available you will have to stream them just as the openAi api is streaming them to you. For this purpose you could either directly provide an HTTP GET endpoint returning content-type: text/event-stream or use a SignalR streaming hub method.

This is a simple example of consuming the IAsyncEnumerable returned by completions.GetChoicesStreaming().

[HttpGet]
public async Task StreamTestAsync([FromQuery] string prompt)
{
    Response.Headers.Add("Content-Type", "text/event-stream");
    var writer = new StreamWriter(Response.Body);

    var messages = new List<ChatMessage>()
    {
        new ChatMessage(ChatRole.System, "You are a helpful assistant."),
        new ChatMessage(ChatRole.User, prompt),
    };

    var options = new ChatCompletionsOptions(messages)
    {
        MaxTokens = 1500,
        FrequencyPenalty = 0,
        PresencePenalty = 0,
    };

    try
    {
        var startTime = DateTime.Now;
        Console.WriteLine("### Starting Async ###");

        StreamingChatCompletions completions = await openAIClient.GetChatCompletionsStreamingAsync("gpt-4", options);

        Console.WriteLine("### Ending Async ###");
        Console.WriteLine($"#### Duration: {DateTime.Now.Subtract(startTime)}");

        var choice = await completions.GetChoicesStreaming().FirstAsync();

        await foreach (var message in choice.GetMessageStreaming())
        {
            await writer.WriteAsync($"data: {message.Content}\n\n");
            await writer.FlushAsync();
        }
    }
    catch (Exception exception)
    {
        logger.LogError(exception, "Error while generating response.");
        await writer.WriteAsync("event: error\ndata: error\n\n");
    }
    finally
    {
        await writer.FlushAsync();
    }
}

The line 'await openAIClient.GetChatCompletionsStreamingAsync completes in less than a second (~500ms) for me. While most of the work happens in the await foreach loop which can take 10-20 seconds if you prompt a long reply.

On the react side of things a function such as this one should get you started:

function createEventSourceConnection(prompt: string) {
    const eventSource = new EventSourcePolyfill(`yourAPI/promptstream?prompt=${prompt}`);

    eventSource.onopen = _ => console.log("EventSource opened.");

    eventSource.onmessage = event => {
      //do something with the event data
    };

    eventSource.onerror = error => {
      console.error("EventSource closed with error.");
    };
  }
Risinger answered 10/10, 2023 at 15:49 Comment(8)
Thanks for your response! I should have clarified, I am not expecting the 'GetChatCompletionsStreamingAsync' to return the overall message faster. However, I would expect it to return the "Streamable" object faster and I could use that object to stream the content to my frontend. I actually already have that part implemented and it seems to be working (I'm using websockets). The part that is confusing me is that GetChatCompletionsStreamingAsync takes about the same time to return as GetChatCompletionsAsync. I tried using your code and it works, but I'm noticing the same behavior :(Beverly
Mhh somethings still not quite right still then you should be able to start getting response parts quite quickly after starting to iterate over the IAsyncEnumerable<> returned by _client.GetChatCompletionsStreamingAsync()Risinger
That is what I had assumed, I am probably just doing something wrong. Specifically I would assume that GetChatCompletionsStreamingAsync returns* faster than GetChatCompletionsAsync. * When I say returns faster I mean that the StreamingChatCompletions object should return faster than the ChatCompletions response. I'm going to add a code excerpt to my initial post to call out what I'm seeing. Is there any chance this could be due to configuration for the service itself in Azure / Open AI studio?Beverly
I added the update to my original post that hopefully details what I'm observing in a more isolated fashion.Beverly
That's strange. I have also updated and simplified my reply. If i consume this controller endpoint with Postman for example i can watch the message parts arriving from second 1- end of the connection being active. I would also suggest you update the nuget package to the latest prerelease version available before testing further.Risinger
Thanks, I updated to the latest pre-prelease (1.0.0-beta.8) and it still is not working. Out of curiosity is the OpenAi service you're using hosted on Azure? I am hoping not, because it seems like this might be an issue with how Azure handles content filtering (1, 2)Beverly
I am using an azure hosted instance of the OpenAI API yes. But we were in contact with Microsoft to somehow disable/adjust the content filter because it was causing other issues. Afaik streaming worked before and after disabling the filter but it has been months already.Risinger
For some reason they made Messages in the ChatCompletionsOptions class readonly?Titoism
W
0

Microsofts own example for GetChatCompletionsStreaming:

using Azure;
using Azure.AI.OpenAI;
using static System.Environment;

string endpoint = GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT");
string key = GetEnvironmentVariable("AZURE_OPENAI_API_KEY");

OpenAIClient client = new(new Uri(endpoint), new AzureKeyCredential(key));

var chatCompletionsOptions = new ChatCompletionsOptions()
{
    DeploymentName= "gpt-35-turbo", //This must match the custom deployment name you chose for your model
    Messages =
    {
        new ChatRequestSystemMessage("You are a helpful assistant."),
        new ChatRequestUserMessage("Does Azure OpenAI support customer managed keys?"),
        new ChatRequestAssistantMessage("Yes, customer managed keys are supported by Azure OpenAI."),
        new ChatRequestUserMessage("Do other Azure AI services support this too?"),
    },
    MaxTokens = 100
};

await foreach (StreamingChatCompletionsUpdate chatUpdate in client.GetChatCompletionsStreaming(chatCompletionsOptions))
{
    if (chatUpdate.Role.HasValue)
    {
        Console.Write($"{chatUpdate.Role.Value.ToString().ToUpperInvariant()}: ");
    }
    if (!string.IsNullOrEmpty(chatUpdate.ContentUpdate))
    {
        Console.Write(chatUpdate.ContentUpdate);
    }
}

You can then use GetChatCompletionsStreamingAsync like this:

await foreach (StreamingChatCompletionsUpdate chatUpdate in await client.GetChatCompletionsStreamingAsync(chatCompletionsOptions))
{
    if (chatUpdate.Role.HasValue)
    {
        Console.Write($"{chatUpdate.Role.Value.ToString().ToUpperInvariant()}: ");
    }
    if (!string.IsNullOrEmpty(chatUpdate.ContentUpdate))
    {
        Console.Write(chatUpdate.ContentUpdate);
    }
}

https://learn.microsoft.com/en-us/azure/ai-services/openai/chatgpt-quickstart?pivots=programming-language-csharp&tabs=command-line%2Cpython-new#async-with-streaming

https://github.com/Azure/azure-sdk-for-net/blob/main/sdk/openai/Azure.AI.OpenAI/tests/Samples/StreamingChat.cs

Westphal answered 23/4 at 12:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.