How to get token usage for each openAI ChatCompletion API call in streaming mode?
Asked Answered
O

7

7

According to openAI's documentation, https://platform.openai.com/docs/guides/chat/chat-vs-completions you should get token usage from the response. However, I am currently working making the API call with stream set to True. The response doesn't seem to contain usage property?

So how can I get the token usage in this case?

Oldenburg answered 23/3, 2023 at 15:14 Comment(3)
can you add the code you use to call the api?Annisannissa
Heya, were you finally able to get it? I am struggling to get it in streaming mode as well (but using node, not pythonHallett
I'm also facing the same issue; one difference is I'm trying Azure open-ai and I'm not a consumer using SDK I'm more like a platform team to enable this to a large team - who is the ideal consumerSelemas
U
2

OpenAI finally just added this feature to streaming. Add the stream_options: {"include_usage": true} parameter to the chatcompletions query.

See: https://community.openai.com/t/usage-stats-now-available-when-using-streaming-with-the-chat-completions-api-or-completions-api/738156/3

To use this, update the openai package to openai >= 1.26.0.

Unrobe answered 16/5 at 1:14 Comment(3)
Do you have any idea to use this option with AzureOpenAI which is also from openai package?Tiemroth
client = AzureOpenAI(azure_endpoint="AZURE_OPENAI_ENDPOINT", api_key="AZURE_OPENAI_API_KEY") response = client.chat.completions.create(model="gpt-3.5-turbo", messages=[], stream=True, stream_options={"include_usage": True})Unrobe
unfortunately, I tried this but the usage always None, even the last message.Tiemroth
W
0

you can use tiktoken

pip install tiktoken

import tiktoken

def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301"):
    """Returns the number of tokens used by a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print("Warning: model not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")
    if model == "gpt-3.5-turbo":
        print("Warning: gpt-3.5-turbo may change over time. Returning num tokens assuming gpt-3.5-turbo-0301.")
        return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301")
    elif model == "gpt-4":
        print("Warning: gpt-4 may change over time. Returning num tokens assuming gpt-4-0314.")
        return num_tokens_from_messages(messages, model="gpt-4-0314")
    elif model == "gpt-3.5-turbo-0301":
        tokens_per_message = 4  # every message follows <|start|>{role/name}\n{content}<|end|>\n
        tokens_per_name = -1  # if there's a name, the role is omitted
    elif model == "gpt-4-0314":
        tokens_per_message = 3
        tokens_per_name = 1
    else:
        raise NotImplementedError(f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""")
    num_tokens = 0

    if type(messages) == "list":
        for message in messages:
            num_tokens += tokens_per_message
            for key, value in message.items():
                num_tokens += len(encoding.encode(value))
                if key == "name":
                    num_tokens += tokens_per_name
        num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
    elif type(messages) == "str":
        num_tokens += len(encoding.encode(messages))
    return num_tokens
import openai

result = []

for chunk in openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ], # this is prompt_tokens ex) prompt_tokens=num_tokens_from_messages(messages)
    stream=True
):
    content = chunk["choices"][0].get("delta", {}).get("content")
    if content:
        result.append(content)


# Usage of completion_tokens
completion_tokens = num_tokens_from_messages("".join(result))
Weekday answered 18/4, 2023 at 11:33 Comment(0)
R
0

It is possible to count the prompt_tokens and completion_tokens manually and add them up to get the total usage count.

Measuring prompt_tokens:

Using any of the tokenizer it is possible to count the prompt_tokens in the request body.

Measuring the completion_tokens:

You need to have an intermittent service (a proxy), that can pass on the SSE(server sent events) to the client applications post counting the tokens in each response.

A sample architecture is present here: https://medium.com/microsoftazure/when-invoking-apis-hosted-by-azure-api-management-configured-azure-openai-service-as-a-backend-bd8f2648cfa5

Reuven answered 9/1 at 9:30 Comment(0)
S
0

All the provided answers have the core solution which is we have to use some kind of proxy to handle the token calculation(i.e,tiktoken) using our custom way. Here, with my answer I would like to share the implementation way of Azure OpenAI

The logic here is to build a reverse-proxy (yarp-ms reverse proxy). You can find the full project from Enterprise-azureai-proxy

The below is one major part of the solution of handling the stream

using AsyncAwaitBestPractices;
using Azure.Core;
using AzureAI.Proxy.Models;
using AzureAI.Proxy.OpenAIHandlers;
using AzureAI.Proxy.Services;
using System.Text;
using System.Text.Json;
using System.Text.Json.Nodes;
using Yarp.ReverseProxy.Transforms;
using Yarp.ReverseProxy.Transforms.Builder;

namespace AzureAI.Proxy.ReverseProxy;

internal class OpenAIChargebackTransformProvider : ITransformProvider
{
   
    private readonly IConfiguration _config;
    private readonly IManagedIdentityService _managedIdentityService;
    private readonly ILogIngestionService _logIngestionService;
   
    private string accessToken = "";

    private TokenCredential _managedIdentityCredential;

    public OpenAIChargebackTransformProvider(
        IConfiguration config, 
        IManagedIdentityService managedIdentityService,
        ILogIngestionService logIngestionService)
    {
        _config = config;
        _managedIdentityService = managedIdentityService;
        _logIngestionService = logIngestionService;
               
        _managedIdentityCredential = _managedIdentityService.GetTokenCredential();

    }

    public void ValidateRoute(TransformRouteValidationContext context) { return; }

    public void ValidateCluster(TransformClusterValidationContext context) { return; }
    
    public void Apply(TransformBuilderContext context)
    {
        context.AddRequestTransform(async requestContext => {
            //enable buffering allows us to read the requestbody twice (one for forwarding, one for analysis)
            requestContext.HttpContext.Request.EnableBuffering();

            //check accessToken before replacing the Auth Header
            if (String.IsNullOrEmpty(accessToken) || OpenAIAccessToken.IsTokenExpired(accessToken, _config["EntraId:TenantId"]))
            {
                accessToken = await OpenAIAccessToken.GetAccessTokenAsync(_managedIdentityCredential, CancellationToken.None);
            }

            //replace auth header with the accesstoken of the managed indentity of the proxy
            requestContext.ProxyRequest.Headers.Remove("api-key");
            requestContext.ProxyRequest.Headers.Remove("Authorization");
            requestContext.ProxyRequest.Headers.Add("Authorization", $"Bearer {accessToken}");

        });
        context.AddResponseTransform(async responseContext =>
        {
            var originalStream = await responseContext.ProxyResponse.Content.ReadAsStreamAsync();
            string capturedBody = "";

            // Buffer for reading chunks
            byte[] buffer = new byte[8192];
            int bytesRead;

            // Read, inspect, and write the data in chunks - this is especially needed for streaming content
            while ((bytesRead = await originalStream.ReadAsync(buffer, 0, buffer.Length)) > 0)
            {
                // Convert the chunk to a string for inspection
                var chunk = Encoding.UTF8.GetString(buffer, 0, bytesRead);

                capturedBody += chunk;

                // Write the unmodified chunk back to the response
                await responseContext.HttpContext.Response.Body.WriteAsync(buffer, 0, bytesRead);
            }

            //flush any remaining content to the client
            await responseContext.HttpContext.Response.CompleteAsync();

            //now perform the analysis and create a log record
            var record = new LogAnalyticsRecord();
            record.TimeGenerated = DateTime.UtcNow;
            
            if (responseContext.HttpContext.Request.Headers["X-Consumer"].ToString() != "")
            {
                record.Consumer = responseContext.HttpContext.Request.Headers["X-Consumer"].ToString();
            }
            else
            {
                record.Consumer = "Unknown Consumer";
            }
           
            bool firstChunck = true;
            var chunks = capturedBody.Split("data:");
            foreach (var chunk in chunks)
            {
                var trimmedChunck = chunk.Trim();
                if (trimmedChunck != "" && trimmedChunck != "[DONE]")
                {

                    JsonNode jsonNode = JsonSerializer.Deserialize<JsonNode>(trimmedChunck);
                    if (jsonNode["error"] is not null)
                    {
                        Error.Handle(jsonNode);
                    }
                    else
                    {
                        string objectValue = jsonNode["object"].ToString();

                        switch (objectValue)
                        {
                            case "chat.completion":
                                Usage.Handle(jsonNode, ref record);
                                record.ObjectType = objectValue;
                                break;
                            case "chat.completion.chunk":
                                if (firstChunck)
                                {
                                    record = Tokens.CalculateChatInputTokens(responseContext.HttpContext.Request, record);
                                    record.ObjectType = objectValue;
                                    firstChunck = false;
                                }
                                ChatCompletionChunck.Handle(jsonNode, ref record);
                                break;
                            case "list":
                                if (jsonNode["data"][0]["object"].ToString() == "embedding")
                                {
                                    record.ObjectType = jsonNode["data"][0]["object"].ToString();
                                    //it's an embedding
                                    Usage.Handle(jsonNode, ref record);
                                }
                                break;
                            default:
                                break;
                        }
                    }
                }

            }

            record.TotalTokens = record.InputTokens + record.OutputTokens;
            _logIngestionService.LogAsync(record).SafeFireAndForget();
        });
    }
}
Selemas answered 15/1 at 5:8 Comment(0)
S
0

You can retrieve the total number of tokens from the response by checking response.usage.total_tokens.

Example:

response = openai_client.embeddings.create(model= "text-embedding-3-large", input="test text", encoding_format="float")
if response.data:
    embedding = response.data[0].embedding
    
    total_tokens = response.usage.total_tokens
    print ("Total tokens: ", total_tokens)

To get total token before embedding use Tiktoken:

def get_number_of_tokens(string: str) -> int:
    encoding = tiktoken.encoding_for_model("text-embedding-3-large")
    num_tokens = len(encoding.encode(string))
    return num_tokens

total_token = get_number_of_tokens('test text')
print total_token
Sort answered 1/4 at 10:27 Comment(0)
U
0

I finally found a solution I'm happy with after hours of scouring documentation, so hopefully this helps someone out. If you find a mismatch between stream/normal usage counts, please let me know.

Unfortunately, they do not give an option to query for usage information by ID, or even just returning usage somehow; that would've been the easier solution. Instead, here's my implementation. It involves:

  • Counting tokens for images with the new gpt-4-turbo/vision models
  • The scuffed and varied additional tokens that get added in with openai's api
  • Wrapping the returned Stream generator, appending any tokens to a list before yielding, and finally processing the list as the output message

Implementation of the CountStreamTokens class (types are slightly scuffed, and I didn't include them in the SO code, but I included it in my actual project if you need all the types)

Implementation in my project for reference; check the chain.py functions: https://github.com/flatypus/flowchat/blob/main/flowchat/private/_private_helpers.py

Code:

from io import BytesIO
from math import ceil
from PIL import Image
from requests import get
from typing import Callable, List, Dict
import base64
import tiktoken

class CalculateImageTokens:
    def __init__(self, image: str):
        self.image = image

    def _get_image_dimensions(self):
        if self.image.startswith("data:image"):
            image = self.image.split(",")[1]
            image = base64.b64decode(image)
            image = Image.open(BytesIO(image))
            return image.size
        else:
            response = get(self.image)
            image = Image.open(BytesIO(response.content))
            return image.size

    def _openai_resize(self, width: int, height: int):
        if width > 1024 or height > 1024:
            if width > height:
                height = int(height * 1024 / width)
                width = 1024
            else:
                width = int(width * 1024 / height)
                height = 1024
        return width, height

    def count_image_tokens(self):
        width, height = self._get_image_dimensions()
        width, height = self._openai_resize(width, height)
        h = ceil(height / 512)
        w = ceil(width / 512)
        total = 85 + 170 * h * w
        return total


class CountStreamTokens:
    def __init__(self, model: str, messages: List[Message]):
        self.collect_tokens: List[str] = []
        self.messages = messages
        self.model = model
        self._get_model(model)
        self.tokens_per_message = 3
        self.tokens_per_name = 1

    def _get_model(self, model: str):
        """Picks the right model and sets the additional tokens. See https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb"""
        try:
            self.encoding = tiktoken.encoding_for_model(model)
        except KeyError:
            self.encoding = tiktoken.get_encoding("cl100k_base")

        if model in {
            "gpt-3.5-turbo-0613",
            "gpt-3.5-turbo-16k-0613",
            "gpt-4-0314",
            "gpt-4-32k-0314",
            "gpt-4-0613",
            "gpt-4-32k-0613",
        }:
            self.tokens_per_message = 3
            self.tokens_per_name = 1

        elif model == "gpt-3.5-turbo-0301":
            # every message follows <|start|>{role/name}\n{content}<|end|>\n
            self.tokens_per_message = 4
            self.tokens_per_name = -1  # if there's a name, the role is omitted
        elif "gpt-3.5-turbo" in model:
            self._get_model("gpt-3.5-turbo-0613")
        elif "gpt-4" in model:
            self._get_model("gpt-4-0613")

    def _count_text_tokens(self, message: Message) -> int:
        """Return the number of tokens used by a list of messages. See above link for context"""
        num_tokens = self.tokens_per_message
        for key, value in message.items():
            num_tokens += len(self.encoding.encode(str(value)))
            if key == "name":
                num_tokens += self.tokens_per_name

        return num_tokens

    def _count_input_tokens(self):
        tokens = 0
        text_messages: List[Message] = []
        image_messages: List[Dict[str, Any]] = []

        for message in self.messages:
            content = message["content"]
            role = message["role"]
            if isinstance(content, str):
                text_messages.append({"role": role, "content": content})
            else:
                for item in content:
                    if item["type"] == "text":
                        text_messages.append(
                            {"role": role, "content": item["text"]})
                    else:
                        image_messages.append(item)

        for message in text_messages:
            tokens += self._count_text_tokens(message)

        for message in image_messages:
            image = message["image_url"]
            detail = image.get("detail", "high")
            if detail == "low":
                tokens += 85
            else:
                tokens += (
                    CalculateImageTokens(message["image_url"]["url"])
                    .count_image_tokens()
                )

        tokens += 3  # every reply is primed with <|start|>assistant<|message|>

        return tokens

    def _count_output_tokens(self, message: str):
        return len(self.encoding.encode(message))

    def wrap_stream_and_count(self, generator: StreamChatCompletion, callback: Callable[[int, int, str], None]):
        for response in generator:
            content = response.choices[0].delta.content
            yield response

            if content is None:
                output_message = "".join(self.collect_tokens)
                prompt_tokens = self._count_input_tokens()
                completion_tokens = self._count_output_tokens(output_message)
                callback(prompt_tokens, completion_tokens, self.model)
                continue

            self.collect_tokens.append(content)

# ============= YOUR CODE =============

def add_token_count(self, prompt_tokens: int, completion_tokens: int, model: str) -> None:
        # I append the tokens to a running total here. This will be called after the calculation is finished, as a callback. 
        # You can choose to do anything here with the numbers.
        self.detailed_usage.append({
            "model": model,
            "usage": {"prompt_tokens": prompt_tokens, "completion_tokens": completion_tokens},
            "time": datetime.now()
        })

completion = openai.chat.completions.create(messages=messages, stream=True, **params)

# completion is now a generator, or a 'stream' object. 
# CountStreamTokens is a custom class that is initialized with the model you use, and the messages you want to query with. 
# These are saved as class attributes for use in the .wrap_stream_and_count() function.
# The .wrap_stream_and_count() returns another generator, yielding all the same tokens as OpenAI provides, 
# but simultaneously collecting the output tokens.
# When the generator detects a None (ending) token in the stream, 
# it yields the final token and begins counting tokens (as to keep the stream running)

return CountStreamTokens(model, messages).wrap_stream_and_count(completion, add_token_count)
Unrobe answered 16/4 at 2:54 Comment(0)
D
-1

you can also use get_openai_callback() if you use Lagchain

from langchain.callbacks import get_openai_callback

        with get_openai_callback() as cb:
            response = qa({"question": prompt, "chat_history": chat_history})

            print(f"Prompt Tokens: {cb.prompt_tokens}")
            print(f"Completion Tokens: {cb.completion_tokens}")
            print(f"Total Cost (USD): ${cb.total_cost}")
Domineca answered 6/7, 2023 at 16:56 Comment(1)
That doesn't work with streaming, which is the question here.Pegues

© 2022 - 2024 — McMap. All rights reserved.