If you send {"role": "user", "content": "What is the most beautiful country?"}
as the messages
parameter, it's not only What is the most beautiful country?
that is sent to the OpenAI API endpoint, but it seems like the whole "role": "user", "content": "What is the most beautiful country?"
.
I was able to confirm this using tiktoken.
If you run get_tokens_long_example.py
you'll get the following output:
14
get_tokens_long_example.py
import tiktoken
def num_tokens_from_string(string: str, encoding_name: str) -> int:
"""Returns the number of tokens in a text string."""
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
print(num_tokens_from_string("'role':'user','content':'What is the most beautiful country?'", "cl100k_base"))
If you run get_tokens_short_example.py
you'll get the following output:
8
get_tokens_short_example.py
import tiktoken
def num_tokens_from_string(string: str, encoding_name: str) -> int:
"""Returns the number of tokens in a text string."""
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
print(num_tokens_from_string("'role':'user','content':'.'", "cl100k_base"))
You said that the OpenAI API reports 15
tokens used in the first example and 9
tokens used in the second example. You probably noticed that I got 14
and 8
tokens using tiktoken (i.e., 1 token less in both examples). This seems to be a known tiktoken problem that should have been solved.
Anyway, I didn't dig that deep to figure out why I still get 1 token less, but I was able to prove that it's not only What is the most beautiful country?
that is sent to the OpenAI API endpoint.
For more information about tiktoken, see this answer.