tokenize Questions
6
import tiktoken
tokenizer = tiktoken.get_encoding("cl100k_base") tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo")
text = "Hello, nice to meet you"
tokenizer...
4
Solved
I've been looking to use Hugging Face's Pipelines for NER (named entity recognition). However, it is returning the entity labels in inside-outside-beginning (IOB) format but without the IOB labels....
Dahl asked 30/3, 2020 at 18:58
4
Is there a method for converting Hugging Face Transformer embeddings back to text?
Suppose that I have text embeddings created using Hugging Face's ClipTextModel using the following method:
import ...
Examination asked 6/11, 2022 at 11:45
0
It seems that the Python tokenizer isn't responsible for the explicit line joining. I mean if we write the following code in file script.py:
"one \
two"
and then type python -m tokenize ...
Elwood asked 20/5 at 12:12
7
I'm building a node/express backend. I want to create an API that only work with my reactjs frontend (private API).
Imagine if this is an e-commerce web site, my users will browse products and will...
Anagnos asked 29/12, 2016 at 19:8
4
In trying to evaluate several transformers models sequentially with the same dataset to check which one performs better.
The list of models is this one:
MODELS = [
('xlm-mlm-enfr-1024' ,"XLMM...
Ep asked 31/12, 2021 at 16:39
3
I have been trying to fine-tune GPT2 on the wikitext-2 dataset (just to help myself learn the process) and I am running into a warning message that I have not seen before:
"The attention mask ...
Yim asked 5/12, 2022 at 1:57
4
I'm trying to read a csv file with pandas.
This file actually has only one row but it causes an error whenever I try to read it.
Something wrong seems happening in line 8 but I could hardly find th...
2
My business handles a variety of entities (job, invoice, quote, resource, vehicle, contact, person, message, alert, etc.).
My goal is to use OpenAI function calling to allow my users to ask "a...
Counterpoise asked 14/9, 2023 at 8:5
3
Solved
How can I make TextField in SwiftUI have tokens like an UISearchBar?
I've tried to insert an UISearchBar so I could use them, but I lost the behavior from the interaction between the TextField and ...
6
Solved
I have following string:
char str[] = "A/USING=B)";
I want to split to get separate A and B values with /USING= as a delimiter
How can I do it? I known strtok() but it just split by one charac...
2
Solved
I'm trying to create a tiny interpreter for TI-BASIC syntax.
This is a snippet of TI-BASIC I'm trying to interpret
A->(2+(3*3))
I've tokenized the code above into this sequence of tokens:
T...
Predicant asked 9/7, 2014 at 19:6
4
Solved
13
I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead....
2
Solved
I am unable to understand the difference between the two. Though, I come to know that word_tokenize uses Penn-Treebank for tokenization purposes. But nothing on TweetTokenizer is available. For whi...
Saphena asked 20/5, 2020 at 17:53
2
Solved
Should be an easy one for you guys.....
I'm playing around with tokenizers using Boost and I want create a token that is comma separated. here is my code:
string s = "this is, , , a test";
boost...
Bield asked 29/10, 2011 at 21:8
17
Solved
Some ways to iterate through the characters of a string in Java are:
Using StringTokenizer?
Converting the String to a char[] and iterating over that.
What is the easiest/best/most correct way to...
35
Solved
I am parsing a string in C++ using the following:
using namespace std;
string parsed,input="text to be parsed";
stringstream input_stringstream(input);
if (getline(input_stringstream,parsed,' ')...
2
Solved
in the Tokenizer documentation from huggingface, the call fuction accepts List[List[str]] and says:
text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be enco...
Billon asked 7/6, 2023 at 10:15
3
I have a UITextView and am using its tokenizer to check which words the user taps on.
My goal is to change what the tokenizer thinks of as a word. Currently it seems to define words as consecutive...
Radiobiology asked 3/4, 2015 at 15:19
3
Solved
I am working on a CNN Sentiment analysis machine learning model which uses the IMDb dataset provided by the Torchtext library.
On one of my lines of code
vocab = Vocab(counter, min_freq = 1, specia...
Hemia asked 28/3, 2022 at 19:41
5
def split_data(path):
df = pd.read_csv(path)
return train_test_split(df , test_size=0.1, random_state=100)
train, test = split_data(DATA_DIR)
train_texts, train_labels = train['text'].to_list(),...
Basketwork asked 21/8, 2020 at 5:59
3
Solved
When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method.
I have a the following string:
test_string = 'text with percentage%'
Then I am running the ...
Reinsure asked 21/11, 2019 at 16:43
4
Solved
I'm using NLTK word_tokenizer to split a sentence into words.
I want to tokenize this sentence:
في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء
The co...
17
Solved
Suppose I have the string 1:2:3:4:5 and I want to get its last field (5 in this case). How do I do that using Bash? I tried cut, but I don't know how to specify the last field with -f.
1 Next >
© 2022 - 2024 — McMap. All rights reserved.