tokenize - McMap

6

how to use tiktoken in offline mode computer

import tiktoken tokenizer = tiktoken.get_encoding("cl100k_base") tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo") text = "Hello, nice to meet you" tokenizer...

python tokenize gpt-3

Homosporous asked 26/4, 2023 at 0:36

4

Solved

How to reconstruct text entities with Hugging Face's transformers pipelines without IOB tags?

I've been looking to use Hugging Face's Pipelines for NER (named entity recognition). However, it is returning the entity labels in inside-outside-beginning (IOB) format but without the IOB labels....

nlp tokenize transformer-model named-entity-recognition huggingface-transformers

Dahl asked 30/3, 2020 at 18:58

4

Converting Hugging Face Transformer Text Embeddings Back to Text

Is there a method for converting Hugging Face Transformer embeddings back to text? Suppose that I have text embeddings created using Hugging Face's ClipTextModel using the following method: import ...

python pipeline tokenize huggingface-transformers

Examination asked 6/11, 2022 at 11:45

0

What thing is responsible for the explicit line joining?

It seems that the Python tokenizer isn't responsible for the explicit line joining. I mean if we write the following code in file script.py: "one \ two" and then type python -m tokenize ...

python language-lawyer tokenize

Elwood asked 20/5 at 12:12

7

Securing my API to only work with my front-end

I'm building a node/express backend. I want to create an API that only work with my reactjs frontend (private API). Imagine if this is an e-commerce web site, my users will browse products and will...

reactjs node.js authentication tokenize

Anagnos asked 29/12, 2016 at 19:8

4

TRANSFORMERS: Asking to pad but the tokenizer does not have a padding token

In trying to evaluate several transformers models sequentially with the same dataset to check which one performs better. The list of models is this one: MODELS = [ ('xlm-mlm-enfr-1024' ,"XLMM...

python tensorflow pytorch tokenize huggingface-transformers

Ep asked 31/12, 2021 at 16:39

3

Fine-Tuning GPT2 - attention mask and pad token id errors

I have been trying to fine-tune GPT2 on the wikitext-2 dataset (just to help myself learn the process) and I am running into a warning message that I have not seen before: "The attention mask ...

machine-learning tokenize training-data gpt-2 fine-tuning

Yim asked 5/12, 2022 at 1:57

4

How can I fix "Error tokenizing data" on pandas csv reader?

I'm trying to read a csv file with pandas. This file actually has only one row but it causes an error whenever I try to read it. Something wrong seems happening in line 8 but I could hardly find th...

python pandas csv tokenize

Jamieson asked 12/11, 2018 at 4:45

2

OpenAI API: What would be a good strategy to handle 80+ function calling?

My business handles a variety of entities (job, invoice, quote, resource, vehicle, contact, person, message, alert, etc.). My goal is to use OpenAI function calling to allow my users to ask "a...

nlp tokenize openai-api chatgpt-api gpt-4

Counterpoise asked 14/9, 2023 at 8:5

3

Solved

Implement tokens in a SwiftUI TextField

How can I make TextField in SwiftUI have tokens like an UISearchBar? I've tried to insert an UISearchBar so I could use them, but I lost the behavior from the interaction between the TextField and ...

ios swiftui tokenize

Ajay asked 19/3, 2020 at 18:2

6

Solved

Split string by a substring

I have following string: char str[] = "A/USING=B)"; I want to split to get separate A and B values with /USING= as a delimiter How can I do it? I known strtok() but it just split by one charac...

c string tokenize strtok

Afghani asked 22/1, 2016 at 8:53

2

Solved

Creating a syntax tree from tokens

I'm trying to create a tiny interpreter for TI-BASIC syntax. This is a snippet of TI-BASIC I'm trying to interpret A->(2+(3*3)) I've tokenized the code above into this sequence of tokens: T...

java tokenize abstract-syntax-tree

Predicant asked 9/7, 2014 at 19:6

4

Solved

C++ Tokenize String

I'm looking for a simple way to tokenize std::string input without using non default libraries such as Boost, etc. For example, if the user enters forty_five, I would like to separate 'forty' and '...

c++string split std tokenize

Revolute asked 7/4, 2012 at 4:14

13

How to get rid of punctuation using NLTK tokenizer?

I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead....

python nlp tokenize nltk

Impaste asked 21/3, 2013 at 12:22

2

Solved

How nltk.TweetTokenizer different from nltk.word_tokenize?

I am unable to understand the difference between the two. Though, I come to know that word_tokenize uses Penn-Treebank for tokenization purposes. But nothing on TweetTokenizer is available. For whi...

python nlp artificial-intelligence nltk tokenize

Saphena asked 20/5, 2020 at 17:53

2

Solved

Boost::tokenizer comma separated (c++)

Should be an easy one for you guys..... I'm playing around with tokenizers using Boost and I want create a token that is comma separated. here is my code: string s = "this is, , , a test"; boost...

c++boost tokenize boost-tokenizer

Bield asked 29/10, 2011 at 21:8

17

Solved

What is the easiest/best/most correct way to iterate through the characters of a string in Java?

Some ways to iterate through the characters of a string in Java are: Using StringTokenizer? Converting the String to a char[] and iterating over that. What is the easiest/best/most correct way to...

java string iteration character tokenize

Depreciable asked 13/10, 2008 at 6:10

35

Solved

Parse (split) a string in C++ using string delimiter (standard C++)

I am parsing a string in C++ using the following: using namespace std; string parsed,input="text to be parsed"; stringstream input_stringstream(input); if (getline(input_stringstream,parsed,' ')...

c++parsing split token tokenize

Pizarro asked 10/1, 2013 at 19:16

2

Solved

How to do Tokenizer Batch processing? - HuggingFace

in the Tokenizer documentation from huggingface, the call fuction accepts List[List[str]] and says: text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be enco...

pytorch batch-processing tokenize huggingface-transformers huggingface-tokenizers

Billon asked 7/6, 2023 at 10:15

3

How do I implement a custom UITextInputTokenizer?

I have a UITextView and am using its tokenizer to check which words the user taps on. My goal is to change what the tokenizer thinks of as a word. Currently it seems to define words as consecutive...

ios swift uitextview tokenize

Radiobiology asked 3/4, 2015 at 15:19

3

Solved

TorchText Vocab TypeError: Vocab.__init__() got an unexpected keyword argument 'min_freq'

I am working on a CNN Sentiment analysis machine learning model which uses the IMDb dataset provided by the Torchtext library. On one of my lines of code vocab = Vocab(counter, min_freq = 1, specia...

python conv-neural-network tokenize imdb torchtext

Hemia asked 28/3, 2022 at 19:41

5

ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error

def split_data(path): df = pd.read_csv(path) return train_test_split(df , test_size=0.1, random_state=100) train, test = split_data(DATA_DIR) train_texts, train_labels = train['text'].to_list(),...

tokenize bert-language-model huggingface-transformers huggingface-tokenizers distilbert

Basketwork asked 21/8, 2020 at 5:59

3

Solved

BertTokenizer - when encoding and decoding sequences extra spaces appear

When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method. I have a the following string: test_string = 'text with percentage%' Then I am running the ...

python pytorch tokenize torch bert-language-model

Reinsure asked 21/11, 2019 at 16:43

4

Solved

Tokenization of Arabic words using NLTK

I'm using NLTK word_tokenizer to split a sentence into words. I want to tokenize this sentence: في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء The co...

python tokenize nltk

Lincolnlincolnshire asked 23/10, 2012 at 16:59

17

Solved

How to split a string in shell and get the last field

Suppose I have the string 1:2:3:4:5 and I want to get its last field (5 in this case). How do I do that using Bash? I tried cut, but I don't know how to specify the last field with -f.

bash split tokenize cut

Perennial asked 1/7, 2010 at 23:29

tokenize Questions

Recommended topics

Hot tags