How to truncate input in the Huggingface pipeline?
Asked Answered
C

4

25

I currently use a huggingface pipeline for sentiment-analysis like so:

from transformers import pipeline
classifier = pipeline('sentiment-analysis', device=0)

The problem is that when I pass texts larger than 512 tokens, it just crashes saying that the input is too long. Is there any way of passing the max_length and truncate parameters from the tokenizer directly to the pipeline?

My work around is to do:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, device=0)

And then when I call the tokenizer:

pt_batch = tokenizer(text, padding=True, truncation=True, max_length=512, return_tensors="pt")

But it would be much nicer to simply be able to call the pipeline directly like so:

classifier(text, padding=True, truncation=True, max_length=512)
Crichton answered 5/6, 2021 at 12:56 Comment(0)
N
11

this way should work:

classifier(text, padding=True, truncation=True)

if it doesn't try to load tokenizer as:

tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_len=512)
Nocuous answered 1/8, 2021 at 11:0 Comment(3)
I think it should be model_max_length instead of model_max_len. Otherwise it doesn't work for me.Battista
I had to use max_len=512 to make it work.Williams
Perhaps this only works for the task text-classification? When trying to set padding="longest" for text-generation I get an error: ValueError: The following `model_kwargs` are not used by the model: ['padding'] (note: typos in the generate arguments will also show up in this list)Plaudit
R
27

you can use tokenizer_kwargs while inference :

model_pipline = pipeline("text-classification",model=model,tokenizer=tokenizer,device=0, return_all_scores=True)

tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512,'return_tensors':'pt'}

prediction = model_pipeline('sample text to predict',**tokenizer_kwargs)

for more details you can check this link

Redan answered 16/1, 2022 at 12:3 Comment(2)
Thanks. This works with regular Python. I am trying it in PySpark. Where would you place the tokenizer_kwargs - when creating the udf or when calling the udf? if you can give me an example for pyspark, I would appreciate it. Thanks. schema = ArrayType(StructType([ StructField("score", FloatType(), True), StructField("label", StringType(), True) ])) ... ... tokenizer_kwargs = {'padding': True, 'truncation': True, 'max_length': 512} sentiment_udf = F.udf(model_pipeline, schema) df = df.withColumn('pred_label', sentiment_udf(F.col("text")))Ethyne
Perhaps this only works for the task text-classification? When trying to set padding="longest" for text-generation I get an error: ValueError: The following `model_kwargs` are not used by the model: ['padding'] (note: typos in the generate arguments will also show up in this list)Plaudit
N
11

this way should work:

classifier(text, padding=True, truncation=True)

if it doesn't try to load tokenizer as:

tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_len=512)
Nocuous answered 1/8, 2021 at 11:0 Comment(3)
I think it should be model_max_length instead of model_max_len. Otherwise it doesn't work for me.Battista
I had to use max_len=512 to make it work.Williams
Perhaps this only works for the task text-classification? When trying to set padding="longest" for text-generation I get an error: ValueError: The following `model_kwargs` are not used by the model: ['padding'] (note: typos in the generate arguments will also show up in this list)Plaudit
T
0

This is the way:

from transformers import pipeline
generator = pipeline(task='text2text-generation', truncation=True, model=model, tokenizer=tokenizer)

# check your result
generator._preprocess_params
Targum answered 11/8, 2023 at 1:57 Comment(1)
generator._preprocess_params did somehow return {} right after initializing the pipeline for mePlaudit
P
0
$ export TRANSFORMERS_CACHE=/projectnb/pnn/.cache
$ export HF_HOME=/projectnb/pnn/.cache
$ export HF_DATASETS_CACHE=/projectnb/pnn/.cache

This one works for me. Please change /projectnb/pnn/.cache to your own path.

Psychographer answered 27/11, 2023 at 16:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.