TL;DR
Take some time to go through https://huggingface.co/course/ or read the https://www.oreilly.com/library/view/natural-language-processing/9781098136789/
After that, you would have answered most of the questions you're having.
Show me the code: Scroll down the bottom of the answer =)
What is a datasets.Dataset
and datasets.DatasetDict
?
TL;DR, basically we want to look through it and give us a dictionary of keys of name of the tensors that the model will consume, and the values are actual tensors so that the models can uses in its .forward()
function.
In code, you want the processed dataset to be able to do this:
from datasets import load_dataset
ds = load_dataset(...)
ds.map(func_to_preprocess)
for data in ds:
model(data) # Does a forward propagation pass.
Why can't I just feed the Dataset
into the model directly?
It's because the individual datasets creators/maintainers are not necessary the ones that create the models.
And keeping them independent makes sense since a dataset can be used by different model and each model requires different datasets to be preprocessed/"munge"/"manipulated" to the format that it expects (kind of like the Extract, Transform, Load (ETL) process in transformers-based models).
Unless explicitly preprocessed, most datasets are in raw text (str
) and annotation/label format, which usually are of these types:
- single token decoder output (single token label),
- e.g. Language ID task
[in]: Hallo Welt
and [out]: de
- normally uses
AutoModelForSequenceClassification
- regression float output
- e.g. Textual Similarity
[in]: Hello world <sep> Foo bar
and [out]: 32.12
- normally uses
AutoModelForSequenceClassification
- free-form autoregressive decoder output (a natural text sentence, i.e. a list of tokens)
- e.g. Machine Translation
[in]: Hallo Welt
and [out]: Hello World
- normally uses
AutoModelForSeq2SeqLM
- fixed tokens decoder output (a list of labels)
- e.g. BIO anntoations
[in]: Obama is the president
and [out]: ['B-PER', 'O', 'O', 'O']
- normally uses
AutoModelForTokenClassification
For the dataset you're interested in:
from datasets import load_dataset
raw_dataset = load_dataset("dxiao/requirements-ner-id")
raw_dataset['train'][0]
[out]:
{'id': 0,
'tokens': ['The',
'operating',
'humidity',
'shall',
'be',
'between',
'0.4',
'and',
'0.6'],
'tags': ['O',
'B-ATTR',
'I-ATTR',
'O',
'B-ACT',
'B-RELOP',
'B-QUANT',
'O',
'B-QUANT'],
'ner_tags': [0, 3, 4, 0, 1, 5, 7, 0, 7]}
But the model doesn't understand inputs and outputs, it only understand torch.tensor
objects, hence you need to do some processing.
So tokenizers usually expects raw strings, not list of tokens.
Normally, a model's tokenizer converts raw strings into a list of token ids,
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModel
model_name = "flax-community/t5-large-wikisplit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer(["hello world", "foo bar is a sentence", "fizz buzz"])
[out]:
{'input_ids': [[21820, 296, 1], [5575, 32, 1207, 19, 3, 9, 7142, 1], [361, 5271, 15886, 1]], 'attention_mask': [[1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1]]}
But my dataset comes pre-tokenized? So what do I do?
sentences = [
['The', 'operating','humidity','shall','be','between','0.4','and','0.6'],
['The', 'CIS', 'CNET', 'shall', 'accommodate', 'a', 'bandwidth', 'of', 'at', 'least', '24.0575', 'Gbps', 'to', 'the', 'Computer', 'Room', '.']
]
[tokenizer.convert_tokens_to_ids(sent) for sent in sentences]
[out]:
[[634, 2, 2, 2, 346, 24829, 22776, 232, 22787],
[634, 21134, 2, 2, 2, 9, 2, 858, 144, 2, 2, 2, 235, 532, 2, 2, 5]]
Why are there so many tokens with index 2?
Because they are unknowns. If we take a look at the vocab,
>>> tokenizer.convert_tokens_to_ids(tokenizer.unk_token)
2
Then how do I encode the tags or new tokens?
Here's an example:
from itertools import chain
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModel
from datasets import load_dataset
model_name = "flax-community/t5-large-wikisplit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
raw_dataset = load_dataset("dxiao/requirements-ner-id")
# Get the NER tags.
tag_set = list(map(str, set(chain(*raw_dataset['train']['tags']))))
# Put them into the tokenizer.
tokenizer.add_special_tokens({'additional_special_tokens': tag_set})
train_datset = raw_dataset['train'].map(lambda x:
{'input_ids': tokenizer.convert_tokens_to_ids(x['tokens']),
'labels': tokenizer.convert_tokens_to_ids(x['tags'])}
)
valid_datset = raw_dataset['validation'].map(lambda x:
{'input_ids': tokenizer.convert_tokens_to_ids(x['tokens']),
'labels': tokenizer.convert_tokens_to_ids(x['tags'])}
)
How to train a Seq2Seq using the text inputs and the NER labels as the outputs?
TL;DR:
from itertools import chain
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModel
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset
import evaluate
model_name = "flax-community/t5-large-wikisplit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
raw_dataset = load_dataset("dxiao/requirements-ner-id")
# Get the NER tags.
tag_set = list(map(str, set(chain(*raw_dataset['train']['tags']))))
# Put them into the tokenizer.
tokenizer.add_special_tokens({'additional_special_tokens': tag_set})
train_data = raw_dataset['train'].map(lambda x:
{'input_ids': tokenizer.convert_tokens_to_ids(x['tokens']),
'labels': tokenizer.convert_tokens_to_ids(x['tags'])}
)
valid_data = raw_dataset['validation'].map(lambda x:
{'input_ids': tokenizer.convert_tokens_to_ids(x['tokens']),
'labels': tokenizer.convert_tokens_to_ids(x['tags'])}
)
# set special tokens, not sure if it's needed but adding them for sanity...
model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id
mt_metrics = evaluate.combine(
["bleu", "chrf"], force_prefix=True
)
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
predictions = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
labels_ids[labels_ids == -100] = tokenizer.pad_token_id
references = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
outputs = mt_metrics.compute(predictions=predictions,
references=references)
return outputs
training_args = Seq2SeqTrainingArguments(
output_dir='./',
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
logging_steps=1,
save_steps=5,
eval_steps=1,
max_steps=10,
evaluation_strategy="steps",
predict_with_generate=True,
report_to=None,
metric_for_best_model="chr_f_score",
load_best_model_at_end=True
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=train_data.with_format("torch"),
eval_dataset=valid_data.with_format("torch"),
compute_metrics=compute_metrics
)
trainer.train()
Hey, something seems fishy, when we train an NER, shouldn't we be using AutoModelForTokenClassification
not AutoModelForSeq2SeqLM
?
Yeah, but like many things in life, there's many means to get to the same end. So in this case, you can take the liberty and be creative to do, e.g.
Περίμενε ένα λεπτό! (Wait a minute!) That's not what I want to do!
I guess you don't really want to do NER but the lessons learnt from munging the corpus with additional tokens and the .map
functions should help what you need.
Why don't you just tell me how to manipulate the DatasetDict
so that it fits what I need?!
Alright, alright. Here goes...
First, I guess you would need to clarify in your question what task are you tackling on top of what model and dataset you're using.
From your code, I am guessing you are trying to build a model for
- Task: Text simplification
[in]: This is super long sentence that has lots of no meaning words.
[out]: This is a long-winded sentence.
- Model: Seq2Seq
- Using
AutoModelForSeq2SeqLM("flax-community/t5-large-wikisplit")
- Dataset: Texts from
dxiao/requirements-ner-id
[in]: ['The', 'operating','humidity','shall','be',...,]
[out]: 'The humidity is high'
- Only the input tokens from
dxiao/requirements-ner-id
are use as input texts, everything else in the dataset is not needed
- Preprocessing: Convert the input into a simplified version
[in]: ['The', 'operating','humidity','shall','be',...,]
[out]: ['The', 'XXXXX', 'humidity', ...]
- Convert the simplified output and original inputs to
input_ids
and labels
(that the model expects)
- Lets create a
random_xxx
function for this purpose.
def random_xxx(tokens):
# Pick out 3 tokens to XXX.
to_xxx = set(random.sample(range(len(tokens)), 3))
tokens = []
for i, tok in enumerate(tokens):
if i in to_xxx:
tokens.append('<xxx>')
else:
tokens.append(tok)
return tokens
So how do I do what you listed above? Stop stalling, just give me the code...
from itertools import chain
import random
import os
os.environ["WANDB_DISABLED"] = "true"
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModel
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset
import evaluate
model_name = "flax-community/t5-large-wikisplit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def random_xxx(tokens):
# Pick out 3 tokens to XXX.
to_xxx = set(random.sample(range(len(tokens)), 3))
tokens = []
for i, tok in enumerate(tokens):
if i in to_xxx:
tokens.append('<xxx>')
else:
tokens.append(tok)
return tokens
raw_dataset = load_dataset("dxiao/requirements-ner-id")
# Put '<xxx>' into the tokenizer.
tokenizer.add_special_tokens({'additional_special_tokens': ['<xxx>']})
# Assuming `input_ids` is "complex" original sentence.
# and `labels` is "simplified" sentence with XXX
train_data = raw_dataset['train'].map(lambda x:
{'input_ids': tokenizer(" ".join(x['tokens']),
max_length=40, truncation=True, padding="max_length")["input_ids"],
'labels': tokenizer(" ".join(random_xxx(x['tokens'])),
max_length=40, truncation=True, padding="max_length")["input_ids"]}
)
valid_data = raw_dataset['validation'].map(lambda x:
{'input_ids': tokenizer(" ".join(x['tokens']),
max_length=40, truncation=True, padding="max_length")["input_ids"],
'labels': tokenizer(" ".join(random_xxx(x['tokens'])),
max_length=40, truncation=True, padding="max_length")["input_ids"]}
)
# set special tokens, not sure if it's needed but adding them for sanity...
model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id
mt_metrics = evaluate.combine(
["bleu", "chrf"], force_prefix=True
)
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
predictions = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
labels_ids[labels_ids == -100] = tokenizer.pad_token_id
references = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
outputs = mt_metrics.compute(predictions=predictions,
references=references)
return outputs
training_args = Seq2SeqTrainingArguments(
output_dir='./',
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
logging_steps=1,
save_steps=5,
eval_steps=1,
max_steps=10,
evaluation_strategy="steps",
predict_with_generate=True,
report_to=None,
metric_for_best_model="chr_f_score",
load_best_model_at_end=True
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=train_data.with_format("torch"),
eval_dataset=valid_data.with_format("torch"),
compute_metrics=compute_metrics
)
trainer.train()
Here's a few other tutorials that will I find helpful:
datasets.Dataset
works and how thetransformers.Trainer
object works, then finally understand whatflax-community/t5-large-wikisplit
expects in itsforward()
function. After that you would understand the current code you have. – Nonresistant