Problem with inputs when building a model with TFBertModel and AutoTokenizer from HuggingFace's transformers

Asked 15/9, 2021 at 15:28 Answered 8/9, 2022 at 11:11

Solved tensorflow keras huggingface-transformers bert-language-model huggingface-tokenizers

I'm trying to build the model illustrated in this picture:

I obtained a pre-trained BERT and respective tokenizer from HuggingFace's transformers in the following way:

from transformers import AutoTokenizer, TFBertModel
model_name = "dbmdz/bert-base-italian-xxl-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
bert = TFBertModel.from_pretrained(model_name)

The model will be fed a sequence of italian tweets and will need to determine if they are ironic or not.

I'm having problems building the initial part of the model, which takes the inputs and feeds them to the tokenizer in order to get a representation I can feed to BERT.

I can do it outside of the model-building context:

my_phrase = "Ciao, come va?"
# an equivalent version is tokenizer(my_phrase, other parameters)
bert_input = tokenizer.encode(my_phrase, add_special_tokens=True, return_tensors='tf', max_length=110, padding='max_length', truncation=True) 
attention_mask = bert_input > 0
outputs = bert(bert_input, attention_mask)['pooler_output']

but I'm having troubles building a model that does this. Here is the code for building such a model (the problem is in the first 4 lines ):

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  encoder_inputs = tokenizer(text_input, return_tensors='tf', add_special_tokens=True, max_length=110, padding='max_length', truncation=True)
  outputs = bert(encoder_inputs)
  net = outputs['pooler_output']
  
  X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(net)
  X = tf.keras.layers.Concatenate(axis=-1)([X, input_layer])
  X = tf.keras.layers.MaxPooling1D(20)(X)
  X = tf.keras.layers.SpatialDropout1D(0.4)(X)
  X = tf.keras.layers.Flatten()(X)
  X = tf.keras.layers.Dense(128, activation="relu")(X)
  X = tf.keras.layers.Dropout(0.25)(X)
  X = tf.keras.layers.Dense(2, activation='softmax')(X)

  model = tf.keras.Model(inputs=text_input, outputs = X) 
  
  return model

And when I call the function for creating this model I get this error:

text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).

One thing I thought was that maybe I had to use the tokenizer.batch_encode_plus function which works with lists of strings:

class BertPreprocessingLayer(tf.keras.layers.Layer):
  def __init__(self, tokenizer, maxlength):
    super().__init__()
    self._tokenizer = tokenizer
    self._maxlength = maxlength
  
  def call(self, inputs):
    print(type(inputs))
    print(inputs)
    tokenized = tokenizer.batch_encode_plus(inputs, add_special_tokens=True, return_tensors='tf', max_length=self._maxlength, padding='max_length', truncation=True)
    return tokenized

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  encoder_inputs = BertPreprocessingLayer(tokenizer, 100)(text_input)
  outputs = bert(encoder_inputs)
  net = outputs['pooler_output']
  # ... same as above

but I get this error:

batch_text_or_text_pairs has to be a list (got <class 'keras.engine.keras_tensor.KerasTensor'>)

and beside the fact I haven't found a way to convert that tensor to a list with a quick google search, it seems weird that I have to go in and out of tensorflow in this way.

I've also looked up on the huggingface's documentation but there is only a single usage example, with a single phrase, and what they do is analogous at my "out of model-building context" example.

EDIT:

I also tried with Lambdas in this way:

tf.executing_eagerly()

def tokenize_tensor(tensor):
  t = tensor.numpy()
  t = np.array([str(s, 'utf-8') for s in t])
  return tokenizer(t.tolist(), return_tensors='tf', add_special_tokens=True, max_length=110, padding='max_length', truncation=True)

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(1,), dtype=tf.string, name='text')
  
  encoder_inputs = tf.keras.layers.Lambda(tokenize_tensor, name='tokenize')(text_input)
  ...
  
  outputs = bert(encoder_inputs)

but I get the following error:

'Tensor' object has no attribute 'numpy'

EDIT 2:

I also tried the approach suggested by @mdaoust of wrapping everything in a tf.py_function and got this error.

def py_func_tokenize_tensor(tensor):
  return tf.py_function(tokenize_tensor, [tensor], Tout=[tf.int32, tf.int32, tf.int32])

eager_py_func() missing 1 required positional argument: 'Tout'

Then I defined Tout as the type of the value returned by the tokenizer:

transformers.tokenization_utils_base.BatchEncoding

and got the following error:

Expected DataType for argument 'Tout' not <class 'transformers.tokenization_utils_base.BatchEncoding'>

Finally I unpacked the value in the BatchEncoding in the following way:

def tokenize_tensor(tensor):
  t = tensor.numpy()
  t = np.array([str(s, 'utf-8') for s in t])
  dictionary = tokenizer(t.tolist(), return_tensors='tf', add_special_tokens=True, max_length=110, padding='max_length', truncation=True)
  #unpacking
  input_ids = dictionary['input_ids']
  tok_type = dictionary['token_type_ids']
  attention_mask = dictionary['attention_mask']
  return input_ids, tok_type, attention_mask

And get an error in the line below:

...
outputs = bert(encoder_inputs)

ValueError: Cannot take the length of shape with unknown rank.

Choroid answered 15/9, 2021 at 15:28 Comment(0)

For now I solved by taking the tokenization step out of the model:

def tokenize(sentences, tokenizer):
    input_ids, input_masks, input_segments = [],[],[]
    for sentence in sentences:
        inputs = tokenizer.encode_plus(sentence, add_special_tokens=True, max_length=128, pad_to_max_length=True, return_attention_mask=True, return_token_type_ids=True)
        input_ids.append(inputs['input_ids'])
        input_masks.append(inputs['attention_mask'])
        input_segments.append(inputs['token_type_ids'])        
        
    return np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')

The model takes two inputs which are the first two values returned by the tokenize funciton.

def build_classifier_model():
   input_ids_in = tf.keras.layers.Input(shape=(128,), name='input_token', dtype='int32')
   input_masks_in = tf.keras.layers.Input(shape=(128,), name='masked_token', dtype='int32') 

   embedding_layer = bert(input_ids_in, attention_mask=input_masks_in)[0]
...
   model = tf.keras.Model(inputs=[input_ids_in, input_masks_in], outputs = X)

   for layer in model.layers[:3]:
     layer.trainable = False
   return model

I'd still like to know if someone has a solution which integrates the tokenization step inside the model-building context so that an user of the model can simply feed phrases to it to get a prediction or to train the model.

Choroid answered 26/9, 2021 at 14:51 Comment(3)

did you ever figure out how to achieve this? – Aggregate 9/3, 2022 at 13:43

It's been a while but if I remember well, I never quite managed to achieve this. You can check my code here, the models are in bertgru.py and bertlstm.py – Choroid 9/3, 2022 at 16:33

#71411565 have a look at this too! – Aggregate 9/3, 2022 at 16:41

text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).

Solution to the above error:

Just use text_input = 'text'

instead of

text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')

Temikatemp answered 16/9, 2021 at 19:53 Comment(2)

That could be probably be a separate question with some more description :) – Temikatemp 17/9, 2021 at 15:51

I just found out that this wasn't a solution to the error, in this way you just pass the string "text" to the tokenizer, which will tokenize it – Choroid 18/9, 2021 at 9:16

It looks like this is not TensorFlow compatible.

https://huggingface.co/dbmdz/bert-base-italian-xxl-cased#model-weights

Currently only PyTorch-Transformers compatible weights are available. If you need access to TensorFlow checkpoints, please raise an issue!

But remember that some things are easier if you don't use keras's functional-model-api. That's what got <class 'keras.engine.keras_tensor.KerasTensor'> is complaining about.

Try passing a tf.Tensor to see if that works. What happens when you try:

text_input = tf.constant('text')

Try writing your model as a subclass of model.

Emiliaemiliaromagna answered 18/9, 2021 at 12:42 Comment(2)

That's not a problem at all, I already have the tensorflow version working as it can be seen in the first two snippets of code. I got the tensorflow model as they answered me here. My problem is that the tokenizer only accepts strings or list of strings and not tensors. I tried extracting the strings from the tensors, unsuccessfully. – Choroid 18/9, 2021 at 13:43

I see your point. Let me try again. – Emiliaemiliaromagna 23/9, 2021 at 12:54

Yeah, my first answer was wrong.

The problem is that tensorflow has two types of tensors. Eager tensors (these have a value). And "symbolic tensors" or "graph tensors" that don't have a value, and are just used to build up a calculation.

Your tokenize_tensor function expects an eager tensor. Only eager tensors have a .numpy() method.

def tokenize_tensor(tensor):
  t = tensor.numpy()
  t = np.array([str(s, 'utf-8') for s in t])
  return tokenizer(t.tolist(), return_tensors='tf', add_special_tokens=True, max_length=110, padding='max_length', truncation=True)

But keras Input is a symbolic tensor.

text_input = tf.keras.layers.Input(shape=(1,), dtype=tf.string, name='text')  
encoder_inputs = tf.keras.layers.Lambda(tokenize_tensor, name='tokenize')(text_input)

To fix this, you can use tf.py_function. It works in graph mode, and will call the wrapped function with eager tensors when the graph is executed, instead of passing it the graph-tensors while the graph is being constructed.

def py_func_tokenize_tensor(tensor):
  return tf.py_function(tokenize_tensor, [tensor])

...

encoder_inputs = tf.keras.layers.Lambda(py_func_tokenize_tensor, name='tokenize')(text_input)

Emiliaemiliaromagna answered 23/9, 2021 at 13:6 Comment(1)

With this approach I had problems with the definition of Tout – Choroid 27/9, 2021 at 15:2

Found this Use `sentence-transformers` inside of a keras model and this amazing articles https://www.philschmid.de/tensorflow-sentence-transformers, which explain you how to do what you're trying to achieve.

The first one is using the py_function approach, the second uses tf.Model to wrap everything into model classes.

Hope this helps anyone arriving here in the future.

Agata answered 8/9, 2022 at 7:35 Comment(0)

This is how to use tf.py_function correctly to create a model that takes string as an input:

model_name = "dbmdz/bert-base-italian-xxl-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
bert = TFBertModel.from_pretrained(model_name)

def build_model():
    
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')

    def encode_text(text):
        inputs = [tf.compat.as_str(x) for x in text.numpy().tolist()]
        tokenized = tokenizer(
            inputs,
            return_tensors='tf',
            add_special_tokens=True,
            max_length=110,
            padding='max_length',
            truncation=True)
        return tokenized['input_ids'], tokenized['attention_mask']
        
    input_ids, attention_mask = tf.py_function(encode_text, inp=[text_input], Tout=[tf.int32, tf.int32])
    
    input_ids = tf.ensure_shape(input_ids, [None, 110])
    attention_mask = tf.ensure_shape(attention_mask, [None, 110])
    
    outputs = bert(input_ids, attention_mask)
    
    net = outputs['last_hidden_state']

    # Some other layers, this part is not important
    x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True))(net)
    x = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(1, name='classifier'))(x)

    return tf.keras.Model(inputs=text_input, outputs=x)

I use last_hidden_state instead of pooler_output, that's where outputs for each token in the sequence are located. (See discussion here on difference between last_hidden_state and pooler_output). We usually use last_hidden_state when doing token level classification (e.g. named entity recognition).

To use pooler_output would be even simpler, e.g:

net = outputs['pooler_output']
x = tf.keras.layers.Dense(1, name='classifier')(net)
return tf.keras.Model(inputs=text_input, outputs=x)

pooler_output can be used in simpler classification problems (like irony detection), but of course it's still possible to use last_hidden_state to create more powerful models. (When you use bert(input_ids_in, attention_mask=input_masks_in)[0] in your solution, it actually returns last_hidden_state.)

Making sure the model works:

model = build_model()
my_phrase = "Ciao, come va?"
model(tf.constant([my_phrase]))

>>> <tf.Tensor: shape=(1, 110, 1), dtype=float32, numpy=...>,

Making sure HuggingFace part of the model is trainable:

model.summary(show_trainable=True)

Civilized answered 8/9, 2022 at 11:11 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags