I'm trying to build the model illustrated in this picture:
I obtained a pre-trained BERT and respective tokenizer from HuggingFace's transformers
in the following way:
from transformers import AutoTokenizer, TFBertModel
model_name = "dbmdz/bert-base-italian-xxl-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
bert = TFBertModel.from_pretrained(model_name)
The model will be fed a sequence of italian tweets and will need to determine if they are ironic or not.
I'm having problems building the initial part of the model, which takes the inputs and feeds them to the tokenizer in order to get a representation I can feed to BERT.
I can do it outside of the model-building context:
my_phrase = "Ciao, come va?"
# an equivalent version is tokenizer(my_phrase, other parameters)
bert_input = tokenizer.encode(my_phrase, add_special_tokens=True, return_tensors='tf', max_length=110, padding='max_length', truncation=True)
attention_mask = bert_input > 0
outputs = bert(bert_input, attention_mask)['pooler_output']
but I'm having troubles building a model that does this. Here is the code for building such a model (the problem is in the first 4 lines ):
def build_classifier_model():
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
encoder_inputs = tokenizer(text_input, return_tensors='tf', add_special_tokens=True, max_length=110, padding='max_length', truncation=True)
outputs = bert(encoder_inputs)
net = outputs['pooler_output']
X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(net)
X = tf.keras.layers.Concatenate(axis=-1)([X, input_layer])
X = tf.keras.layers.MaxPooling1D(20)(X)
X = tf.keras.layers.SpatialDropout1D(0.4)(X)
X = tf.keras.layers.Flatten()(X)
X = tf.keras.layers.Dense(128, activation="relu")(X)
X = tf.keras.layers.Dropout(0.25)(X)
X = tf.keras.layers.Dense(2, activation='softmax')(X)
model = tf.keras.Model(inputs=text_input, outputs = X)
return model
And when I call the function for creating this model I get this error:
text input must of type
str
(single example),List[str]
(batch or single pretokenized example) orList[List[str]]
(batch of pretokenized examples).
One thing I thought was that maybe I had to use the tokenizer.batch_encode_plus
function which works with lists of strings:
class BertPreprocessingLayer(tf.keras.layers.Layer):
def __init__(self, tokenizer, maxlength):
super().__init__()
self._tokenizer = tokenizer
self._maxlength = maxlength
def call(self, inputs):
print(type(inputs))
print(inputs)
tokenized = tokenizer.batch_encode_plus(inputs, add_special_tokens=True, return_tensors='tf', max_length=self._maxlength, padding='max_length', truncation=True)
return tokenized
def build_classifier_model():
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
encoder_inputs = BertPreprocessingLayer(tokenizer, 100)(text_input)
outputs = bert(encoder_inputs)
net = outputs['pooler_output']
# ... same as above
but I get this error:
batch_text_or_text_pairs has to be a list (got <class 'keras.engine.keras_tensor.KerasTensor'>)
and beside the fact I haven't found a way to convert that tensor to a list with a quick google search, it seems weird that I have to go in and out of tensorflow in this way.
I've also looked up on the huggingface's documentation but there is only a single usage example, with a single phrase, and what they do is analogous at my "out of model-building context" example.
EDIT:
I also tried with Lambda
s in this way:
tf.executing_eagerly()
def tokenize_tensor(tensor):
t = tensor.numpy()
t = np.array([str(s, 'utf-8') for s in t])
return tokenizer(t.tolist(), return_tensors='tf', add_special_tokens=True, max_length=110, padding='max_length', truncation=True)
def build_classifier_model():
text_input = tf.keras.layers.Input(shape=(1,), dtype=tf.string, name='text')
encoder_inputs = tf.keras.layers.Lambda(tokenize_tensor, name='tokenize')(text_input)
...
outputs = bert(encoder_inputs)
but I get the following error:
'Tensor' object has no attribute 'numpy'
EDIT 2:
I also tried the approach suggested by @mdaoust of wrapping everything in a tf.py_function
and got this error.
def py_func_tokenize_tensor(tensor):
return tf.py_function(tokenize_tensor, [tensor], Tout=[tf.int32, tf.int32, tf.int32])
eager_py_func() missing 1 required positional argument: 'Tout'
Then I defined Tout as the type of the value returned by the tokenizer:
transformers.tokenization_utils_base.BatchEncoding
and got the following error:
Expected DataType for argument 'Tout' not <class 'transformers.tokenization_utils_base.BatchEncoding'>
Finally I unpacked the value in the BatchEncoding in the following way:
def tokenize_tensor(tensor):
t = tensor.numpy()
t = np.array([str(s, 'utf-8') for s in t])
dictionary = tokenizer(t.tolist(), return_tensors='tf', add_special_tokens=True, max_length=110, padding='max_length', truncation=True)
#unpacking
input_ids = dictionary['input_ids']
tok_type = dictionary['token_type_ids']
attention_mask = dictionary['attention_mask']
return input_ids, tok_type, attention_mask
And get an error in the line below:
...
outputs = bert(encoder_inputs)
ValueError: Cannot take the length of shape with unknown rank.