Using Arabert model with SpaCy
Asked Answered
U

1

5

SpaCy doesn't support the Arabic language, but Can I use SpaCy with the pretrained Arabert model?

Is it possible to modify this code so it can accept bert-large-arabertv02 instead of en_core_web_lg?

!python -m spacy download en_core_web_lg
import spacy
nlp = spacy.load("en_core_web_lg")

Here How we can call AraBertV.02

from arabert.preprocess import ArabertPreprocessor
from transformers import AutoTokenizer, AutoModelForMaskedLM

model_name="aubmindlab/bert-large-arabertv02"
arabert_prep = ArabertPreprocessor(model_name=model_name)  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
Unlive answered 13/10, 2022 at 22:0 Comment(0)
M
6

spaCy actually does support Arabic, though only at an alpha level, which basically just means tokenization support (see here). That's enough for loading external models or training your own, though, so in this case you should be able to load this like any HuggingFace model - see this FAQ.

In this case this would look like:

import spacy
nlp = spacy.blank("ar") # empty English pipeline
# create the config with the name of your model
# values omitted will get default values
config = {
    "model": {
        "@architectures": "spacy-transformers.TransformerModel.v3",
        "name": "aubmindlab/bert-large-arabertv02"
    }
}
nlp.add_pipe("transformer", config=config)
nlp.initialize() # XXX don't forget this step!
doc = nlp("فريك الذرة لذيذة")
print(doc._.trf_data) # all the Transformer output is stored here

I don't speak Arabic, so I can't check the output thoroughly, but that code ran and produced an embedding for me.

Maloy answered 14/10, 2022 at 3:40 Comment(3)
It gives me an error: ValueError: [E002] Can't find factory for 'transformer' for language Arabic (ar). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).Unlive
Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, ner, beam_ner, entity_ruler, tagger, morphologizer, senter, sentencizer, textcat, spancat, future_entity_ruler, span_ruler, textcat_multilabelUnlive
I solved it by adding import spacy_transformersUnlive

© 2022 - 2024 — McMap. All rights reserved.