HuggingFace AutoTokenizer | ValueError: Couldn't instantiate the backend tokenizer
Asked Answered
P

2

6

Goal: Amend this Notebook to work with albert-base-v2 model

Error occurs in Section 1.3.

Kernel: conda_pytorch_p36. I did Restart & Run All, and refreshed file view in working directory.


There are 3 listed ways this error can be caused. I'm not sure which my case falls under.

Section 1.3:

# define the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
        configs.output_dir, do_lower_case=configs.do_lower_case)

Traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-25-1f864e3046eb> in <module>
    140 # define the tokenizer
    141 tokenizer = AutoTokenizer.from_pretrained(
--> 142         configs.output_dir, do_lower_case=configs.do_lower_case)
    143 
    144 # Evaluate the original FP32 BERT model

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    548             tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
    549             if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 550                 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    551             else:
    552                 if tokenizer_class_py is not None:

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1752             use_auth_token=use_auth_token,
   1753             cache_dir=cache_dir,
-> 1754             **kwargs,
   1755         )
   1756 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, *init_inputs, **kwargs)
   1880         # Instantiate tokenizer.
   1881         try:
-> 1882             tokenizer = cls(*init_inputs, **init_kwargs)
   1883         except OSError:
   1884             raise OSError(

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/models/albert/tokenization_albert_fast.py in __init__(self, vocab_file, tokenizer_file, do_lower_case, remove_space, keep_accents, bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, **kwargs)
    159             cls_token=cls_token,
    160             mask_token=mask_token,
--> 161             **kwargs,
    162         )
    163 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils_fast.py in __init__(self, *args, **kwargs)
    116         else:
    117             raise ValueError(
--> 118                 "Couldn't instantiate the backend tokenizer from one of: \n"
    119                 "(1) a `tokenizers` library serialization file, \n"
    120                 "(2) a slow tokenizer instance to convert or \n"

ValueError: Couldn't instantiate the backend tokenizer from one of: 
(1) a `tokenizers` library serialization file, 
(2) a slow tokenizer instance to convert or 
(3) an equivalent slow tokenizer class to instantiate and convert. 
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

Please let me know if there's anything else I can add to post.

Poesy answered 13/1, 2022 at 14:37 Comment(1)
Git IssuePoesy
P
7

First, I had to pip install sentencepiece.

However, in the same code line, I was getting an error with sentencepiece.

Wrapping str() around both parameters yielded the same Traceback.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-1f864e3046eb> in <module>
    140 # define the tokenizer
    141 tokenizer = AutoTokenizer.from_pretrained(
--> 142         configs.output_dir, do_lower_case=configs.do_lower_case)
    143 
    144 # Evaluate the original FP32 BERT model

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    548             tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
    549             if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 550                 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    551             else:
    552                 if tokenizer_class_py is not None:

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1752             use_auth_token=use_auth_token,
   1753             cache_dir=cache_dir,
-> 1754             **kwargs,
   1755         )
   1756 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, *init_inputs, **kwargs)
   1776                 copy.deepcopy(init_configuration),
   1777                 *init_inputs,
-> 1778                 **(copy.deepcopy(kwargs)),
   1779             )
   1780         else:

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, *init_inputs, **kwargs)
   1880         # Instantiate tokenizer.
   1881         try:
-> 1882             tokenizer = cls(*init_inputs, **init_kwargs)
   1883         except OSError:
   1884             raise OSError(

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/models/albert/tokenization_albert.py in __init__(self, vocab_file, do_lower_case, remove_space, keep_accents, bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, sp_model_kwargs, **kwargs)
    179 
    180         self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
--> 181         self.sp_model.Load(vocab_file)
    182 
    183     @property

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sentencepiece/__init__.py in Load(self, model_file, model_proto)
    365       if model_proto:
    366         return self.LoadFromSerializedProto(model_proto)
--> 367       return self.LoadFromFile(model_file)
    368 
    369 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sentencepiece/__init__.py in LoadFromFile(self, arg)
    169 
    170     def LoadFromFile(self, arg):
--> 171         return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
    172 
    173     def DecodeIdsWithCheck(self, ids):

TypeError: not a string
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-1f864e3046eb> in <module>
    140 # define the tokenizer
    141 tokenizer = AutoTokenizer.from_pretrained(
--> 142         configs.output_dir, do_lower_case=configs.do_lower_case)
    143 
    144 # Evaluate the original FP32 BERT model

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    548             tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
    549             if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 550                 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    551             else:
    552                 if tokenizer_class_py is not None:

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1752             use_auth_token=use_auth_token,
   1753             cache_dir=cache_dir,
-> 1754             **kwargs,
   1755         )
   1756 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, *init_inputs, **kwargs)
   1776                 copy.deepcopy(init_configuration),
   1777                 *init_inputs,
-> 1778                 **(copy.deepcopy(kwargs)),
   1779             )
   1780         else:

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, *init_inputs, **kwargs)
   1880         # Instantiate tokenizer.
   1881         try:
-> 1882             tokenizer = cls(*init_inputs, **init_kwargs)
   1883         except OSError:
   1884             raise OSError(

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/models/albert/tokenization_albert.py in __init__(self, vocab_file, do_lower_case, remove_space, keep_accents, bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, sp_model_kwargs, **kwargs)
    179 
    180         self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
--> 181         self.sp_model.Load(vocab_file)
    182 
    183     @property

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sentencepiece/__init__.py in Load(self, model_file, model_proto)
    365       if model_proto:
    366         return self.LoadFromSerializedProto(model_proto)
--> 367       return self.LoadFromFile(model_file)
    368 
    369 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sentencepiece/__init__.py in LoadFromFile(self, arg)
    169 
    170     def LoadFromFile(self, arg):
--> 171         return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
    172 
    173     def DecodeIdsWithCheck(self, ids):

TypeError: not a string

I then had to swap out parameters for just the model name:

tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')

This second part is detailed on this SO post.

Poesy answered 14/1, 2022 at 14:9 Comment(1)
https://mcmap.net/q/414106/-transformers-v4-x-convert-slow-tokenizer-to-fast-tokenizer pip install sentencepiece followed by kernel/runtime restart solves the issue.Inequitable
P
0

Installling sentencepiece will fix it, https://github.com/huggingface/transformers/issues/8963

 !pip install sentencepiece
Prerequisite answered 28/9, 2023 at 10:35 Comment(1)
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - From ReviewWares

© 2022 - 2024 — McMap. All rights reserved.