Question 1

I am new to PyTorch and recently, I have been trying to work with Transformers. I am using pretrained tokenizers provided by HuggingFace.
I am successful in downloading and running them. But if I try to save them and load again, then some error occurs.
If I use AutoTokenizer.from_pretrained to download a tokenizer, then it works.

[1]:    tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')text = "Hello there"enc = tokenizer.encode_plus(text)enc.keys()Out[1]: dict_keys(['input_ids', 'attention_mask'])

But if I save it using tokenizer.save_pretrained("distilroberta-tokenizer") and try to load it locally, then it fails.

[2]:    tmp = AutoTokenizer.from_pretrained('distilroberta-tokenizer')---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)238                 resume_download=resume_download,
--> 239                 local_files_only=local_files_only,240             )/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, local_files_only)266         # File, but it doesn't exist.
--> 267         raise EnvironmentError("file {} not found".format(url_or_filename))268     else:OSError: file distilroberta-tokenizer/config.json not foundDuring handling of the above exception, another exception occurred:OSError                                   Traceback (most recent call last)
<ipython-input-25-3bd2f7a79271> in <module>
----> 1 tmp = AutoTokenizer.from_pretrained("distilroberta-tokenizer")/opt/conda/lib/python3.7/site-packages/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)193         config = kwargs.pop("config", None)194         if not isinstance(config, PretrainedConfig):
--> 195             config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)196 197         if "bert-base-japanese" in pretrained_model_name_or_path:/opt/conda/lib/python3.7/site-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)194 195         """
--> 196         config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)197 198         if "model_type" in config_dict:/opt/conda/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)250                 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\n\n"251             )
--> 252             raise EnvironmentError(msg)253 254         except json.JSONDecodeError:OSError: Can't load config for 'distilroberta-tokenizer'. Make sure that:- 'distilroberta-tokenizer' is a correct model identifier listed on 'https://huggingface.co/models'- or 'distilroberta-tokenizer' is the correct path to a directory containing a config.json file

Its saying 'config.josn' is missing form the directory. On checking the directory, I am getting list of these files:

[3]:    !ls distilroberta-tokenizerOut[3]: merges.txt  special_tokens_map.json  tokenizer_config.json  vocab.json

I know this problem has been posted earlier but none of them seems to work. I have also tried to follow the docs but still can't make it work.
Any help would be appreciated.

Question 2

There is currently an issue under investigation which only affects the AutoTokenizers but not the underlying tokenizers like (RobertaTokenizer). For example the following should work:

from transformers import RobertaTokenizertokenizer = RobertaTokenizer.from_pretrained('YOURPATH')

To work with the AutoTokenizer you also need to save the config to load it offline:

from transformers import AutoTokenizer, AutoConfigtokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
config = AutoConfig.from_pretrained('distilroberta-base')tokenizer.save_pretrained('YOURPATH')
config.save_pretrained('YOURPATH')tokenizer = AutoTokenizer.from_pretrained('YOURPATH')

I recommend to either use a different path for the tokenizers and the model or to keep the config.json of your model because some modifications you apply to your model will be stored in the config.json which is created during model.save_pretrained() and will be overwritten when you save the tokenizer as described above after your model (i.e. you won't be able to load your modified model with tokenizer config.json).

AutoTokenizer.from_pretrained fails to load locally saved pretrained tokenizer (PyTorch)

Related Q&A

How to scroll down in an instagram pop-up frame with Selenium

Get starred messages from GMail using IMAP4 and python

python and php bcrypt

Python socket library thinks socket is open when its not

object of type _csv.reader has no len(), csv data not recognized

Lookup country for GPS coordinates without Internet access

how to get spyders python recognize external packages on MacOS X?

textcat - architecture extra fields not permitted

cv2.rectangle() calls overloaded method, although I give other parameter

Converting xls to csv in Python 3 using xlrd