AutoTokenizer.from_pretrained fails to load locally saved pretrained tokenizer (PyTorch)

2024/10/1 5:40:55

I am new to PyTorch and recently, I have been trying to work with Transformers. I am using pretrained tokenizers provided by HuggingFace.
I am successful in downloading and running them. But if I try to save them and load again, then some error occurs.
If I use AutoTokenizer.from_pretrained to download a tokenizer, then it works.

[1]:    tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')text = "Hello there"enc = tokenizer.encode_plus(text)enc.keys()Out[1]: dict_keys(['input_ids', 'attention_mask'])

But if I save it using tokenizer.save_pretrained("distilroberta-tokenizer") and try to load it locally, then it fails.

[2]:    tmp = AutoTokenizer.from_pretrained('distilroberta-tokenizer')---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)238                 resume_download=resume_download,
--> 239                 local_files_only=local_files_only,240             )/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, local_files_only)266         # File, but it doesn't exist.
--> 267         raise EnvironmentError("file {} not found".format(url_or_filename))268     else:OSError: file distilroberta-tokenizer/config.json not foundDuring handling of the above exception, another exception occurred:OSError                                   Traceback (most recent call last)
<ipython-input-25-3bd2f7a79271> in <module>
----> 1 tmp = AutoTokenizer.from_pretrained("distilroberta-tokenizer")/opt/conda/lib/python3.7/site-packages/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)193         config = kwargs.pop("config", None)194         if not isinstance(config, PretrainedConfig):
--> 195             config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)196 197         if "bert-base-japanese" in pretrained_model_name_or_path:/opt/conda/lib/python3.7/site-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)194 195         """
--> 196         config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)197 198         if "model_type" in config_dict:/opt/conda/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)250                 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\n\n"251             )
--> 252             raise EnvironmentError(msg)253 254         except json.JSONDecodeError:OSError: Can't load config for 'distilroberta-tokenizer'. Make sure that:- 'distilroberta-tokenizer' is a correct model identifier listed on 'https://huggingface.co/models'- or 'distilroberta-tokenizer' is the correct path to a directory containing a config.json file

Its saying 'config.josn' is missing form the directory. On checking the directory, I am getting list of these files:

[3]:    !ls distilroberta-tokenizerOut[3]: merges.txt  special_tokens_map.json  tokenizer_config.json  vocab.json

I know this problem has been posted earlier but none of them seems to work. I have also tried to follow the docs but still can't make it work.
Any help would be appreciated.

Answer

There is currently an issue under investigation which only affects the AutoTokenizers but not the underlying tokenizers like (RobertaTokenizer). For example the following should work:

from transformers import RobertaTokenizertokenizer = RobertaTokenizer.from_pretrained('YOURPATH')

To work with the AutoTokenizer you also need to save the config to load it offline:

from transformers import AutoTokenizer, AutoConfigtokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
config = AutoConfig.from_pretrained('distilroberta-base')tokenizer.save_pretrained('YOURPATH')
config.save_pretrained('YOURPATH')tokenizer = AutoTokenizer.from_pretrained('YOURPATH')

I recommend to either use a different path for the tokenizers and the model or to keep the config.json of your model because some modifications you apply to your model will be stored in the config.json which is created during model.save_pretrained() and will be overwritten when you save the tokenizer as described above after your model (i.e. you won't be able to load your modified model with tokenizer config.json).

https://en.xdnf.cn/q/70990.html

Related Q&A

How to scroll down in an instagram pop-up frame with Selenium

I have a python script using selenium to go to a given Instagram profile and iterate over the users followers. On the instagram website when one clicks to see the list of followers, a pop-up opens with…

Get starred messages from GMail using IMAP4 and python

I found many dummy info about working with IMAP, but I didnt understand how to use it for my purposes. I found how I can get ALL messages from mailbox and ALL SEEN messages, but how should I work with …

python and php bcrypt

I was using Laravel to register the users. It uses bcrypt like so:$2y$10$kb9T4WXdz5aKLSZX1OkpMOx.3ogUn9QX8GRZ93rd99i7VLKmeoXXXI am currently making another script that will authenticate users from anot…

Python socket library thinks socket is open when its not

Im working with a bit of Python that looks like this:HOST = 127.0.0.1 PORT = 43434 single = socket.socket(socket.AF_INET, socket.SOCK_STREAM) try:single.bind((HOST, PORT)) except socket.error as e:# Pr…

object of type _csv.reader has no len(), csv data not recognized

The following is a self-contained example. Change the "folder_name" to run it. This answers : reader type = _csv.reader list(reader) = [] _csv.reader has no len()I have tried many things but …

Lookup country for GPS coordinates without Internet access

I need to find out in what country given GPS coordinates are, on a device that has no Internet access (e.g. this, but without the easy on-line solution). Having no experience with GIS, I guess Id need …

how to get spyders python recognize external packages on MacOS X?

I have spyderlib installed on my MacOS X (10.6.8) using the official dmg file. In parallel, I have installed packages using both pip and homebrew from the terminal (i.e. opencv, gdal...). As Spyder is …

textcat - architecture extra fields not permitted

Ive been trying to practise what Ive learned from this tutorial:(https://realpython.com/sentiment-analysis-python/) using PyCharm. And this line: textcat.add_label("pos")generated a warning: …

cv2.rectangle() calls overloaded method, although I give other parameter

cv2.rectangle has two ways of calling:img = cv.rectangle( img, pt1, pt2, color[, thickness[, lineType[, shift]]] ) img = cv.rectangle( img, rec, color[, thickness[, lineType[, shift]]]source:h…

Converting xls to csv in Python 3 using xlrd

Im using Python 3.3 with xlrd and csv modules to convert an xls file to csv. This is my code:import xlrd import csvdef csv_from_excel():wb = xlrd.open_workbook(MySpreadsheet.xls)sh = wb.sheet_by_name(S…