pythontorchbert-language-modeltransformer-modeldoc2vec

transformers and BERT downloading to your local machine


I am trying to replicates the code from this page.

At my workplace we have access to transformers and pytorch library but cannot connect to internet from our python environment. Could anyone help with how we could get the script working after manually downloading files to my machine?

my specific questions are -

  1. should I go to the location bert-base-uncased at main and download all the files? Do I have put them in a folder with a specific name?

How should I change the below code

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Tokenize our sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)

How should I change the below code

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',

                                  output_hidden_states = True, # Whether the model returns all hidden-states.

                              )

Please let me know if anyone has done this…thanks

###update1

I went to the link and manually downloaded all files to a folder and specified path of that folder in my code. Tokenizer works but this line model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states = True, # Whether the model returns all hidden-states. ) fails. Any idea what should i do? I noticed that 4 big files when downloaded have very strange name...should I rename them to same names as shown on the above page? Do I need to download any other files?

the error message is OSErrr: unable to load weights from pytorch checkpoint file for bert-base-uncased2/ at bert-base-uncased/pytorch_model.bin If you tried to load a pytroch model from a TF 2 checkpoint, please set from_tf=True


Solution

  • clone the model repo for downloading all the files

    git lfs install
    git clone https://huggingface.co/bert-base-uncased
    
    # if you want to clone without large files – just their pointers
    # prepend your git clone with the following env var:
    GIT_LFS_SKIP_SMUDGE=1
    

    git usage:

    1. download git from here https://git-scm.com/downloads

    2. paste these to your cli(terminal):
      a. git lfs install
      b. git clone https://huggingface.co/bert-base-uncased

    3. wait for download, it will take time. if you want monitor your web performance

    4. find the current directory simply pasting cd to your cli and get the file path(e.g "C:/Users/........./bert-base-uncased" )

    5. use it as:

       from transformers import BertModel, BertTokenizer
       model = BertModel.from_pretrained("C:/Users/........./bert-base-uncased")
       tokenizer = BertTokenizer.from_pretrained("C:/Users/........./bert-base-uncased")
      

    Manual download, without git:

    1. Download all the files from here https://huggingface.co/bert-base-uncased/tree/main

    2. Put them in a folder named "yourfoldername"

    3. use it as:

       model = BertModel.from_pretrained("C:/Users/........./yourfoldername")
       tokenizer = BertTokenizer.from_pretrained("C:/Users/........./yourfoldername")
      

    For only model(manual download, without git):

    1. just click the download button here and download only pytorch pretrained model. its about 420mb https://huggingface.co/bert-base-uncased/blob/main/pytorch_model.bin

    2. download config.json file from here https://huggingface.co/bert-base-uncased/tree/main

    3. put both of them in a folder named "yourfilename"

    4. use it as:

       model = BertModel.from_pretrained("C:/Users/........./yourfilename")