pythongensimword2vectar.gz

Load word2vec model that is in .tar format


I want to load a previously trained word2vec model into gensim. The trouble is the file format. It is not a .bin file format but a .tar file. It is the model / file deu-ch_web-public_2019_1M.tar.gz from the University of Leipzig. The model is also listed on HuggingFace where different word2vec models for English and German are listed.

First I tried:

from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('deu-ch_web-public_2019_1M.tar.gz')

--> ValueError: invalid literal for int() with base 10: 'deu-ch_web-public_2019_1M

Then I unzipped the file with 7-Zip and tried the following:

from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('deu-ch_web-public_2019_1M.tar')

--> ValueError: invalid literal for int() with base 10: 'deu-ch_web-public_2019_1M

from gensim.models import word2vec
model = word2vec.Word2Vec.load('deu-ch_web-public_2019_1M.tar')

--> UnpicklingError: could not find MARK

Then I got a bit desperate...

import gensim.downloader
model = gensim.downloader.load('deu-ch_web-public_2019_1M.tar')

--> ValueError: Incorrect model/corpus name

Googling around I found useful information how to load a .bin model with gensim ( see here and here ). Following this thread it seems tricky to load a .tar file with gensim. Especially if one has not one .txt file but five .txt files as in this case. I found one answer how to read a .tar file but with tensorflow. Since I am not familiar with tensorflow, I prefer to use gensim. Any thoughts how to solve the issue is appreciated.


Solution

  • A .tar file is a bundle of one or more directories and files – see https://en.wikipedia.org/wiki/Tar_(computing) – and thus not the sort of single-model file that you should expect Gensim to open directly.

    Rather, similar to as with a .zip file, you'd use some purpose-specific software to extract any content inside the .tar into individual files – then point Gensim at those, individually, if they're formats Gensim understands.

    A typical command-line operation to extract the individual file(s) from a .tar.gz file (which is both tarred & gzipped) would be:

    tar -xvzf deu-ch_web-public_2019_1M.tar.gz

    That tells the command to extract with verbose reporting while also un-gzipping the file deu-ch_web-public_2019_1M.tar.gz. Then you'll have one or more new local files, which are the actual (not-packaged-up) files of interest.

    In some graphical UI file-explorers, like the MacOS 'Finder', simply double-clicking to perform the default 'open' action on deu-ch_web-public_2019_1M.tar.gz will perform this expansion (no tar command-line needed).

    But note: the University of Liepzig page you've linked describes these files as 'corpora' (training texts), not trained sets of word-vectors or word2vec models.

    And I looked at the "2019 - switzerland - public web file" you're referring-to, and inside is a directory (folder) deu-ch_web-public_2019_1M, with 7 .txt files inside of various formats, and 1 .sql file. But none of those are any sort of trained word-vectors - just text & text-statistics.

    You could use those to train a model yourself. The deu-ch_web-public_2019_1M-sentences.txt is closest to what you need, as 1 million plain-text sentences.

    But it's still not yet in a form fully ready for word2vec training. Each line has a redundant line-number at the front, and the text hasn't yet been tokenized into word-tokens (which would potentially remove punctuation, or sometimes keep punctuation as distinct tokens). And, as a mere 15 million words total, it's still fairly small as a corpus for creating a powerful word2vec model.