Is it possible to use a txt or jsonl file in an s3 bucket as the corpus_file
input for a gensim Doc2Vec model? I am looking for something of the form:
Doc2Vec(corpus_file="s3://bucket_name/subdir/sample.jsonl")
When I run the above line, I get the following error:
TypeError: Parameter corpus_file must be a valid path to a file, got 's3://bucket_name/subdir/sample.jsonl' instead.
I have also tried creating an iterator object that iterates through the file and yields its lines, and passing it as the corpus_file
argument. But I get the same TypeError.
Please note that I am specifically looking to use the corpus_file
argument instead of the documents.
The corpus_file
mode requires random-seek access to the file for its technique, which involves every worker thread opening its own unique file view on distinct ranges of the file. Such access is not well-supported for S3 (HTTP GET) access.
To use corpus_file
mode, download the file to a local volume whose filesystem offers efficient seek access.
Or, supply things as a corpus iterable - which can re-iterate over a remote streamed file multiple times, but won't achieve the same high thread utilization. (From an iteratable, even if you have 16+ cores, you'll usually get optimal throughput with no more than 6-12 worker threads – even if you've eliminated IO & expensive in-iterable preprocesing from the setup. The exact optimal number of workers depends on other model parameters – it's especially sensitive to vector_size
, negative
, & window
.)