pythonscikit-learn

Download sklearn datasets behind a proxy


I installed sklearn in my enviorment and running it now on jupyter notebook on windows.

How can I avoid the error:

URLError: urlopen error [Errno 11004] getaddrinfo failed

I am running the following code:

import sklearn
import sklearn.ensemble
import sklearn.metrics
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)

which gives the error with line 5:

----> 3 newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)

I am behind a proxy on my working computer, is there any option to avoid this error and to be able to use the sample datasets?


Solution

  • According to source code, scikit-learn will download the file from:

    https://ndownloader.figshare.com/files/5975967
    

    I am assuming that you cannot reach this location from behind the proxy.

    Can you access the dataset by some other means? If yes, then you can download it manually and then run the following script on it:

    and keep it at the location:

    ~/scikit_learn_data/
    

    Here ~ refers to the user home folder. You can use the following code to know the default location of that folder according to your system.

    from sklearn.datasets import get_data_home
    print(get_data_home())
    

    Update: Once done, use the following script to make it in a form in which scikit-learn keeps its caches

    import codecs, pickle, tarfile, shutil
    from sklearn.datasets import load_files
    
    data_folder = '~/scikit_learn_data/'
    target_folder = data_folder+'20news_home/'
    
    tarfile.open(data_folder+'20newsbydate.tar.gz', "r:gz").extractall(path=target_folder)
    cache = dict(train=load_files(target_folder+'20news-bydate-train', encoding='latin1'),
                 test=load_files(target_folder+'20news-bydate-test', encoding='latin1'))
    
    compressed_content = codecs.encode(pickle.dumps(cache), 'zlib_codec')
    
    with open(data_folder+'20news-bydate_py3.pkz', 'wb') as f:
        f.write(compressed_content)
    
    shutil.rmtree(target_folder)
    

    Scikit-learn will always check if the dataset exists locally before attempting to download from internet. For that it will check the above location.

    After that you can run the import normally.