pythonpandasdataframeimportuci

Dataframes from .data, .names and .test files using pandas


I am trying to work on the adult dataset, available at this link.

At the moment I'm stuck since the data I am able to crawl are in formats which are not completely known to me. Therefore, after downloading the files, I am not able to correcly get a pandas dataframe with the downloaded files.

I am able to download 3 files from UCI using the following links:

data = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'  
names = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names'
test = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'

They are respectively of formats .data, .names and .test. I have always worked using .csv format, therefore I am a little confused about these ones.

How can I get a pandas dataframe with the train data (= data + names) and a pandas dataframe with the test data (= test + names)?

This code won't completely work:

train_df = pd.read_csv(r'./adult.data', header=None)
train_df.head()  # WORKING (without column names)

df_names = df = pd.read_csv(r'./adult.names')
df_names.head()  # ERROR

test_df = pd.read_csv(r'./adult.test')
test_df.head()  # ERROR

Solution

  • Use:

    import pandas as pd
    import re
    
    # adult.names
    with open('adult.names') as fp:
        cols = []
        for line in fp:
            sre = re.match(r'(?P<colname>[a-z\-]+):.*\.', line)
            if sre:
                cols.append(sre.group('colname'))
        cols.append('label')
    
    # Python > 3.8, walrus operator
    # with open('adult.names') as fp:
    #     cols = [sre.group('colname') for line in fp
    #                 if (sre := re.match(r'(?P<colname>[a-z\-]+):.*\.', line))]
    #     cols.append('label')
    
    options = {'header': None, 'names': cols, 'skipinitialspace': True}
    
    # adult.data
    train_df = pd.read_csv('adult.data', **options)
    
    # adult.test
    test_df = pd.read_csv('adult.test', skiprows=1, **options)
    test_df['label'] = test_df['label'].str.rstrip('.')