pythonpandascsvpython-regex

How to parse CSV with incorrectly separated rows to pandas dataframe


How to store properly in a pandas Datframe?

link to dataset:"https://lib.stat.cmu.edu/datasets/boston"

Snippet from data

[ 21 lines of header text]
 0.00632  18.00   2.310  0  0.5380  6.5750  65.20  4.0900   1  296.0  15.30
  396.90   4.98  24.00
 0.02731   0.00   7.070  0  0.4690  6.4210  78.90  4.9671   2  242.0  17.80
  396.90   9.14  21.60
 0.02729   0.00   7.070  0  0.4690  7.1850  61.10  4.9671   2  242.0  17.80
  392.83   4.03  34.70
 0.03237   0.00   2.180  0  0.4580  6.9980  45.80  6.0622   3  222.0  18.70
  394.63   2.94  33.40
# Importing
data_url = "http://lib.stat.cmu.edu/datasets/boston"
# Define your column names
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
boston = pd.read_csv(data_url, skiprows=21, sep =r'\s+',  names = column_names, header= None, engine='python')

boston.head(2)

Result with current approach:

>>> boston.head()
        CRIM     ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  PTRATIO   B  LSTAT  MEDV
0    0.00632  18.00   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0     15.3 NaN    NaN   NaN
1  396.90000   4.98  24.00   NaN    NaN    NaN   NaN     NaN  NaN    NaN      NaN NaN    NaN   NaN
...

Solution

  • It's probably better to not try to process your input data as a CSV - mainly because it's not really a CSV file!

    You could do this:

    import requests
    import pandas as pd
    from itertools import cycle
    from collections import defaultdict
    
    with requests.get("http://lib.stat.cmu.edu/datasets/boston") as response:
        response.raise_for_status()
        lines = response.text.splitlines()
        data = defaultdict(list)
        columns = [line.split()[0] for line in lines[7:21]]
        keys = cycle(columns)
        for line in lines[22:]:
            for token in line.split():
                data[next(keys)].append(token)
        df = pd.DataFrame(data)
        print(df.head())
    

    Output:

          CRIM     ZN  INDUS CHAS     NOX      RM    AGE     DIS RAD    TAX PTRATIO       B LSTAT   MEDV
    0  0.00632  18.00  2.310    0  0.5380  6.5750  65.20  4.0900   1  296.0   15.30  396.90  4.98  24.00
    1  0.02731   0.00  7.070    0  0.4690  6.4210  78.90  4.9671   2  242.0   17.80  396.90  9.14  21.60
    2  0.02729   0.00  7.070    0  0.4690  7.1850  61.10  4.9671   2  242.0   17.80  392.83  4.03  34.70
    3  0.03237   0.00  2.180    0  0.4580  6.9980  45.80  6.0622   3  222.0   18.70  394.63  2.94  33.40
    4  0.06905   0.00  2.180    0  0.4580  7.1470  54.20  6.0622   3  222.0   18.70  396.90  5.33  36.20