How to parse CSV with incorrectly separated rows to pandas dataframe

How to store properly in a pandas Datframe?

link to dataset:"https://lib.stat.cmu.edu/datasets/boston"

Snippet from data

[ 21 lines of header text]
 0.00632  18.00   2.310  0  0.5380  6.5750  65.20  4.0900   1  296.0  15.30
  396.90   4.98  24.00
 0.02731   0.00   7.070  0  0.4690  6.4210  78.90  4.9671   2  242.0  17.80
  396.90   9.14  21.60
 0.02729   0.00   7.070  0  0.4690  7.1850  61.10  4.9671   2  242.0  17.80
  392.83   4.03  34.70
 0.03237   0.00   2.180  0  0.4580  6.9980  45.80  6.0622   3  222.0  18.70
  394.63   2.94  33.40

# Importing
data_url = "http://lib.stat.cmu.edu/datasets/boston"
# Define your column names
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
boston = pd.read_csv(data_url, skiprows=21, sep =r'\s+',  names = column_names, header= None, engine='python')

boston.head(2)

Result with current approach:

>>> boston.head()
        CRIM     ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  PTRATIO   B  LSTAT  MEDV
0    0.00632  18.00   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0     15.3 NaN    NaN   NaN
1  396.90000   4.98  24.00   NaN    NaN    NaN   NaN     NaN  NaN    NaN      NaN NaN    NaN   NaN
...

Solution

It's probably better to not try to process your input data as a CSV - mainly because it's not really a CSV file!

You could do this:

import requests
import pandas as pd
from itertools import cycle
from collections import defaultdict

with requests.get("http://lib.stat.cmu.edu/datasets/boston") as response:
    response.raise_for_status()
    lines = response.text.splitlines()
    data = defaultdict(list)
    columns = [line.split()[0] for line in lines[7:21]]
    keys = cycle(columns)
    for line in lines[22:]:
        for token in line.split():
            data[next(keys)].append(token)
    df = pd.DataFrame(data)
    print(df.head())

Output:

      CRIM     ZN  INDUS CHAS     NOX      RM    AGE     DIS RAD    TAX PTRATIO       B LSTAT   MEDV
0  0.00632  18.00  2.310    0  0.5380  6.5750  65.20  4.0900   1  296.0   15.30  396.90  4.98  24.00
1  0.02731   0.00  7.070    0  0.4690  6.4210  78.90  4.9671   2  242.0   17.80  396.90  9.14  21.60
2  0.02729   0.00  7.070    0  0.4690  7.1850  61.10  4.9671   2  242.0   17.80  392.83  4.03  34.70
3  0.03237   0.00  2.180    0  0.4580  6.9980  45.80  6.0622   3  222.0   18.70  394.63  2.94  33.40
4  0.06905   0.00  2.180    0  0.4580  7.1470  54.20  6.0622   3  222.0   18.70  396.90  5.33  36.20