I'm looking for a fast way to read a csv file into dataframe. the csv file is a space separated, however, columns names are placed between double quotation if it contains more than one word & contains multiple spaces. pd.read_csv with sep=" " does not work, because column names spaces.
I currently solved this by applying shlex.split
on every line of the file, changing it to comma separated. However, it is taking too long: ~6 seconds for a file with 15K lines. Below is an example of my file ('template.csv'
) and code snippet on how it is solved through shlex.split
.
Appreciate the help in advance!
a b c "d e " "f g " "h k "
1 2 3 4 5 6
2 2 3 4 5 6
3 2 3 4 5 6
4 2 3 4 5 6
5 2 3 4 5 6
6 2 3 4 5 6
This is the code and desired dataframe output:
import pandas as pd
import shlex
data = []
df = pd.DataFrame()
for line in open(r'template.csv'):
line = shlex.split(line)
data.append(line)
df = pd.DataFrame(data[1:], columns=[colName for colName in data[0]])
a b c d e f g h k
0 1 2 3 4 5 6
1 2 2 3 4 5 6
2 3 2 3 4 5 6
3 4 2 3 4 5 6
4 5 2 3 4 5 6
5 6 2 3 4 5 6
You can set the separator:
# or suggested by @AndrejKesely, sep=r'\s+'
df = pd.read_csv('template.csv', sep=' ')
print(df)
# Output
a b c d e f g h k
0 1 2 3 4 5 6
1 2 2 3 4 5 6
2 3 2 3 4 5 6
3 4 2 3 4 5 6
4 5 2 3 4 5 6
5 6 2 3 4 5 6
Alternative with shlex
but use it only on the first row:
with open('template.csv') as fp:
headers = shlex.split(fp.readline())
df = pd.read_csv(fp, header=None, names=headers, sep=' ')