pythonnumpycsvdata-analysisgenfromtxt

How to extract data from csv by column header


I have csv files (tab separated) that I want to analyse and graph. I can extract the data from the files but I would prefer to do it by using a column header name rather than normal indexing.

i.e instead of:

freq_data = my_data[:,0]

i would use something like:

freq2_data=dataA['Freq']

which would give me just that column of data without a 'nan' for the top field. I want to do it this way in case the data is ordered differently by some people.

What I currently have is:

import os
import csv
import numpy as np
from numpy import genfromtxt

def mylistdir(directory):
    """A specialized version of os.listdir() that ignores files that
    start with a leading period."""
    filelist = os.listdir(directory)
    return [x for x in filelist
            if not (x.startswith('.'))]
path = ("C:\\Users\\priper\\Desktop\\rough_data\\")
results_files = mylistdir(path)
print(results_files)


vel_data = []

for f in results_files:
    f = path + f
    my_data = np.genfromtxt(f, dtype = float, delimiter='\t') #, names = True, max_rows=1
    print(my_data)
    freq_data = my_data[:,0]
    height_data = my_data[:,1]
    width_data = my_data[:,2]
    time_data = my_data[:,3]
    freq2_data=dataA['Freq']
    print(width_data)
    print(freq2_data)

Any ideas as to what I can do?

the csv file:

Freqheight_cmsWidth_cmsTime_secs
"998.2121573301549  44.08897100772889   6.445672191528545   90.0"
"998.2121573301549  46.34952337794475   6.49171270718232    90.0"
"998.2121573301549  39.7907973252776    6.49171270718232    90.0"
"1999.404052443385  42.986804623146725  6.445672191528545   90.0"
"1999.404052443385  38.76177273904744   6.49171270718232    90.0"
"1999.404052443385  46.34952337794475   6.491875969369261   89.59365376669096"
"2997.61620977354   44.08897100772889   6.491875969369261   89.59365376669096"
"2997.61620977354   42.986804623146725  6.537915335317934   89.59651526494126"
"2997.61620977354   44.08897100772889   6.49171270718232    90.0"
"3998.80810488677   47.50820176059876   6.307550644567219   90.0"
"3998.80810488677   46.34952337794475   6.3535911602209945  90.0"
"3998.80810488677   41.903151251584184  6.3997972870975675  89.58780725859766"
"5000.0 38.76177273904744   6.21564013134898    89.57559458063852"
"5000.0 44.08897100772889   6.261510128913444   90.0"
"5000.0 41.903151251584184  6.2616793932272925  89.57871509583141"
"5998.212157330155  33.881963382336906  6.077522459688805   89.5659493678606"
"5998.212157330155  47.50820176059876   5.985444111277719   89.55927192723898"
"5998.212157330155  53.59203690324092   6.123388581952118   90.0"

This is what worked after perusing the answers and tips given from users below.

for f in results_files:
    f = path + f
    data = pd.read_csv(f, sep = '\t')
    length_of_data = len(data)
    print(data.head(length_of_data))
    freqy = data[['Freq']]
    print(freqy)

Solution

  • Use the library pandas: https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_csv.html

    import pandas as pd
    
    my_csv = pd.read_csv(filepath, header, names)
    

    "header : int or list of ints, default ‘infer’

    Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

    names : array-like, default None

    List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list will cause a UserWarning to be issued."