I have csv files (tab separated) that I want to analyse and graph. I can extract the data from the files but I would prefer to do it by using a column header name rather than normal indexing.
i.e instead of:
freq_data = my_data[:,0]
i would use something like:
freq2_data=dataA['Freq']
which would give me just that column of data without a 'nan' for the top field. I want to do it this way in case the data is ordered differently by some people.
What I currently have is:
import os
import csv
import numpy as np
from numpy import genfromtxt
def mylistdir(directory):
"""A specialized version of os.listdir() that ignores files that
start with a leading period."""
filelist = os.listdir(directory)
return [x for x in filelist
if not (x.startswith('.'))]
path = ("C:\\Users\\priper\\Desktop\\rough_data\\")
results_files = mylistdir(path)
print(results_files)
vel_data = []
for f in results_files:
f = path + f
my_data = np.genfromtxt(f, dtype = float, delimiter='\t') #, names = True, max_rows=1
print(my_data)
freq_data = my_data[:,0]
height_data = my_data[:,1]
width_data = my_data[:,2]
time_data = my_data[:,3]
freq2_data=dataA['Freq']
print(width_data)
print(freq2_data)
Any ideas as to what I can do?
the csv file:
Freqheight_cmsWidth_cmsTime_secs
"998.2121573301549 44.08897100772889 6.445672191528545 90.0"
"998.2121573301549 46.34952337794475 6.49171270718232 90.0"
"998.2121573301549 39.7907973252776 6.49171270718232 90.0"
"1999.404052443385 42.986804623146725 6.445672191528545 90.0"
"1999.404052443385 38.76177273904744 6.49171270718232 90.0"
"1999.404052443385 46.34952337794475 6.491875969369261 89.59365376669096"
"2997.61620977354 44.08897100772889 6.491875969369261 89.59365376669096"
"2997.61620977354 42.986804623146725 6.537915335317934 89.59651526494126"
"2997.61620977354 44.08897100772889 6.49171270718232 90.0"
"3998.80810488677 47.50820176059876 6.307550644567219 90.0"
"3998.80810488677 46.34952337794475 6.3535911602209945 90.0"
"3998.80810488677 41.903151251584184 6.3997972870975675 89.58780725859766"
"5000.0 38.76177273904744 6.21564013134898 89.57559458063852"
"5000.0 44.08897100772889 6.261510128913444 90.0"
"5000.0 41.903151251584184 6.2616793932272925 89.57871509583141"
"5998.212157330155 33.881963382336906 6.077522459688805 89.5659493678606"
"5998.212157330155 47.50820176059876 5.985444111277719 89.55927192723898"
"5998.212157330155 53.59203690324092 6.123388581952118 90.0"
This is what worked after perusing the answers and tips given from users below.
for f in results_files:
f = path + f
data = pd.read_csv(f, sep = '\t')
length_of_data = len(data)
print(data.head(length_of_data))
freqy = data[['Freq']]
print(freqy)
Use the library pandas: https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_csv.html
import pandas as pd
my_csv = pd.read_csv(filepath, header, names)
"header : int or list of ints, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.
names : array-like, default None
List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list will cause a UserWarning to be issued."