I am iterating through monthly weather station data. I can concatenate the files as follows:
path = r"D:\NOAA\output\TEST"
all_files = glob.glob(path + "/*.csv")
for filename in all_files:
print filename # prints D:\NOAA\output\TEST\189501.tave.conus.csv
df = (pd.read_csv(f) for f in all_files)
concatenated_df = pd.concat(df, axis=1, join='inner')
Which results in the following dataframe:
lat lon temp lat lon temp lat lon temp
0 24.5625 -81.8125 21.06 24.5625 -81.8125 17.08 24.5625 -81.8125 22.42
1 24.5625 -81.7708 21.06 24.5625 -81.7708 17.08 24.5625 -81.7708 22.47
2 24.5625 -81.7292 21.06 24.5625 -81.7292 17.08 24.5625 -81.7292 22.47
3 24.5625 -81.6875 21.05 24.5625 -81.6875 17.04 24.5625 -81.6875 22.47
4 24.6042 -81.6458 21.06 24.6042 -81.6458 17.08 24.6042 -81.6458 22.45
The lat
and lon
columns are identical, so I would like to drop those duplicates columns. The temp
columns are unique to each monthly CSV file. I want to keep all of them, but also give them meaningful column names, taken from the filename, ie:
lat lon temp185901 temp185902 temp185903
0 24.5625 -81.8125 21.06 17.08 22.42
1 24.5625 -81.7708 21.06 17.08 22.47
2 24.5625 -81.7292 21.06 17.08 22.47
3 24.5625 -81.6875 21.05 17.04 22.47
4 24.6042 -81.6458 21.06 17.08 22.45
I am brand new to Pandas (it seems awesome, but it's a lot to absorb), I would appreciate any help. I think the solutions are in the parameters I use for either the .concat()
, .duplicate()
, and .loc()
.
Example data: ftp://ftp.commissions.leg.state.mn.us/pub/gis/Temp/NOAA/
You can merge two columns and set suffixes for others:
temp = df1.merge(df2, on=['lat','lon'], suffixes=('185901','185902'))
lat lon temp185901 temp185902
0 24.5625 -81.8125 21.06 17.08
1 24.5625 -81.7708 21.06 17.08
2 24.5625 -81.7292 21.06 17.08
3 24.5625 -81.6875 21.05 17.04
4 24.6042 -81.6458 21.06 17.08
Or in a loop
temp.merge(df3, on=['lat','lon']).rename(columns={'temp':'temp185903'})
lat lon temp185901 temp185902 temp185903
0 24.5625 -81.8125 21.06 17.08 22.42
1 24.5625 -81.7708 21.06 17.08 22.47
2 24.5625 -81.7292 21.06 17.08 22.47
3 24.5625 -81.6875 21.05 17.04 22.47
4 24.6042 -81.6458 21.06 17.08 22.45
df = []
for filename in all_files:
df1 = pd.read_csv(filename)
# if the first loop
if not list(df):
df = df1
else:
df = df.merge(df1, on=['lat','lon'])
df.rename(columns={'temp':'temp'+put_numer_from_filename}, inplace=True)