I'm trying to gather multiple csv files from one folder into a dataframe. With this prior question we realized the real issue is that some csv files (summary files) contain more than one table. As a result, the current solution's product (code below) skips a significant portion of the data.
Is there any reasonable way to gather multiple files, each possibly containing multiple tables?
Alternatively, if this makes it easier, I have, and could use, separate text files for each of the tables contained in the larger summary files.
Anyhow, what I seek, is that a single row of the generated dataframe should contain the data from the three separate text files / three tables inside the summary file.
Here is my code for just adding the text files from their folder.
import pandas as pd
import os
import glob
#define path to dir containing the summary text files
files_folder = "/data/TB/WA_dirty_prep_reports/"
#create a df list using list comprehension
files = [pd.read_csv(file, sep='\t', on_bad_lines='skip') for file in glob.glob(os.path.join(files_folder,"*txt"))]
#concatanate the list of df's into one df
files_df = pd.concat(files)
print(files_df)
Here is an example file
Sample Summary: | ||||||
---|---|---|---|---|---|---|
Sample ID: | 18RF0375-MI-TBWGS-NextSeq-2019-63_S54 | |||||
Sample Name: | 18RF0375 | |||||
Coverage Drop: | 0 | |||||
Pipeline Version: | Var | |||||
Date: | 1/12/2023 9:55 | |||||
Target Coverage Summary: | ||||||
Gene Name | Start | End | Flag | |||
gyrB | 6571 | 6762 | No deletion | |||
gyrA | 7360 | 7583 | No deletion | |||
rpoB | 760307 | 761286 | No deletion | |||
mmpR | 778989 | 779487 | No deletion | |||
rplC1 | 800808 | 801462 | No deletion | |||
atpE | 1461044 | 1461290 | No deletion | |||
rrs | 1473245 | 1473331 | No deletion | |||
rrl | 1473657 | 1476796 | No deletion | |||
fabG1 | 1673409 | 1674052 | No deletion | |||
inhA | 1674201 | 1675012 | No deletion | |||
tlyA | 1917939 | 1918747 | No deletion | |||
katG | 2153888 | 2156112 | No deletion | |||
pncA | 2288676 | 2289272 | No deletion | |||
eis | 2714123 | 2715372 | No deletion | |||
ahpC | 2726093 | 2726194 | No deletion | |||
pepQ | 2859299 | 2860418 | No deletion | |||
embB | 4246586 | 4249653 | No deletion | |||
ethA | 4326003 | 4327474 | No deletion | |||
Variant Summary: | ||||||
POS | Gene Name | Nucleotide Change | Amino acid Change | Read Depth | Percent Alt Allele | Annotation |
1474639 | rrl | c.982G>A | NA | 109 | 98.9 | rRNA |
1476311 | rrl | c.2654_2655delGTGinsCCA | NA | 100 | 11.7 | rRNA |
1476369 | rrl | c.2712C>T | NA | 110 | 10 | rRNA |
1476463 | rrl | c.2806C>T | NA | 84 | 10.7 | rRNA |
1476481 | rrl | c.2824T>C | NA | 100 | 13 | rRNA |
1476506 | rrl | c.2849T>C | NA | 92 | 12.7 | rRNA |
1673425 | fabG1 upstream | c.-15C>T | NA | 90 | 98.6 | Non-Coding |
2154724 | katG | c.1388G>T | p.Arg463Leu | 70 | 98.2 | Non-synonymous |
4247646 | embB | c.1133A>C | p.Glu378Ala | 110 | 98.8 | Non-synonymous |
Interpretations Summary: | ||||||
Drug | Variant | Interpretation | ||||
INH | fabG1 upstream_c.-15C>T,katG_p.Arg463Leu | INH-R | ||||
RIF | No reportable variant detected | RIF-S | ||||
PZA | No reportable variant detected | PZA-S | ||||
FQ | No reportable variant detected | FQ-S | ||||
EMB | embB_p.Glu378Ala | EMB-S |
New answer using the posted tables. I have pasted your input in a notepad, saved it as csv and tuned my previous answer with it.
There are 2 strategies, right?
pd.read_csv
as needed, into a list, using a comprehension,In both cases, the empty rows looks like usable table delimiters.
Here is the way of pandas:
Step 1. Read your csv file with an excess of columns. You know your largest table's size, right?
df_read = pd.read_csv(file, # your example, pasted back into a csv
sep='\t', # tabs for you
names=range(20), # columns in excess.
header=None, # skip no line on top
skip_blank_lines=False # skip no empty line -we'll use those
).dropna(how='all', axis=1) # Drop excess columns
Used these:
Step 2. Use the empty rows as table delimiters
# Flag empty rows
nullsearch = df_read.isnull().T.all()
# Assign group index, and add it as an extra column
df_read['group'] = nullsearch.diff().ne(0).where(~nullsearch).astype(float).cumsum().fillna(0)
Used these:
Step 3. Split the dataframe into a list containing all detected tables
df_list = [df_read[df_read.group==i]
.dropna(how='all', axis=1)
.drop(columns='group')
for i in df_read.group.unique() if i]
There are your four tables, three plus the file header, accessible as
df_list[0] # your file header
df_list[1]
df_list[2]
df_list[3]
File header:
0 1
0 Sample ID: 18RF0375-MI-TBWGS-NextSeq-2019-63_S54
1 Sample Name: 18RF0375
2 Coverage Drop: 0
3 Pipeline Version: Var
4 Date: 01/12/2023 09:55
Here is for example table 2:
0 1 2 3 4 5 6
29 Variant Summary: NaN NaN NaN NaN NaN NaN
30 POS Gene Name Nucleotide Change Amino acid Change Read Depth Percent Alt Allele Annotation
31 1474639 rrl c.982G>A NaN 109 98.9 rRNA
32 1476311 rrl c.2654_2655delGTGinsCCA NaN 100 11.7 rRNA
33 1476369 rrl c.2712C>T NaN 110 10 rRNA
34 1476463 rrl c.2806C>T NaN 84 10.7 rRNA
35 1476481 rrl c.2824T>C NaN 100 13 rRNA
36 1476506 rrl c.2849T>C NaN 92 12.7 rRNA
37 1673425 fabG1 upstream c.-15C>T NaN 90 98.6 Non-Coding
38 2154724 katG c.1388G>T p.Arg463Leu 70 98.2 Non-synonymous
39 4247646 embB c.1133A>C p.Glu378Ala 110 98.8 Non-synonymous
Based on what looks similar in your tables except file header, some formatting could look like this:
def postprocess(df):
'''
Format listed dataframes
- set row 1 as column names
- drop row 0 and 1
- reset the index
'''
return df.set_axis(labels=df.iloc[1,:].values, axis=1).iloc[2:,:].reset_index(drop=True)
Variant_Summary = postprocess(df_list[2])
POS Gene Name Nucleotide Change Amino acid Change Read Depth Percent Alt Allele Annotation
0 1474639 rrl c.982G>A NaN 109 98.9 rRNA
1 1476311 rrl c.2654_2655delGTGinsCCA NaN 100 11.7 rRNA
2 1476369 rrl c.2712C>T NaN 110 10 rRNA
3 1476463 rrl c.2806C>T NaN 84 10.7 rRNA
4 1476481 rrl c.2824T>C NaN 100 13 rRNA
5 1476506 rrl c.2849T>C NaN 92 12.7 rRNA
6 1673425 fabG1 upstream c.-15C>T NaN 90 98.6 Non-Coding
7 2154724 katG c.1388G>T p.Arg463Leu 70 98.2 Non-synonymous
8 4247646 embB c.1133A>C p.Glu378Ala 110 98.8 Non-synonymous