I'm working on a Snakemake workflow where I need to combine multiple CSV files into a single Pandas DataFrame. The list of CSV files is dynamic—it depends on upstream rules and wildcard patterns.
Here's a simplified version of what I have in my Snakefile:
rule combine_tables:
input:
expand("results/{sample}/data.csv", sample=SAMPLES)
output:
"results/combined/all_data.csv"
run:
import pandas as pd
dfs = [pd.read_csv(f) for f in input]
combined = pd.concat(dfs)
combined.to_csv(output[0], index=False)
This works when the files exist, but I’d like to know:
What's the best practice for handling missing or corrupt files in this context?
Is there a more "Snakemake-idiomatic" way to dynamically list and read input files for Pandas operations?
How do I ensure proper file ordering or handle metadata like sample names if not all CSVs are structured identically?
rule combine_tables:
input:
# Static sample list (use checkpoints if dynamically generated)
expand("results/{sample}/data.csv", sample=SAMPLES)
output:
"results/combined/all_data.csv"
run:
import pandas as pd
dfs = []
missing_files = []
corrupt_files = []
# Process files in consistent order
for file_path in sorted(input, key=lambda x: x.split("/")[1]): # Sort by sample
# Handle missing files (shouldn't occur if Snakemake workflow is correct)
if not os.path.exists(file_path):
missing_files.append(file_path)
continue
# Handle corrupt/unreadable files
try:
df = pd.read_csv(file_path)
# Add sample metadata column
sample_id = file_path.split("/")[1]
df.insert(0, "sample", sample_id) # Add sample column at start
dfs.append(df)
except Exception as e:
corrupt_files.append((file_path, str(e)))
# Validation reporting
if missing_files:
raise FileNotFoundError(f"Missing {len(missing_files)} files: {missing_files}")
if corrupt_files:
raise ValueError(f"Corrupt files detected:\n" + "\n".join(
[f"{f[0]}: {f[1]}" for f in corrupt_files]))
if not dfs:
raise ValueError("No valid dataframes to concatenate")
# Concatenate and save
combined = pd.concat(dfs, ignore_index=True)
combined.to_csv(output[0], index=False)