I have single-cell RNA-Seq experiment with >1000 samples from the same individual. To get an idea of their genotype, I want to do a pseudo-bulk variant call where I merge the 1000+ BAM files into one and call variants from that. The problem I have is that Freebayes will give each sample from the merged VCF file a separate column rather than one "merged" column. I couldn't find any options on Freebayes that ignore read groups.
Normal output:
<header info>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE 1 SAMPLE 2 ...
<data>
Desired output:
<header info>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT COMBINED_SAMPLE
<data>
While not elegant, a solution I have found is to modify the header of the BAM file so that every read group maps to the same sample. A quick way of doing this with sed and SAMtools:
samtools reheader -c "sed 's/\bSM:[^ ]*/SM:combined/'" input.bam > output.bam
In this example, samtools reheader modifies the BAM file's header using the command passed with -c. The sed command matches any space-separated word that starts with "SM:" and replaces it with SM:combined:
Before
@RG ID:read_group_1 LB:RNA PL:Illumina SM:sample_A
@RG ID:read_group_2 LB:RNA PL:Illumina SM:sample_B
After
@RG ID:read_group_1 LB:RNA PL:Illumina SM:combined
@RG ID:read_group_2 LB:RNA PL:Illumina SM:combined
When running Freebayes on the new BAM file, all metadata is contained within a single sample column named "combined" (copied here):
<header info>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT combined
<data>