regexsedbioinformaticssamtools

Remove character from the middle of a string


I have a SAM file with an RX: field containing 12 bases separated in the middle by a - i.e. RX:Z:CTGTGC-TCGTAA

I want to remove the hyphen from this field, but I can't simply remove all hyphens from the whole file as the read names contain them, like 1713704_EP0004-T

Have mostly been trying tr, but this is just removing all hyphens from the file.:

tr -d '"-' < sample.fq.unaln.umi.sam > sample.fq.unaln.umi.re.sam

input is a large SAM file of >10,000,000 lines like this:

1902336-103-016_C1D1_1E-T:34    99  chr1    131341  36  146M    =   131376  182 GGACAGGGAGTGTTGACCCTGGGCGGCCCCCTGGAGCCACCTGCCCTGAAAGCCCAGGGCCCGCAACCCCACACACTTTGGGGCTGGTGGAACCTGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCTCAGGGG  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN  MC:Z:147M   MD:Z:83T62cD:i:4    cE:f:0  PG:Z:bwa    RG:Z:A  MI:Z:34 NM:i:1  cM:i:3  MQ:i:36 UQ:i:45 AS:i:141    XS:i:136    RX:Z:CTGTGC-TCGTAA

Desired output (i.e. last field)

1902336-103-016_C1D1_1E-T:34    99  chr1    131341  36  146M    =   131376  182 GGACAGGGAGTGTTGACCCTGGGCGGCCCCCTGGAGCCACCTGCCCTGAAAGCCCAGGGCCCGCAACCCCACACACTTTGGGGCTGGTGGAACCTGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCTCAGGGG  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN  MC:Z:147M   MD:Z:83T62cD:i:4    cE:f:0  PG:Z:bwa    RG:Z:A  MI:Z:34 NM:i:1  cM:i:3  MQ:i:36 UQ:i:45 AS:i:141    XS:i:136    RX:Z:CTGTGCTCGTAA

How do I solve this problem?


Solution

  • I've solved this problem using pysam which is faster, safer and requires less disk space as a sam file is not required. It's not perfect, I'm still learning python and have used pysam for half a day.

    import pysam
    import sys
    from re import sub
    
    # Provide a bam file
    if len(sys.argv) == 2:
        assert sys.argv[1].endswith('.bam')
    
    # Makes output filehandle
    inbamfn = sys.argv[1]
    outbamfn = sub('.bam$', '.fixRX.bam', inbamfn)
    
    inbam = pysam.Samfile(inbamfn, 'rb')
    outbam = pysam.Samfile(outbamfn, 'wb', template=inbam)
    
    # Counters for reads processed and written
    n = 0
    w = 0
    
    # .get_tag() retrieves RX tag from each read
    for read in inbam.fetch(until_eof=True):
        n += 1
        umi = read.get_tag('RX')
        assert umi is not None
        umifix = umi[:6] + umi[7:]
        read.set_tag('RX', umifix, value_type='Z')
        if '-' in umifix:
            print('Hyphen found in UMI:', umifix, read)
            break
        else:
            w += 1
            outbam.write(read)
    
    inbam.close()
    outbam.close()
    
    print ('Processed', n, 'reads:\n',
           w, 'UMIs written.\n',
           str(int((w / n) * 100)) + '% of UMIs fixed')