regexgrepvcf-variant-call-format

Extract a string from vcf file


I need to extract RS=368138379 string from following lines in a vcf file of few thousand millions lines. I am wondering how can we use grep -o "" and regular expression to quickly extract that?

AF_ESP=0.0001;ALLELEID=359042;CLNDISDB=MedGen:C0678202,OMIM:266600;CLNDN=Inflammatory_bowel_disease_1;CLNHGVS=NC_000006.11:g.31779521C>T;CLNREVSTAT=no_assertion_criteria_provided;CLNSIG=association;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=HSPA1L:3305;MC=SO:0001583|missense_variant;ORIGIN=4;RS=368138379

Thanks very much indeed.


Solution

  • Let's say text.log contains your log you can use:

    grep -oE "RS=[0-9]+" test.log
    

    If you want to print also the line numbers:

    grep -noE "RS=[0-9]+" test.log