rrepeatncbi

How can I choose the longest of each locus?


I have a CSV file that I got from NCBI and I want to select in R the one with the largest value in the "length" column of each repetitive "locus".

For example it repeats AGER in the Locus column and when I checked it I need to get it as the longest AGER is in the 16th row

My file


Solution

  • You can do this:

    library(data.table)
    fread("proteins_51_1820449.csv")[order(-Length),first(.SD), Locus]
    

    Output:

                  Locus         #Name      Accession     Start      Stop Strand    GeneID Locus tag Protein product Length                           Protein Name
        1:          TTN  chromosome 2   NC_000002.12 178527012 178804642      -      7273         -  NP_001254479.2  35991                       titin isoform IC
        2:        MUC16            Un NW_025791807.1     83875    260124      -     94025         -  NP_001388430.1  15349                     mucin-16 precursor
        3:        OBSCN  chromosome 1   NC_000001.11 228211784 228378795      +     84033         -  NP_001373054.1   8925                     obscurin isoform c
        4:        SYNE1  chromosome 6   NC_000006.12 152122436 152628331      -     23345         -  XP_016866097.1   8846                   nesprin-1 isoform X1
        5:          NEB  chromosome 2   NC_000002.12 151485760 151733156      -      4703         -  NP_001258137.2   8560                      nebulin isoform 4
       ---                                                                                                                                                       
    19995:        HRURF  chromosome 8   NC_000008.11  22130604  22130708      - 120766137         -  NP_001381061.1     34                          protein HRURF
    19996:      BLACAT1  chromosome 1   NC_000001.11 205440925 205441026      - 101669762         -  NP_001384355.1     33 bladder cancer associated transcript 1
    19997:          SLN chromosome 11   NC_000011.10 107707835 107707930      -      6588         -     NP_003054.1     31                             sarcolipin
    19998: LOC105372440 chromosome 19   NC_000019.10  50785812  50786104      - 105372440         -  NP_001371526.1     28   uncharacterized protein LOC105372440
    19999:        RPL41 chromosome 12   NC_000012.12  56116788  56117524      +      6171         -     NP_066927.1     25              60S ribosomal protein L41
    

    If speed is important, this approach will help greatly