I have a CSV file that I got from NCBI and I want to select in R
the one with the largest value in the "length" column of each repetitive "locus".
For example it repeats AGER in the Locus column and when I checked it I need to get it as the longest AGER is in the 16th row
You can do this:
library(data.table)
fread("proteins_51_1820449.csv")[order(-Length),first(.SD), Locus]
Output:
Locus #Name Accession Start Stop Strand GeneID Locus tag Protein product Length Protein Name
1: TTN chromosome 2 NC_000002.12 178527012 178804642 - 7273 - NP_001254479.2 35991 titin isoform IC
2: MUC16 Un NW_025791807.1 83875 260124 - 94025 - NP_001388430.1 15349 mucin-16 precursor
3: OBSCN chromosome 1 NC_000001.11 228211784 228378795 + 84033 - NP_001373054.1 8925 obscurin isoform c
4: SYNE1 chromosome 6 NC_000006.12 152122436 152628331 - 23345 - XP_016866097.1 8846 nesprin-1 isoform X1
5: NEB chromosome 2 NC_000002.12 151485760 151733156 - 4703 - NP_001258137.2 8560 nebulin isoform 4
---
19995: HRURF chromosome 8 NC_000008.11 22130604 22130708 - 120766137 - NP_001381061.1 34 protein HRURF
19996: BLACAT1 chromosome 1 NC_000001.11 205440925 205441026 - 101669762 - NP_001384355.1 33 bladder cancer associated transcript 1
19997: SLN chromosome 11 NC_000011.10 107707835 107707930 - 6588 - NP_003054.1 31 sarcolipin
19998: LOC105372440 chromosome 19 NC_000019.10 50785812 50786104 - 105372440 - NP_001371526.1 28 uncharacterized protein LOC105372440
19999: RPL41 chromosome 12 NC_000012.12 56116788 56117524 + 6171 - NP_066927.1 25 60S ribosomal protein L41
If speed is important, this approach will help greatly