rdataframebed

Data frame to bed file conversion


I have pretty large data frames in R, which I need to convert to bed files. I use the code below for df->bed conversion, but it is extremely slow. I was wondering how to convert df to bed quicker and in a smarter way, again in R or bash.

Here are first couple of lines of an example data frame and bed file:

Dataframe:

7:115121211     7:115717553     7:115728606     7:115728881     7:115732922     7:115736195     7:115742884     7:115745446     7:115747757     7:115752949     7:115754451     7:115758839     7:115760815     7:115764258     7:115766049     7:115767796     7:115770659   7:115778018      7:115778916     7:115783939     7:115786469     7:115786614     7:115787054     7:115795892     7:115796254     7:115796568     7:115796577     7:115798414     7:115799403
15:101802122    15:101796748    15:101797565    15:101798070    15:101800680    15:101800810    15:101800817    15:101801307    15:101801525    15:101801924    15:101802122    15:101802957    15:101803999    15:101804286    15:101806680    15:101807291    15:101807374  15:101809243     15:101809473    15:101809583    15:101809747    15:101809846    15:101811404    15:101812357    15:101816568    NA:NA   NA:NA   NA:NA   NA:NA
14:48092448     14:48076797     14:48077220     14:48078107     14:48088532     14:48092327     14:48092448     14:48096413     14:48096883     14:48099107     14:48104473     14:48104777     14:48107294     14:48108274     14:48111243     14:48115370     14:48122276   14:48134996      14:48135150     14:48142024     14:48143526     14:48144608     NA:NA   NA:NA   NA:NA   NA:NA   NA:NA   NA:NA   NA:NA
12:131528491    12:131516574    12:131516713    12:131516733    12:131516770    12:131516883    12:131517005    12:131517020    12:131517066    12:131517150    12:131517651    12:131517793    12:131519612    12:131520249    12:131520675    12:131521681    12:131521694  12:131522373     12:131522451    12:131523741    12:131524764    12:131526844    12:131526894    12:131528491    12:131528903    NA:NA   NA:NA   NA:NA   NA:NA
2:36665932      2:36656809      2:36656951      2:36657905      2:36659235      2:36660367      2:36660476      2:36660581      2:36660989      2:36662473      2:36663238      2:36664571      2:36664898      2:36665052      2:36665273      2:36665548      2:36665932    2:36667413       2:36667876      2:36668395      2:36668846      2:36669071      2:36669645      2:36669670      NA:NA   NA:NA   NA:NA   NA:NA   NA:NA
9:22877714      9:22839400      9:22841425      9:22841518      9:22848811      9:22849299      9:22850177      9:22852729      9:22854439      9:22855915      9:22861588      9:22862018      9:22862481      9:22867193      9:22873872      9:22875745      9:22876877    9:22877714       9:22878225      9:22878914      9:22889291      9:22889400      9:22889518      9:22889619      9:22890108      9:22898970      9:22900997      NA:NA   NA:NA
1:207123117     1:207117558     1:207118228     1:207123117     1:207141973     1:207141987     1:207142251     1:207142507     1:207143053     1:207143296     1:207143550     NA:NA   NA:NA   NA:NA   NA:NA   NA:NA   NA:NA   NA:NA   NA:NA   NA:NA   NA:NA   NA:NA   NA:NA NA:NA    NA:NA   NA:NA   NA:NA   NA:NA   NA:NA
12:43892862     12:43843894     12:43855134     12:43863058     12:43869655     12:43871540     12:43874891     12:43881326     12:43886205     12:43892862     12:43893000     12:43893367     12:43897876     12:43898117     12:43900108     12:43900561     12:43904333   NA:NA    NA:NA   NA:NA   NA:NA   NA:NA   NA:NA   NA:NA   NA:NA   NA:NA   NA:NA   NA:NA   NA:NA
20:55462744     20:55453181     20:55461735     20:55461900     20:55462009     20:55462033     20:55462059     20:55462092     20:55462201     20:55462241     20:55462356     20:55462451     20:55462457     20:55462468     20:55462495     20:55462612     20:55462729   20:55462744      20:55462789     20:55462796     20:55462807     20:55462898     20:55462921     20:55462971     20:55464575     NA:NA   NA:NA   NA:NA   NA:NA
13:111858911    13:111835700    13:111837099    13:111837719    13:111837911    13:111840850    13:111842053    13:111845195    13:111845231    13:111852468    13:111852692    13:111853267    13:111856600    13:111856756    13:111858582    13:111858911    13:111869432  13:111869734     13:111871992    13:111876200    13:111878282    13:111883434    NA:NA   NA:NA   NA:NA   NA:NA   NA:NA   NA:NA   NA:NA

Bed file:

chr7    115121211       115121212       1       1
chr7    115717552       115717553       2       1
chr7    115728605       115728606       3       1
chr7    115728880       115728881       4       1
chr7    115732921       115732922       5       1
chr7    115736194       115736195       6       1
chr7    115742883       115742884       7       1
chr7    115745445       115745446       8       1
chr7    115747756       115747757       9       1
chr7    115752948       115752949       10      1

R code:

df2bed = function(trait-regions, outDir) {
  # converts data.frames to bed files
  
  for (i in dir(CSAregions, full.names = T)) {
    fileName = sapply(strsplit(i, split = "/"), tail, 1)
    tmp_df = read.table(i)
    tmp_bed = data.frame(chr = character(), 
                         str = character(),
                         end = character(),
                         id = character(),
                         set = character(),
                         stringsAsFactors = F)
    m = 1
        for (j in 1:nrow(tmp_df)){
            for (k in 1:ncol(tmp_df)){
              tmp_bed[m,]$chr = paste0("chr", strsplit(tmp_df[j,k], split = ":")[[1]][1])
              tmp_bed[m,]$str = as.numeric(strsplit(tmp_df[j,k], split = ":")[[1]][2])-1
              tmp_bed[m,]$end = as.numeric(strsplit(tmp_df[j,k], split = ":")[[1]][2])
              tmp_bed[m,]$id = m
              tmp_bed[m,]$set = j
              m = m + 1
              }
        }
    
    # Clear NAs
    tmp_bed = na.omit(tmp_bed)
    
    write.table(tmp_bed, file = paste0(outDir, "/variant_beds/", fileName), 
                quote = F, row.names = F, col.names = F, sep = "\t")
  }
  
}

Thanks!


Solution

  • I have created a bash code for this, so hopefully it will be much faster for you.

    # get lines
    IFS=$'\n' read -d '' -r -a lines < input.txt
    
    id=0 # to keep rowid
    
    # loop through the lines
    for i in "${!lines[@]}"
    do
        # loop through the columns
        for col in ${lines[i]}
        do
            # separate by colon
            CHR=$(echo $col | cut -f1 -d:)
            pos=$(echo $col | cut -f2 -d:)
            
            posi=$((pos-1))
            id=$((id+1))
            rownumber=$((i+1))
            # print to file
            printf 'chr%s\t%s\t%s\t%s\t%s\n' $CHR $posi $pos $id $rownumber >> output.txt
        done
    done
    
    # delete NAs
    awk '!/NA/' output.txt > temp && mv temp output.txt
    

    What I basically do is: read in the file with your dataframe (input.txt) and then loop through the lines and get each column (col). Then I split the string by ":" into $CHR and $pos. Finally, print to the output file (output.txt) your bed file including: the chromosome, position-1, position, row id ($id) and the original row where it was extracted ($rownumber). After creating the output file, I delete all the NA rows.