rdatedatasetcorpus

How to extract the date from PDF file names to a data set?


I am trying to extract the date from multiple PDF's to create a date column in a dataset.

I have a folder holding all the pdf's and am trying to do a topic modelling over a time period, hence I need to extract the dates.

Below is the dataset I have just containing the filenames.

# A tibble: 260 x 1
   filename        
   <chr>           
 
1 ./2012.01.18.pdf
 2 ./2012.02.07.pdf
 3 ./2012.03.12.pdf
 4 ./2012.03.26.pdf
 5 ./2012.04.02.pdf
 6 ./2012.04.04.pdf
 7 ./2012.04.19.pdf
 8 ./2012.05.01.pdf
 9 ./2012.05.07.pdf
10 ./2012.06.14.pdf

Tried "as.Date" with no luck, as I am unable to extract the dates from a file holding the all the PDFs


Solution

  • In the format, we could specify the extra characters along with the custom format for year (%Y), month (%m) and day (%d)

    df$V2 <-  as.Date(df$V2, format = "./%Y.%m.%d.pdf")
    

    -output

    > df
       V1         V2
    1   1 2012-01-18
    2   2 2012-02-07
    3   3 2012-03-12
    4   4 2012-03-26
    5   5 2012-04-02
    6   6 2012-04-04
    7   7 2012-04-19
    8   8 2012-05-01
    9   9 2012-05-07
    10 10 2012-06-14
    

    data

    df <- structure(list(V1 = 1:10, V2 = c("./2012.01.18.pdf", "./2012.02.07.pdf", 
    "./2012.03.12.pdf", "./2012.03.26.pdf", "./2012.04.02.pdf", "./2012.04.04.pdf", 
    "./2012.04.19.pdf", "./2012.05.01.pdf", "./2012.05.07.pdf", "./2012.06.14.pdf"
    )), class = "data.frame", row.names = c(NA, -10L))