r text-files text-processing logfile-analysis

How do I read information from text files?

I have hundreds of text files with the following information in each file:

*****Auto-Corelation Results******
1     .09    -.19     .18     non-Significant

*****STATISTICS FOR MANN-KENDELL TEST******
S=  609
VAR(S)=      162409.70
Z=           1.51
Random : No trend at 95%

*****SENs STATISTICS ******
SEN SLOPE =  .24

Now, I want to read all these files, and "collect" Sen's Statistics from each file (eg. .24) and compile into one file along with the corresponding file names. I have to do it in R.

I have worked with CSV files but not sure how to use text files.

This is the code I am using now:

require(gtools)
GG <- grep("*.txt", list.files(), value = TRUE)
GG<-mixedsort(GG)
S <- sapply(seq(GG), function(i){
X <- readLines(GG[i])
grep("SEN SLOPE", X, value = TRUE)
})
spl <- unlist(strsplit(S, ".*[^.0-9]"))
SenStat <- as.numeric(spl[nzchar(spl)])
SenStat<-data.frame( SenStat,file = GG)
write.table(SenStat, "sen.csv",sep = ", ",row.names = FALSE)

The current code is not able to read all values correctly and giving this error:

Warning message:
NAs introduced by coercion

Also I am not getting the file names the other column of Output. Please help!

Diagnosis 1

The code is reading the = sign as well. This is the output of print(spl)

 [1] ""       "5.55"   ""       "-.18"   ""       "3.08"   ""       "3.05"   ""       "1.19"   ""       "-.32"  
[13] ""       ".22"    ""       "-.22"   ""       ".65"    ""       "1.64"   ""       "2.68"   ""       ".10"   
[25] ""       ".42"    ""       "-.44"   ""       ".49"    ""       "1.44"   ""       "=-1.07" ""       ".38"   
[37] ""       ".14"    ""       "=-2.33" ""       "4.76"   ""       ".45"    ""       ".02"    ""       "-.11"  
[49] ""       "=-2.64" ""       "-.63"   ""       "=-3.44" ""       "2.77"   ""       "2.35"   ""       "6.29"  
[61] ""       "1.20"   ""       "=-1.80" ""       "-.63"   ""       "5.83"   ""       "6.33"   ""       "5.42"  
[73] ""       ".72"    ""       "-.57"   ""       "3.52"   ""       "=-2.44" ""       "3.92"   ""       "1.99"  
[85] ""       ".77"    ""       "3.01"

Diagnosis 2

Found the problem I think. The negative sign is a bit tricky. In some files it is

SEN SLOPE =-1.07
SEN SLOPE = -.11

Because of the gap after =, I am getting NAs for the first one, but the code is reading the second one. How can I modify the regex to fix this? Thanks!

Solution

Assume "text.txt" is one of your text files. Read into R with readLines, you can use grep to find the line containing SEN SLOPE. With no further arguments, grep returns the index number(s) for the element where the regular expression was found. Here we find that it's the 11th line. Add the value = TRUE argument to get the line as it reads.

x <- readLines("text.txt")
grep("SEN SLOPE", x)
## [1] 11
( gg <- grep("SEN SLOPE", x, value = TRUE) )
## [1] "SEN SLOPE =  .24"

To find all the .txt files in the working directory we can use list.files with a regular expression.

list.files(pattern = "*.txt")
## [1] "text.txt"

LOOPING OVER MULTIPLE FILES

I created a second text file, text2.txt with a different SEN SLOPE value to illustrate how I might apply this method over multiple files. We can use sapply, followed by strsplit, to get the spl values that are desired.

GG <- list.files(pattern = "*.txt")
S <- sapply(seq_along(GG), function(i){
    X <- readLines(GG[i])
    ifelse(length(X) > 0, grep("SEN SLOPE", X, value = TRUE), NA)
    ## added 04/23/14 to account for empty files (as per comment)
})
spl <- unlist(strsplit(S, split = ".*((=|(\\s=))|(=\\s|\\s=\\s))"))
## above regex changed to capture up to and including "=" and 
## surrounding space, if any - 04/23/14 (as per comment)
SenStat <- as.numeric(spl[nzchar(spl)])

Then we can put the results into a data frame and send it to a file with write.table

( SenStatDf <- data.frame(SenStat, file = GG) )
##   SenStat      file
## 1    0.46 text2.txt
## 2    0.24  text.txt

We can write it to a file with

write.table(SenStatDf, "myFile.csv", sep = ", ", row.names = FALSE)

UPDATED 07/21/2014:

Since the result is being written to a file, this can be made much more simple (and faster) with

( SenStatDf <- cbind(
      SenSlope = c(lapply(GG, function(x){
          y <- readLines(x)
          z <- y[grepl("SEN SLOPE", y)]
          unlist(strsplit(z, split = ".*=\\s+"))[-1]
          }), recursive = TRUE),
      file = GG
 ) )
#      SenSlope file       
# [1,] ".46"   "test2.txt"
# [2,] ".24"   "test.txt"

And then written and read into R with

write.table(SenStatDf, "myFile.txt", row.names = FALSE)
read.table("myFile.txt", header = TRUE)
#   SenSlope      file
# 1     1.24 test2.txt
# 2     0.24  test.txt