rtmcorpuskorpus

Compiling and analysing a Corpus with R and koRpus


I'm a student of literature lost in data sciences. I'm trying to analyse a corpus of 70 .txt-files, which are all in one directory.

My final goal is to get a table containing the filename (or something similar), the sentence and word counts, a Flesch-Kincaid readability score and a MTLD lexical diversity score.

I've found the packages koRpus and tm (and tm.plugin.koRpus) and have tried to understand their documentation, but haven't come far. With the help of the RKward IDE and the koRpus-Plugin I manage to get all of these measure for one file at a time and can copy that data into a table manually, but that is very cumbersome and still a lot of work.

What I've tried so far is this command to create a corpus of my files:

simpleCorpus(dir = "/home/user/files/", lang = "en", tagger = "tokenize",
encoding = "UTF-8", pattern = NULL, recursive = FALSE, ignore.case = FALSE, mode = "text", source = "Wikipedia", format = "file",
mc.cores = getOption("mc.cores", 1L))

But I always get the error:

Error in data.table(token = tokens, tag = unk.kRp):column or argument 1 is NULL).

If someone could help an absolute newbie to R I'd be incredibly grateful!


Solution

  • I have found the solution with the help of unDocUMeantIt, the author of the package (thank you!). An empty file in the directory caused the error, after removal I've managed to get everything running.