I have load and clean a corpus in R with :
myTxt <- Corpus(DirSource("."), readerControl = list(language="lat"))
corp <- tm_map(myTxt, removeWords, c(stopwords("french")))
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, content_transformer(removeNumbers))
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, removeWords, stopwords("french"))
corp <- tm_map(corp, stripWhitespace); #inspect(docs[1])
tdm <- TermDocumentMatrix(corp)
And with treetagger I have write a function like :
require(koRpus)
lemmatisation <- function(my.df){
##my.df est un objet Corpus issu de du chargement du corpus avec tm
print(my.df)
dictionnaire <- data.frame()
for(i in 1 : length(my.df)){
lemma <- treetag(corp[[i]][[1]], treetagger = "manual", format = "obj", TT.tknz = FALSE,
lang = "fr", TT.options = list(path = "treetagger", preset = "fr-utf8"))
dictionnaire <- rbind(dictionnaire, lemma@TT.res )
}
return(unique(dictionnaire))
}
At this point I have tdm
with something like
Docs
Terms Urbain.txt Versele.txt
sudest 0 1
suit 0 0
suivi 0 0
sujets 0 0
supplémentaire 0 0
suzanne 0 0
symbols 0 0
tant 0 0
tdm 0 0
télévisés 0 0
tempérament 0 0
temps 1 0
termdocumentmatrixcorp 0 0
terms 0 0
terre 0 0
tête 0 0
text 0 0
textcat 0 0
the 0 1
théâtre 0 0
thème 0 0
themebw 0 0
thérapeute 0 0
thérapie 0 0
thèse 0 0
tissent 0 0
tmmapcorp 0 0
tmmapmytxt 0 0
tokyo 0 0
tôt 0 0
touchent 0 0
toujours 0 0
tournant 0 0
tous 0 0
tout 0 0
toute 0 0
toutes 0 0
traditionnelle 0 1
transformé 0 0
travail 0 0
travaillant 0 1
travaille 0 0
travaillé 0 0
travaillent 0 0
Now I would like to aggregate word count with my lemme dictonnary something for groupping travaillé, travaille, travaillant, travaillent...
in the result of my fonction lemmatisation I have :
my.lemma[my.lemma$lemma == "travailler",]
token tag lemma lttr wclass desc stop stem
665 travaillé VER:pper travailler 9 verb verb past participle NA NA
835 travaille VER:pres travailler 9 verb verb present NA NA
1369 travaillent VER:pres travailler 11 verb verb present NA NA
1713 travaillant VER:ppre travailler 11 verb verb present participle NA NA
I don't know how to proceed to this aggregation
You could try
aggregate(.~lemma, merge(tdm, mylemma[, c("token", "lemma")], by.x="row.names", by.y="token")[-1], sum)
which should give you something like
# lemma Urbain.txt Versele.txt
# 1 travailler 0 1
# ...