raggregatetmtreetagger

R aggregate tocken by lemme in tm package


I have load and clean a corpus in R with :

myTxt <- Corpus(DirSource("."), readerControl = list(language="lat"))
corp <- tm_map(myTxt, removeWords, c(stopwords("french")))
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, content_transformer(removeNumbers))
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, removeWords, stopwords("french"))
corp <- tm_map(corp, stripWhitespace); #inspect(docs[1])
tdm <- TermDocumentMatrix(corp)

And with treetagger I have write a function like :

require(koRpus)

lemmatisation <- function(my.df){
  ##my.df est un objet Corpus issu de du chargement du corpus avec tm
  print(my.df)
  dictionnaire <- data.frame()
  for(i in 1 : length(my.df)){
    lemma <- treetag(corp[[i]][[1]], treetagger = "manual", format = "obj", TT.tknz = FALSE, 
                     lang = "fr", TT.options = list(path = "treetagger", preset = "fr-utf8"))
    dictionnaire <- rbind(dictionnaire, lemma@TT.res )
  }
  return(unique(dictionnaire))
}

At this point I have tdm with something like

                                                 Docs
Terms                                              Urbain.txt Versele.txt
  sudest                                                    0           1
  suit                                                      0           0
  suivi                                                     0           0
  sujets                                                    0           0
  supplémentaire                                            0           0
  suzanne                                                   0           0
  symbols                                                   0           0
  tant                                                      0           0
  tdm                                                       0           0
  télévisés                                                 0           0
  tempérament                                               0           0
  temps                                                     1           0
  termdocumentmatrixcorp                                    0           0
  terms                                                     0           0
  terre                                                     0           0
  tête                                                      0           0
  text                                                      0           0
  textcat                                                   0           0
  the                                                       0           1
  théâtre                                                   0           0
  thème                                                     0           0
  themebw                                                   0           0
  thérapeute                                                0           0
  thérapie                                                  0           0
  thèse                                                     0           0
  tissent                                                   0           0
  tmmapcorp                                                 0           0
  tmmapmytxt                                                0           0
  tokyo                                                     0           0
  tôt                                                       0           0
  touchent                                                  0           0
  toujours                                                  0           0
  tournant                                                  0           0
  tous                                                      0           0
  tout                                                      0           0
  toute                                                     0           0
  toutes                                                    0           0
  traditionnelle                                            0           1
  transformé                                                0           0
  travail                                                   0           0
  travaillant                                               0           1
  travaille                                                 0           0
  travaillé                                                 0           0
  travaillent                                               0           0

Now I would like to aggregate word count with my lemme dictonnary something for groupping travaillé, travaille, travaillant, travaillent...

in the result of my fonction lemmatisation I have :

my.lemma[my.lemma$lemma == "travailler",]
           token      tag      lemma lttr wclass                    desc stop stem
665    travaillé VER:pper travailler    9   verb    verb past participle   NA   NA
835    travaille VER:pres travailler    9   verb            verb present   NA   NA
1369 travaillent VER:pres travailler   11   verb            verb present   NA   NA
1713 travaillant VER:ppre travailler   11   verb verb present participle   NA   NA

I don't know how to proceed to this aggregation


Solution

  • You could try

    aggregate(.~lemma, merge(tdm, mylemma[, c("token", "lemma")], by.x="row.names", by.y="token")[-1], sum)
    

    which should give you something like

    #        lemma Urbain.txt Versele.txt
    # 1 travailler          0           1
    # ...