rmachine-learningquantedamachine-learning-model

Text Analysis in R: How to add variables to my machine learning classifier in addition to the tokens?


how to consider additional variables

I am working on a classification task using quanteda in R and I want to include some variables to be considered by my models apart from the bag of words. for instance, I computed dictionary based sentiment indexes and I d like to include these variables so that the models consider them.

these are the indexes I created, for each document.

dfneg <-  cbind(negDfm1@docvars$label , negDfm1@x ,posDfm@x ,  angDfm@x , 
disgDfm1@x)
colnames(dfneg) <- c("label","neg" , "pos" , "ang" , "disg")
dfneg <- as.data.frame(dfneg)

this is the document features matrix I will work with:

DFM

newsdfm <- dfm(newscorp, tolower = TRUE , stem = FALSE ,  remove_punct = 
TRUE, remove = stopwords("english"),verbose=TRUE)
newst<- dfm_trim(newsdfm , min_docfreq=2 , verbose=TRUE)

id_train <- sample(1:6335, 5384, replace = FALSE)
# create docvar with ID
docvars(newst, "id_numeric") <- 1:ndoc(newst)

# get training set
train <- dfm_subset(newst, id_numeric %in% id_train)

# get test set (documents not in id_train)
test <- dfm_subset(newst, !id_numeric %in% id_train) 

finally, I run a classification, for instance, a Naive Bayes classifier or lasso

Naive Bayes classifier or lasso

NBmodel <- textmodel_nb(train , train@docvars$label)


lasso <- cv.glmnet(train, train@docvars$label, 
family="binomial", alpha=1, nfolds=10,
type.measure="class")

this is what I tried after creating the dfm, but it didn't work

 newsdfm@Dimnames$features$negz <- dfneg$neg
 newsdfm@Dimnames$features$posz <- dfneg$pos
 newsdfm@Dimnames$features$angz <- dfneg$ang
 newsdfm@Dimnames$features$disgz <- dfneg$disg

then I thought of creating document variables before creating newsdfm

   docvars(newscorp , "negz") <- dfneg$neg
   docvars(newscorp , "posz") <- dfneg$pos
   docvars(newscorp , "angz") <- dfneg$ang
   docvars(newscorp , "disgz") <- dfneg$disg

but at that point, I don't know how to tell the classifier that I want it to consider also these document variables in addition to the bag of words.

In summary, I expect the model to consider both the matrix with all the words per each document and the indexes I created per each document.

any suggestion is highly appreciated

thank you in advance,

Carlo


Solution

  • Internally, dfm are sparse matrices, but it is better to avoid manipulating them directly if possible.

    For adding new features for textmodel_nb(), you need to add them to the dfm. As you might expect, the easiest way to do so is to use cbind() to dfm.

    In your example, you can run something like this:

    additional_features <- dfneg[, c("neg", "pos", "ang", "disg")] %>% as.matrix()
    newsdfm_added <- cbind(newsdfm, additional_features)
    

    As you see, I firstly created a matrix of additional features and then run cbind(). When you execute cbind() you will get the following warning:

    Warning messages:
    1: cbinding dfms with different docnames 
    2: cbinding dfms with overlapping features will result in duplicated features 
    

    As this indicates you have to make sure that the colnames for the additional features should not be in the original dfm.