Consider this modified classic example:
library(dplyr)
library(tibble)
dtrain <- data_frame(text = c("Chinese Beijing Chinese",
"Chinese Chinese Shanghai",
"France",
"Tokyo Japan Chinese"),
add_numeric = c(1, 1, 0, 1),
doc_id = 1:4,
class = c(1, 1, 1, 0))
> dtrain
# A tibble: 4 x 4
text add_numeric doc_id class
<chr> <dbl> <int> <dbl>
1 Chinese Beijing Chinese 1 1 1
2 Chinese Chinese Shanghai 1 2 1
3 France 0 3 1
4 Tokyo Japan Chinese 1 4 0
Here, I would like to use lasso to predict class
. The variables of interest are text
and add_numeric
.
I know how to use text2vec
or tm
to predict class
using text
only: the packages will transform text
into a sparse document term matrix and feed the model.
However, here, I want to use both a textual variable text
, and add_numeric
. I do not know how to mix the two approaches. Any ideas?
Thanks!
I haven't checked how to do this with text2vec, but with quanteda this is quite easy to do, just using cbind
and the advantage is that is stays a sparse matrix. I haven't changed the dimnames so the added column will be shown as feat1.
library(quanteda)
dtm <- dfm(dtrain$text) # create documenttermmatrix
dtm_num <- cbind(dtm, dtrain$add_numeric) # add column to sparse matrix.
dtm_num
Document-feature matrix of: 4 documents, 7 features (60.7% sparse).
4 x 7 sparse Matrix of class "dfm"
features
docs chinese beijing shanghai france tokyo japan feat1
text1 2 1 0 0 0 0 1
text2 2 0 1 0 0 0 1
text3 0 0 0 1 0 0 0
text4 1 0 0 0 1 1 1