rsvmdata-miningtext2vecvector-space

How to convert text fields into numeric/vector space for a SVM in R Studio?


I am attempting to train a Support Vector Machine to aid in the detection of similarity between strings. My training data consists of two text fields and a third field that contains 0 or 1 to indicate similarity. This last field was calculated with the help of an edit distance operation. I know that I need to convert the two text fields to numeric values before continuing. I am hoping to find out what is the best method to achieve this?

The training data looks like:

ID          MAKTX_Keyword       PH_Level_04_Keyword   Result
266325638   AMLODIPINE          AMLODIPINE              0
724712821   IRBESARTANHCTZ      IRBESARTANHCTZ          0
567428641   RABEPRAZOLE         RABEPRAZOLE             0
137472217   MIRTAZAPINE         MIRTAZAPINE             0
175827784   FONDAPARINUX        ARIXTRA                 1
456372747   VANCOMYCIN          VANCOMYCIN              0
653832438   BRUFEN              IBUPROFEN               1
917575539   POTASSIUM           POTASSIUM               0
222949123   DIOSMINHESPERIDIN   DIOSMINHESPERIDIN       0
892725684   IBUPROFEN           IBUPROFEN               0

I have been experimenting with the text2vec library, using this useful vignette as a guide. In doing so, I can presumably represent one of the fields in vector space.

The code that will be used to manage one of the fields:

library(text2vec)
library(data.table)

preproc_func = tolower
token_func = word_tokenizer

it_train = itoken(Train_PRDHA_String.df$MAKTX_Keyword, 
                  preprocessor = preproc_func, 
                  tokenizer = token_func, 
                  ids = Train_PRDHA_String.df$ID, 
                  progressbar = TRUE)
vocab = create_vocabulary(it_train)

vectorizer = vocab_vectorizer(vocab)
t1 = Sys.time()
dtm_train = create_dtm(it_train, vectorizer)
print(difftime(Sys.time(), t1, units = 'sec'))

dim(dtm_train)
identical(rownames(dtm_train), Train_PRDHA_String.df$id) 

Solution

  • One way to embed docs into the same space is to learn vocabulary from both columns:

    preproc_func = tolower
    token_func = word_tokenizer
    union_txt = c(Train_PRDHA_String.df$MAKTX_Keyword, Train_PRDHA_String.df$PH_Level_04_Keyword)
    it_train = itoken(union_txt, 
                      preprocessor = preproc_func, 
                      tokenizer = token_func, 
                      ids = Train_PRDHA_String.df$ID, 
                      progressbar = TRUE)
    vocab = create_vocabulary(it_train)
    vectorizer = vocab_vectorizer(vocab)
    
    it1 = itoken(Train_PRDHA_String.df$MAKTX_Keyword, preproc_func, 
                 token_func, ids = Train_PRDHA_String.df$ID)
    dtm_train_1 = create_dtm(it1, vectorizer)
    
    it2 = itoken(Train_PRDHA_String.df$PH_Level_04_Keyword, preproc_func, 
                 token_func, ids = Train_PRDHA_String.df$ID)
    dtm_train_2 = create_dtm(it2, vectorizer)
    

    And after that you can combine them into a single matrix:

    dtm_train = cbind(dtm_train_1, dtm_train_2)
    

    However if you want to solve problem of duplicate detection I suggest to use char_tokenizer with ngram > 1 (say ngram = c(3, 3)). And check great stringdist package. I suppose you received Result with some manual human work. Because if it is just edit distance, algorithm will learn at most how edit distance works.