rdplyrlevenshtein-distancetmagrep

which R function to use for Text Auto-Correction?


I have a csv Document with 2 columns which contains Commodity Category and Commodity Name.

Ex:

Sl.No. Commodity Category Commodity Name
1      Stationary         Pencil
2      Stationary         Pen
3      Stationary         Marker
4      Office Utensils    Chair
5      Office Utensils    Drawer
6      Hardware           Monitor
7      Hardware           CPU

and I have another csv file which contains various Commodity names.

Ex:

Sl.No. Commodity Name
1      Pancil
2      Pencil-HB 02
3      Pencil-Apsara
4      Pancil-Nataraj
5      Pen-Parker
6      Pen-Reynolds
7      Monitor-X001RL

The output I would like is to standardise and categorise the commodity names and classify them into respective Commodity Categories like shown below :

Sl.No. Commodity Name   Commodity Category
1      Pencil           Stationary
2      Pencil           Stationary
3      Pencil           Stationary
4      Pancil           Stationary
5      Pen              Stationary
6      Pen              Stationary
7      Monitor          Hardware

Step 1) I first have to use NLTK (Text mining methods) and clean the data so as to seperate "Pencil" from "Pencil-HB 02" .

Step 2) After cleaning I have to use Approximate String match technique i.e agrep() to match the patterns "Pencil *" or correcting "Pancil" to "Pencil".

Step 3)Once correcting the pattern I have to categorise. No idea how.

This is what I have thought about. I started with step 2 and I'm stuck in step 2 only. I'm not finding an exact method to code this. Is there any way to get the output as required? If yes please suggest me the method I can proceed with.


Solution

  • You could use the stringdist package. The correct function below will correct the Commodity.Name in file2 based on distances of the item to different CName.

    Then a left_join is used to join the two tables.

    I also notice that there are some classifications if I use the default options for stringdistmatrix. You can try changing the weight argument of stringdistmatrix for better correction result.

    > library(dplyr)
    > library(stringdist)
    > 
    > file1 <- read.csv("/Users/Randy/Desktop/file1.csv")
    > file2 <- read.csv("/Users/Randy/Desktop/file2.csv")
    > 
    > head(file1)
      Sl.No. Commodity.Category Commodity.Name
    1      1         Stationary         Pencil
    2      2         Stationary            Pen
    3      3         Stationary         Marker
    4      4    Office Utensils          Chair
    5      5    Office Utensils         Drawer
    6      6           Hardware        Monitor
    > head(file2)
      Sl.No. Commodity.Name
    1      1         Pancil
    2      2   Pencil-HB 02
    3      3  Pencil-Apsara
    4      4 Pancil-Nataraj
    5      5     Pen-Parker
    6      6   Pen-Reynolds
    > 
    > CName <- levels(file1$Commodity.Name)
    > correct <- function(x){
    +     factor(sapply(x, function(z) CName[which.min(stringdistmatrix(z, CName, weight=c(1,0.1,1,1)))]), CName)
    + }
    > 
    > correctedfile2 <- file2 %>%
    + transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name))
    > 
    > correctedfile2 %>%
    + inner_join(file1[,-1], by="Commodity.Name")
      Commodity.Name.Old Commodity.Name Commodity.Category
    1             Pancil         Pencil         Stationary
    2       Pencil-HB 02         Pencil         Stationary
    3      Pencil-Apsara         Pencil         Stationary
    4     Pancil-Nataraj         Pencil         Stationary
    5         Pen-Parker            Pen         Stationary
    6       Pen-Reynolds            Pen         Stationary
    7     Monitor-X001RL        Monitor           Hardware
    

    If you need the "Others" category, you just need to play with the weights. I added a row "Diesel" in file2. Then compute the score using stringdist with customized weights (you should try varying the values). If the score is large than 2 (this value is related to how the weights are assigned), it doesn't correct anything.

    PS: as we don't know all the possible labels, we have to do as.character to convect factor to character.

    PS2: I am also using tolower for case insensitive scoring.

    > head(file2)
      Sl.No. Commodity.Name
    1      1         Diesel
    2      2         Pancil
    3      3   Pencil-HB 02
    4      4  Pencil-Apsara
    5      5 Pancil-Nataraj
    6      6     Pen-Parker
    > 
    > CName <- levels(file1$Commodity.Name)
    > CName.lower <- tolower(CName)
    > correct_1 <- function(x){
    +     scores = stringdistmatrix(tolower(x), CName.lower, weight=c(1,0.001,1,0.5))
    +     if (min(scores)>2) {
    +         return(x)
    +     } else {
    +         return(as.character(CName[which.min(scores)]))
    +     }
    + }
    > correct <- function(x) {
    +     sapply(as.character(x), correct_1)
    + }
    > 
    > correctedfile2 <- file2 %>%
    + transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name))
    > 
    > file1$Commodity.Name = as.character(file1$Commodity.Name)
    > correctedfile2 %>%
    + left_join(file1[,-1], by="Commodity.Name")
      Commodity.Name.Old Commodity.Name Commodity.Category
    1             Diesel         Diesel               <NA>
    2             Pancil         Pencil         Stationary
    3       Pencil-HB 02         Pencil         Stationary
    4      Pencil-Apsara         Pencil         Stationary
    5     Pancil-Nataraj         Pencil         Stationary
    6         Pen-Parker            Pen         Stationary
    7       Pen-Reynolds            Pen         Stationary
    8     Monitor-X001RL        Monitor           Hardware