javaregextopic-modelingmallet

Create customized Pattern for my data-set in mallet


I'm using Mallet 2.0.7 in java for mining of tweets. According the documentation, for topic modeling I have to read data set using CsvIterator.

Reader fileReader = new InputStreamReader(new FileInputStream(new File(args[0])), "UTF-8");
    instances.addThruPipe(new CsvIterator (fileReader, Pattern.compile("^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$"),
                                           3, 2, 1)); // data, label, name fields

My data set is like: row,x,location,username,hashtaghs,text,retweets,date,favorites,numberOfComment

for label I added column x. in the first time, I want to run algorithm in column text (6) and later added another column. I wrote this pattern but it doesn't work as expected, It gets column 6 until last for data. how do I change the regular expression for pattern?

 Reader fileReader = new InputStreamReader(new FileInputStream(new File(filePath)), "UTF-8");
    instances.addThruPipe(new CsvIterator(fileReader,
            Pattern.compile("^(\\S*)[\\s,]*(\\S*)[\\s,]*(\\S*)[\\s,]*(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$"),
            6, 2, 1)); // data, label, name fields

Solution

  • Look for regular expression documentation to understand the meaning of each element of the pattern. The original pattern divides the whole line into three groups: all characters from the beginning to the first comma or whitespace, all characters until the second comma or whitespace, and then everything else.

    The new pattern does the same, but captures six groups. That's why you're getting everything from the text to the end of the line.

    I would recommend a few fixes:

    Try something like this (not tested):

    // row,x,location,username,hashtaghs,text,retweets,date,favorites,numberOfComment
    Reader fileReader = new InputStreamReader(new FileInputStream(new File(filePath)), "UTF-8");
    instances.addThruPipe(new CsvIterator(fileReader,
            Pattern.compile("^(\d+)\t[^\t]*\t[^\t]*\t[^\t]*\t([^\t]*)\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*$"),
            2, 0, 1)); // data, label, name fields