I'm using Mallet 2.0.7 in java for mining of tweets. According the documentation, for topic modeling I have to read data set using CsvIterator.
Reader fileReader = new InputStreamReader(new FileInputStream(new File(args[0])), "UTF-8");
instances.addThruPipe(new CsvIterator (fileReader, Pattern.compile("^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$"),
3, 2, 1)); // data, label, name fields
My data set is like: row,x,location,username,hashtaghs,text,retweets,date,favorites,numberOfComment
for label I added column x. in the first time, I want to run algorithm in column text (6) and later added another column. I wrote this pattern but it doesn't work as expected, It gets column 6 until last for data. how do I change the regular expression for pattern?
Reader fileReader = new InputStreamReader(new FileInputStream(new File(filePath)), "UTF-8");
instances.addThruPipe(new CsvIterator(fileReader,
Pattern.compile("^(\\S*)[\\s,]*(\\S*)[\\s,]*(\\S*)[\\s,]*(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$"),
6, 2, 1)); // data, label, name fields
Look for regular expression documentation to understand the meaning of each element of the pattern. The original pattern divides the whole line into three groups: all characters from the beginning to the first comma or whitespace, all characters until the second comma or whitespace, and then everything else.
The new pattern does the same, but captures six groups. That's why you're getting everything from the text to the end of the line.
I would recommend a few fixes:
If a field isn't relevant, like label
, you can just use 0 to specify that it doesn't exist. You don't need to add a dummy field.
Anything in ()
is a capturing group. If you don't want to include a field, don't capture it. Just delete the parentheses but leave the pattern.
The original pattern works because we can make assumptions about the name and label fields: they don't contain commmas or spaces, and everything afterwards is text. To grab a field in the middle of a line, you need to be more careful. You have to find the end of the text field. I would strongly suggest using tab-delimited fields, assuming no field contains tab characters.
Try something like this (not tested):
// row,x,location,username,hashtaghs,text,retweets,date,favorites,numberOfComment
Reader fileReader = new InputStreamReader(new FileInputStream(new File(filePath)), "UTF-8");
instances.addThruPipe(new CsvIterator(fileReader,
Pattern.compile("^(\d+)\t[^\t]*\t[^\t]*\t[^\t]*\t([^\t]*)\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*$"),
2, 0, 1)); // data, label, name fields