I have a data set that looks like this:
drawdate lotterynumbers meganumber multiplier
2005-01-04 03 06 07 12 32 30 NULL
2005-01-07 02 08 14 15 51 38 NULL
etc.
and the following code:
public class LotteryCount {
/**
* Mapper which extracts the lottery number and passes it to the Reducer with a single occurrence
*/
public static class LotteryMapper extends Mapper<Object, Text, IntWritable, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private IntWritable lotteryKey;
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString(), ",");
while (itr.hasMoreTokens()) {
lotteryKey.set(Integer.valueOf(itr.nextToken()));
context.write(lotteryKey, one);
}
}
}
/**
* Reducer to sum up the occurrence
*/
public static class LotteryReducer
extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
IntWritable result = new IntWritable();
public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
}
It is actually the word count from the official apache hadoop documentation, just a bit customized to my data set.
I get the following error:
Caused by: java.lang.NumberFormatException: For input string: "2005-01-04"
I am just interested in counting the occurrences for each individual drawn lottery number. How can I do this by using the StringTokenizer from my code? I know that I have to split the whole row because the tokenizer is "fed" with the whole. How can I take the lotterynumbers, split them and then count?
Thank you in advance
I am just interested in counting the occurrences for each individual drawn lottery number. How can I do this by using the StringTokenizer from my code? I know that I have to split the whole row because the tokenizer is "fed" with the whole. How can I take the lotterynumbers, split them and then count?
The data sample you posted is tab-delimited:
drawdate lotterynumbers meganumber multiplier
2005-01-04 03 06 07 12 32 30 NULL
2005-01-07 02 08 14 15 51 38 NULL
Here's a simple example, and a few notes:
line
, including the tab characters separating the data fields, just like you posted.StringTokenizer
with the token separator defined as as a single tab character (\t
)hasMoreTokens()
until all tokens are seen, printing each one along the way.[]
characters, same with leading whitesapce in front of "NULL".String line = "2005-01-04 03 06 07 12 32 30 NULL";
StringTokenizer tokenizer = new StringTokenizer(line, "\t");
while (tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
System.out.println("token: [" + token + "]");
}
Here's the output:
token: [2005-01-04 ]
token: [03 06 07 12 32 ]
token: [30 ]
token: [ NULL]
You could take this approach, processing all lines, tokenizing on tab character, and use the 2nd token as your "lotterynumbers" data to do what you like.