Splitting the tokens with Java StringTokenizer

I have a data set that looks like this:

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL
etc.

and the following code:

public class LotteryCount {

    /**
     * Mapper which extracts the lottery number and passes it to the Reducer with a single occurrence
     */
    public static class LotteryMapper extends Mapper<Object, Text, IntWritable, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private IntWritable lotteryKey;

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            StringTokenizer itr = new StringTokenizer(value.toString(), ",");
            while (itr.hasMoreTokens()) {
                lotteryKey.set(Integer.valueOf(itr.nextToken()));
                context.write(lotteryKey, one);
            }
        }
    }

    /**
     * Reducer to sum up the occurrence
     */
    public static class LotteryReducer
            extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
        IntWritable result = new IntWritable();

        public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;

            for (IntWritable val : values) {
                sum += val.get();
            }

            result.set(sum);
            context.write(key, result);
        }
    }
}

It is actually the word count from the official apache hadoop documentation, just a bit customized to my data set.

I get the following error:

Caused by: java.lang.NumberFormatException: For input string: "2005-01-04"

I am just interested in counting the occurrences for each individual drawn lottery number. How can I do this by using the StringTokenizer from my code? I know that I have to split the whole row because the tokenizer is "fed" with the whole. How can I take the lotterynumbers, split them and then count?

Thank you in advance

Solution

I am just interested in counting the occurrences for each individual drawn lottery number. How can I do this by using the StringTokenizer from my code? I know that I have to split the whole row because the tokenizer is "fed" with the whole. How can I take the lotterynumbers, split them and then count?

The data sample you posted is tab-delimited:

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL

Here's a simple example, and a few notes:

This uses the first line of your sample data as line, including the tab characters separating the data fields, just like you posted.
It uses a StringTokenizer with the token separator defined as as a single tab character (\t)
The program calls hasMoreTokens() until all tokens are seen, printing each one along the way.
The output includes left+right brackets to show the boundary of each token. For example, the "30" has a trailing space character that wouldn't be noticeable without using [] characters, same with leading whitesapce in front of "NULL".

String line = "2005-01-04   03 06 07 12 32  30            NULL";
StringTokenizer tokenizer = new StringTokenizer(line, "\t");

while (tokenizer.hasMoreTokens()) {
    String token = tokenizer.nextToken();
    System.out.println("token: [" + token + "]");
}

Here's the output:

token: [2005-01-04 ]
token: [03 06 07 12 32 ]
token: [30 ]
token: [          NULL]

You could take this approach, processing all lines, tokenizing on tab character, and use the 2nd token as your "lotterynumbers" data to do what you like.