javahadoopmapreducestringtokenizer

Splitting the tokens with Java StringTokenizer


I have a data set that looks like this:

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL
etc.

and the following code:

public class LotteryCount {

    /**
     * Mapper which extracts the lottery number and passes it to the Reducer with a single occurrence
     */
    public static class LotteryMapper extends Mapper<Object, Text, IntWritable, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private IntWritable lotteryKey;

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            StringTokenizer itr = new StringTokenizer(value.toString(), ",");
            while (itr.hasMoreTokens()) {
                lotteryKey.set(Integer.valueOf(itr.nextToken()));
                context.write(lotteryKey, one);
            }
        }
    }

    /**
     * Reducer to sum up the occurrence
     */
    public static class LotteryReducer
            extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
        IntWritable result = new IntWritable();

        public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;

            for (IntWritable val : values) {
                sum += val.get();
            }

            result.set(sum);
            context.write(key, result);
        }
    }
}

It is actually the word count from the official apache hadoop documentation, just a bit customized to my data set.

I get the following error:

Caused by: java.lang.NumberFormatException: For input string: "2005-01-04"

I am just interested in counting the occurrences for each individual drawn lottery number. How can I do this by using the StringTokenizer from my code? I know that I have to split the whole row because the tokenizer is "fed" with the whole. How can I take the lotterynumbers, split them and then count?

Thank you in advance


Solution

  • I am just interested in counting the occurrences for each individual drawn lottery number. How can I do this by using the StringTokenizer from my code? I know that I have to split the whole row because the tokenizer is "fed" with the whole. How can I take the lotterynumbers, split them and then count?

    The data sample you posted is tab-delimited:

    drawdate    lotterynumbers  meganumber  multiplier
    2005-01-04  03 06 07 12 32  30            NULL
    2005-01-07  02 08 14 15 51  38            NULL
    

    Here's a simple example, and a few notes:

    String line = "2005-01-04   03 06 07 12 32  30            NULL";
    StringTokenizer tokenizer = new StringTokenizer(line, "\t");
    
    while (tokenizer.hasMoreTokens()) {
        String token = tokenizer.nextToken();
        System.out.println("token: [" + token + "]");
    }
    

    Here's the output:

    token: [2005-01-04 ]
    token: [03 06 07 12 32 ]
    token: [30 ]
    token: [          NULL]
    

    You could take this approach, processing all lines, tokenizing on tab character, and use the 2nd token as your "lotterynumbers" data to do what you like.