
Splitting the tokens with Java StringTokenizer

I have a data set that looks like this:

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL

and the following code:

public class LotteryCount {

     * Mapper which extracts the lottery number and passes it to the Reducer with a single occurrence
    public static class LotteryMapper extends Mapper<Object, Text, IntWritable, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private IntWritable lotteryKey;

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            StringTokenizer itr = new StringTokenizer(value.toString(), ",");
            while (itr.hasMoreTokens()) {
                context.write(lotteryKey, one);

     * Reducer to sum up the occurrence
    public static class LotteryReducer
            extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
        IntWritable result = new IntWritable();

        public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;

            for (IntWritable val : values) {
                sum += val.get();

            context.write(key, result);

It is actually the word count from the official apache hadoop documentation, just a bit customized to my data set.

I get the following error:

Caused by: java.lang.NumberFormatException: For input string: "2005-01-04"

I am just interested in counting the occurrences for each individual drawn lottery number. How can I do this by using the StringTokenizer from my code? I know that I have to split the whole row because the tokenizer is "fed" with the whole. How can I take the lotterynumbers, split them and then count?

Thank you in advance


  The data sample you posted is tab-delimited:

    The data sample you posted is tab-delimited:

    Here's a simple example, and a few notes:

    String line = "2005-01-04   03 06 07 12 32  30            NULL";
    StringTokenizer tokenizer = new StringTokenizer(line, "\t");
    while (tokenizer.hasMoreTokens()) {
        String token = tokenizer.nextToken();
        System.out.println("token: [" + token + "]");

    Here's the output:

    token: [2005-01-04 ]
    token: [03 06 07 12 32 ]
    token: [30 ]
    token: [          NULL]

    You could take this approach, processing all lines, tokenizing on tab character, and use the 2nd token as your "lotterynumbers" data to do what you like.