javahadoophdfssequencefile

SequenceFile Compactor of several small files in only one file.seq


Novell in HDFS and Hadoop: I am developing a program which one should get all the files of a specific directory, where we can find several small files of any type.

Get everyfile and make append in a SequenceFile compressed, where the key must be the path of the file, and the value must be the file got, For now my code is:

    import java.net.*;

    import org.apache.hadoop.fs.*;
    import org.apache.hadoop.conf.*;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.io.compress.BZip2Codec;

public class Compact {
        public static void main (String [] args) throws Exception{
                try{
                        Configuration conf = new Configuration();
                        FileSystem fs =
                                FileSystem.get(new URI("hdfs://quickstart.cloudera:8020"),conf);
                        Path destino = new Path("/user/cloudera/data/testPractice.seq");//test args[1]
                    
                        if ((fs.exists(destino))){
                            System.out.println("exist : " + destino);
                            return;
                        }
                        BZip2Codec codec=new BZip2Codec();
                        
                        SequenceFile.Writer outSeq = SequenceFile.createWriter(conf
                                   ,SequenceFile.Writer.file(fs.makeQualified(destino))
                                   ,SequenceFile.Writer.compression(SequenceFile.CompressionType.BLOCK,codec)
                                   ,SequenceFile.Writer.keyClass(Text.class)
                                   ,SequenceFile.Writer.valueClass(FSDataInputStream.class));
    
                        FileStatus[] status = fs.globStatus(new Path("/user/cloudera/data/*.txt"));//args[0]
                        for (int i=0;i<status.length;i++){
                                FSDataInputStream in = fs.open(status[i].getPath());
                                                            
                                
                                outSeq.append(new org.apache.hadoop.io.Text(status[i].getPath().toString()), new FSDataInputStream(in));
                                fs.close();
                                
                        }
                        outSeq.close();
                        System.out.println("End Program");
                }catch(Exception e){
                        System.out.println(e.toString());
                        System.out.println("File not found");
                }
        }
}

But after of every execution I receive this exception:

java.io.IOException: Could not find a serializer for the Value class: 'org.apache.hadoop.fs.FSDataInputStream'. Please ensure that the configuration 'io.serializations' is properly configured, if you're using custom serialization. File not found

I understand the error must be in the type of the file I am creating and the type of object I define for adding to the sequenceFile, but I don't know which one should add, can anyone help me?


Solution

  • FSDataInputStream, like any other InputStream, is not intended to be serialized. What serializing an "iterator" over a stream of byte should do ?

    What you most likely want to do, is to store the content of the file as the value. For example you can switch the value type from FsDataInputStream to BytesWritable and just get all the bytes out of the FSDataInputStream. One drawback of using Key/Value SequenceFile for a such purpose is that the content of each file has to fit in memory. It could be fine for small files but you have to be aware of this issue.

    I am not sure what you are really trying to achieve but perhaps you could avoid reinventing the wheel by using something like Hadoop Archives ?