javamavenapache-sparkintellij-idealzo

Importing a lzo file into java spark as dataset


I have some data in tsv format compressed using lzo. Now, I would like to use these data in a java spark program.

At the moment, I am able to decompress the files and then import them in Java as text files using

    SparkSession spark = SparkSession.builder()
            .master("local[2]")
            .appName("MyName")
            .getOrCreate();

    Dataset<Row> input = spark.read()
            .option("sep", "\t")
            .csv(args[0]);

    input.show(5);   // visually check if data were imported correctly

where I have passed the path to the decompressed file in the first argument. If I pass the lzo file as an argument, the result of show is illegible garbage.

Is there a way to make it work? I use IntelliJ as an IDE and the project is set-up in Maven.


Solution

  • I found a solution. It consists of two parts: installing the hadoop-lzo package and configuring it; after doing this, the code will remain the same as in the question, provided one is OK with the lzo file being imported in a single partition.

    In the following I will explain how to do it for a maven project set up in IntelliJ.

    This will activate the maven Twitter repository that contains the package hadoop-lzo and make hadoop-lzo available for the project.

    And that's it! Run your program again and it should work!