hadoopapache-sparkhdfshadoop-streamingapache-crunch

What does read data as "streaming fashion" mean?


I was reading the Apache Crunch documentation and I found the following sentence:

Data is read in from the filesystem in a streaming fashion, so there is no requirement for the contents of the PCollection to fit in memory for it to be read into the client using materialization.

I would like to know what read in from the filesystem in a streaming fashion means and would be much appreciate if someone can tell me what is the difference with others kind of ways to read data.

I would say this concept also applies for other tools like for example Spark.


Solution

  • Let's say you have a file in English on your filesystem that you need to translate to German. You basically have two choices. You can load the whole English file in memory as one big batch, translate the whole batch at once, and then write the new German batch back out to the filesystem.

    Or you could do it line-by-line. Read the first line in English; translate to German and write out to the new file; read the second line in English and translate to German and append to the new file; and so on.

    The latter approach is analogous to the streaming approach described in the Apache Crunch documentation.

    The PCollection is to Crunch what the RDD is to Spark--the fundamental distributed data abstraction of the framework, but Crunch operates at a higher level of abstraction. It seeks to provide a nice API for data pipelines across technologies.

    For example, you may have your data in Hive that you have reliable queries for; the output of those queries serves as the input to a legacy MapReduce job storing data in HBase; those data are analyzed by Spark's MLLib machine learning library, and the results ultimately go to Cassandra. Crunch seeks to pipe all that together through the PCollection abstraction, but its streaming approach means that you don't have to wait for one job to finish before the next one starts. Just as with the line-by-line file translation, you process a bit at a time and move each bit through each phase of the pipeline--as opposed to doing it all in batches.

    You are right that the concept of streaming applies to tools like Spark (most obviously with Spark Streaming), but as I mentioned, Spark works at a lower level of abstraction than Crunch. A Spark job might be just one part of a Crunch pipeline. But streaming is a powerful paradigm indeed. It is the basis of the Kappa Architecture devised by Jay Kreps (formerly of LinkedIn and now of Confluent, who pioneers Apache Kafka) as a simpler but more powerful alternative to the batch-based Lambda Architecture devised by Nathan Marz (formerly of Twitter).

    In the end, the choice is between levels of abstraction (Crunch higher than Spark) and between operating one batch at a time or bit-by-bit.