hadoopfilesystemsmaildir

Hadoop good example of production implementation


I hear lot about Hadoop but when it comes for defining what it is i get confused. Because definition defers form point to point.

Is Hadoop something that serves files from server to client?

Ex: If we implement Hadoop for a MAILDIR where emails are stored, Can Hadoop help in accessing the emails and serving it to client in super fast speed? Is this how it can be used?

Can you tell me in simple words what is Hadoop and its uses?


Solution

  • Dude You are messing this up.

    Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache project being built and used by a global community of contributors and users.

    The Apache Hadoop framework is composed of the following modules

    1. Hadoop Common – contains libraries and utilities needed by other Hadoop modules

    2. Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.

    3. Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.
    4. Hadoop MapReduce – a programming model for large scale data processing.

    For the end-users, though MapReduce Java code is common, any programming language can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program.Apache Pig, Apache Hive, Apache Spark among other related projects expose higher level user interfaces like Pig Latin and a SQL variant respectively. The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts.

    The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses TCP/IP sockets for communication. Clients use remote procedure call (RPC) to communicate between each other.

    HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence theoretically does not require RAID storage on hosts (but to increase I/O performance some RAID configurations are still useful). With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high.

    The HDFS file system is not restricted to MapReduce jobs.It can be used for other applications including the HBase database, the Apache Mahout machine learning system, and the Apache Hive Data Warehouse system. Hadoop can in theory be used for any sort of work that is batch-oriented rather than real-time, that is very data-intensive, and able to work on pieces of the data in parallel.

    Commercial applications of Hadoop includes :

    1. Log and/or clickstream analysis of various kinds
    2. Marketing analytics
    3. Machine learning and/or sophisticated data mining
    4. Image processing
    5. Processing of XML messages
    6. Web crawling and/or text processing
    7. General archiving, including of relational/tabular data, e.g. for compliance

    You can refer to YDN to have a good startup in understanding the hadoop framework.