hadooptwitterflumeflume-ngflume-twitter

Apache Flume 1.5 not giving expected results in Hadoop 2/Automatic fail-over cluster configuration


I have configured Apache Hadoop 2 cluster in HA/Automatic fail-over configuration on CentOS 6.5//64-bit. I have installed Flume 1.5 (apache-flume-1.5.0-bin.tar.gz). I want to analyse twitter data using flume/Hive with some key words filtering. See image below: Here are hadoop2 configuration file contents.(important properties only).

core-site.xml

<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>

hdfs-site.xml

<property><name>dfs.ha.namenodes.mycluster</name><value>nn1,nn2</value><final>true</final></property>
<property><name>dfs.namenode.rpc-address.mycluster.nn1</name><value>nn1.mycluster1.com:9000</value></property>
<property><name>dfs.namenode.rpc-address.mycluster.nn2</name><value>nn2.mycluster1.com:9000</value></property>
<property><name>dfs.namenode.http-address.mycluster.nn1</name><value>nn1.mycluster1.com:50070</value></property>
<property><name>dfs.namenode.http-address.mycluster.nn2</name><value>nn2.mycluster1.com:50070</value></property>

Here are flume configuration file contents:

flume-env.sh

JAVA_HOME=/usr/java/jdk1.7.0_60
JAVA_OPTS="-Xms100m -Xmx200m -Dcom.sun.management.jmxremote"

twitter.conf

# Name the components on this agent
TwitterAgent.sources = Twitter
TwitterAgent.sinks = HDFS
TwitterAgent.channels = MemChannel

# Describe/configure the source
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = **************
TwitterAgent.sources.Twitter.consumerSecret = **********
TwitterAgent.sources.Twitter.accessToken = **************
TwitterAgent.sources.Twitter.accessTokenSecret = **************

TwitterAgent.sources.Twitter.maxBatchSize = 1000
TwitterAgent.sources.Twitter.maxBatchDurationMillis = 1000

TwitterAgent.sources.Twitter.keywords=hadoop, big data, analytics, bigdata, cloudera, data science, mapreduce, mahout, nosql

TwitterAgent.sources.Twitter.bind = localhost
TwitterAgent.sources.Twitter.port = 44444

# Describe the sink
TwitterAgent.sinks.HDFS.type = logger
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.hdfs.path=/user/flume/tweets/20140814/1_55
TwitterAgent.sinks.HDFS.fileType = DataStream
TwitterAgent.sinks.HDFS.writeFormat = Text
TwitterAgent.sinks.HDFS.batchSize = 100
TwitterAgent.sinks.HDFS.rollSize = 0
TwitterAgent.sinks.HDFS.rollCount = 100
TwitterAgent.sinks.HDFS.rollInterval = 100

# Use a channel which buffers events in memory
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 1000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

I am executing following command.

flume-ng agent --conf conf --conf-file conf/twitter.conf --name TwitterAgent -Dflume.root.logger=INFO,console

I have following questions/problems.

It is continuing to show the following output.

14/08/14 03:58:14 INFO twitter.TwitterSource: Processed 45,000 docs
14/08/14 03:58:14 INFO twitter.TwitterSource: Total docs indexed: 45,000, total skipped docs: 0
14/08/14 03:58:14 INFO twitter.TwitterSource:     53 docs/second
14/08/14 03:58:14 INFO twitter.TwitterSource: Run took 846 seconds and processed:
14/08/14 03:58:14 INFO twitter.TwitterSource:     0.013 MB/sec sent to index
14/08/14 03:58:14 INFO twitter.TwitterSource:     11.111 MB text sent to index
14/08/14 03:58:14 INFO twitter.TwitterSource: There were 0 exceptions ignored:
14/08/14 03:58:14 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:15 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:16 INFO twitter.TwitterSource: Processed 45,100 docs
14/08/14 03:58:16 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:17 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:18 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:18 INFO twitter.TwitterSource: Processed 45,200 docs
14/08/14 03:58:19 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:20 INFO twitter.TwitterSource: Processed 45,300 docs
14/08/14 03:58:20 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:21 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:22 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:22 INFO twitter.TwitterSource: Processed 45,400 docs
14/08/14 03:58:23 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:24 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:24 INFO twitter.TwitterSource: Processed 45,500 docs
14/08/14 03:58:25 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:26 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:26 INFO twitter.TwitterSource: Processed 45,600 docs
14/08/14 03:58:27 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:28 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:28 INFO twitter.TwitterSource: Processed 45,700 docs
14/08/14 03:58:29 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:30 INFO twitter.TwitterSource: Processed 45,800 docs
14/08/14 03:58:30 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:31 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:32 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:32 INFO twitter.TwitterSource: Processed 45,900 docs
14/08/14 03:58:33 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:34 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:34 INFO twitter.TwitterSource: Processed 46,000 docs
14/08/14 03:58:34 INFO twitter.TwitterSource: Total docs indexed: 46,000, total skipped docs: 0
14/08/14 03:58:34 INFO twitter.TwitterSource:     53 docs/second
14/08/14 03:58:34 INFO twitter.TwitterSource: Run took 867 seconds and processed:
14/08/14 03:58:34 INFO twitter.TwitterSource:     0.013 MB/sec sent to index
14/08/14 03:58:34 INFO twitter.TwitterSource:     11.36 MB text sent to index
14/08/14 03:58:34 INFO twitter.TwitterSource: There were 0 exceptions ignored:

Can any body please help me, what I am missing?

Should I re-build Flume with Maven, before using for this task?


Solution

  • No need to give read-write access to Twitter/API access token? The way you have used hdfs.path style is also correct.

    To fix the main issue ( not copying the files ), do the following changes:

    Changes in conf/twitter.conf file

    Replace following line: ( TwitterAgent.sinks.HDFS.type = logger )

    with following line: TwitterAgent.sinks.HDFS.type = hdfs

    Comment the following line:

    #TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
    

    Use following ( Apache Class )

    TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
    

    Changes in flume-env.conf

    Comment following: (no need to set this value)

    #FLUME_CLASSPATH=""
    

    Set proper values for the following attributes:

    hdfs.filePrefix         
    hdfs.fileSuffix         
    hdfs.inUsePrefix        
    hdfs.inUseSuffix        
    hdfs.rollInterval       
    hdfs.rollSize           
    hdfs.rollCount          
    hdfs.idleTimeout        
    hdfs.batchSize          
    hdfs.fileType   
    hdfs.maxOpenFiles   
    hdfs.minBlockReplicas   
    hdfs.writeFormat    
    hdfs.callTimeout    
    hdfs.threadsPoolSize    
    hdfs.rollTimerPoolSize  
    hdfs.kerberosPrincipal  
    hdfs.kerberosKeytab 
    hdfs.proxyUser  
    hdfs.round  
    hdfs.roundValue 
    hdfs.roundUnit  
    hdfs.timeZone   
    hdfs.useLocalTimeStamp  
    hdfs.closeTries 
    hdfs.retryInterval  
    

    To see more detail, see following link:

    https://flume.apache.org/FlumeUserGuide.html