I am trying to load Twitter data into Hadoop. It says that it has processed nearly 25000 files, but when I check Hadoop I always find the folder empty. This is the command I am using
flume-ng agent -n TwitterAgent -f flume.conf
Here is a small caption
21/07/18 19:40:03 INFO twitter.TwitterSource: Processed 25,000 docs 21/07/18 19:40:03 INFO twitter.TwitterSource: Total docs indexed: 25,000, total skipped docs: 0 21/07/18 19:40:03 INFO twitter.TwitterSource: 45 docs/second 21/07/18 19:40:03 INFO twitter.TwitterSource: Run took 545 seconds and processed: 21/07/18 19:40:03 INFO twitter.TwitterSource: 0.012 MB/sec sent to index 21/07/18 19:40:03 INFO twitter.TwitterSource: 6.708 MB text sent to index 21/07/18 19:40:03 INFO twitter.TwitterSource: There were 0 exceptions ignored: 21/07/18 19:40:05 INFO twitter.TwitterSource: Processed 25,100 docs 21/07/18 19:40:06 INFO hdfs.BucketWriter: Creating /home/hadoopusr/flumetweets/FlumeData.1626629459197.tmp 21/07/18 19:40:06 WARN hdfs.HDFSEventSink: HDFS IO error org.apache.hadoop.fs.ParentNotDirectoryException: /home (is not a directory) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkIsDirectory(FSPermissionChecker.java:538) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:278) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:206) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:189) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:507) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1612) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1630) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:551) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2282) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2225) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:728)
This is my Flume.config file
#Naming the components on the current agent.
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
#Describing/Configuring the source
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels=MemChannel
TwitterAgent.sources.Twitter.consumerKey = ************
TwitterAgent.sources.Twitter.consumerSecret =************
TwitterAgent.sources.Twitter.accessToken = ************
TwitterAgent.sources.Twitter.accessTokenSecret = ************
TwitterAgent.sources.Twitter.keywords =covid,covid-19,coronavirus
#Describing/Configuring the sink TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = /home/hadoopusr/flumetweets
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 10
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600
TwitterAgent.sinks.HDFS.hdfs.rollCount = 100
#Describing/Configuring the channel
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 1000
TwitterAgent.channels.MemChannel.transactionCapacity = 1000
#Binding the source and sink to the channel
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel
As commented, you fixed your first error, now you get a permission error upon writing to the HDFS root path as the user=amel
In your config you have
TwitterAgent.sinks.HDFS.hdfs.path = /home/hadoopusr/flumetweets
But, I'm guessing either /home
or /home/hadoopusr
does not exist, so that directory is trying to get created.
However, your user is not hadoopusr
(your HDFS superuser), so there is not permissions to do so
Your options therefore are either
flume-ng agent
as the hadoopusr
(sudo su hadoopusr -c flume-ng agent ...
)/home/amel
(after you create that path and give yourself permissions on it) sudo su hadoopusr; hadoop fs -mkdir /home/amel; hadoop fs chown -R amel /home/amel; hadoop fs -chmod -R 760 /home/amel