I am running HBase on an EMR cluster (emr-5.7.0) enabled on S3. We are using 'ImportTsv' and 'CompleteBulkLoad' utilities for importing the data into HBase. During our process, we have observed that intermittently there were failures stating that there were HFile corruption for some of the imported files. This happens sporadically and there is no pattern that we could deduce for the errors.
After lot of research and going through many suggestions in blogs, I have tried the below fixes but of no avail and we are still facing the discrepancy.
Tech Stack :
AWS EMR Cluster (emr-5.7.0 | r3.8xlarge | 15 nodes)
AWS S3
HBase 1.3.1
Data Volume:
- ~ 960000 lines (To be upserted) | ~ 7GB TSV file
Commands used in sequence:
1) hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="|" -Dimporttsv.columns="<Column Names (472 Columns)>" -Dimporttsv.bulk.output="<HFiles Path on HDFS>" <Table Name> <TSV file path on HDFS> 2) hadoop fs -chmod 777 <HFiles Path on HDFS> 3) hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles <HFiles Path on HDFS> <Table Name>
Fixes Tried:
Increasing S3 Max Connections:
- We increased the below property but it did not seem to resolve the issue. fs.s3.maxConnections : Values tried -- 10000, 20000, 50000, 100000.
HBase Repair:
- Another approach was to execute the HBase repair command but it didn't seem to help either.
Command : hbase hbase hbck -repair
Error Trace is as below:
[LoadIncrementalHFiles-17] mapreduce.LoadIncrementalHFiles: Received a CorruptHFileException from region server: row '00218333246' on table 'WB_MASTER' at region=WB_MASTER,00218333246,1506304894610.f108f470c00356217d63396aa11cf0bc., hostname=ip-10-244-8-74.ec2.internal,16020,1507907710216, seqNum=198 org.apache.hadoop.hbase.io.hfile.CorruptHFileException: org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile Trailer from file s3://wbpoc-landingzone/emrfs_test/wb_hbase_compressed/data/default/WB_MASTER/f108f470c00356217d63396aa11cf0bc/cf/2a9ecdc5c3aa4ad8aca535f56c35a32d_SeqId_200_ at org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:497) at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:525) at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.(StoreFile.java:1170) at org.apache.hadoop.hbase.regionserver.StoreFileInfo.open(StoreFileInfo.java:259) at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:427) at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:528) at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:518) at org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:667) at org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:659) at org.apache.hadoop.hbase.regionserver.HStore.bulkLoadHFile(HStore.java:799) at org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:5574) at org.apache.hadoop.hbase.regionserver.RSRpcServices.bulkLoadHFile(RSRpcServices.java:2034) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:34952) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2339) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168) Caused by: java.io.FileNotFoundException: File not present on S3 at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem$NativeS3FsInputStream.read(S3NativeFileSystem.java:203) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.readFully(DataInputStream.java:195) at org.apache.hadoop.hbase.io.hfile.FixedFileTrailer.readFromStream(FixedFileTrailer.java:391) at org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:482)
Any suggestions in figuring out the root cause for this discrepancy would be really helpful.
Appreciate your help! Thank you!
After much research and trial & errors, I was finally able to find a resolution for this issue, thanks to AWS support folks. It seems the issue is an occurrence as a result of S3's eventual consistency. The AWS team suggested to use the below property and it worked like a charm, so far we haven't hit the HFile corruption issue. Hope this helps if someone is facing the same issue!
Property (hbase-site.xml): hbase.bulkload.retries.retryOnIOException : true