I have a Hadoop+Hive+Tez setup from scratch (meaning I deployed it component by component). Hive is set up using Tez as execution engine.
In its current status, Hive can access table on HDFS, but it can not access table stored on MinIO (using s3a
filesystem implementation).
As shows the following screenshot,
when executing SELECT COUNT(*) FROM s3_table
,
Map 1
always in INITIALIZING
stateMap 1
always has a total count of -1
and pending count of -1
. (why -1
?)Things already checked:
hdfs dfs -ls s3a://bucketname
works well.What could be the possible causes for this problem?
Version informations:
It turned out the problem is that Tez's S3 support must be enabled explicitly at compile time. For hadoop 2.8+, to enable S3 support, Tez must be compiled from source, with the following command:
mvn clean package -DskipTests=true -Dmaven.javadoc.skip=true -Paws -Phadoop28 -P\!hadoop27
After that, drop the generated tez-x.y.z.tar.gz
to HDFS and extract tez-x.x.x-minimal.tar.gz
to $TEZ_LIB_DIR
. Then it worked for me. Hive execution against MinIO/S3 runs smoothly.
However, Tez installation guide didn't mention anything about enabling S3 support. Nor does the default Tez binary releases build with S3 or Azure support.
The (hopefully) complete build options and pitfalls are actually documented in BUILDING.txt, where it says:
However, to build against hadoop versions higher than 2.7.0, you will need to do the following:
For Hadoop version X where X >= 2.8.0
$ mvn package -Dhadoop.version=${X} -Phadoop28 -P\!hadoop27
For recent versions of Hadoop (which do not bundle aws and azure by default), you can bundle AWS-S3 (2.7.0+) or Azure (2.7.0+) support:
$ mvn package -Dhadoop.version=${X} -Paws -Pazure