apache-spark graph gremlin olap amazon-neptune

Unable to run Gremlin OLAP queries on AWS Neptune

I am using a Gremlin Console installed on an Amazon EC2 instance to connect to a Neptune DB instance and run Gremlin queries on.

I have installed and configured Anaconda, Python (3.10.4), Jupyter Notebook (6.4.12), Java (1.8.0_252), Scala (2.12.16), Spark and Hadoop (spark-3.3.0-bin-hadoop3) on the EC2 instance following the instructions here.

I have also installed and activated the spark-gremlin on the graph DB:

gremlin> :install org.apache.tinkerpop spark-gremlin 3.6.0
gremlin> :q 
$ bin/gremlin.sh
          \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin> :plugin use tinkerpop.spark
==>tinkerpop.spark activated

I then tried testing the SparkGraphComputer by running the queries given in the TinkerPop documentation, but I am getting the error java.io.IOException: No input paths specified in job:

$ bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.spark
[INFO] o.a.t.g.h.j.HadoopGremlinPlugin - HADOOP_GREMLIN_LIBS is set to: /home/ec2-user/apache-tinkerpop-gremlin-console-3.6.0/ext/spark-gremlin/lib
[WARN] o.a.h.u.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.tinkergraph
gremlin> graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
==>hadoopgraph[gryoinputformat->gryooutputformat]
gremlin> g = traversal().withEmbedded(graph).withComputer(SparkGraphComputer)
==>graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()
java.io.IOException: No input paths specified in job
Type ':help' or ':h' for help.
Display stack trace? [yN]
gremlin>

Can anyone please help me out to resolve the error? Thanks!

Solution

You have to copy the tinkerpop-modern.kryo file to the hdfs filesystem. After that it should look like this in the gremlin console:

gremlin> hdfs.ls()
==>rwxr-xr-x smallette supergroup 0 (D) .sparkStaging
==>rwxr-xr-x smallette supergroup 0 (D) output
==>rw-r--r-- smallette supergroup 781 tinkerpop-modern.kryo

If used on one machine without a hadoop cluster, the hdfs client falls back to the local file system and you should change the input location to:

gremlin.hadoop.inputLocation=conf/tinkerpop-modern.kryo