I wrote a simple PIG program as follows to analyze a small and a modified version of the google n-grams dataset on AWS. The data looks something like this:
I am 1936 942 90
I am 1945 811 5
I am 1951 47 12
very cool 1923 118 10
very cool 1980 320 100
very cool 2012 994 302
very cool 2017 1820 612
and has the form:
n-gram TAB year TAB occurrences TAB books NEWLINE
I wrote the following program to calculate the occurences of an ngram per book:
inp = LOAD <insert input path here> AS (ngram:chararray, year:int, occurences:int, books:int);
filter_input = FILTER inp BY (occurences >= 400) AND (books >= 8);
groupinp = GROUP filter_input BY ngram;
sum_occ = FOREACH groupinp GENERATE FLATTEN(group) AS firstcol, SUM(occurences) AS socc , SUM(books) AS nbooks;
DUMP sum_occ;
However, the DUMP command does not work and gives the following error:
892520 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY,FILTER
18/03/28 00:56:09 INFO pigstats.ScriptState: Pig features used in the script: GROUP_BY,FILTER
1892554 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
18/03/28 00:56:09 INFO data.SchemaTupleBackend: Key [pig.schematuple] was not set... will not generate code.
1892555 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[ConstantCalculator, LoadTypeCastInserter, PredicatePushdownOptimizer, StreamTypeCastInserter], RULES_DISABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter]}
18/03/28 00:56:09 INFO optimizer.LogicalPlanOptimizer: {RULES_ENABLED=[ConstantCalculator, LoadTypeCastInserter, PredicatePushdownOptimizer, StreamTypeCastInserter], RULES_DISABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter]}
1892591 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher - Tez staging directory is /tmp/temp383666093 and resources directory is /tmp/temp383666093
18/03/28 00:56:09 INFO tez.TezLauncher: Tez staging directory is /tmp/temp383666093 and resources directory is /tmp/temp383666093
1892592 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.plan.TezCompiler - File concatenation threshold: 100 optimistic? false
18/03/28 00:56:09 INFO plan.TezCompiler: File concatenation threshold: 100 optimistic? false
1892593 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.AccumulatorOptimizerUtil - Reducer is to run in accumulative mode.
18/03/28 00:56:09 INFO util.AccumulatorOptimizerUtil: Reducer is to run in accumulative mode.
1892606 [main] INFO org.apache.pig.builtin.PigStorage - Using PigTextInputFormat
18/03/28 00:56:09 INFO builtin.PigStorage: Using PigTextInputFormat
18/03/28 00:56:09 INFO input.FileInputFormat: Total input files to process : 1
1892626 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
18/03/28 00:56:09 INFO util.MapRedUtil: Total input paths to process : 1
1892627 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
18/03/28 00:56:09 INFO util.MapRedUtil: Total input paths (combined) to process : 1
18/03/28 00:56:09 INFO hadoop.MRInputHelpers: NumSplits: 1, SerializedSize: 408
1892653 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Local resource: joda-time-2.9.4.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: joda-time-2.9.4.jar
1892653 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Local resource: pig-0.17.0-core-h2.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: pig-0.17.0-core-h2.jar
1892653 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Local resource: antlr-runtime-3.4.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: antlr-runtime-3.4.jar
1892653 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Local resource: automaton-1.11-8.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: automaton-1.11-8.jar
1892709 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - For vertex - scope-239: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA
18/03/28 00:56:09 INFO tez.TezDagBuilder: For vertex - scope-239: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA
1892709 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Processing aliases: filter_input,groupinp,inp
18/03/28 00:56:09 INFO tez.TezDagBuilder: Processing aliases: filter_input,groupinp,inp
1892709 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],groupinp[3,11]
18/03/28 00:56:09 INFO tez.TezDagBuilder: Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],groupinp[3,11]
1892709 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Pig features in the vertex:
18/03/28 00:56:09 INFO tez.TezDagBuilder: Pig features in the vertex:
1892744 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Set auto parallelism for vertex scope-240
18/03/28 00:56:09 INFO tez.TezDagBuilder: Set auto parallelism for vertex scope-240
1892744 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - For vertex - scope-240: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA
18/03/28 00:56:09 INFO tez.TezDagBuilder: For vertex - scope-240: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA
1892744 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Processing aliases: sum_occ
18/03/28 00:56:09 INFO tez.TezDagBuilder: Processing aliases: sum_occ
1892744 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Detailed locations: sum_occ[5,10]
18/03/28 00:56:09 INFO tez.TezDagBuilder: Detailed locations: sum_occ[5,10]
1892745 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Pig features in the vertex: GROUP_BY
18/03/28 00:56:09 INFO tez.TezDagBuilder: Pig features in the vertex: GROUP_BY
1892762 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal error creating job configuration.
18/03/28 00:56:09 ERROR grunt.Grunt: ERROR 2017: Internal error creating job configuration.
Details at logfile: /mnt/var/log/pig/pig_1522196676602.log
How do I fix this?
If you are using an old version, kindly update it (should solve your problem)
PIG scripts are lazily evaluated, so unless you use a DUMP or STORE command you will not know what is wrong with your code.
When you run your code it will again throw the following error:
ERROR 1025: Invalid field projection. Projected field [occurences] does not exist in schema: group:chararray,filter_input:bag{:tuple(ngram:chararray,year:int,occurences:int,books:int)}.
Change the below line from
sum_occ = FOREACH groupinp GENERATE FLATTEN(group) AS firstcol, SUM(occurences) AS socc , SUM(books) AS nbooks;
to
sum_occ = FOREACH groupinp GENERATE FLATTEN(group) AS firstcol, SUM(filter_input.occurences) AS socc, SUM(filter_input.books) AS nbooks;
will solve this error.