google-cloud-platformapache-piggcloudgoogle-cloud-dataproc

Error in submitting a pig job to Google Dataproc with properties file


I'm new to Dataproc and trying to submit a pig job to google dataproc via gcloud

   gcloud config set project PROJECT

  gcloud dataproc jobs submit pig   --cluster=cluster-workaround   --region=us-east4   --verbosity=debug   --properties-file=gs://bucket/cvr_gcs_one.properties --file=gs://bucket-temp/intellibid-intermediat-cvr.pig 

with below property file

jarLocation=gs://bucket-data-science/emr/jars/pig.jar
pigScriptLocation=gs://bucket-data-science/emr/pigs
logLocation=gs://bucket-data-science/prod/logs
udf_path=gs://bucket-data-science/emr/jars/udfs.jar
csv_dir=gs://bucket-db-dump/prod
currdate=2022-12-13
train_cvr=gs://bucket-temp/{2022-12-09}
output_dir=gs://analytics-bucket/outoout

and below is the sample of pig script which is uploaded to GCS

 register $udf_path;

 SET default_parallel 300;
 SET pig.exec.mapPartAgg true; -- To remove load on combiner

 SET pig.tmpfilecompression TRUE          -- To make Compression true between 
 MapReduce Job Mainly when using Joins
 SET pig.tmpfilecompression.codec gz     -- To Specify the type of compression between MapReduce Job
 SET mapreduce.map.output.compress TRUE      --To make Compression true between Map and Reduce
 SET mapreduce.map.output.compress.codec org.apache.hadoop.io.compress.GzipCodec
 set mapred.map.tasks.speculative.execution false
 SET mapreduce.task.timeout 10800000
 set mapreduce.output.fileoutputformat.compress true
 set mapreduce.output.fileoutputformat.compress.codec 
 org.apache.hadoop.io.compress.GzipCodec
 SET mapreduce.map.maxattempts 16
 SET mapreduce.reduce.maxattempts 16
 SET mapreduce.job.queuename HIGH_PRIORITY

 define GSUM com.java.udfs.common.javaSUM();
 define get_cvr_key com.java.udfs.common.ALL_CTR_MODEL('$csv_dir', 'variableList.ini')
 define multiple_file_generator com.java.udfs.common.CVR_KEY_GENERATION('$csv_dir','newcampaignToKeyMap')

  train_tmp1 = load '$train_cvr/' using PigStorage('\t','-noschema') as (cookie,AdvID,nviews,ls_dst,ls_src,ls_di,ls_ft,ls_np,tos,nsess,e100_views,e200_views,e300_views,e400_views,e100_tos,e200_tos,e300_tos,e400_tos,uniq_prod,most_seen_prod_freq,uniq_cat,uniq_subcat,search_cnt,click_cnt,cart_cnt,HSDO,os,bwsr,dev,hc_c_v,hc_c_tp,hc_c_up,hc_c_ls,hc_s_v,hc_s_tp,hs_s_up,hc_s_ls,hc_clk_pub,hc_clk_cnt,hc_clk_lm,hp_ls_v,hp_ls_c,hp_ls_s,hp_ms_v,hp_ms_c,hp_ms_s,hu_v,hu_c,hu_s,purchase_flag,hp_ls_cvr,hp_ls_crr,hp_ms_cvr,hp_ms_crr,mpv,gc_c_tp,gc_clk_cnt,gc_c_up,gc_clk_lm,gc_c_v,gc_c_ls,gc_s_v,gc_s_lsts,gc_s_tp,gc_s_up,gc_clk_pub,epoch_ms,gc_ac_s,gc_ac_clk,gc_ac_vclk,udays,hc_vclk_cnt,gc_vclk_cnt,e205_view,e205_tos,AdvID_copy,hc_p_ms_p,hc_c_ms_p,most_seen_cat_freq,hc_p_ls_p,currstage,hc_c_city);

Getting below error

INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
ERROR org.apache.pig.impl.PigContext - Undefined parameter : udf_path
2022-12-13 11:58:51,504 [main] 
ERROR org.apache.pig.Main - ERROR 2997: Encountered IOException. 
org.apache.pig.tools.parameters.ParameterSubstitutionException: Undefined parameter : udf_path

Tried most of the methods using console as well, doesn't get good documentation to go through.

And Whats Exactly the difference between Query parameters Field(Specify the parameter names and values to insert in place of parameter entries in the query file. The query uses those values when it runs.) and Property Field(A list of key-value pairs to configure the job. ) in UI

Can somone guide me here on what im doing wrong and how can i run a pig script in Dataproc


Solution

  • Pass it like below ,

      gcloud config set project PROJECT
    
      gcloud dataproc jobs submit pig   --cluster=cluster-workaround   --region=us-east4   --verbosity=debug   --properties-file=gs://bucket/cvr_gcs_one.properties --file=gs://bucket-temp/your_pig.pig --params udf_path=gs://your_udfs.jar