I'm new to Dataproc and trying to submit a pig job to google dataproc via gcloud
gcloud config set project PROJECT
gcloud dataproc jobs submit pig --cluster=cluster-workaround --region=us-east4 --verbosity=debug --properties-file=gs://bucket/cvr_gcs_one.properties --file=gs://bucket-temp/intellibid-intermediat-cvr.pig
with below property file
jarLocation=gs://bucket-data-science/emr/jars/pig.jar
pigScriptLocation=gs://bucket-data-science/emr/pigs
logLocation=gs://bucket-data-science/prod/logs
udf_path=gs://bucket-data-science/emr/jars/udfs.jar
csv_dir=gs://bucket-db-dump/prod
currdate=2022-12-13
train_cvr=gs://bucket-temp/{2022-12-09}
output_dir=gs://analytics-bucket/outoout
and below is the sample of pig script which is uploaded to GCS
register $udf_path;
SET default_parallel 300;
SET pig.exec.mapPartAgg true; -- To remove load on combiner
SET pig.tmpfilecompression TRUE -- To make Compression true between
MapReduce Job Mainly when using Joins
SET pig.tmpfilecompression.codec gz -- To Specify the type of compression between MapReduce Job
SET mapreduce.map.output.compress TRUE --To make Compression true between Map and Reduce
SET mapreduce.map.output.compress.codec org.apache.hadoop.io.compress.GzipCodec
set mapred.map.tasks.speculative.execution false
SET mapreduce.task.timeout 10800000
set mapreduce.output.fileoutputformat.compress true
set mapreduce.output.fileoutputformat.compress.codec
org.apache.hadoop.io.compress.GzipCodec
SET mapreduce.map.maxattempts 16
SET mapreduce.reduce.maxattempts 16
SET mapreduce.job.queuename HIGH_PRIORITY
define GSUM com.java.udfs.common.javaSUM();
define get_cvr_key com.java.udfs.common.ALL_CTR_MODEL('$csv_dir', 'variableList.ini')
define multiple_file_generator com.java.udfs.common.CVR_KEY_GENERATION('$csv_dir','newcampaignToKeyMap')
train_tmp1 = load '$train_cvr/' using PigStorage('\t','-noschema') as (cookie,AdvID,nviews,ls_dst,ls_src,ls_di,ls_ft,ls_np,tos,nsess,e100_views,e200_views,e300_views,e400_views,e100_tos,e200_tos,e300_tos,e400_tos,uniq_prod,most_seen_prod_freq,uniq_cat,uniq_subcat,search_cnt,click_cnt,cart_cnt,HSDO,os,bwsr,dev,hc_c_v,hc_c_tp,hc_c_up,hc_c_ls,hc_s_v,hc_s_tp,hs_s_up,hc_s_ls,hc_clk_pub,hc_clk_cnt,hc_clk_lm,hp_ls_v,hp_ls_c,hp_ls_s,hp_ms_v,hp_ms_c,hp_ms_s,hu_v,hu_c,hu_s,purchase_flag,hp_ls_cvr,hp_ls_crr,hp_ms_cvr,hp_ms_crr,mpv,gc_c_tp,gc_clk_cnt,gc_c_up,gc_clk_lm,gc_c_v,gc_c_ls,gc_s_v,gc_s_lsts,gc_s_tp,gc_s_up,gc_clk_pub,epoch_ms,gc_ac_s,gc_ac_clk,gc_ac_vclk,udays,hc_vclk_cnt,gc_vclk_cnt,e205_view,e205_tos,AdvID_copy,hc_p_ms_p,hc_c_ms_p,most_seen_cat_freq,hc_p_ls_p,currstage,hc_c_city);
Getting below error
INFO org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
ERROR org.apache.pig.impl.PigContext - Undefined parameter : udf_path
2022-12-13 11:58:51,504 [main]
ERROR org.apache.pig.Main - ERROR 2997: Encountered IOException.
org.apache.pig.tools.parameters.ParameterSubstitutionException: Undefined parameter : udf_path
Tried most of the methods using console as well, doesn't get good documentation to go through.
And Whats Exactly the difference between Query parameters Field(Specify the parameter names and values to insert in place of parameter entries in the query file. The query uses those values when it runs.) and Property Field(A list of key-value pairs to configure the job. ) in UI
Can somone guide me here on what im doing wrong and how can i run a pig script in Dataproc
Pass it like below ,
gcloud config set project PROJECT
gcloud dataproc jobs submit pig --cluster=cluster-workaround --region=us-east4 --verbosity=debug --properties-file=gs://bucket/cvr_gcs_one.properties --file=gs://bucket-temp/your_pig.pig --params udf_path=gs://your_udfs.jar