I am new to oozie and trying to understand dataset.xml. I have following dataset and trying to understand what exactly oozie is trying to validate here. what is the meaning of initial instance and what uri-template is doing here(not clear on oozie document)
<dataset name="sample" frequency="${coord:hours(1)}" initial-instance="2022-01-10T00:00Z" timezone="UTC">
<uri-template>${hdfsdir}/filepath/${YEAR}${MONTH}${DAY}${HOUR}</uri-template>
<done-flag>_SUCCESS</done-flag>
</dataset>
Similarly, in coordinator I have following for input and output dataset. Here what is the significance of current(-5) and start parameter?
<coordinator-app name="test" frequency="${freq}" start="2022-01-10T00:00Z" end="2023-04-11T00:00Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.4" xmlns:sla="uri:oozie:sla:0.2">
<data-in name="raw" dataset="raw_data">
<instance>${coord:current(-5)}</instance>
</data-in>
<data-out name="processed" dataset="raw_out">
<instance>${coord:current(-5)}</instance>
</data-out>
Can someone explain what oozie is expecting on the datasets?
Thanks, bab
Without looking at the documentation, here's what I can guess.
initial-instance
- When is the dataset
first available? If you try to provide a timestamp before this in a workflow or coordinator, you can expect an error.frequency
will "count up" from that timestampuri-template
uses built-in Oozie variables to determine what pattern those files exist in the filesystem.coord:current(-5)
will multiply 5 by the dataset frequency
, and return the 5th previous instance... Giving you a dataset 5 hours before the time that the coordinator was started.
So, for your example, you have dataset name="sample"
defined, but your data-in
and data-out
tags do not reference this, so I don't think anything will run...
Here's the docs for coord:current
(might say something different from my answer) https://oozie.apache.org/docs/5.2.1/CoordinatorFunctionalSpec.html#a6.6.1._coord:currentint_n_EL_Function_for_Synchronous_Datasets
Section 5.1 seems to mostly answer your question