I am trying to import a certain .CSV file into my database using PDI (Kettle).
Normally this would be rather easy, as you could just link up a CSV file input
step with a Table output
step and be good to go. However, the problem is that I don't know which file I want to import in advance, as in before executing the job/transformation in PDI.
That is because I have many files in my import folder, which all have the same format regarding their filename: KeyDate_Filename_YYYYMMDD.CSV
The idea is to have file with the newest YYYYMMDD imported for a given key date.
My theoretical approach to implement this would be:
CSV file input
step (already done)Table output
step (already done)Unfortunately I am fairly new to PDI and don't really have a compelling idea on how to implement the bold parts or if that approach as a whole is even viable.
Can anybody think of a way to get this done? Appreciate any feedback
Edit: Forgot to mention that I am using PDI 3.2.6.
In 4.x.x I could simple use a User Defined Java Class to get this done :/
There are various ways to achieve that result. Here's one:
Get filenames lists all files within a specific folder that match a given pattern. As ${KeyDate}
is already defined as a parameter, the pattern could be
${KeyDate}[^]_[0-9].csv
(you can use a simpler regex, but this one will match only filenames in that format;
${KeyDate}_[^_]*_([0-9]*)\.csv
. Remember to tick the "create fields for capture groups" checkbox;The output of the Group by step is a single row with the most recent filename that matches your pattern.
Now you pass it to the CSV file input, telling it to "accept filenames from previous step", and specifying which field to use (default is filename).