As the title says, is there a workaround with jOOQ that would let me use DuckDB to query parquet files stored in object storage?
Example: read_parquet(s3://path/to/my/parquet_file)
I have a schema of the parquet files that I have defined in a script to generate jOOQ classes from.
I can think of one way that would be to avoid using jooq-codegen and do it all manually by defining constants, record mappers and query builders.
Is there a better way to do it using the auto generated classes?
jOOQ doesn't know anything about these dynamic external tables, out of the box. But you perhaps create a view that reads from the parquet file and query that view, instead, assuming DuckDB can discover the parquet meta data when describing the view to jOOQ using its INFORMATION_SCHEMA
views.
Another approach would be to create a dummy table (or view) that describes the parquet file and then use runtime schema mapping to map from the dummy table to your function call. Make sure you turn off quoting of identifiers when doing this, otherwise, the whole function call is going to be quoted.
You could also use a VisitListener
or model API replacement to replace the dummy table by the function call. That would be a bit more robust, as it wouldn't require turning off quoting of identifiers.
The dummy table doesn't even have to exist in your schema. You could define a second code generation run that uses the DDLDatabase
just for this:
<configuration>
<generator>
<database>
<name>org.jooq.meta.extensions.ddl.DDLDatabase</name>
<properties>
<property>
<key>sql</key>
<value>create table dummy (col1 int, col2 int);</value>
</property>
</properties>
</database>
</generator>
</configuration>
Of course, code generation is really useful with jOOQ, but if all else fails, plain SQL templating is a pragmatic solution here. You can still combine this approach with the above dummy table, if you alias your plain SQL template to your dummy table.