I have some deeply nested data from a json file and I'm trying to load it into DuckDB:
"events": [
{
"id": "401638586",
"uid": "s:40~l:41~e:401638586",
"date": "2024-03-21T18:45Z",
"name": "Wagner Seahawks at North Carolina Tar Heels",
"shortName": "WAG VS UNC",
"season": {
"year": 2024,
"type": 3,
"slug": "post-season"
},
"competitions": [
{
"id": "401638586",
"uid": "s:40~l:41~e:401638586~c:401638586",
"date": "2024-03-21T18:45Z",
"attendance": 18223,
"type": {
"id": "6",
"abbreviation": "TRNMNT"
},
"timeValid": true,
"neutralSite": true,
"conferenceCompetition": false,
"playByPlayAvailable": true,
"recent": false,
.
.
.
The query I'm using looks like:
select
games['id'] as id,
games['date'] as date,
games['season']['year'] as season_year,
games['season']['slug'] as season_slug,
'2024-03-21' as partition_date,
games['name'] as name,
games['shortName'] as short_name,
games['status']['period'] as period,
games['status']['type']['completed'] as completed,
games['competitions'][0]['neutralSite'] as neutral,
games['competitions'][0]['conferenceCompetition'] as in_conference,
games['competitions'][0]['playByPlayAvailable'] as pbp_available
from (
select
unnest(events) as games
from read_json('/path/to/json/data')
)
limit 1;
The output looks like this:
id = 401638586
date = 2024-03-21T18:45Z
season_year = 2024
season_slug = post-season
partition_date = 2024-03-21
name = Wagner Seahawks at North Carolina Tar Heels
short_name = WAG VS UNC
period = 2
completed = true
neutral =
in_conference =
pbp_available =
You can see that once it gets into the nested 'competitions' field it just returns empty/null values. How can I access those fields correctly? I have tried using json_extract
but can't seem to get it to work.
You're using [0]
instead of [1]
See the warning in the docs:
Following PostgreSQL's conventions, DuckDB uses 1-based indexing for arrays and lists but 0-based indexing for the JSON data type.
Using [1]
will produce the expected values.
games['competitions'][1]['neutralSite'] as neutral,
games['competitions'][1]['conferenceCompetition'] as in_conference,
games['competitions'][1]['playByPlayAvailable'] as pbp_available
Rows: 1
Columns: 10
$ id <str> '401638586'
$ date <str> '2024-03-21T18:45Z'
$ season_year <i64> 2024
$ season_slug <str> 'post-season'
$ partition_date <str> '2024-03-21'
$ name <str> 'Wagner Seahawks at North Carolina Tar Heels'
$ short_name <str> 'WAG VS UNC'
$ neutral <bool> True
$ in_conference <bool> False
$ pbp_available <bool> True