sqlsql-serverazure-synapseazure-data-lake

Read multiple JSON files from storage in SQL and parse to table


I am attempting to read multiple JSON files using Synapse from a data lake storage container. I can successfully parse an example file pre-loaded in the SQL script, but I want to iterate through multiple files in the data lake (using OPENROWSET) and present the results in a table.

So far, I have:

SELECT 
    iotdata.did as Device, iotdata.dt as DeviceType, iotdata.c as PayloadComment, iotdata.mid as ManufactureId, iotdata.t as PayloadTimeStamp,
    ds.tp as DataType, ds.t as DataSetTimeStamp,
    de.*    
FROM 
    OPENROWSET(
        BULK 'https://xxxx.dfs.core.windows.net/raw/100/100/xxxx/2023/11/10/*.json',
        FORMAT = 'CSV',
        FIELDQUOTE = '0x0b',
        FIELDTERMINATOR ='0x0b',
        ROWTERMINATOR = '0x0b'
    )
    WITH (
    did nvarchar(100),
    dt nvarchar(100),
    c nvarchar(100),
    mid nvarchar(100),
    t nvarchar(100),
    ds nvarchar(max)
    ) as iotdata
cross apply openjson (iotdata.ds)
with
(
    tp nvarchar(100),
    t nvarchar(100),
    de nvarchar(max) as json
) as ds
cross apply openjson (ds.de)
with
(
    ch int '$.ch',
    m nvarchar(100)
)
as de

This produces error:

Statement ID: {251445DC-2D85-4044-BF2F-34FE862FA8BA} | Query hash: 0x4C7DD120748CDBE7 | Distributed request ID: {F813A165-E978-442F-B61B-704D72E754E1}. Total size of data scanned is 2 megabytes, total size of data moved is 0 megabytes, total size of data written is 0 megabytes. Msg 13812, Level 16, State 3, Line 1 Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 1 (did) in https://xxxx.dfs.core.windows.net/raw/100/100/R00016/2023/11/10/xxx-Received-20231110080054-MID-2xxx.json.

What am I doing wrong?

Working SQL example, with a single JSON in embedded in the SQL

declare @json2 nvarchar(max)
set @json2 = '
  {
  "did": "0001",
  "mid": "100",
  "dt": 100,
  "t": "2023-11-10T00:00:50.046Z",
  "c": "comment",
  "ds": [
    {
      "t": "2023-11-09T20:16:56.143Z",
      "tp": "100",
      "de": [
        {
          "ch": 100,
          "m": "2023-11-09T20:16:36.037Z"
        },
        {
          "ch": 101,
          "m": "2023-11-09T20:16:56.143Z"
        }
      ]
    },
    {
      "t": "2023-11-10T00:00:12.158Z",
      "tp": "101",
      "de": [
        {
          "ch": 120,
          "m": 13.189029
        },
        {
          "ch": 121,
          "m": 77.434937
        }
      ]
    },
    {
      "t": "2023-11-10T00:00:07.054Z",
      "tp": "101",
      "de": [
        {
          "ch": 120,
          "m": 13.190003
        },
        {
          "ch": 121,
          "m": 77.44736
        }
      ]
    },
    {
      "t": "2023-11-09T22:48:42.351Z",
      "tp": "100",
      "de": [
        {
          "ch": 100,
          "m": "2023-11-09T22:47:10.731Z"
        },
        {
          "ch": 101,
          "m": "2023-11-09T22:48:42.351Z"
        }
      ]
    },
    {
      "t": "2023-11-10T00:00:50.014Z",
      "tp": "103",
      "de": [
        {
          "ch": 145,
          "m": 35.935687
        },
        {
          "ch": 146,
          "m": -96.062258
        }
      ]
    }
  ]
}';


select
    iotdata.did as Device, iotdata.dt as DeviceType, iotdata.c as PayloadComment, iotdata.mid as ManufactureId, iotdata.t as PayloadTimeStamp,
    ds.tp as DataType, ds.t as DataSetTimeStamp,
    de.*    
from   openjson (@json2)
with
(
    did nvarchar(100),
    dt nvarchar(100),
    c nvarchar(100),
    mid nvarchar(100),
    t nvarchar(100),
    ds nvarchar(max) as json
)
as iotdata
cross apply openjson (iotdata.ds)
with
(
    tp nvarchar(100),
    t nvarchar(100),
    de nvarchar(max) as json
) as ds
cross apply openjson (ds.de)
with
(
    ch int '$.ch',
    m nvarchar(100)
)
as de

Desired table output

╔════════╦════════════╦════════════════╦═══════════════╦══════════════════════════╦══════════╦══════════════════════════╦═════╦══════════════════════════╗
║ Device ║ DeviceType ║ PayloadComment ║ ManufactureId ║     PayloadTimeStamp     ║ DataType ║     DataSetTimeStamp     ║ ch  ║            m             ║
╠════════╬════════════╬════════════════╬═══════════════╬══════════════════════════╬══════════╬══════════════════════════╬═════╬══════════════════════════╣
║   0001 ║        100 ║ comment        ║           100 ║ 2023-11-10T00:00:50.046Z ║      100 ║ 2023-11-09T20:16:56.143Z ║ 100 ║ 2023-11-09T20:16:36.037Z ║
║   0001 ║        100 ║ comment        ║           100 ║ 2023-11-10T00:00:50.046Z ║      100 ║ 2023-11-09T20:16:56.143Z ║ 101 ║ 2023-11-09T20:16:56.143Z ║
║   0001 ║        100 ║ comment        ║           100 ║ 2023-11-10T00:00:50.046Z ║      101 ║ 2023-11-10T00:00:12.158Z ║ 120 ║ 13.189029                ║
║   0001 ║        100 ║ comment        ║           100 ║ 2023-11-10T00:00:50.046Z ║      101 ║ 2023-11-10T00:00:12.158Z ║ 121 ║ 77.434937                ║
║   0001 ║        100 ║ comment        ║           100 ║ 2023-11-10T00:00:50.046Z ║      101 ║ 2023-11-10T00:00:07.054Z ║ 120 ║ 13.190003                ║
║   0001 ║        100 ║ comment        ║           100 ║ 2023-11-10T00:00:50.046Z ║      101 ║ 2023-11-10T00:00:07.054Z ║ 121 ║ 77.44736                 ║
║   0001 ║        100 ║ comment        ║           100 ║ 2023-11-10T00:00:50.046Z ║      100 ║ 2023-11-09T22:48:42.351Z ║ 100 ║ 2023-11-09T22:47:10.731Z ║
║   0001 ║        100 ║ comment        ║           100 ║ 2023-11-10T00:00:50.046Z ║      100 ║ 2023-11-09T22:48:42.351Z ║ 101 ║ 2023-11-09T22:48:42.351Z ║
║   0001 ║        100 ║ comment        ║           100 ║ 2023-11-10T00:00:50.046Z ║      103 ║ 2023-11-10T00:00:50.014Z ║ 145 ║ 35.935687                ║
║   0001 ║        100 ║ comment        ║           100 ║ 2023-11-10T00:00:50.046Z ║      103 ║ 2023-11-10T00:00:50.014Z ║ 146 ║ -96.062258               ║
╚════════╩════════════╩════════════════╩═══════════════╩══════════════════════════╩══════════╩══════════════════════════╩═════╩══════════════════════════╝

A single example JSON file:

{
  "did": "0000",
  "mid": "100",
  "dt": 100,
  "t": "2023-11-09T12:00:54.055Z",
  "c": "",
  "ds": [
    {
      "t": "2023-11-09T12:00:07.508Z",
      "tp": "101",
      "de": [
        {
          "ch": 120,
          "m": 13.133502
        },
        {
          "ch": 121,
          "m": 77.360396
        },
        {
          "ch": 122,
          "m": "2023-11-09T12:00:07.508Z"
        },
        {
          "ch": 123,
          "m": 26677
        },
        {
          "ch": 124,
          "m": 5099.845
        },
        {
          "ch": 125,
          "m": 1
        },
        {
          "ch": 126,
          "m": "A"
        },
        {
          "ch": 127,
          "m": 0.02119
        },
        {
          "ch": 128,
          "m": 0
        },
        {
          "ch": 129,
          "m": 60273
        }
      ]
    },
    {
      "t": "2023-11-09T12:00:54.018Z",
      "tp": "103",
      "de": [
        {
          "ch": 145,
          "m": 35.935687
        },
        {
          "ch": 146,
          "m": -96.062258
        }
      ]
    },
    {
      "t": "2023-11-09T12:00:12.612Z",
      "tp": "101",
      "de": [
        {
          "ch": 120,
          "m": 13.133502
        },
        {
          "ch": 121,
          "m": 77.347973
        },
        {
          "ch": 122,
          "m": "2023-11-09T12:00:12.612Z"
        },
        {
          "ch": 123,
          "m": 2
        },
        {
          "ch": 124,
          "m": 0.0
        },
        {
          "ch": 125,
          "m": 1
        },
        {
          "ch": 126,
          "m": "A"
        },
        {
          "ch": 127,
          "m": 0.025042
        },
        {
          "ch": 128,
          "m": 1
        },
        {
          "ch": 129,
          "m": 0
        }
      ]
    }
  ]
}

Solution

  • When Json object is stored in the variable json2, you used openjson to parse the Json variable json2. If you want to read and parse the Json file, you need to use JSONVALUE for parsing the Json value that are read with openrowset . Below is the modified code.

    Code:

    SELECT
    JSON_VALUE(ds,'$.did') as deviceid,JSON_VALUE(ds, '$.dt') as DeviceType, JSON_VALUE(ds,'$.c') as PayloadComment,
    JSON_VALUE(ds,'$.mid') as ManufactureId, JSON_VALUE(ds,'$.t') as PayloadTimeStamp,
    ds1.tp as [DataType], ds1.t as DataSetTimeStamp
    ,ch,m
    FROM
    OPENROWSET(
    BULK  'https://xxx.blob.core.windows.net/xxx/xx/xxx.json',
    FORMAT = 'CSV',
    FIELDQUOTE = '0x0b',
    FIELDTERMINATOR ='0x0b',
    ROWTERMINATOR = '0x0b'
    )
    WITH (
    ds nvarchar(max)
    ) as iotdata
    cross  apply openjson (ds,'$.ds')
    with
    (
    tp nvarchar(100) '$.tp',
    t nvarchar(100) '$.t',
    de nvarchar(max) as json
    ) as ds1
    cross  apply openjson (de)
    with
    (
    ch int  '$.ch',
    m nvarchar(100) '$.m'
    ) as de
    

    Output:

    Output data for the above code with sample Json file ,

    deviceid DeviceType PayloadComment ManufactureId PayloadTimeStamp DataType DataSetTimeStamp ch m
    0001 100 comment 100 2023-11-10T00:00:50.046Z 100 2023-11-09T20:16:56.143Z 100 2023-11-09T20:16:36.037Z
    0001 100 comment 100 2023-11-10T00:00:50.046Z 100 2023-11-09T20:16:56.143Z 101 2023-11-09T20:16:56.143Z
    0001 100 comment 100 2023-11-10T00:00:50.046Z 101 2023-11-10T00:00:12.158Z 120 13.189029
    0001 100 comment 100 2023-11-10T00:00:50.046Z 101 2023-11-10T00:00:12.158Z 121 77.434937
    0001 100 comment 100 2023-11-10T00:00:50.046Z 101 2023-11-10T00:00:07.054Z 120 13.190003
    0001 100 comment 100 2023-11-10T00:00:50.046Z 101 2023-11-10T00:00:07.054Z 121 77.44736
    0001 100 comment 100 2023-11-10T00:00:50.046Z 100 2023-11-09T22:48:42.351Z 100 2023-11-09T22:47:10.731Z
    0001 100 comment 100 2023-11-10T00:00:50.046Z 100 2023-11-09T22:48:42.351Z 101 2023-11-09T22:48:42.351Z
    0001 100 comment 100 2023-11-10T00:00:50.046Z 103 2023-11-10T00:00:50.014Z 145 35.935687
    0001 100 comment 100 2023-11-10T00:00:50.046Z 103 2023-11-10T00:00:50.014Z 146 -96.062258

    enter image description here