amazon-redshiftaws-glue-data-catalog

AWS Glue Job writes Null to Redshift


I have multiple JSON files in an s3 bucket folder each of the files has the same pattern as the below sample an array/list of JSON objects.

file1

[{"coinRank":1,"coinId":"bitcoin","coinName":"Bitcoin","coinSymbol":"BTC","coinLoc":"bitcoin","coinPrice":53501.08,"coin1hrChange":-0.6,"coin24hrChange":-6.0,"coin7dChange":-9.2,"coin24hrVol":38266934579,"coinMarketCap":1012650219321,"fetchTime":"2021-12-03 23:55:42.654921","rankDate":"2021-12-03","rate":409.98,"coinPriceNaira":21934372.7784000002},{"coinRank":2,"coinId":"ethereum","coinName":"Ethereum","coinSymbol":"ETH","coinLoc":"ethereum","coinPrice":4225.28,"coin1hrChange":-0.3,"coin24hrChange":-7.2,"coin7dChange":-6.4,"coin24hrVol":27395766224,"coinMarketCap":502376237337,"fetchTime":"2021-12-03 23:55:42.655698","rankDate":"2021-12-03","rate":409.98,"coinPriceNaira":1732280.2944},{"coinRank":3,"coinId":"binancecoin","coinName":"Binance Coin","coinSymbol":"BNB","coinLoc":"binance-coin","coinPrice":593.95,"coin1hrChange":-0.7,"coin24hrChange":-4.9,"coin7dChange":-6.9,"coin24hrVol":2379210538,"coinMarketCap":100022794436,"fetchTime":"2021-12-03 23:55:42.656393","rankDate":"2021-12-03","rate":409.98,"coinPriceNaira":243507.621}]

file2

[{"coinRank":1,"coinId":"bitcoin","coinName":"Bitcoin","coinSymbol":"BTC","coinLoc":"bitcoin","coinPrice":52936.1,"coin1hrChange":-1.5,"coin24hrChange":-6.5,"coin7dChange":-1.7,"coin24hrVol":38241025550,"coinMarketCap":998999157967,"fetchTime":"2021-12-04 02:33:23.182164","rankDate":"2021-12-04","rate":409.98,"coinPriceNaira":21702742.2780000009},{"coinRank":2,"coinId":"ethereum","coinName":"Ethereum","coinSymbol":"ETH","coinLoc":"ethereum","coinPrice":4159.85,"coin1hrChange":-1.4,"coin24hrChange":-8.1,"coin7dChange":2.8,"coin24hrVol":28661534477,"coinMarketCap":493429600914,"fetchTime":"2021-12-04 02:33:23.182785","rankDate":"2021-12-04","rate":409.98,"coinPriceNaira":1705455.3030000003},{"coinRank":3,"coinId":"binancecoin","coinName":"Binance Coin","coinSymbol":"BNB","coinLoc":"binance-coin","coinPrice":582.32,"coin1hrChange":-1.9,"coin24hrChange":-5.4,"coin7dChange":-0.6,"coin24hrVol":1059743631,"coinMarketCap":97824378011,"fetchTime":"2021-12-04 02:33:23.183415","rankDate":"2021-12-04","rate":409.98,"coinPriceNaira":238739.5536}]

file3

[{"coinRank":1,"coinId":"bitcoin","coinName":"Bitcoin","coinSymbol":"BTC","coinLoc":"bitcoin","coinPrice":49375.27,"coin1hrChange":-0.7,"coin24hrChange":4.3,"coin7dChange":-9.5,"coin24hrVol":35860857801.0,"coinMarketCap":932932346783,"fetchTime":"2021-12-05 14:34:49.339803","rankDate":"2021-12-05","rate":410.764648,"coinPriceNaira":20281615.4014549591},{"coinRank":2,"coinId":"ethereum","coinName":"Ethereum","coinSymbol":"ETH","coinLoc":"ethereum","coinPrice":4218.99,"coin1hrChange":-0.7,"coin24hrChange":7.1,"coin7dChange":3.3,"coin24hrVol":27778808883.0,"coinMarketCap":500688046117,"fetchTime":"2021-12-05 14:34:49.340495","rankDate":"2021-12-05","rate":410.764648,"coinPriceNaira":1733011.9422655201},{"coinRank":3,"coinId":"binancecoin","coinName":"Binance Coin","coinSymbol":"BNB","coinLoc":"binance-coin","coinPrice":574.23,"coin1hrChange":-0.5,"coin24hrChange":5.2,"coin7dChange":-4.0,"coin24hrVol":2265817636.0,"coinMarketCap":96576091895,"fetchTime":"2021-12-05 14:34:49.341177","rankDate":"2021-12-05","rate":410.764648,"coinPriceNaira":235873.38382104}]

Using AWS Glue Crawler and classifier for separating JSON Objects $[*] I have split the records, and I can confirm the number of records in the Data Catalog matches the number of records in the files. However, when I push the data to redshift, I have some columns showing up as null. I can also share my glue script if necessary.

enter image description here


Solution

  • I figured out what the problem was with the Dataset, The DataFrame had inferred different datatypes int64 and float64 on the columns, and when Glue created the table in Redshift, it created the number columns as double precision (float64) hence, the records that were integers were not cast properly on Redshift.

    1. I manually specified the column types in Pandas DataFrame using the .astype() function
    2. I dropped the table in redshift, deleted the table also in the data catalog database
    3. Re-crawled the database and re-ran the job.

    Now every data point shows up well on redshift.

    enter image description here