xmljsongoogle-bigquerystackexchangeopendata

Convert xml to json to process file into Bigquery


I would like process a stackexchange raw data into BigQuery, but first the data use a 7z compress format, so I uncompressed the data to port this to gz format, but the internal file is a xml. So I need convert the file from xml to json. Any ideas? I used p7zip to uncompress and xml2json to try port the xml file but not work.

<?xml version="1.0" encoding="utf-8"?> <comments> <row Id="1" PostId="1" Score="3" Text="We need to all post more questions. Last time, we kinda &quot;rushed&quot; to get a w hole bunch of people to sign up at the last minute (and pulled some funny stuff" CreationDate="2014-02-12T01:01:14.257" UserId="52" />..

I used xml2json xml2json -t json2xml -o xxx.xml yyy.json

Other test using xml-json **David recomendations

Used this file Users.xml(size 895M) from stackoverflow.com-Users.7z with this command:xml-json Users.xml row > Users.json

xml-json Users.xml row > Users.json /usr/local/lib/node_modules/xml-json/node_modules/xml-nodes/index.js:19 this.soFar += String(chunk)
RangeError: Invalid string length
at XmlNodes._transform (/usr/local/lib/node_modules/xml-json/node_modules/xml-nodes/index.js:19:15)
at XmlNodes.Transform._read (_stream_transform.js:183:22)
at XmlNodes.Transform._write (_stream_transform.js:167:12)
at doWrite (_stream_writable.js:265:12)
at writeOrBuffer (_stream_writable.js:252:5)
at XmlNodes.Writable.write (_stream_writable.js:197:11)
at Duplexify._write (/usr/local/lib/node_modules/xml-json/node_modules/pumpify/node_modules/duplexify/index.js:197:22)
at doWrite (/usr/local/lib/node_modules/xml-json/node_modules/pumpify/node_modules/duplexify/node_modules/readable-stream/lib/_stream_writable.js:237:10)
at writeOrBuffer (/usr/local/lib/node_modules/xml-json/node_modules/pumpify/node_modules/duplexify/node_modules/readable-stream/lib/_stream_writable.js:227:5)
at Writable.write (/usr/local/lib/node_modules/xml-json/node_modules/pumpify/node_modules/duplexify/node_modules/readable-stream/lib/_stream_writable.js:194:11)
at ReadStream.ondata (_stream_readable.js:539:20)
at ReadStream.emit (events.js:107:17)
at readableAddChunk (_stream_readable.js:162:16)
at ReadStream.Readable.push (_stream_readable.js:125:10)
at onread (fs.js:1581:12)
at Object.wrapper [as oncomplete] (fs.js:482:17)

Solution

  • David M Smith's answer is correct, converting to CSV would also work.

    After several attempts (and edits to my answer as I did not test thoroughly before answering), I managed to create a correct json file with a small Python script like this :

    #!python
    from __future__ import print_function
    import sys
    import fileinput
    import xml
    from xml.dom import minidom
    import json
    
    for line in fileinput.input():
            try:
                    xmlDoc = minidom.parseString(line)
                    print(json.dumps(dict(xmlDoc.childNodes[0].attributes.items())))
            except xml.parsers.expat.ExpatError:
                    print("Unable to process line : ", line, file=sys.stderr)
            except KeyboardInterrupt:
                    sys.exit(0)
    

    Then you will probably need to restart your shell to update the path (or any other method).

    For the biggest files, I needed to split them before because BigQuery accepts files of maximum 4GB. Here's the complete flow :

    7z x -so ../orig/stackoverflow.com-Posts.7z 2> /dev/null | ./xmltojson.py > PostHistory3.json
    split -e -d -C3G --additional-suffix=.json Posts.json Postssplit
    ls Postssplit*.json | xargs -ifile gzip file 
    gsutil cp Postssplit*.json.gz gs://YOURBUCKET
    bq --project_id=YOURPROJECT load --source_format=NEWLINE_DELIMITED_JSON YOURDATASET.YOURTABLE gs://YOURBUCKET/Postssplit01.json,gs://YOURBUCKET/Postssplit03.json,gs://YOURBUCKET/Postssplit04.json,#ETCETERA 'Id:INTEGER,PostTypeId:INTEGER,AcceptedAnswerId:INTEGER,ParentId:INTEGER,CreationDate:TIMESTAMP,Score:INTEGER,ViewCount:INTEGER,Body:STRING,OwnerUserId:INTEGER,OwnerDisplayName:STRING,LastEditorUserId:INTEGER,LastEditorDisplayName:STRING,LastEditDate:TIMESTAMP,LastActivityDate:TIMESTAMP,Title:STRING,Tags:STRING,AnswerCount:INTEGER,CommentCount:INTEGER,FavoriteCount:INTEGER,ClosedDate:TIMESTAMP,CommunityOwnedDate:TIMESTAMP'
    

    The gsutil part is not mandatory, but I am more comfortable uploading my files in Cloud Storage and then importing. That way if the import fails I can retry.

    If someone from the Google team is reading, it would be great to get this as a public dataset :-)

    Note that this will not work for any XMLs, only for the ones that are formatted like the current Stack Exchange export formats.