Trying to use Open Refine to analyze a data set of messy JSON strings (40k lines), however due to JSONs' nature of being unordered, some of the lines of JSON objects were mixed up when returned and recorded to a file.
Some objects are missing keys, some objects have incorrect order. Example:
1 {"about":"foo", "category":"bar", "id":"123", "cat_list": ["category1":"foo2"]}
2 {"id":"22","about":"barFoo", "category":"NotABar"}
3 {"about":"barbar", "category":"website", "id":"3333", "cat_list": ["category1":"foo22"]}
....
....
....
40,000 {"about":"bar123", "category":"publish", "id":"3323", "cat_list": ""}
ISSUE:
Importing the data into Open Refine, the program asks for a specific schema to compare to when it reads the file. It then reads the supplied file, comparing each JSON object on the line to the schema and imports or discards depending on how well it matches the schema! As a result many entries are left out!
IDEALLY:
Using Python, I would like to reorder the JSON objects to a specific schema which I specify.
Example:
Specified Schema
{"about":"", "category":"", "id":"", "cat_list": ""}
Which would then rearrange each line of JSON and its key-values to be in this specific format:
1 {"about": ....
2 {"about": ....
3 {"about": ....
....
....
....
40,000 {"about": ....
I am not entirely sure how I can do this efficiently?
EDIT:
I decided to just write a script to organize this. I removed some of the complex fields and have a full .JSON file:
{"name":"Carstar Bridgewater",
"category":"Automotive",
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.",
"country":"Canada",
"state":"NS",
"city":"Bridgewater
"},
{"name":"Febreze",
"category":"Product/Service
",
"about":"Freshness that eliminates odorsso you can breathe happy.",
"country":"Added Nothing",
"state":"Added Nothing",
"city":"Added Nothing"},
{"name":"Custom Wood & Acrylic Turnings",
"category":"Professional Services",
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings",
"country":"Canada",
"state":"NS
",
"city":"Middle Sackville"},
{"name":"The Hunger Games",
"category":"Movie
",
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca",
"country":"Added Nothing",
"state":"Added Nothing",
"city":"Added Nothing"},
Yet. Google-Refine still refuses to accept my file? What is it that I am doing incorrectly?
Not sure if you resolved this.
The JSON needs to be valid before it can be successfully imported - at the moment the text you've posted in the Q above doesn't validate with a tool such as http://jsonlint.com.
The issue you have in terms of importing this to OpenRefine (aka Google Refine) is that the JSON objects have to be in an array:
[{"name":"Carstar Bridgewater",
"category":"Automotive",
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.",
"country":"Canada",
"state":"NS",
"city":"Bridgewater"},
{"name":"Febreze",
"category":"Product/Service",
"about":"Freshness that eliminates odorsso you can breathe happy.",
"country":"Added Nothing",
"state":"Added Nothing",
"city":"Added Nothing"},
{"name":"Custom Wood & Acrylic Turnings",
"category":"Professional Services",
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings",
"country":"Canada",
"state":"NS",
"city":"Middle Sackville"},
{"name":"The Hunger Games",
"category":"Movie",
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca",
"country":"Added Nothing",
"state":"Added Nothing",
"city":"Added Nothing"}]
I can successfully import this JSON as posted here into OpenRefine it works fine - screenshots: