javajsonjson-simple

JSON Parsing Error: Unexpected character (s) at position 226025


I saw similar question on Stackoverflow but none of them helped me to solve my issue. So, I am asking for help as I have tried to find out what is the reason behind the error I am getting but failed. Please don't mark it as a duplicate question.

I am parsing a Json file and getting the following error.

Jun 06, 2017 2:06:24 PM edu.virginia.cs.services.FileManager ParseJson
SEVERE: null
Unexpected character (s) at position 226025.
    at org.json.simple.parser.Yylex.yylex(Yylex.java:610)
    at org.json.simple.parser.JSONParser.nextToken(JSONParser.java:269)
    at org.json.simple.parser.JSONParser.parse(JSONParser.java:118)
    at org.json.simple.parser.JSONParser.parse(JSONParser.java:92)
    at edu.virginia.cs.services.FileManager.ParseJson(FileManager.java:68)
    at edu.virginia.cs.main.Processer.main(Processer.java:20)

Exception in thread "main" java.lang.NullPointerException
    at edu.virginia.cs.services.FileManager.ParseJson(FileManager.java:76)
    at edu.virginia.cs.main.Processer.main(Processer.java:20)

Code of interest:

try {
    arr = (JSONArray) parser.parse(new FileReader(sourceFile));
} catch (IOException | ParseException ex) {
    Logger.getLogger(FileManager.class.getName()).log(Level.SEVERE, null, ex);
}

File content looks like as follows:

[
    {
        "url": "http://www.save-on-crafts.com/",
        "title": "Events & Wedding Sale | Save 20-60% | SaveOnCrafts",
        "content": {
            "p": ["Wedding decorations, party supplies, home d cor & craft supplies at 20-70% off. Save On Crafts brings you classic and trending fashions.", "Save On Crafts has continually evolved to meet the needs of our customers   DIY brides, home decorators, party planners, florists, and caterers. Our goal is simple: provide an exciting selection of quality , , and items at the lowest price possible for the customer with discerning taste."],
            "div": ["indicates required", "(831) 768-8428", "Take a Peek at our Specials: Save up to 70%!", "Candle Holders", "Flowers & Branches", "Crystal D cor, Chandeliers", "Set the Mood with Candles", "Champagne & Ice Buckets", "Chalkboards", "Eco Confetti", "Wedding Signs", "Sola Flowers", "Natural Wood Slices", "Classic & trending styles without the traditional retail markup.", "(831) 768-8428"],
            "a": ["X", "What's New", "SPECIALS", "Wedding Decorations", "Lights | Event Lighting", "Wood Slabs & Tree Slices", "Vases", "Apothecary Jars", "Banners", "Baskets", "Bell Jars, Cloches", "Beverage Bar Supplies", "Bird Cages & Birds", "Botanicals, Lavender, Sola Flowers", "Bottles & Jars", "Branches - Natural", "Buckets & Tubs", "Burlap Fabric, Jute, and Linen", "Cake Stands", "Candles", "Candle Holders", "Candy Buffet", "Chair Sashes, Banners, Signs", "Chalkboards", "Chandeliers", "Charger & Base Plates", "Confetti", "Corsage & Bouquet Supplies", "Craft Supplies", "Crates, Boxes, & Trays", "Crystal Decorations", "Easels & Frames", "Event Decor", "Favors", "Feathers", "Floral Supplies", "Flowers", "Greenery", "Home & Garden Decor", "Lanterns", "Mirrors & Mirror Stands", "Moss Natural & Artificial", "Nautical Decor & Decorations", "Packaging, Gift Wrapping", "Paper Lanterns & Parasols", "Paper Party Decorations", "Party Supplies", "Pots & Planters", "Placecard Holders,Table Numbers, Displays", "Preserved Flowers & Leaves", "Props, Pedestals, Risers", "Ribbon", "Silk Flowers", "Signage", "Shells - Sand", "Shepherds Hooks & Stanchions", "Sola Flowers", "Succulents & Cactus", "Table Runners & Toppers", "Terrariums", "Tote Bags, Welcome Bags", "Trees, Potted Plants", "Vases & Vase Fillers", "Wedding Cake Decorations and Toppers", "Wedding Decorations", "Wedding Signs", "Wedding Themes", "Wedding Trees & Wishing Trees", "Wood Crafts", "Wood Slabs & Tree Slices", "Wreath Making Supplies, Frames, Forms", "Gifts - Holiday Decorations", "Gifts Under $25", "Ideas & Inspiration", "Shopping Cart", "About", "Shipping", "Return Policy", "Contact", "FAQ", "Privacy Policy", "Terms and Conditions", "Read More", "Shipping", "Cart"],
            "strong": ["Need Help?", "SUBSCRIBE", "wedding supplies", "party decorations", "home d cor", "Affordable Wedding & Event Decor", "Save 20-70%", "Need Help?"],
            "span": ["*", "*", "Live Chat", "Shop Categories", "Customer Service: 7am - 5pm PST (M-F) | (831)768-8428", "Copyright   2017 Save-On-Crafts. All Rights Reserved. Designated trademarks and brands are the property of their respective owners. Use of this website constitutes acceptance of the Save-On-Craftsand Privacy Policy.", "Live Chat"]
        }
    },
    {
        "url": "http://www.carsurvey.org/",
        "title": "Carsurvey.org - Car Reviews",
        "content": {
            "p": ["I feel as if this vehicle was custom built for me, love it", "Neat cruiser, comfort first, performance second", "Beast maaaaaaate!", "Best value for the money", "There are reviews on the site", "new reviews and new comments are in the Members section, awaiting approval"],
            "td": ["2 days ago", "2 days ago", "3 days ago", "3 days ago", "18 hours ago", "19 hours ago", "19 hours ago", "19 hours ago"],
            "a": ["Write a Review", "About", "Members", "Reviews by Region", "Write a Review", "About", "Members", "Reviews by Region", "BMW", "Buick", "Chevrolet", "Chrysler", "Citroen", "Dodge", "Fiat", "Ford", "Honda", "Hyundai", "Jeep", "Kia", "Mazda", "Mercedes-Benz", "Mercury", "Mitsubishi", "Nissan", "Oldsmobile", "Peugeot", "Pontiac", "Renault", "Saturn", "Subaru", "Toyota", "Vauxhall", "Volkswagen", "Volvo", "AC", "Acura", "Alfa Romeo", "Alvis", "AMC", "ARO", "Asia Motors", "Aston Martin", "Asuna", "Audi", "Austin", "Austin Healey", "Autobianchi", "Autocars", "Avanti", "Bajaj", "Bedford", "Bentley", "Birkin", "BMW", "Bombardier", "Bond", "Brennan-Mays", "Bricklin", "Bugatti", "Buick", "Cadillac", "Caterham", "Checker", "Chery", "Chevrolet", "Chrysler", "Citroen", "Commer", "Cord", "Dacia", "Daewoo", "DAF", "Daihatsu", "Datsun", "DeLorean", "DeSoto", "DeTomaso", "Dodge", "Eagle", "Edsel", "Ferrari", "Fiat", "Ford", "Franklin", "Freightliner", "FSO", "Geely", "Geo", "GMC", "Great Wall", "Grinnall", "Hillman", "Holden", "Honda", "HSV", "Humber", "Hummer", "Hyundai", "IHC", "IKA", "Infiniti", "Innocenti", "Inokom", "Iran Khodro", "Iso Rivolta", "Isuzu", "Iveco", "Jaguar", "Jeep", "Jensen", "JiangNan", "Kaiser", "Kia", "Kish Khodro", "Lada", "Laforza", "Lamborghini", "Lancia", "Land Rover", "Lexus", "Leyland", "Leyland DAF", "Lincoln", "Lotus", "Mahindra", "Maple", "Marcos", "Maruti", "Maserati", "Matra", "Maybach", "Mazda", "McLaren", "Mercedes-Benz", "Mercury", "Merkur", "Meson", "Meyers Manx", "MG", "Microcar", "Mitsubishi", "Morgan", "Morris", "Moskvitch", "Nash", "NAZA", "Nissan", "Noble", "Nova", "NSU", "Oldsmobile", "Oltcit", "Opel", "Packard", "Panther", "Perodua", "Peugeot", "Plymouth", "Pontiac", "Porsche", "Premier", "Proton", "Puma", "Pyonghwa Motors", "Quantum", "Qvale", "Ram Trucks", "Rayton Fissore", "Reliant", "Renault", "Riley", "Robert Jankel Design", "Rolls Royce", "Rover - Austin", "SAAB", "Saleen", "Samsung", "Santana", "Sao", "Saturn", "Scion", "Seat", "Sebring", "Sebring Vanguard", "Shelby", "Simca", "Singer", "Skoda", "smart", "Spartan", "SsangYong", "Standard", "Sterling", "Studebaker", "Subaru", "Sunbeam", "Suzuki", "Talbot", "Tata", "Tatra", "Tesla", "Tickford", "Toyota", "Trabant", "Triumph", "Troller", "TVR", "Vanden Plas", "Vauxhall", "Venturi", "Volga", "Volkswagen", "Volvo", "Wartburg", "Westfield", "Willys", "Wolseley", "Yugo", "Zagato", "ZAZ", "Zhengzhou Nissan", "Zhonghua", "ZXAUTO", "1997 Lexus LS", "2012 Audi A7", "1985 Dodge D100", "2007 Citroen C5", "More New Car Reviews", "1987 Chrysler New Yorker", "1995 Chevrolet Monte Carlo", "1995 Chevrolet Monte Carlo", "1995 Chevrolet Monte Carlo", "More New Comments", "Advertise on this site", "Privacy Policy"],
            "strong": ["110091", "0", "3"],
            "h1": ["Car Reviews by Manufacturer"],
            "h2": ["Most Popular", "All Manufacturers"],
            "h3": ["Newest Car Reviews", "Newest Comments", "Current Status"],
            "span": ["Copyright 1997 - 2017 CSDO Media Limited", "|"]
        }
    },
    {
        "url": "http://www.hollywood.com/",
        "title": "Hollywood.com - Best of Movies, TV, and Celebrities",
        "content": {
            "div": ["TRENDING NOW", "Hollywood.com Photo Archive", "Hollywood.com Esports", "Hollywood.com Discovery", "MovieTickets.com Discovery", "Wenn Penelope Cruz will always put her all into every role she wins, even if it means transforming herself physically. The Spanish actress has varied...", "Wenn Sean Penn reportedly resolved a dispute with fellow passengers during a recent flight to New York. The Mystic River actor had just boarded the...", "Wenn Rita Ora has hinted in a new interview that she and Cara Delevingne were more than just good friends. The 26-year-old singer and the...", "Wenn Charlie Sheen has stepped out in public with a new girlfriend. The 51-year-old actor showed off his blonde partner, known only as Jools, as...", "Wenn Tom Cruise's insistence on perfecting a zero-gravity stunt for The Mummy caused members of the film's crew to vomit. Tom stars as military operative...", "Wenn The Big Chill star Meg Tilly has made a return to Hollywood after 18 years to play Brad Pitt's wife. The actress stepped away...", "Wenn Rob Kardashian has slammed rumors he's dating reality TV star Mehgan James. A report published by Us Weekly magazine on Thursday (01Jun17) suggested that...", "Wenn Taylor Swift has been pictured with her actor boyfriend Joe Alwyn for the first time. News of the Bad Blood hitmaker's relationship with 26-year-old...", "NBC Ariana Grande has touched down in the U.K. ahead of her benefit concert for victims of the terrorist attack on her gig in...", "Wenn Alec Baldwin helped raise $5.1 million for New Jersey Democrats at an event in Collingswood, New Jersey, on Thursday night (01Jun17). The 30 Rock...", "Wenn Johnny Depp has claimed he was completely unaware his former managers were using his name to take out $40 million in loans. The fight...", "Wenn Carey Mulligan is reportedly expecting her second child. The Great Gatsby actress was pictured outside Sexy Fish restaurant in London with her husband Marcus...", "When it was first announced that Scarlett Johansson would play The Major in the wildly popular 'Ghost in the Shell' fans weren't happy, to...", "Billy Bob Thornton and the cast of Bad Santa 2 looked super naughty at AMC Loews Lincoln Square in New York City. Check out...", "Hulu's much anticipated drama The Handmaid's Tale premiered last night. This 10-part series is an adaptation from Margaret Atwood's 1985 novel of the same name, set...", "Julianne Moore and Michelle Williams premiered their new movie Wonderstruck at the 70th Cannes Film Festival. For a complete gallery of pictures, click here.", "Selena Gomez hosted WE Day celebrations at The Forum in California for her fifth year. WE Day is one of the largest Facebook non-profits in...", "Check out the super whimsical cast of NBC's Hairspray Live! before the musical premieres Wednesday, December 7th!", "Wenn / Paramount Pictures Thandie Newton wore a wig she was given on Mission: Impossible 2 to the BAFTAs on Sunday night (12Feb17). The Westworld...", "The Light Between Oceans premiered at the Venice Film Festival and co-stars and real-life lovers Michael Fassbender and Alicia Vikander were all smiles on...", "With the Margot Robbie stepping into the role of Maid Marian, and the currently-filming of Robin Hood: Origins, there's been a resurgence of interest...", "Tom Hanks is Forrest Gump, just like like Richard Gere is Edward Lewis in Pretty Woman; some actors have had such iconic movie roles,...", "Disney These days, Disney is known for pushing the envelope and hiding adult themes and jokes in their films. However, there was a time...", "ABC Television Network Abby, The Deadliest Catch Darby Stanchfield plays Abby Whelan, and she's come a long way to get to D.C. She actually grew up...", "There are many different kinds of family businesses, but one we hardly think about is acting. However, there are families that have actors going...", "It's no secret that Hollywood loves its cliches from action heroes who magically avoid every bullet fired at them to fat sitcom husbands who...", "HBO HBO's Silicon Valley just finished its first season. The show features a great cast of comedians, and it's managed to satirize the nerdy masculinity of...", "32.2x", "|", "19.2x", "|", "6.84x", "|", "6.16x", "|", "4.77x", "|", "4.22x", "|", "Powered by Crowdtangle", "1999-2017 HOLLYWOOD.COM, LLC. ALL RIGHTS RESERVED", "|   |  |   |", "MOVIE, TV, AND CELEBRITY DATA PROVIDED BY AND IS THE COPYRIGHT OF"],
            "a": ["CLOSE", "Click here - to use the wp menu builder", "Click here - to use the wp menu builder", "SIGN UP FOR OUR NEWSLETTER", "Meg Tilly Returns to Movies after Two Decade Hiatus to Play Brad Pitt's Wife", "Kathy Griffin in Tears at Press Conference", "Rob Kardashian Denies Reports He's Dating Reality Star Mehgan James", "Rita Ora talks 'ambiguous' relationship with Cara Delevingne", "Sean Penn Involved in Dispute During Flight to JFK", "Khloe Kardashian won't identify friend she claims is stealing from her", "Underwear On The Outside At The 'Captain Underpants' Premiere", "Penelope Cruz: 'I don't mind getting ugly for movie roles'", "Charlie Sheen goes public with new girlfriend", "Tom Cruise made The Mummy crew vomit with zero-gravity stunt", "Kathy Griffin in Tears at Press Conference", "Underwear On The Outside At The 'Captain Underpants' Premiere", "Khloe Kardashian won't identify friend she claims is stealing from her", "Underwear On The Outside At The 'Captain Underpants' Premiere", "'Baby Driver' Looks Like The Most Fun Movie In 2nd Trailer", "Go Behind the Voices of 'Captain Underpants: The First Movie'", "Something Is Wrong In the 'Murder on the Orient Express' Trailer", "Nicole Kidman lends her Balenciaga wedding dress to exhibition", "Penelope Cruz:  I don t mind getting ugly for movie roles", "Sean Penn Involved in Dispute During Flight to JFK", "Rita Ora talks  ambiguous  relationship with Cara Delevingne", "Charlie Sheen goes public with new girlfriend", "Tom Cruise made The Mummy crew vomit with zero-gravity stunt", "Meg Tilly Returns to Movies after Two Decade Hiatus to Play Brad Pitt s Wife", "Rob Kardashian Denies Reports He s Dating Reality Star Mehgan James", "Taylor Swift Spotted with New Boyfriend Joe Alwyn for First Time", "Ariana Grande returns to U.K. as thousands make false ticket claims for Manchester benefit show", "Alec Baldwin Raises $5 million for Democrats", "Johnny Depp was  unaware ex managers were using his name to get loans", "Carey Mulligan is Pregnant", "see more", "RED CARPET", "Travel To Tokyo For The  Ghost in the Shell  World Premiere", "The Cast Of  Bad Santa 2  Spiced Up The Red Carpet At The NYC Premiere", "Hulu s  The Handmaid s Tale  Premieres", "Julianne Moore and Michelle Williams Premiere  Wonderstruck  at Cannes", "Selena Gomez, Demi Lovato, and Alicia Keys Celebrate WE Day", "The Cast Of NBC s  Hairspray Live!  Were Super Whimsical On The Red Carpet", "Thandie Newton wore Mission: Impossible II wig to the BAFTAs", "Michael Fassbender & Alicia Vikander Are Perfection At  The Light Between Oceans  Premiere", "see more", "DID YOU KNOW?", "All the Actresses Who Have Played Maid Marian", "12 Iconic Movie Roles That Famous Actors Turned Down", "The Original Drawing For  Snow White  Was Banned By Disney Because It Was Too Sexy!", "Facts You Never Knew About The Cast of  Scandal", "11 Actors You Didn t Know Have Famous Grandparents", "15 Celebrity Dads You Didn t Know Have Hot Sons", "The 10 Most Overused Sound Effects in Hollywood", "21 Facts You Don t Know About  Silicon Valley", "see more", "Teen Mom: OG Star Ryan Edwards Has Checked into Rehab", "E! News", "How To Train Your Dragon 3: Eveything We Know So Far", "moviepilot.com", "Alec Baldwin's Advice to Kathy Griffin on Trump Brouhaha: 'F--- Them All'", "The Wrap", "Alec Baldwin Defends Kathy Griffin in Wake of Trump Decapitated Photo Controversy: 'Ignore Him'", "People", "The Wonder Woman Scene That Pays Tribute To Superman", "CinemaBlend", "14 Of The Most Utterly Bizarre Things On Display At The M tter Museum", "Ranker", "Movies", "TV", "Celebrities", "Best Of/Worst Of", "Where Are They Now?", "Did You Know", "Buzzing", "Quizzes", "Pop Lists", "News", "SSNInsider", "MovieTickets.com", "EsportsHW", "Photo Archive", "About Us", "Contact Us", "Media Kit", "PRIVACY POLICY", "TERMS OF SERVICE", "COPYRIGHT ISSUES", "DISCLOSURE", "REPORT ABUSE", "BASELINE"],
            "em": ["Want More?"],
            "h1": ["WANT MORE?"],
            "i": ["Facebook", "Google+", "Twitter", "YouTube", "Instagram"],
            "h2": ["Sign Up For Our Newsletter!", "Sign Up For Our Newsletter!"],
            "h3": ["FOLLOW US!", "LIKE US!", "TOPIC", "Category", "partners", "COMPANY", "Be friends with us"],
            "time": ["Jun 2, 2017", "Jun 2, 2017", "Jun 2, 2017", "Jun 2, 2017", "Jun 2, 2017", "Jun 2, 2017", "Jun 2, 2017", "Jun 2, 2017", "Jun 2, 2017", "Jun 2, 2017", "Jun 2, 2017", "Jun 2, 2017", "Mar 17, 2017", "Nov 16, 2016", "Apr 26, 2017", "May 18, 2017", "Apr 28, 2017", "Nov 18, 2016", "Feb 14, 2017", "Sep 1, 2016", "Mar 7, 2017", "Aug 15, 2013", "Oct 5, 2016", "Apr 4, 2014", "Apr 22, 2016", "Aug 22, 2014", "Sep 21, 2015", "Jun 11, 2014"],
            "span": ["Celebrities", "Movies", "Television", "Showtimes", "Search", "Esports", "Photo Archive", "The Latest", "Video", "Buzzing", "Pop Lists", "Did You Know?", "Where Are They Now?", "Featured", "Take A Sneak Peak At The Movies Coming Out This Week (8/12)", "Kathy Griffin in Tears at Press Conference", "Underwear On The Outside At The  Captain Underpants  Premiere", "Khloe Kardashian won t identify friend she claims is stealing from her", "Penelope Cruz:  I don t mind getting ugly for movie roles", "Partners", "MovieTickets.com", "SSN Insider", "Privacy Policy", "Copyright Notice", "Terms of Use", "Report Abuse", "Videos", "Buzzing", "Red Carpet", "Esports", "Photo Archive", "Newsletter Signup", "Meg Tilly Returns to Movies after Two Decade Hiatus to Play Brad Pitt's Wife", "WENN", "Kathy Griffin in Tears at Press Conference", "WENN", "Rob Kardashian Denies Reports He's Dating Reality Star Mehgan James", "WENN", "Rita Ora talks 'ambiguous' relationship with Cara Delevingne", "WENN", "Sean Penn Involved in Dispute During Flight to JFK", "WENN", "Khloe Kardashian won't identify friend she claims is stealing from her", "WENN", "Underwear On The Outside At The 'Captain Underpants' Premiere", "Michael Chaney", "Penelope Cruz: 'I don't mind getting ugly for movie roles'", "WENN", "Charlie Sheen goes public with new girlfriend", "WENN", "Sign Up for Our Newsletter!", "Follow @hollywood", "THE LATEST", "Hot on Facebook"]
        }
    }
]

I have crawled 500K webpages and stored them in a Json file. Now, I am trying to read it. The whole file is 2GB, so I am unable to share the whole file.

I understand Json parser is getting some unexpected character (s) in the file but I am unable to find which line in the json file is erroneous. Is there any way I can find out the faulty line in the json file?


Edit

Main code in processing webpage contents is as follows.

for (Element element : elements) {
    String tagName = element.tagName();
    if (Util.isValidTag(tagName)) {
        String textValue = Util.removeNonPrintableChars(element.ownText()).trim().replace("\"", "\'");
        if (!textValue.isEmpty()) {
            if (tagTextMap.containsKey(tagName)) {
                tagTextMap.get(tagName).add(textValue);
            } else {
                ArrayList<String> arr = new ArrayList<>();
                arr.add(textValue);
                tagTextMap.put(tagName, arr);
            }
        }
    }
}

I just removed non-printable characters and also replace double quotes by single quote, that's it.


Update

I found the problematic section in the json file.

{
    "url": "http://www.kudzu.com/",
    "title": "Atlanta roofers, hvac, plumbers, electricians and other businesses - reviews, coupons and cost estimates from your neighbors.",
    "content": {
        "h2": ["From Our Experts", "Recent Projects", "Recent Articles", "What It Costs", "Review a Business", "What It Costs", "Other Markets"],
        "body": ["\"],
        "span": ["Area", "Area", "Cost"]
    }
}

This part - "body": ["\"], is the source of problem. I can understand now why its causing the problem.


Solution

  • It seems you are having trouble with escaping special characters. See this list of special characters used in JSON :

    1. \b Backspace (ascii code 08)
    2. \f Form feed (ascii code 0C)
    3. \n New line
    4. \r Carriage return
    5. \t Tab
    6. \" Double quote
    7. \ Backslash character

    So, while dumping json you need to escape this special characters. Fortunately every json library's has way to do this job. As it seems you have used JSON.simple toolkit, you can use JSONObject.escape() method to escape the special characters.