pythonjsonweb-scrapinghtml-parsing

Parse html with ajax json inside


I have such files to parse (from scrapping) with Python:

some HTML and JS here...
SomeValue = 
{
     'calendar': [
     {       's0Date': new Date(2010, 9, 12),
             'values': [
                     { 's1Date': new Date(2010, 9, 17), 'price': 9900 },
                     { 's1Date': new Date(2010, 9, 18), 'price': 9900 },
                     { 's1Date': new Date(2010, 9, 19), 'price': 9900 },
                     { 's1Date': new Date(2010, 9, 20), 'price': 9900 },
                     { 's1Date': new Date(2010, 9, 21), 'price': 9900 },
                     { 's1Date': new Date(2010, 9, 22), 'price': 9900 },
                     { 's1Date': new Date(2010, 9, 23), 'price': 9900 }]
     },
     'data': [{
     index: 0,
     serviceClass: 'Economy',
     prices: [9900, 320.43, 253.27],
     eTicketing: true,
     segments: [{
             indexSegment: 0,
             stopsCount: 1,
             flights: [{
                     index: 0,

... and a lot of nested data and again HTML and JS...

I need to parse it and extract all json data. Now I use regex with cleaning all '\n' and '\t' and eval() function to convert it to Python dictionary.. I really don't like this solution, eval() especially. But I looked at BeautifulSoup and lxml, and didn't find something that will help to parse it.
Can you suggest something better than regex and eval() for this task?
Page example: http://codepaste.ru/3830/


Solution

  • aarrghhh no regex dont use regex no regex no no nooooooo


    Use the json module to handle JSON data:

    import json
    json.loads( <string> )
    

    Use BeautifulSoup or lxml to handle parsing the html page:

    from BeautifulSoup import BeautifulSoup
    soup = BeautifulSoup( <string> )
    

    If you want specific help, you'll need to provide specific data e.g. the class of the tags in which this data is enclosed. You could soup.findAll the script tags, for instance, then strip some lines to get to the JSON, then feed that into json.loads.