jsonbashraspberry-pi

Converting the output of MediaWiki to plain text


Using the MediaWiki API, this gives me an output like so, for search term Tiger

https://simple.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Tiger&format=json&exintro=1

Response:

{"batchcomplete":"","query":{"pages":{"9796":{"pageid":9796,"ns":0,"title":"Tiger","extract":"<p>The <b>tiger</b> (<i>Panthera tigris</i>) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.</p>\n<p></p>"}}}}

How do I get an output as

The tiger (Panthera tigris) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.

Please can someone also tell me how to store everything in a text file? I'm a beginner here so please be nice. I need this for a project I'm doing in Bash, on a Raspberry Pi 2, with Raspbian


Solution

  • It's usually recommended to use JSON parser for handling JSON, one that I like is jq

    % jq -r '.query.pages[].extract' file
    <p>The <b>tiger</b> (<i>Panthera tigris</i>) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.</p>
    <p></p>
    

    To remove the HTML tags you can do something like:

    ... | sed 's/<[^>]*>//g'
    

    Which will remove HTML tags that are not on continues lines:

    % jq -r '.query.pages[].extract' file | sed 's/<[^>]*>//g'
    The tiger (Panthera tigris) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.
    

    file is the file the JSON is stored in, eg:

    curl -so - 'https://simple.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Tiger&format=json&exintro=1' > file
    jq '...' file
    

    or

    jq '...' <(curl -so - 'https://simple.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Tiger&format=json&exintro=1')
    

    You can install jq with:

    sudo apt-get install jq
    

    For your example input you can also use grep with -P (PCRE). But using a proper JSON parser as above is recommended

    grep -oP '(?<=extract":").*?(?=(?<!\\)")' file 
    <p>The <b>tiger</b> (<i>Panthera tigris</i>) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.</p>\n<p></p>