beautifulsoupjsoupmagmiopenrefine

Parse and remove HTML tags using Google Refine/OpenRefine & Jsoup/BeautifulSoup


I use Google Refine for dealing with messy product data sheets in order to format them for upload into Magento stores using Magmi/Dataflow profiles. I am still using Google Refine 2.5 as it is the latest stable release.

The descriptions from supplier datasheets are often filled with binary characters and messy HTML that I need to manipulate and re-format en masse.

I know I can use some combination of GREL / Python / Jsoup to accomplish my task, but I'm having trouble with the syntax moving in an out of different languages.

My data looks like the following:

Some product data here. <ul><li>Bullet one <li> Bullet two</ul> <br /> Some other product data here. <span id="product-image><img src="image.png"></span>

Using the following snippet:value.parseHtml().select("img").toString() I am able to parse the image tags I want, but I'm unable to remove/replace these tags using the replace() function in GREL. I tried to add the expression into the first string of the replace function like: value.replace(/value.parseHtml().select("img").toString()/, "") and other similar functions to no avail.

For my current project I need to: 1) remove all <img>, <div>, <p> and <span> tags, plus 2) parse and separate YouTube video links into a separate column.

Can someone please help me with the syntax / cook me up a function to accomplish this task (preferably with an explanation of the syntax)?


Solution

  • Remove Tag

    If you want to just replace the tag, there is no need to use parsHtml(). Simply do value.replace('<img','') to remove all image related tag. value.replace('<div>','').replace('</div>','') for all the <div>

    Extract Images

    value.parseHtml().select("img").toString() select the tag and its content. Using your example it will returns:

    <img alt=" style=" width:="" 62px="" src="http://sunlightsupply.s3.amazonaws.com/images/icon/product/logo_culus.gif" />

    and

    <img alt=" src=" http:="" sunlightsupply="" s3="" amazonaws="" com="" images="" icon="" product="" watchvideo="" gif="" complete="complete" />

    Extract YouTube Link

    The following GREL value.split('href=')[1].split('"')[0] will extract all links.

    You can store them in a new column an remove all links who doesn't contains youtube.com using a custom facet with value.contains('youtube.com')