Parse and remove HTML tags using Google Refine/OpenRefine & Jsoup/BeautifulSoup

I use Google Refine for dealing with messy product data sheets in order to format them for upload into Magento stores using Magmi/Dataflow profiles. I am still using Google Refine 2.5 as it is the latest stable release.

The descriptions from supplier datasheets are often filled with binary characters and messy HTML that I need to manipulate and re-format en masse.

I know I can use some combination of GREL / Python / Jsoup to accomplish my task, but I'm having trouble with the syntax moving in an out of different languages.

My data looks like the following:

Some product data here. <ul><li>Bullet one <li> Bullet two</ul> <br /> Some other product data here. <span id="product-image><img src="image.png"></span>

Using the following snippet:value.parseHtml().select("img").toString() I am able to parse the image tags I want, but I'm unable to remove/replace these tags using the replace() function in GREL. I tried to add the expression into the first string of the replace function like: value.replace(/value.parseHtml().select("img").toString()/, "") and other similar functions to no avail.

For my current project I need to: 1) remove all <img>, <div>, <p> and <span> tags, plus 2) parse and separate YouTube video links into a separate column.

Can someone please help me with the syntax / cook me up a function to accomplish this task (preferably with an explanation of the syntax)?

Solution

Remove Tag

If you want to just replace the tag, there is no need to use parsHtml(). Simply do value.replace('<img','') to remove all image related tag. value.replace('<div>','').replace('</div>','') for all the <div>

Extract Images

value.parseHtml().select("img").toString() select the tag and its content. Using your example it will returns:

<img alt=" style=" width:="" 62px="" src="http://sunlightsupply.s3.amazonaws.com/images/icon/product/logo_culus.gif" />

and

<img alt=" src=" http:="" sunlightsupply="" s3="" amazonaws="" com="" images="" icon="" product="" watchvideo="" gif="" complete="complete" />

Extract YouTube Link

The following GREL value.split('href=')[1].split('"')[0] will extract all links.

You can store them in a new column an remove all links who doesn't contains youtube.com using a custom facet with value.contains('youtube.com')