http-redirecthttp-status-code-301http-status-code-302openrefine

Fetching a redirect's target URL in OpenRefine


I have a CSV of ~2000 URLs that, when queried, do a 301 or 302 redirect, and I'm trying to figure out if OpenRefine is able to export to a new column the destination url that it retrieves HTML from when I fetch the html from it (or some other way).

e.g.

https://www-istp.gsfc.nasa.gov/stargaze/Ssolsys.htm 

redirects to

https://pwg.gsfc.nasa.gov/stargaze/Ssolsys.htm

And I know that from clicking the link in my browser of choice. I've found a few answers suggesting that this can be done in various coding languages, but nothing so far suggesting how to do so in OpenRefine, even though I'm like 80% sure that it can be.

Does anyone out there know what I might be able to do to make this happen?


Solution

  • In OpenRefine you can write expressions in GREL, Jython (Java Implementation of Python 2) and Clojure. As far as I know GREL does not support analyzing the target of a redirection URL, so I would use Python for that.

    In your OpenRefine Project go to your column containing the urls and use "Edit column" > "Add column based on this column..."

    In the corresponding dialog window (see Screenshot below) you change the expression language to "Python / Jython" and use the following code snippet to retrieve the "real" URL of the request.

    import urllib2
    response = urllib2.urlopen(value)
    return response.geturl()
    

    Screenshot of OpenRefine dialog for adding a new column with the target URL.