javaurlsanitizationpii

Cleaning up URLs to remove personal information


Are there rules to identify and remove any PII information from URLs? I would like this to be generic and handle all sorts of urls which we might encounter on the internet.

Clarification : I have a list of urls of people browsing the internet and want to remove PII from those.


Solution

  • To answer the question as restated in your reply to snemarch:

    Yes I understand that. I meant what considerations I need to keep in mind to identify PII in urls? What are the various ways in which PII might occur in URls?

    HTTP GET information can be transmitted in many different ways. Some, and likely most, will look like this:

    example.com/form.php?key=value.

    Other websites, including stackoverflow, may use a URL rewrite to tranform the link "example.com/form/value" into the equivalent: "example.com/form.php?key=value." This URL rewrite is completely dependent on the configuration of the server and there is no simple way to detect and strip off PII presented this way.

    With this in mind, there is really no way to 100% remove all PII from a list of different urls, as such information can be indiscernible from a URL without any PII. You can, at the very least, strip out information that is DEFINITELY PII, such as a URL in the form "example.com/form.php?key=value." I would be willing to bet that any URL with a "=" has some sort of variable in it, and should be filtered. Past that, you're going to have to manually parse a majority of the list.

    Depending on how big the list is and how serious you are about filtering it, you could research popular mod_rewrite methods for popular products and attempt to match them in your list, scrape URLS to determine additional information about a URL, and do some complicated and likely ugly algorithms to attempt to guess at what may be a variable in a URL - possibly factoring into account similar URL's a user has visited and comparing the tokens of the URL. similar urls with slightly different text in a given token are probably variables, and should be filtered.

    Good luck!