google-custom-searchgoogle-api-python-clientgoogle-apis-explorer

How to get only HTML webpages from Google Custom Search API


I'm using Google CSE JSON API to obtain some webpages I'll scrape later. The thing is sometimes I'm getting PDFs, DOCX and some other files published in the web that I don't want to get from Google.

I know that there is a parameter in this API named as fileType that filters the results, but this doesn't work for me because I want the opposite (exclude them not exclude others).

  1. I tried using fileType telling Google that this is 'html' but didn't work neither (from results like example.com/foo to only example.net/bar.html). Using this, for example, any webpage in PHP or ASP won't fit this criteria.
  2. I also tried to set 'text/html' as fileType value but it didn't do anything.

The way of filtering it could be the Content-Type header included in the response of any HTTP GET petition (text/html), but of course it'll be better if Google do this for me.

Thank you in advance.


Solution

  • Well, I found how to do this easily. Just add the filter to the query q parameter in the Google API call using filetype:foo. This way you can filter the search to only the results wanted:

    service.cse().list(cx=const.SEARCH_ENGINE_KEY, q='"user manual" -filetype:pdf').execute()
    

    You can add as many filetype filters as you need to get better results.

    Now I feel like this was a silly question. Anyway, I hope this help anyone in the future.