I'm using Google CSE JSON API to obtain some webpages I'll scrape later. The thing is sometimes I'm getting PDFs, DOCX and some other files published in the web that I don't want to get from Google.
I know that there is a parameter in this API named as fileType
that filters the results, but this doesn't work for me because I want the opposite (exclude them not exclude others).
fileType
telling Google that this is 'html'
but didn't work neither (from results like example.com/foo
to only example.net/bar.html
). Using this, for example, any webpage in PHP or ASP won't fit this criteria.'text/html'
as fileType
value but it didn't do anything.The way of filtering it could be the Content-Type
header included in the response of any HTTP GET petition (text/html
), but of course it'll be better if Google do this for me.
Thank you in advance.
Well, I found how to do this easily. Just add the filter to the query q
parameter in the Google API call using filetype:foo
. This way you can filter the search to only the results wanted:
service.cse().list(cx=const.SEARCH_ENGINE_KEY, q='"user manual" -filetype:pdf').execute()
You can add as many filetype
filters as you need to get better results.
Now I feel like this was a silly question. Anyway, I hope this help anyone in the future.