automationmeta-tagscrawler4j

How to discover all HTML meta tags using edu.uci.ics.crawler4j.crawler.WebCrawler


I am completing a research project to catalogue all HTML meta tags used to describe scientific and academic journals, e.g. Dublin Core, open graph, prism, citation, biblio etc..

I am using edu.uci.ics.crawler4j.crawler.WebCrawler and have it working for a small number of seed URL's.

My issue is I need a larger list of seed URL's.

What options do I have?

Do I have to manually search the web looking for journal websites or can I use something similar to crawler4j to discover the seed sites?


Solution

  • Generating good seeds is a general problem for the field of Web-Crawling, especially for field-specific tasks (such as only look at academic journals). In general, there are several options:

    An option would be: