[SOLVED] How to discover all HTML meta tags using edu.uci.ics.crawler4j.crawler.WebCrawler

How to discover all HTML meta tags using edu.uci.ics.crawler4j.crawler.WebCrawler

I am completing a research project to catalogue all HTML meta tags used to describe scientific and academic journals, e.g. Dublin Core, open graph, prism, citation, biblio etc..

I am using edu.uci.ics.crawler4j.crawler.WebCrawler and have it working for a small number of seed URL's.

My issue is I need a larger list of seed URL's.

What options do I have?

Do I have to manually search the web looking for journal websites or can I use something similar to crawler4j to discover the seed sites?

Solution

Generating good seeds is a general problem for the field of Web-Crawling, especially for field-specific tasks (such as only look at academic journals). In general, there are several options:

Use a Open Web Directory (e.g. dmoz, ...) or a Journal-List (e.g. Reuters List) to harvest pre-categorized seed points for well-known journals.
In theory, the big search engines have harvested a quite big portion of the WWW. You can try to perform semi-automated searches for pre-defined queries and process the hits. However, this might lead to some more complicated techniques in Web-Crawling (e.g. focused crawling)

An option would be:

Use crawler4j to collect the Journal names from Reuters for the fields you like to investigate.
For this purpose, you need to look at the Journal lists, e.g. the business journal list. Journal names are always in the h4 tags, which can be easily extracted.
After extracting the names, you only have to find out the corresponding URLs. For this purpose, you could use the search-engine approach above. With high probability, the first hit should be the Journal's web-page.