I am completing a research project to catalogue all HTML meta tags used to describe scientific and academic journals, e.g. Dublin Core, open graph, prism, citation, biblio etc..
I am using edu.uci.ics.crawler4j.crawler.WebCrawler
and have it working for a small number of seed URL's.
My issue is I need a larger list of seed URL's.
What options do I have?
Do I have to manually search the web looking for journal websites or can I use something similar to crawler4j
to discover the seed sites?
Generating good seeds is a general problem for the field of Web-Crawling
, especially for field-specific tasks (such as only look at academic journals). In general, there are several options:
Use a Open Web Directory (e.g. dmoz, ...) or a Journal-List (e.g. Reuters List) to harvest pre-categorized seed points for well-known journals.
In theory, the big search engines have harvested a quite big portion of the WWW. You can try to perform semi-automated searches for pre-defined queries and process the hits. However, this might lead to some more complicated techniques in Web-Crawling (e.g. focused crawling
)
An option would be:
crawler4j
to collect the Journal names from Reuters for the fields you like to investigate.h4
tags, which can be easily extracted.