htmlrweb-scrapinggisrjsonio

Extracting population data from website; wiki town webpages


G'day Everyone,

I am looking for a raster layer for human population/habitation in Australia. I have tried finding some free datasets online but couldn't really find anything in a useful formate. I thought it might be interesting to try and scrape population data from wikipedia and make my own raster layer. To this end I have tried getting the info from wiki, but not knowing anything about html has not help me.

The idea is to supply a list of all the towns in Australia that have wiki pages and extract the appropriate data into a data.frame.

I can get the webpage source data into R, but am stuck on how to extract the particular data that I want. The code below shows where I am stuck, any help would be really appreciated or some hints in the right direction.

I thought I might be able to use readHTMLTable() because, in the normal webpage, the info I want is off to the right in a nice table. But when I use this function I get an error (below). Is there any way I can specify this table when I am getting the source info?

Sorry if this question doesn't make much sense, I don't have any idea what I am doing when it comes to searching HTML files.

Thanks for your help, it is greatly appreciated!

Cheers, Adam

    require(RJSONIO)
    loc.names <- data.frame(town = c('Sale', 'Bendigo'), state = c('Victoria', 'Victoria'))
    u <- paste('http://en.wikipedia.org/wiki/',
         sep = '', loc.names[,1], ',_', loc.names[,2])
    res <- lapply(u, function(x) htmlParse(x))

Error when I use readHTMLTable:

    tabs <- readHTMLTable(res[1])
    Error in (function (classes, fdef, mtable)  : 
    unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"list"’

For instance, some of the data I need looks like this in the html stuff. My question is how do I specify these locations in the HTML stuff I have?

/ <span class="geo">-38.100; 147.067

title="Victoria (Australia)">Victoria</a>. It has a population (2011) of 13,186

Solution

  • res returns a list in this case you need to use res[[1]] rather then res[1] to access its elements. Using readHTMLTable on these elements will give you all tables. The tables with geo info is contained in a table with class = "infobox vcard" you can just extract these tables seperately then pass them to readHTMLTable

    require(XML)
    lapply(sapply(res, getNodeSet, path = '//*[@class="infobox vcard"]')
           , readHTMLTable)
    

    If you are not familiar with xpaths the selectr package allows you to use css selectors which maybe easier.

    require(selectr)
    > querySelectorAll(res[[1]], "table span .geo")
    [[1]]
    <span class="geo">-38.100; 147.067</span> 
    
    [[2]]
    <span class="geo">-38.100; 147.067</span>