I've been trying for several day now (unsuccessfully) to scrape cities from about 500 Facebook URLs. However, Facebook handles its data in a very strange way and I can't figure out what's going on under the hood to understand what I need to do.
Essentially the problem is that Facebook displays very different amounts of data depending on who is logged in, and what the privacy settings of the account are. For instance, try opening the following three links, both in a browser where you are logged into Facebook, and one where you are not:
[REDACTED LINKS DUE TO PRIVACY CONCERNS]
As you can see, Facebook loads the data in both cases for the first link, but only gets data for the second link if you are logged in (to ANY account). The third link displays city when you are logged in, but only displays other information when you are not.
The reason this is extremely problematic (and related to Python) is that when trying to scrape the page with Beautiful Soup or Mechanize, I cannot figure out how to get the program to "pretend" that I am logged into an account. This means that I can easily grab data off the first type of link (of which there are less than 10), but I cannot get city off the second or third type. So far I've tried a number of solutions with little success.
Here's some sample code that works correctly for the first type, but not for other types:
import mechanize
import re
import csv
user_info = []
fb_url = 'http://www.facebook.com/100004210542493'
br = mechanize.Browser()
br.set_handle_robots(False)
br.open(fb_url)
all_html = br.response().get_data()
print all_html
city = re.search('fsl fwb fcb">(.+?)</a></div><div class="aboutSubtitle fsm fwn fcg', all_html).group(1)
user_info = [fb_url, city]
print user_info
I also have a version that uses Beautiful Soup. If anyone has any ideas on how to get around this, I would be extremely grateful. Thank you!
The right way to do this is to use the facebook API. For various business, security, and privacy reasons they go out of their way to make scraping data tricky.
If you insist on scraping I would try to log in first using mechanize to submit the form. I've never tried to do this with facebook, but alot of websites have easier to parse versions intended for mobile users at m.site.com.