web-scrapingrssgoogle-groups

google-groups rss feed has truncated description


I'm trying to analyze the sentiment of the google group forum I'm running. In order to get the forum content, I came to know of two methods: 1. Webscraping from google-groups using selenium but this method is unreliable and google changes the class names often. 2. Using RSS feed.

The 2nd method seemed to be a good option but the problem was the RSS feed descriptions were truncated. Is there a way to get the complete description without truncation ? or is there any other way to get the content of a public google groups ?


Solution

  • To those who are facing similar problems - scraping google group contents, I came across a python pkg called gg_scraper 0.10.0 written by "Matěj Cepl" that downloaded the google group content into MBOX files. I later converted these MBOX files into JSON formatted files for my use.