python-3.xscrapy

Scrapy Crawl Table with Rowspan


I am trying to get data from a table on a website using scrapy. The first column in the table has a few fields with a "ROWSPAN" of various lengths. When I run my crawl code, the table output gets messed up because the code does not recognizes the data in column 1 for all the rows covered by the original "ROWSPAN" statement.

I have created an html file mimicking the table I'm trying to crawl, and what I am trying to accomplish. As you see in the html output, the "Family" column spans all rows containing the same family members. But the scrapy code output only shows the correct Family title on the first entry for each Family.

html output:

code output:

Scrapy Code:

import scrapy

class FamilyTableSpider(scrapy.Spider):
    name = 'familytable'
    allowed_domains = ['127.0.0.1'] #domain changed to protect the innocent
    start_urls = ['127.0.0.1'] #url changed to protect the innocent

    def start_requests(self):
        urls = [
            '127.0.0.1', #url changed to protect the innocent
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
 
    def parse(self, response):
        for row in response.xpath('//*[@class="table table-striped table-bordered"]//tbody/tr'):
            yield {
                'Family' : row.xpath('td[1]//text()').extract_first(),
                'Name': row.xpath('td[2]//text()').extract_first(),
                'Relationship' : row.xpath('td[3]//text()').extract(),
                'Age' : row.xpath('td[3]//text()').extract_first(),
            }

"""run code with: scrapy crawl familytable -O familytable.json"""


html code:

<html>

<body>

<table width="50%" border="0" cellspacing="10" cellpadding="0" class="table table-striped table-bordered">
    <thead>
    <tr>
        <th align="center" valign="top" width="25%">Family</th><th align="left" valign="top" width="25%">Name</th>
        <th align="left" valign="top" width="25%">Relationship</th>
    <th align="left" valign="top" width="25%">Age</th>
    </tr>
    </thead>
    <tbody>
    <tr class="linebottom">
    <td style="background-color:#F9F9F9;" align="center" valign="top" rowspan="5">Smith</td>
    <td align="left" valign="top">Thomas</a></td>
    <td align="left" valign="top">Father<br>Husband<br></td><td align="left" valign="top">58</td>
    </tr>
    <tr class="linebottom"><td align="left" valign="top">Mary</a></td>
    <td align="left" valign="top">Mother<br>Wife<br></td><td align="left" valign="top">57</td>
    </tr>
    <tr class="linebottom">
    <td align="left" valign="top">Joe</a></td>
    <td align="left" valign="top">Son<br></td><td align="left" valign="top">18</td>
    </tr>
    <tr class="linebottom">
    <td align="left" valign="top">Sue</a></td>
    <td align="left" valign="top">Daughter<br></td><td align="left" valign="top">16</td>
    </tr>
    <tr class="linebottom">
    <td align="left" valign="top">Tommy</a></td>
    <td align="left" valign="top">Son<br></td><td align="left" valign="top">13</td>
    </tr>
    <tr class="linebottom">
    <td style="background-color:#F9F9F9;" align="center" valign="top" rowspan="4">Jones</td>
    <td align="left" valign="top">James</a></td>
    <td align="left" valign="top">Father<br>Husband<br></td><td align="left" valign="top">42</td>
    </tr>
    <tr class="linebottom"><td align="left" valign="top">Linda</a></td>
    <td align="left" valign="top">Mother<br>Wife<br></td><td align="left" valign="top">42</td>
    </tr>
    <tr class="linebottom"><td align="left" valign="top">Anthony</a></td>
    <td align="left" valign="top">Son</td><td align="left" valign="top">14</td>
    </tr>
    <tr class="linebottom"><td align="left" valign="top">Jeff</a></td>
    <td align="left" valign="top">Son</td><td align="left" valign="top">11</td>
    </tr>
    <tr class="linebottom">
    <td style="background-color:#F9F9F9;" align="center" valign="top" rowspan="2">Johnson</td>
    <td align="left" valign="top">Stephen</a></td>
    <td align="left" valign="top">Husband</td><td align="left" valign="top">29</td>
    </tr>
    <tr class="linebottom">
    <td align="left" valign="top">Samantha</a></td>
    <td align="left" valign="top">Wife</td><td align="left" valign="top">28</td>
    </tr>

</tbody>
</table>

</body>
</html>

Solution

  • You can either use xpath selectors to get all the tags between the tr tag <td style="background-color:#F9F9F9;" align="center" valign="top" rowspan="..."> including the tr itself, and scrape whatever you want.

    Another solution is to use the rowspan attribute. It tells you how much lines there are for each family (see the example).

    import scrapy
    
    
    class FamilyTableSpider(scrapy.Spider):
        name = 'familytable'
    
        def start_requests(self):
            for url in self.start_urls:
                yield scrapy.Request(url=url, callback=self.parse)
    
        def parse(self, response):
            counter = 0
            rowspans = []
            for first_tag in response.xpath('//*[@class="table table-striped table-bordered"]//tbody/tr/td[@align="center"]'):
                rowspans.append(int(first_tag.xpath('.//@rowspan').get(default='0')))
    
            rows = response.xpath('//*[@class="table table-striped table-bordered"]//tbody/tr')
    
            for rowspan in rowspans:
                family = rows[counter].xpath('./td[1]//text()').get(default='')
                for i in range(rowspan):
                    index = counter + i
                    family_member = {
                        'Family': family,
                        'Name': rows[index].xpath('./td[last()-2]//text()').get(),
                        'Relationship': ', '.join(rows[index].xpath('./td[last()-1]//text()').getall()),
                        'Age': rows[index].xpath('./td[last()]//text()').get(),
                    }
                    yield family_member
                counter += rowspan