pythonweb-scrapingxpathscrapyweb-crawler

using scrapy to parse an arbitrary number of rows (key:value pairs) in an html table


Recently started working with the scrapy library. I am trying to scrape from a web site that has slightly different tables for each kind of product they sell. Eventually, I will use the data to populate object attributes. For now, I just need to extract it to JSON format.

Here is an example table:

<table id="table_1">
<tr id="row_1">
    <td>cell_1</td>
    <td>cell_2</td>
    <td>cell_3</td>
</tr>
<tr id="row_2">
    <td>cell_4</td>
    <td>cell_5</td>
    <td>cell_6</td>
</tr>
<tr id="row_n">
    <td>cell_x</td>
    <td>cell_y</td>
    <td>cell_z</td>
</tr>
</table>

Each column represents a different item, ie, small medium or large t-shirts. There would be 3 items in the table above, so the Items would look like:

Item 1 {
    row_1:cell_1
    row_2:cell_4
    row_n:cell_x
}
Item 2 {
    row_1:cell_2
    row_2:cell_5
    row_n:cell_y
}
Item 3 {
    row_1:cell_3
    row_2:cell_6
    row_n:cell_z
}

They are well-structured tables with no 'missing' or 'extra' cells, although the number of rows and columns is arbitrary.

The difficulty I had was in using the scrapy Item object, as this requires my Item class to define the number of Fields before scraping, instead of on a per-table basis. I have hundreds of tables I want to perform this process on.

Thanks for reading this far, any help is appreciated. :)

RESOLUTION: @warawuk Thanks for your help. I used your suggestion, and ended up with a triple-nested list. Perhaps not ideal, but it is trivial enough to extract the values as I continue working with them:

{"tRows": 
    [[["row1"], ["cell1", "cell2"]]
    [["row2"], ["cell3", "cell4"]]
    [["row3"], ["cell5", "cell6"]]
    [["row4"], ["cell7", "cell8"]]] x100s of tables
}

To deal with the arbitrary number of rows, I used a regular expression to extract the ids from each row and count them. A simple loop using range(len(rowNames)), plus some string concatenation finished the job.


Solution

  • You have too many questions here, imo.

    First of all, looks like your question is not about scrapy at all. It's about organizing your data and xpath.

    I think you must split your task in subtasks. First subtask is to actually extract the data into a python data structure and then try to process it. From your info, i think the data will like:

    {
        'table_1': {
            'row_1': ['cell_1', 'cell_2'],
            'row_2': ['cell_1', 'cell_2'],
            ...
        },
        'table_2': {
            'row_1': ['cell_1', 'cell_2', 'cell_3'],
            'row_2': ['cell_1', 'cell_2', 'cell_3'],
            ...
        },
    }
    

    Is this correct?


    UPDATE:

    The difficulty I had was in using the scrapy Item object, as this requires my Item class to define the number of Fields before scraping, instead of on a per-table basis. I have hundreds of tables I want to perform this process on.

    AFAIK, Item Fields can store any Python object. Scrapy Item class is just a place where you store Fields, but scrapy does not treat these fields in a special way. It's just you who takes these Fields in a pipeline and interprets data in them.

    So choose any store format that suites you. For example:

    class Shirt(Item):
        available_sizes = Field() # [(size1, amount1), (size2, amount2), ...] or {size1: amount1, size2: amount2, ...} if `size` is a hashable object