htmlscrapycss-selectorshtml-tableextract

how to retrieve table, excluding some tags from it


I am trying to scrape data from html table(main_table), using css selectors. The problem is that when I am trying to get all rows(tr), I get extra rows from inner_table, which is inside the main_table, but I can't figure out how to exclude the inner_table.

I tried css selectors as

response.css('.main_table-id:not([class^="inner_table"])').extract()

and

response.css("table[id='main_table_id']:not([class*='inner_table'])").extract()

but it doesn't exclude it.

<table id ="main_table_id" class="main_table_class">
<tbody>
<tr block-id="123" class="main_tr_class">
<td class="td_class">
<div class="inner_table_div">
<table class="inner_table">
</table>
</div>  
</td>
</tr>
</tbody>
</table>

I would like to scrape all the data from main_table, and exclude inner table. I was told that I am applying my selector to the parent node, but I don't know how to edit my css.


Solution

  • Use > to select only immediate child nodes

    response.css('#main_table_id > tr')
    

    or

    response.css('#main_table_id > tbody > tr')