pythonmysqlscrapymariadbmariasql

Python and Scrapy, and attempting to get scraped data into MariaDB/MYSQL Database


The below code is unfinished I know, but all that I am simply trying to do is get those yield results into my MARIADB. I spent far too many hours combing Stackoverflow, looking for this answer. The code works great and I can manually add the final items to enter stuff into a database with static information, but I have tried everything I can in the for loop.

I just need the final lines of code to figure this out, and I am confident I can move on to finally scraping data.

import scrapy
import mysql.connector
from scrapy.selector import Selector

mydb = mysql.connector.connect(
  host="localhost",
  database="database",  
  user="root",
  password="password"
)

mycursor = mydb.cursor()
sql = "INSERT INTO testTable (name) VALUES (%s)"

class scrapysclass(scrapy.Spider):
    name = "scrapy-name"

    start_urls = ['url']

    def parse(self, response):
        quotes = str(response.xpath('//comment()').extract())
        quotes = quotes.replace('<!--','').replace('-->','')
        sel = Selector(text=quotes)
        for row in sel.xpath('//table[@id="tableid"]//tbody/tr'):
            yield {
                'first' : row.xpath('td[1]//text()').extract_first(),
                'last': row.xpath('td[2]//text()').extract_first(),
                'handle' : row.xpath('td[3]//text()').extract_first(),
            }`

Solution

  • Since you are establishing the connection with the DB along side the spider, you could have the cursor execute the insert query instead of yielding the item.

    sql = "INSERT INTO testTable (name, last_name, handle) VALUES (%s, %s, %s)"
    
    ...
    
    for row in sel.xpath('//table[@id="tableid"]//tbody/tr'):
        first = row.xpath('td[1]//text()').extract_first()
        last = row.xpath('td[2]//text()').extract_first()
        handle = row.xpath('td[3]//text()').extract_first()
        
        data = (first, last, handle)
        mycursor.execute(sql, data)
    

    Notice that I've changed your sql statement, as it wasn't clear for me which data should go there.

    I must point out that this is not the best solution for this.

    Ideally your spider should be in charge of only scraping the data, and you should write an Item Pipeline for inserting the data into the database.

    When you write a pipeline, everytime the spider yield an item that it was scraped, the process_item method will be called to handle that item. Here in the docs you will find some examples of pipelines.