sqlpython-2.7sqlalchemyscraperwiki

sqlalchemy.exc.StatementError: invalid literal for int() with base 10 in scraper


I've written a Python 2.7 scraper, but am getting an error when attempting to save my data. The scraper is written in Scraperwiki, but I think that's largely irrelevant to the error I'm getting - saving in Scraperwiki seems to be handled using Sqlalchemy, and it's this that is giving the error.

I get this error message:

Traceback (most recent call last):
  File "./code/scraper", line 192, in <module>
    saving(spreadsheet_pass)
  File "./code/scraper", line 165, in saving
    scraperwiki.sql.save(["URN"], school, "magic")
  File "/usr/local/lib/python2.7/dist-packages/scraperwiki/sql.py", line 195, in save
    connection.execute(insert.values(row))
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 729, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/elements.py", line 321, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 826, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 893, in _execute_context
    None, None)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1160, in _handle_dbapi_exception
    exc_info
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 889, in _execute_context
    context = constructor(dialect, self, conn, *args)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 573, in _init_compiled
    param.append(processors[key](compiled_params[key]))
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/processors.py", line 56, in boolean_to_int
    return int(value)
sqlalchemy.exc.StatementError: invalid literal for int() with base 10: 'n/a' (original cause: ValueError: invalid literal for int() with base 10: 'n/a') u'INSERT OR REPLACE INTO magic (published_recent, inspection_rating2, schooltype, "LA", "URL", "URN", schoolname, open_closed, opendate_full, inspection_rating, opendate_short, phase, publication_date, include, notes, inspection_date) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)' []

when attempting to save this line of data:

{u'published_recent': 'n/a', u'inspection_rating2': 'n/a', u'schooltype': u'Free school', u'LA': u'Tower Hamlets', u'URL': u'http://www.ofsted.gov.uk/inspection-reports/find-inspection-report/provider/ELS/138262', u'URN': u'138262', u'schoolname': u'City Gateway 14-19 Provision', u'open_closed': u'Open', u'opendate_full': u'2012-09-03', u'inspection_rating': 'No section 5 inspection yet', u'opendate_short': u'September 2012', u'phase': u'Alternative provision', u'publication_date': 'n/a', u'include': False, u'notes': 'test message', u'inspection_date': 'n/a'}

using this line of code:

scraperwiki.sql.save(["URN"], school, "magic")

(Which in Scraperwiki, saves the data in the 'school' dictionary to a database called 'magic' using the key 'URN' as the unique key.)

Weirdly, sometimes the scraper works fine and I don't get the error, but other times, running identical code, I get this error.

Things I've tried:

  1. Clearing the database I'm saving to, or starting a new database with a different name. Neither worked.
  2. Editing the data being saved. The error refers to a problem with the 'n/a' value saved against the key 'published_recent'. Previous lines of data, that save with no problem, contain data of Boolean type, so I figured the string is causing difficulties for some reason. Changing the value to an integer means I don't get this error. Right now I can't replicate this (saving seems to be working when the value is an integer), but I think I've received this error when I've tried changing the 'published_recent' value to an integer for the line of data that seems to be causing me problems: sqlalchemy.exc.IntegrityError: (IntegrityError) constraint failed

Either way, this isn't really a solution, as I need to be able to save a string.

  1. Reading all StackOverflow questions about these two errors, and the sqlalchemy documentation. I couldn't find anything that seemed to address the issue I've got.
  2. Using an Autoincrementing key for the data. I'm saving the data on key 'URN', which is unique, but I figured that the scraper might be using the 'published_recent' key as the unique key when saving for some reason, so I tried using an autoincrementing key, following this answer: ScraperWiki: How to create and add records with autoincrement key. Still get the same error.

Thanks in advance for any answers - this is driving me a bit nuts.


Solution

  • Error says the value it is trying to save as integer is 'n/a'. If you are scraping data, well you don't always get what you want. Seems like 'n/a' is what they put on the site you are scraping when there's no number for that field. You will have to do some validation on your data before saving it.