pythonweb-scrapingmechanicalsoup

Why am I getting a 400 response when trying to fill in a form using Mechanical Soup


I am currently building a basic webscraper that gets train ticket prices from National Rail using Python and MechanicalSoup.

I am trying to fill out a form using basic train data (start and end station, as well as a date and time) so then I will have access to ticket prices for a specific train journey.

Here is the code I have used to fill out the form

import requests
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup as Soup
import mechanicalsoup

#Mechanical soup
browser = mechanicalsoup.StatefulBrowser()
browser.open("http://www.nationalrail.co.uk/")

#Find the correct form
trainForm = browser.select_form('form[action="http://ojp.nationalrail.co.uk/service/planjourney/plan"]')

#Basic parameters (start and end, and date and time)
browser["from.searchTerm"]                              = "Norwich"
browser["to.searchTerm"]                                = "London Liverpool Street"
browser["timeOfOutwardJourney.monthDay"]                = "28/11/2018"
browser["timeOfOutwardJourney.hour"]                    = 13 
browser["timeOfOutwardJourney.minute"]                  = 15 
browser["_checkbox"]                                    = "off"                           

#Submit the form
browser.launch_browser()
response = browser.submit_selected()

#print the response
print(response)

The problem I am having is that when the form submits it returns <Response [400]>. Research has led me to believe that my form is incorrectly filled out. However, when browser.launch_browser() is executed and my browser is opened all the fields seem like they are correctly filled out and if I press submit myself then form is submitted correctly and the correct page of ticket prices is opened.

Does anyone know what I am doing wrong?


Solution

  • it happen only in python3, the problem is requests replacing space in redirect URL with %09

    print(response.url)
    # http://www.nationalrail.co.uk/times_fares/109179.aspx%09%09%09%09
    

    you can patch it, go to line 114 of

    python_dir\Lib\site-packages\requests\sessions.py
    

    and replace

    location = location.encode('latin1')
    

    with

    location = location.strip().encode('latin1')