htmlpython-3.xweb-scrapingbeautifulsoupmechanicalsoup

How to yield children of span tag using BeautifulSoup/MechanicalSoup - selecting drop-down field values


I am trying to complete a form submission on a webpage (http://supermag.jhuapl.edu/mag/?) using MechanicalSoup. Prior to the submission a date must be specified, within the same form, using drop down boxes for start day, month, year, time etc. This can be done with the set_select() MechanicalSoup function, but I cannot seem to access the relevant select tag for each field. A small disclaimer; while I have scientific programming experience I am new to HTML and the Python libraries mentioned above.

While I am unsure which library is best to use for selecting the date, I cannot seem to access the relevant select tag that is a child element of corresponding span tags within the form, with name attributes such as 'start_day', 'start_month'.

I have both the mechanicalsoup.Form(form) and mechanicalsoup.StatefulBrowser(*args, **kwargs) objects (the latter corresponding to a bs4.BeautifulSoup object) and have tried:

A snippet of the relevant HTML is shown; note the div tags and subsequent select tags as children.

The form tag:

<form name="theForm" class="form-horizontal" onsubmit="return false;">

The relevant span and select tags within form:

<span name="start_time">
  <div>
    <select name="start_day">
      <option value="1">1</option>
      <option value="2">2</option>
      <option value="3">3</option>...
    </select>
    <select style="width: 4em;" name="start_month">
      <option value="1">January</option>
      <option...
    </select>
  </div>
</span>

Code is found below:

# Opening browser and URL
url = "http://supermag.jhuapl.edu/mag/?"
browser = ms.StatefulBrowser()
browser.open(url)

# Assigning bs4.BeautifulSoup object
html = browser.get_current_page()

# Assigning relevant form
form = browser.select_form('form[name="theForm"]')

# Assign correct span tag for e.g start_time
start_time_span = html.find_all('span')[2]

# Attempt to set start day value - returns
# 'InvalidFormMethod: No select named start_day'
form.set_select({'start_day': 1})

# Attempt to find select tags with bs4
html.find('select', {'start_day': 1})
start_time_span.find('select', {'start_day': 1})

# and eg looking for contents returns empty list
start_time_span.contents

I expected to have the select tags listed within the bs4 find() attempts, or for the mechanicalsoup set_select() to access and set the given select tag when called on the correct form.

The span tag is found within the BeautifulSoup HTML, but does not seem to have any child select tags that are present within the source HTML, and are necessary for selecting the date. Calling set_select() returns an error saying that the tag cannot be found.

Thank you in advance; this is my first question on StackOverflow and I hope it meets the guidelines sufficiently well!


Solution

  • To me, your code generally looks fine! When I run your python snippet on the HTML you quote here, it does not raise an InvalidFormMethod exception. However, when I run it on the URL you provided, I do see that error (because, looking at the source HTML, there is no element with the name start_day).

    I suspect this is because a specific JavaScript action is generating the HTML that includes a start_day field. This is hinted at by the form having an onsubmit attribute and no action, as well as including a lot of JavaScript files (which may or may not be necessary to interact with the form). Depending on what exactly you want to do with this form, you probably need to use a tool that supports JavaScript, like Selenium (MechanicalSoup does not -- see this FAQ).