pythonhtmlweb-scrapingpython-requestssid

Dungeons and Dragons Character Sheet parser via Python


The point of this project is simple, but some pointers form anyone who feels they have something to add would be appreciated.

Purpose: The application's purpose is to enter an account on Myth-Weavers (https://www.myth-weavers.com/) and return the names of all Dungeons and Dragons sheets that have been created on the account. This

The app should also be able to take a direct link (https://www.myth-weavers.com/sheet.html#id=2311944). This is theoretically possible because you are able to access the link and associated sheet without being logged into Myth-Weavers.

PART ONE: I need to be able to have the application enter the site and use my log-in credentials to enter my account. When I log into the site the following form data is sent on the network:

vb_login_username: Testbug Jones
vb_login_password: 
s: 
securitytoken: guest
do: login
vb_login_md5password: fea5ff2cf4764d2e76ea81e68bb458d1
vb_login_md5password_utf: fea5ff2cf4764d2e76ea81e68bb458d1

I am using the following code to check my progress through the log in:

import requests

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) 
Chrome/85.0.4183.121 Safari/537.36'
  }

login_data = {
    's' : '',
    'securitytoken' : 'guest',
    'vb_login_username' : 'Testbug Jones',
    'vb_login_password' : 'TeStBuG',
    'redirect' : 'index.php',
    'login' : 'Login',
    'vb_login_md5password' : 'fea5ff2cf4764d2e76ea81e68bb458d1',
    'vb_login_md5password_utf' : 'fea5ff2cf4764d2e76ea81e68bb458d1'
}


#get page
url = 'https://www.myth-weavers.com/'
source = requests.get(url)

#isolates login form, along with an sid
print('\n\n***CURRENT LOGIN STATUS***')
login_status = source.text
login_status = login_status.split("<!-- login form -->")[1]
login_status = login_status.split("<!-- / login form -->")[0]
print(login_status)

#nab sid and update library
sid  = login_status.split('<input type="hidden" name="s" value="')[1]
sid = sid.split('" /')[0]
login_data['s'] = sid

#create session and attempt to log in
with requests.Session() as s:
  print('\n\n***ATTEMPTING TO LOGIN***')
  r = s.post(url, data = login_data, headers = headers)
  login_status = r.text
  login_status = login_status.split("<!-- login form -->")[1]
  login_status = login_status.split("<!-- / login form -->")[0]
  print(login_status)

As for the login form itself, it normally looks like:

<li class="smallfont" id="login" style="width: auto; float: right; text-align: right; padding-right: 6px;">
        <span id="login_register"><a href="#" onclick="fetch_object('login_register').style.display = 'none'; fetch_object('login_form').style.display = ''; return false;" tabindex="0">Log In</a> / <a href="https://www.myth-weavers.com/register.php?s=f4cde1e552e96a9a2b4c4479559e6510">Register</a> <a href="//www.myth-weavers.com/login.php?do=lostpw" style="font-size:smaller">forgot password?</a></span>
        <form id="login_form" style="display: none;" action="https://www.myth-weavers.com/login.php?do=login" method="post" onsubmit="md5hash(vb_login_password, vb_login_md5password, vb_login_md5password_utf, 0)">
        <script type="text/javascript" src="//static.myth-weavers.com/clientscript/vbulletin_md5.js?v=388"></script>

        <input type="text" class="bginput" style="font-size: 10px" name="vb_login_username" id="navbar_username" size="10" accesskey="u" tabindex="0" value="User Name" onfocus="if (this.value == 'User Name') this.value = '';" onblur="if (this.value == '') this.value = 'User Name';" />

        <input type="password" class="bginput" style="font-size: 10px" name="vb_login_password" id="navbar_password" size="10" tabindex="0" value="Password" onfocus="if (this.value == 'Password') this.value = '';" onblur="if (this.value == '') this.value = 'Password';" />

    <label for="cb_cookieuser_navbar"><input type="checkbox" name="cookieuser" value="1" tabindex="0" id="cb_cookieuser_navbar" accesskey="c" />Remember Me?</label>

        <input type="submit" class="button" value="Log in" tabindex="0" title="Enter your username and password in the boxes provided to login, or click the 'register' button to create a profile for yourself." accesskey="s" />

        <input type="hidden" name="s" value="f4cde1e552e96a9a2b4c4479559e6510" />
        <input type="hidden" name="securitytoken" value="guest" />
        <input type="hidden" name="do" value="login" />
        <input type="hidden" name="vb_login_md5password" />
        <input type="hidden" name="vb_login_md5password_utf" />
        </form>
</li>

At this point I think what is stopping me is 1)syntax as I am obviously new, 2) cookies are not being handled correctly or 3)securitytoken/sid is not being handled correctly, but I'm reaching the point where I can see my errors but not the way to overcome them. Any help or insight in getting past this would be very helpful!

PART TWO: This will allow me to access a page on the site, specifically the "Sheets" page, and print out a list of all Character Sheets found there. It will also be able to retrieve the JSON files stored in the table rows the character names are found.


Solution

  • You should make the first request using requests.Session() to get the cookies and send them back when you make the post /login.php. Also, you can use beautifulsoup to get all the input name/value in the login form, so you just add your username/password (so you don't hardcode anything other than username/password)

    The password is md5 hashed, so you can use hashlib to encode it

    The following make the login call :

    import requests
    from bs4 import BeautifulSoup
    import hashlib
    
    url = "https://www.myth-weavers.com"
    username = "Testbug Jones"
    password = "TeStBuG"
    
    s = requests.Session()
    r = s.get(url)
    
    soup = BeautifulSoup(r.text, "html.parser")
    form = soup.find("form",{"id":"login_form"})
    payload = dict([(t.get("name"),t.get("value","")) 
        for t in form.findAll("input")
        if t.get("name")
    ])
    
    md5 = hashlib.md5(password.encode('utf-8')).hexdigest()
    payload["vb_login_username"] = username
    payload["vb_login_password"] = password
    payload["vb_login_md5password"] = md5
    payload["vb_login_md5password_utf"] = md5
    
    r = s.post(f"{url}/login.php", 
        params= {"do": "login"},
        data = payload
    )
    

    Then, you can use s.get(".....") to get the sheets data like this :

    r = s.get(f"{url}/sheets")
    soup = BeautifulSoup(r.text, "html.parser")
    rows = soup.find("table").find_all("tr")[1:]
    sheet_data = []
    for row in rows:
        tds = row.find_all("td")
        download_link = f'{url}{tds[5].find("a")["href"]}'
        json = s.get(download_link)
        sheet_data.append({
            "name": tds[1].text.strip(),
            "template": tds[2].text.strip(),
            "game": tds[3].text.strip(),
            "download_link": download_link,
            "json": json.json()
        })
    
    print(sheet_data)
    

    run this on repl.it