I am currently working on a personal project and utilizing the chessdotcom Public API Package. I am currently able to store in a variable the PGN from the daily puzzle (Portable Game Notation) which is a required input to create a chess gif (https://www.chess.com/gifs).
I wanted to use requests and html parsers to essentially fill out the form on the gifs site and create a gif through my python script. I made a request to the gif website and the response.text returns a huge html string (thousands of lines) which I am parsing using html5lib. I am currently getting a "html5lib.html5parser.ParseError: Unexpected character after attribute value." I can't seem to figure out where in this giant response the issue is. What are some tips/tricks to debug this issue? Where do I even begin looking for this unexpected character?
import requests as req
import html5lib
from datetime import datetime
from chessdotcom import Client, get_player_profile, get_player_game_archives,get_player_stats, get_current_daily_puzzle, get_player_games_by_month
Client.request_config['headers']['User-Agent'] = 'PyChess Program for Automated YouTube Creation'
class ChessData:
def __init__(self, name):
self.player = get_player_profile(name)
self.archives = get_player_game_archives(name)
self.stats = get_player_stats(name)
self.games = get_player_games_by_month(name, datetime.now().year, datetime.now().month)
self.puzzle = get_current_daily_puzzle()
self.html_parser = html5lib.HTMLParser(strict=True, namespaceHTMLElements=True, debug=True)
def organize_puzzles(self, puzzles):
#dict_keys(['title', 'url', 'publish_time', 'fen', 'pgn', 'image'])
portableGameNotation = puzzles['pgn']
html_data = req.get('https://www.chess.com/gifs')
print(html_data.text)
self.html_parser.parse(html_data.text.replace('&', '&'))
def get_puzzles(self):
self.organize_puzzles(self.puzzle.json['puzzle'])
I had initially had issues with "Name Entity Expected. Got None" error which I temporarily bypassed by replacing all instances of &
with &
entity.
Traceback (most recent call last):
File "C:/ChessProgram/ChessTop.py", line 17, in <module>
main()
File "C:/ChessProgram/ChessTop.py", line 14, in main
ChessResults.get_puzzles()
File "C:\ChessProgram\ChessData.py", line 32, in get_puzzles
self.organize_puzzles(self.puzzle.json['puzzle'])
File "C:\ChessProgram\ChessData.py", line 29, in organize_puzzles
self.html_parser.parse(html_data.text.replace('&', '&'))
File "C:\ChessProgram\lib\site-packages\html5lib\html5parser.py", line 284, in parse
self._parse(stream, False, None, *args, **kwargs)
File "C:\ChessProgram\lib\site-packages\html5lib\html5parser.py", line 133, in _parse
self.mainLoop()
File "C:\ChessProgram\lib\site-packages\html5lib\html5parser.py", line 216, in mainLoop
self.parseError(new_token["data"], new_token.get("datavars", {}))
File "C:\ChessProgram\lib\site-packages\html5lib\html5parser.py", line 321, in parseError
raise ParseError(E[errorcode] % datavars)
html5lib.html5parser.ParseError: Unexpected character after attribute value.
I tried replacing the &
with &
to fix the entity name issue and manually searched through this html response for the different attributes and looking for anything out of place.
Normally to debug html I would try to split HTML to smaller elements and test it. But with html5lib
it may be problem because it may need full HTML to parse it. So it may need to write own functions in parser to display more information during parsing.
But if you use html5lib.HTMLParser()
without parameters (or with stricte=False
) then it runs correctly even without .replace('&', '&')
But still I wouldn't use html5lib
for this because I don't see any functions to search elements in HTML. It may need to write own functions.
It is much simpler to do it with BeautifulSoup
or lxml
(or other modules)
Other problem: page uses cookies and it has hidden input
with token
which it probably compares with cookies (to generate image) and this needs `requests.Session()
So I do
requests.Session()
get()
page with form
BeautifulSoup
to search hidden input
with token
post()
all data like real form
text.find()
to find url to animated gif
BeautifulSoup
)get()
aniamted gif
and write it in local file.content
instead of .text
to work with bytes
instead of string
)webbrowser
to display url with animated gif
in default browserFull working code:
#import requests
from requests import Session
from bs4 import BeautifulSoup
#headers = {
# 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0'
#}
s = Session()
#s.headers.update(headers)
url = 'https://www.chess.com/gifs'
# --- get token ---
response = s.get(url)
html = response.text
#soup = BeautifulSoup(html, 'html.parser')
soup = BeautifulSoup(html, 'html5lib')
item = soup.find('input', {'id': 'animated_gif__token'})
#print(item)
token = item['value']
print('token:', token)
# --- send form, get response and search image ---
game = "https://www.chess.com/live/game/3048628857"
payload = {
"animated_gif[data]": game,
"animated_gif[board_texture]": "green", # "brown",
"animated_gif[piece_theme]": "neo",
"animated_gif[_token]": token
}
response = s.post(url, data=payload)
html = response.text
start = html.find('https://images.chesscomfiles.com/uploads/game-gifs/')
end = html.find('"', start)
image_url = html[start:end]
print(image_url)
# --- download file ---
response = s.get(image_url)
# write using `bytes` instead of `text`
with open('animation.gif', 'wb') as f:
f.write(response.content)
# --- show image_url in browser ---
import webbrowser
webbrowser.open(image_url)