I was trying to extract data from the web. Few of the letters which were in latin were coming in its plain hex format.
For eg :
https://www.zomato.com/ncr/café-mrp-connaught-place-new-delhi
this link would become
https://www.zomato.com/ncr/caf%C3%A9-mrp-connaught-place-new-delhi
How do I get the latin letter back from this. I want to generalize this and do it for all the latin letters that gets changed in my dataframe.
i=1
main_page_url = r"https://www.zomato.com/ncr/connaught-place-delhi-restaurants"
chrome_path = r"C:\Users\HPO2KOR\Desktop\chromedriver.exe"
wd = webdriver.Chrome(chrome_path)
wd.get(main_page_url)
while(i<=2):
rests = wd.find_elements_by_xpath('//a[@class="result-title hover_feedback zred bold ln24 fontsize0 "]')
for rest in rests:
df = df.append({'Rest Name' : rest.text,
'URL' : rest.get_attribute("href")}, ignore_index=True)
nxt_pg = wd.find_element_by_xpath('//a[@class="paginator_item next item"]')
nxt_pg.click()
wd.switch_to_window(wd.window_handles[0])
i+=1
wd.close()
You can use urllib.parse.unquote(s)/urllib.parse.quote(s)
This is my code strip:
>>> urllib.parse.unquote("https://www.zomato.com/ncr/caf%C3%A9-mrp-connaught-place-new-delhi")
'https://www.zomato.com/ncr/café-mrp-connaught-place-new-delhi'
>>> urllib.parse.quote('https://www.zomato.com/ncr/café-mrp-connaught-place-new-delhi')
'https%3A//www.zomato.com/ncr/caf%C3%A9-mrp-connaught-place-new-delhi'