jsonhttppython-requestsurlopen

HTML junk returned when JSON is expected


The following code used to work but not anymore and I'm seeing junk HTML with success code of 200 returned.

response = urlopen('https://www.tipranks.com/api/stocks/stockAnalysisOverview/?tickers='+symbol)
data = json.load(response)

If you open the page in chrome you will see the JSON file format. But when opened in python I'm now getting:

f1xx.v1xx=v1xx;f1xx[374148]=window;f1xx[647467]=e8NN(f1xx[374148]);f1xx[125983]=n3EE(f1xx[374148]);f1xx[210876]=(function(){var P6=2;for(;P6 !== 1;){switch(P6){case 2:return {w3:(function(v3){var v6=2;for(;v6 !== 10;){switch(v6){case 2:var O3=function(W3){var u6=2;for(;u6 !== 13;){switch(u6){case 2:var o3=[];u6=1;break;case 14:return E3;break;case 8:U3=o3.H8NN(function(){var Z6=2;for(;Z6 !== 1;){switch(Z6){case 2:return 0.5 - B8NN.P8NN();break;}}.....

What should I be doing to adapt to the new backend change so that I can parse the JSON again.


Solution

  • It is a bot protection, to prevent people from doing what you are doing. This API endpoint is supposed to be used only by the website itself, not by some Python script!

    If you delete your site data and then freshly access the page in the browser, you'll see it first loads the HTML page that you see which loads some JavaScript, which then executes a POST to another URL with some data. Somewhere in the process a number of cookies get set and finally the code refreshes the page which then loads the JSON data. At this point visiting the URL directly returns the data because the correct cookies are already set.

    Network tab screenshot

    If you look at those requests, you'll see the server returns a header server: rhino-core-shield. If you google that, you can see that it's part of the Reblaze DDoS Protection Platform.

    You may have luck with a headless browser like ghost.py or pyppetteer but I'm not sure how effective it will be, you'll have to try. The proper way to do this would be to find an official (probably paid) API for getting the information you need instead of relying on non-public endpoints.