There is this website I'm trying to scrape.
If you put organization ID in a search bar, and press "Искать", it redirects you to a separate page with base url of https://pb.nalog.ru/search.html and hash "#t=*&mode=search-all&queryAll=ID" where t are current miliseconds (from Date.gettime()
)
If I use the url generated by macro and manually put it into a browser, it returns the correct page, but whenever I try to do it programmatically, it returns 404 Page not found website dummy; moreover, the url it returns is different from mine:
https://pb.nalog.ru/search.html#t=1730356470622&mode=search-all&queryAll=9714055795
/search.html%23t=1730356470622&mode=search-all&queryAll=9714055795
I assume %23 is a conversion of #, but I'm new to this and can't say for sure. I will try to answer all of the follow-up questions.
Here's the code in question:
Option Explicit
Private Type SYSTEMTIME
wYear As Integer
wMonth As Integer
wDayOfWeek As Integer
wDay As Integer
wHour As Integer
wMinute As Integer
wSecond As Integer
wMilliseconds As Integer
End Type
Private Declare PtrSafe Sub GetSystemTime Lib "kernel32" (lpSystemTime As SYSTEMTIME)
Function CurrentTimeMillis() As Double
' Returns the milliseconds from 1970/01/01 00:00:00.0 to system UTC
Dim st As SYSTEMTIME
GetSystemTime st
Dim t_Start, t_Now
t_Start = DateSerial(1970, 1, 1) ' Starting time for Linux
t_Now = DateSerial(st.wYear, st.wMonth, st.wDay) + _
TimeSerial(st.wHour, st.wMinute, st.wSecond)
CurrentTimeMillis = DateDiff("s", t_Start, t_Now) * 1000 + st.wMilliseconds
End Function
Public Sub oopsie_doopsie()
Dim http As New XMLHTTP60
Dim html As New HTMLDocument
Dim curr As Double
curr = CurrentTimeMillis(): Debug.Print curr
With http
.Open "GET", "https://pb.nalog.ru/search.html#t=" & curr & "&mode=search-all&queryAll=" & "9714055795" & "", False
Debug.Print "https://pb.nalog.ru/search.html#t=" & curr & "&mode=search-all&queryAll=" & "9714055795" & ""
DoEvents
.send
DoEvents
html.body.innerHTML = .responseText
End With
html.getElementsByClassName ("pb-subject-status pb-subject-status--active")
End Sub
As far as I see, the page on the URL is not static and the element pb-subject-status pb-subject-status--active
will not be visible even if you succeed in your request with your original code.
On the other hand, I think there are some elements being filled from some JSON
sources.
My Language is not Russian so, I don't understand from response I get in the below trial code. This code, will get some JSON
data and write it in cell A1
.
Just check the data and see if your required data is available somewhere in the text. If you find your data there, then you can parse the string with Tim Hall's JSon Parser
Sub Test()
Dim objHTTP As Object, strURL As String, PayLoad As String, requestID As String, strJSON As String
Set objHTTP = CreateObject("MSXML2.XMLHTTP.6.0")
strURL = "https://pb.nalog.ru/search-proc.json"
PayLoad = "mode=search-all&queryAll=9714055795&queryUl=&okvedUl=®ionUl=&statusUl=&isMspUl=&mspUl1=1&mspUl2=1&mspUl3=1&queryIp=&okvedIp=®ionIp=&statusIp=&isMspIp=&mspIp1=1&mspIp2=1&mspIp3=1&taxIp=&queryUpr=&uprType1=1&uprType0=1&queryRdl=&dateRdl=&queryAddr=®ionAddr=&queryOgr=&ogrFl=1&ogrUl=1&ogrnUlDoc=&ogrnIpDoc=&npTypeDoc=1&nameUlDoc=&nameIpDoc=&formUlDoc=&formIpDoc=&ifnsDoc=&dateFromDoc=&dateToDoc=&page=1&pageSize=10&pbCaptchaToken=&token="
objHTTP.Open "POST", strURL, False
objHTTP.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
objHTTP.send PayLoad
strJSON = objHTTP.responseText
requestID = Split(Split(strJSON, """id"":""")(1), """")(0)
PayLoad = "id=" & requestID & "&method=get-response"
strURL = "https://pb.nalog.ru/search-proc.json"
objHTTP.Open "POST", strURL, False
objHTTP.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
objHTTP.send PayLoad
strJSON = objHTTP.responseText
Range("A1") = strJSON
Set objHTTP = Nothing
End Sub
. Screenshot of the page is added below: