htmlgoparsinggoquery

Strange len function (or string) behavior


I'm trying to parse the timetable content using goquery to work with it later. But I have a problem.

I have two functions. The first one takes an html document and searches for a token (csrfmiddlewaretoken) and the second one sends a request using this token and extracts information. Finishing extracting all necessary information from the page, I search for the token to use it in future request and store it.

But for some reason found token turns into an empty string when it reaches if len(foundCsrfToken) == 0 {. If I print length of the token just before the statement it prints this:

...
64
0
...

I've got rid of all goroutines in case if it's the problem.

func findCsrfMiddlewareToken(responseBody io.Reader) (string, error) {
    document, err := goquery.NewDocumentFromReader(responseBody)
    if err != nil {
        return "", err
    }

    var foundCsrfToken string
    document.Find("script").Each(func(_ int, scrpt *goquery.Selection) {
        scriptText := scrpt.Text()
        if funcDefIndex := strings.Index(scriptText, "function Filter"); funcDefIndex != -1 {
            csrfTokenValueStart := strings.Index(scriptText, "csrfmiddlewaretoken: '")
            offset := csrfTokenValueStart + len("csrfmiddlewaretoken: '")
            foundCsrfToken = scriptText[offset : offset+csrfMiddlewareTokenLength]
        }
    })
    if len(foundCsrfToken) == 0 {
        return "", errNoCsrfMiddlewareToken
    }
    return foundCsrfToken, nil
}

func (parser *TimetableParser) ParseTimetable(timetableFilterInfo internal.TimetableInfo) (internal.Timetable, error) {
    timetable := internal.Timetable{}

    requestBody := makeFormValues(timetableFilterInfo, parser.csrfMiddlewareToken).Encode()
    request, err := http.NewRequest("POST", baseUrl, strings.NewReader(requestBody))
    if err != nil {
        return timetable, err
    }
    request.Header.Add("Content-Type", "application/x-www-form-urlencoded")
    request.Header.Add("Content-Length", strconv.Itoa(len(requestBody)))
    request.Header.Add("Referer", baseUrl)

    response, err := parser.client.Do(request)
    if err != nil {
        return timetable, err
    }
    defer response.Body.Close()

    document, err := goquery.NewDocumentFromReader(response.Body)
    if err != nil {
        return timetable, err
    }

    document.Find("table#schedule").Find("tr").Each(func(rowIndex int, row *goquery.Selection) {
        subjectTimeElement := row.Closest("td")
        subjectTimeElement.NextAll().Each(func(columnIndex int, cell *goquery.Selection) {
            subjectInfo := extractSubjectInfoFromCell(cell)
            subjectInfo.Order = rowIndex
            timetable.Subjects[columnIndex][rowIndex] = subjectInfo
        })
    })

    parser.csrfMiddlewareToken, err = findCsrfMiddlewareToken(response.Body)
    if err != nil {
            log.Println("csrfMiddlewareToken: " + err.Error())
    }
    return timetable, nil
}

Go version: go1.17.1 windows/amd64

goquery version: 1.7.1


Solution

  • I've just realized what is wrong. io.Reader is treated as a stream. So when I make read from it once, it becomes empty. As you can see, after gathering all necessary information and reading the response, it is passed into the first function. But it's already empty. When I call findCsrfMiddlewareToken function for the first time, it works as usual and prints token length (64). But when I get to second call with empty response, it prints 0.

    Possible solution: How to read multiple times from same io.Reader