I'm trying to parse data from the site using QNetworkAccessManager
. To do this, I write the site data to QString
, but when I do a substring search using indexOf
, the result is incorrect: one value is -1. Tell me, please, what is the problem here?
void MainWindow::on_pushButton_clicked()
{
pBar = new QProgressBar;
pBar->setMaximum(0); // максимум
pBar->setMinimum(0);
pBar->show();
QNetworkRequest request;
QUrl url(tr("https://auto.ru/tver/cars/all/? utm_source=yandex_direct&utm_medium=direct.brand&utm_campaign=460_hand_desktop_used_brand_search_Tver_none_82222146&utm_content=cid%3A82222146%7Cgid%3A5114476686%7Caid%3A13330832990%7Cph%3A42898716904%7Cpt%3Apremium%7Cpn%3A1%7Csrc%3Anone%7Cst%3Asearch%7Ccgcid%3A0%7Cdt%3Adesktop&utm_term=auto+ru&adjust_t=cl4qttt_nsw4it6&adjust_campaign=82222146&adjust_adgroup=5114476686&tracker_limit=10000&adjust_ya_click_id=1049526807999603789&_openstat=ZGlyZWN0LnlhbmRleC5ydTs4MjIyMjE0NjsxMzMzMDgzMjk5MDt5YW5kZXgucnU6cHJlbWl1bQ&yclid=639777327900000255"));
request.setUrl(url);
this->manager->get(request);
connect(manager, SIGNAL(finished(QNetworkReply *)), this,
SLOT(replyFinished(QNetworkReply *)));
}
void MainWindow::replyFinished(QNetworkReply *reply)
{
if (reply->error() == QNetworkReply::NoError)
{
QByteArray content= reply->readAll();
QTextCodec *codec = QTextCodec::codecForName("utf8");
QString page = codec->toUnicode(content.data());
int startStrPos = page.indexOf("<div class=\"ListingCars
ListingCars_outputType_list\">");
int endStrPos = page.lastIndexOf("<div class=\"ListingCarsPagination\">");
// startStrPos = -1
qDebug() << startStrPos << endStrPos;
QString ctn = page.mid(startStrPos, endStrPos - startStrPos);
ui->textEdit->setPlainText(ctn);
}
reply->deleteLater();
pBar->close();
}
The string you search for is spanning 2 source lines, this is not correct. You should either use a single line:
int startStrPos = page.indexOf("<div class=\"ListingCars ListingCars_outputType_list\">");
Or you should use string split the string in 2 parts for readability:
int startStrPos = page.indexOf("<div class=\"ListingCars"
" ListingCars_outputType_list\">");
Using this technique, you can make the URL somewhat more readable:
QUrl url(tr("https://auto.ru/tver/cars/all/?"
"utm_source=yandex_direct&"
"utm_medium=direct.brand&"
"utm_campaign=460_hand_desktop_used_brand_search_Tver_none_82222146&"
"utm_content=cid%3A82222146%7Cgid%3A5114476686%7Caid%3A13330832990%7Cph%3A42898716904%7Cpt%3Apremium%7Cpn%3A1%7Csrc%3Anone%7Cst%3Asearch%7Ccgcid%3A0%7Cdt%3Adesktop&"
"utm_term=auto+ru&"
"adjust_t=cl4qttt_nsw4it6&"
"adjust_campaign=82222146&"
"adjust_adgroup=5114476686&"
"tracker_limit=10000&"
"adjust_ya_click_id=1049526807999603789&"
"_openstat=ZGlyZWN0LnlhbmRleC5ydTs4MjIyMjE0NjsxMzMzMDgzMjk5MDt5YW5kZXgucnU6cHJlbWl1bQ&"
"yclid=639777327900000255"));
Note however that the indexOf
method scans for an exact match of the substring. The html file might contain a variation of the substring with different spacing or extra attributes.
EDIT:
Requesting the page manually, I got a cookie screen and a robot detection page... There is a chance you are not parsing the final page. Furthermore, the string <div class="ListingCars ListingCars_outputType_list">
does not appear in the page I finally get. Did the page structure change since you last analyzed it?