Questions that pose a similar problem:
Issues with LWP when using HTTP/1.1: bad chunk-size, truncated responses.
I am using the Perl module WWW::Mechanize to scrape web sites. As far as I understand, WWW::Mechanize uses the Net::HTTP module to implement the HTTP protocol.
Here is the issue:
my $url = 'https://somewebsite.com/a/b/c?skey=svalue';
my $browser = WWW::Mechanize->new();
$browser->get($url);
When I execute the above snippet (assuming all imports are in place), I get an empty response content with the following error in response header inside the response object of WWW:Mechanize:
'x-died' = "Bad chunk-size in HTTP response: { at path/ to/perl/vendor/lib/Net/HTTP/Methods.pm line 542."
Notice the '{' in the exception message. I then tried to debug the Methods.pm module to see what was going on and it looks like the exception happens inside the read_entity_body subroutine.
I also did a curl for the url that I have and got the following response headers:
< HTTP/1.1 200 OK
< Set-Cookie: JSESSIONID=C61B57BA5DD0A05912C98CE1CFBAD435; Path=/; HttpOnly
< X-Frame-Options: DENY
< Transfer-Encoding: chunked
< Strict-Transport-Security: max-age=31536000 ; includeSubDomains
< Server: Apache-Coyote/1.1
< Cache-Control: no-cache, no-store, max-age=0, must-revalidate
< X-Content-Type-Options: nosniff
< Content-Disposition: attachment;filename=f.txt
< Pragma: no-cache
< Expires: 0
< X-XSS-Protection: 1; mode=block
< Date: Thu, 21 Sep 2017 18:31:27 GMT
< Content-Type: application/json;charset=UTF-8
< Transfer-Encoding: chunked
and with the following content:
{
"total" : 1,
"page" : 1,
"records" : 1,
"rows" : [ {
"infoPostRptId" : 2,
"mngPplId" : 1,
"infoPostRptXsdId" : 1,
"rptFmtCode" : "XML",
"createUserId" : 5183202,
"updateUserId" : 1,
"statusId" : 309403,
"seqNbr" : 0,
"urlAnchor" : null,
} ],
"errors" : null
}
* Connection #0 to host xxxxxxx left intact
If I am not wrong, it looks like the content that came through from the website is not actually chunk encoded though the headers mention the transfer-encoding to be chunked.
More information regarding the Methods.pm module:
From what I understand, the read_entity_body subroutine tries to decode and combines the chunks to form the response content.
I think the problem is that the response headers have Transfer-Encoding: chunked but the content in fact is not chunked encoded.
Any help is highly appreciated. Thanks.
EDIT 1:
Versions:
WWW:Mechanize: 1.83, LWP:UserAgent: 6.15 and Net::HTTP: 6.12
EDIT 2:
Output of curl -s --raw -D - "https://...."
:
HTTP/1.1 200 OK
Set-Cookie: JSESSIONID=A29B1E0F561F1E4FBAF12583C0C2DE08; Path=/; HttpOnly
X-Frame-Options: DENY
Transfer-Encoding: chunked
Strict-Transport-Security: max-age=31536000 ; includeSubDomains
Server: Apache-Coyote/1.1
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
X-Content-Type-Options: nosniff
Content-Disposition: attachment;filename=f.txt
Pragma: no-cache
Expires: 0
X-XSS-Protection: 1; mode=block
Date: Fri, 22 Sep 2017 02:36:51 GMT
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
45c
{
"total" : 1,
"page" : 1,
"records" : 1,
"rows" : [ {
"infoPostRptId" : 2,
"mngPplId" : 1,
"infoPostRptXsdId" : 1,
"rptFmtCode" : "XML",
"createUserId" : 5183202,
"updateUserId" : 1,
"statusId" : 309403,
"seqNbr" : 0,
"urlAnchor" : null,
} ],
"errors" : null
}
0
Like the previous JSON content, I have removed/altered some values just to anonymize data.
EDIT 3: This is what I get when I execute the following command:
perl -MLWP::UserAgent -e'print LWP::UserAgent->new->get($ARGV[0])->as_string' 'https://......'
HTTP/1.1 200 OK
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Connection: close
Date: Fri, 22 Sep 2017 04:15:06 GMT
Pragma: no-cache
Server: Apache-Coyote/1.1
Content-Type: application/json;charset=UTF-8
Expires: 0
Client-Aborted: die
Client-Date: Fri, 22 Sep 2017 04:15:06 GMT
Client-Peer: 67.221.172.5:443
Client-Response-Num: 1
Client-SSL-Cert-Issuer: /C=US/ST=Arizona/L=Scottsdale/O=GoDaddy.com, Inc./OU=http://certs.godaddy.com/repository//CN=Go Daddy Secure Certificate Authority - G2
Client-SSL-Cert-Subject: /OU=Domain Control Validated/CN=*.trellisenergy.com
Client-SSL-Cipher: ECDHE-RSA-AES128-SHA256
Client-SSL-Socket-Class: IO::Socket::SSL
Client-Transfer-Encoding: chunked
Content-Disposition: attachment;filename=f.txt
Set-Cookie: JSESSIONID=5CAC35648DBBE25E3229DE9BF21C3794; Path=/; HttpOnly
Strict-Transport-Security: max-age=31536000 ; includeSubDomains
X-Content-Type-Options: nosniff
X-Died: Bad chunk-size in HTTP response: { at /usr/local/share/perl5/Net/HTTP/Methods.pm line 544.
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block
EDIT 4: TCP Dump:
Did the following command in one terminal window:
perl -MLWP::UserAgent -e'print LWP::UserAgent->new->get($ARGV[0])->as_string' 'https://vgs.trellisenergy.com/ptms/public/infopost/getInfoPostRpts.do?tspId=1&proxyTspId=1&rptId=2&downloadInd=0&searchInd=0&showLatestInd=0&cycleId=10303&startDate=09/20/2017&endDate=09/20/2017&_search=false&nd=1505846852955&rows=10&page=1&sidx=&sord=asc&_=1505846826289'
And the following in another:
tcpdump -w tcpdump.pcap -A -s0 -e -n -vvv -i eth0 host vgs.trellisenergy.com
Pretty print tcpdump using:
tcpick -C -yP -r tcpdump.pcap
TCP Dump:
Starting tcpick 0.2.1 at 2017-09-22 10:24 MDT
Timeout for connections is 600
tcpick: reading from tcpdump.pcap
1 SYN-SENT 10.1.1.10:24876 > 67.221.172.5:https
1 SYN-RECEIVED 10.1.1.10:24876 > 67.221.172.5:https
1 ESTABLISHED 10.1.1.10:24876 > 67.221.172.5:https
...........Y.8..*m.i.'ZZP*....1...d
.._.$.^....0.,.(.$...
.....k.j.9.8.....2...*.&.......=.5.../.+.'.#... .....g.@.3.2.....E.D.1.-.).%.......<./...A.........
..................._.........vgs.trellisenergy.com.........
. .....................................
.....0..1.0.......U....US1.0...U....Arizona1.0...U...............>.s].s.a^.
Scottsdale1.0...U.
..........0..0A1!0...U....Domain Control Validated1.0...U....*.trellisenergy.com0.."0 Secure Certificate Authority - G20..
h@s0.*$.H.4./..E8.m.V......'!..f...!tY'.(..`......... ...E.)Tz..z2.%..KEi....Dd.....s....JW_.Y ..8..6..Y ........i.r............"...a.
LI1V 6t....C.....20uB'..#:...n..(-...(..P..M..O...p.3L.].@A.........0...0...U.......0.0...U.%..0...+.........+.......0...U...........07..U...00.0,.*.(.&http://crl.godaddy.com/gdig2s1-337.crl0]..U. .V0T0H..`.H...m....0907..+........+http://certificates.godaddy.com/repository/0...g.....0v..+........j0h0$..+.....0...http://ocsp.godaddy.com/0@..+.....0..4http://certificates.godaddy.com/repository/gdig2.crt0...U.#..0...@..'..4.0.3..l...,..01..U...*0(..*.trellisenergy.com............z...;^..'.@.l..,Cj...N.LY.S.......~p...k.. ...Y..S}.\}o.......(.
.....H..SG.D.vy}...qM(.0LT.C.....R.......y... Y.....wz.s4..Q.t...u...].8.|..q..+.>5...?..`z.X2. .{.%..[ 7.. r...y.yjY..h]...0I.$..x,O....h......n.b.....c.<.....X.Gi.P.vTM.d.B.
.....0..1.0...a...U....US1.0...U....Arizona1.0...U...
Scottsdale1.0...U.
310503070000Z0..1.0110/...U....US1.0...U....Arizona1.0...U...rity - G20..
Scottsdale1.0...U.
..........0.., Inc.1-0+..U...$http://certs.godaddy.com/repository/1301..U...*Go Daddy Secure Certificate Authority - G20.."0
...........v...b.0d...l...b../.>e...b.<R...EKU.xkc.b...il.....L.E3......+..a.yW....?0<]G.....7.AQ..KT.(.....08...&.fGcm.q&G.8GS.F......E...q..o....0:yO_LG...[...`;..C...3N...'O.%........t.dW..DU.-*:>....2
..d..:P.J..y3.. .....9.i.lcR.w...t.....PT5KiN.;.I.....R..........0...0...U.......0....0...U...........0...U......@..'..4.0.3..l...,..0...U.#..0...:....g(.....An .....04..+........(0&0$..+.....0...http://ocsp.godaddy.com/05..U....0,0*.(.&.......`..r.s$..."....bXD...%......b.Q...Q*...s.v.6....,....*...Mu..?.A.#}[K...X.F..``..}PA......../..T.D..}.C.D..p
...3..-v6&.....a....o.F.(..&}
.....0..1.0.......U....US1.0...U....Arizona1.0...U...
Scottsdale1.0...U.
09GoDaddy.com, Inc.110/..U...(Go Daddy Root Certificate Authority - G20..
371231235959Z0..1.0 ..U....US1.0...U....Arizona1.0...U...
Scottsdale1.0...U.
..........0.., Inc.110/..U...(Go Daddy Root Certificate Authority - G20.."0
..f"..im6.......`.8......F.. C.;....I.'....N...p..2...>.N...O/Y0"...Vk......u.9Q{..5.tN......?........j..............;F|2
>.]|.|..+S..biQ%.a.D..,.C.#..:...)....]....0
............]y...Yg.a.~;.1u-. .Oe......../..Z..t.s.8B..{..u...........S.~.F.....+....'....Z.7....l....=.$Oy.5._.......-.......s@.r%......h..W...: ..D...7...2..8..d.,~........h..".8-z..T.i._3.z={
.8.. 'e...]p-..N.(F...6.....(....k.Q......8k...v...v...(...=!.:...;.L.....K./.....D....xH .Zi.<!.}i. t.c.!yWY..c.I......?.._.e......"...v.'8Qq.d].......O(8._M....%........]:LU....]l. .....
............iA...~....C5...k.43... .F6. .\!....X......bJ.e..@.....[.uO.&..-....7.O. .......g2..R.b....H7.........G.....%u1.....8$.u..O....za..T..........P...V2.;.......j.L.Px;..-....&.......H...yQ,n.s..<KFx#...2..K.G..n4OG{N.5.6../...
......
....PU.T....A.d...*.iw.. c.Wjm.V\. ..vP.Z%......v...k......l...b7.|.u..c.=:....$.3K..
........v.{u...`..+.qU. .'.t.g....V......1..P.g..aO....nY..C..F...4x.d...Y....|3..Pz;.K.~]...H..;...PIR..hRv...)].=?.:..[...h...A.. /4..d.......C`....]LZK.Y..q......Q.L.R..D&...l..t..I.j2....8...y.L..).y.n..).u|..'.....z ..,Yg..md."i.......M.74x...3..N.b.6..tm.).u...|-.xK.9R..M,......!....}..[=B.J...... ...~Gx.8p.5.UQ........sJ
...w..Xf.#^..,..G.w.f4.V..'..Bb_..*e.i......P1.
U6!.l..%...ts. u!c5.0>.!.2J.G)p.W.........dF*5.....5..M. .....G+.....I..vG&..>.}(....E. ...9...N.i..Jm&b...G...3Wo#k.........e:..p........:w....V.L'9.-..)......d.P_....#..iide@.2..E>.?|..:....B.,mr...N.JAS1]:...O.......i..c..T.pZZ)..E."\b.r2HA..r!....L........K....~1.....x!.Gp.K..G..D*s.u....WN.?..(+..rU..g?d.....eG.L.^...*..a...]/...N0.gX..;...T...%...;.P?.O4{.i.....%.T.|..
...U..Ug......d...a3:$...p...v..t."...
.......%..J`E....5....n..M....>...ge.r.,...s..,.. k..R.N._>3}...=.0...........T.d.. ...u 7?T...3b.?.lr...8o.Gk.}xkBY[...l..^.-.Wt}..G/..l.f..z..^F.A.G.i8l4.....#.a.....BS.c.Q7..=y...{ELUP.R..c.{...a9.u3..-@F.H..M..2.o.j@.pI..S....R ..vx.u.<-x..".T.d-...:...>......n..Z|..?Dz@N..?...#.../.....2.Z..y..Ej..........Q.....'8.....nC..7.....)e..7r..[..H...R.....h...x7G.+.......eBErwo.r....,..e*.8O..oQ. `O.@.J#...5).9.....!d.u....,...pV..oS...%.o..F..G.7....I...N...s .G..G@.".w6d......R..j
..........G.D..l....0..EH.Y..4.e.\#~s.i.-WKoyK...w.'.o.X-.,x.......4......T.*.>#..
..G(wP.V.i...F.U...t...-.\.!...Y4,...._............7..|<DM3.&u.%.0..G.......9....
.....Y......55ZW..X......Tz..D...r.6$..B...Wv..R..8.."../dL..-...i^o..>:..O...s.W.).i....gOH...@.....8k.......Q........#.....#.R..^.....f.......x^X....^S.R..u.7.._..T]A'/4>k\..Lg....H...J....o>.2 ......$.......PP..#..=.E..;2..>k...`...9..>*.....N...4........(...a....n....)w.I.@O+.(.cV..g.....%G..^.Z#.'EG...]..$_...!e...%.;VG.7.5.&...C........s4..1....t[
1 FIN-WAIT-1 10.1.1.10:24876 > 67.221.172.5:https
1 TIME-WAIT 10.1.1.10:24876 > 67.221.172.5:https
1 CLOSED 10.1.1.10:24876 > 67.221.172.5:https
tcpick: done reading from tcpdump.pcap
22 packets captured
1 tcp sessions detected
That's a bug in the server or (more likely) a bug in the application running on the server. If one is sending the following request:
GET /some-path HTTP/1.1
Host: some-host
The server is responding with a correct chunked response. Interestingly the Transfer-Encoding: chunked
header is sent twice - one at the beginning of the HTTP header and one at the end:
HTTP/1.1 200 OK
Set-Cookie: ...
X-Frame-Options: DENY
Transfer-Encoding: chunked
...
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
45c
{
Now, when sending a slightly changed request with an added Connection: close
header the response looks different:
GET /some-path HTTP/1.1
Host: some-host
Connection: close
----
HTTP/1.1 200 OK
Set-Cookie: ...
X-Frame-Options: DENY
Transfer-Encoding: chunked
...
Content-Type: application/json;charset=UTF-8
{
The leading Transfer-Encoding: chunked
is still there but the last one is no longer there. And the response body is not chunked anymore, even though there is still a Transfer-Encoding: chunked
in the response header! .
This is whats is happening with LWP contrary to curl: LWP is sending a Connection: TE, close
header while curl is not sending a Connection
header. This means LWP is getting the broken response and is complaining correctly while curl does not get the broken response and thus has no reason to complain. But, if you explicitly add a Connection: close
header to curl it will run into the same problem:
$ curl -H 'Connection:close' https://...
curl: (56) Illegal or missing hexadecimal sequence in chunked-encoding
Further tests show that the leading Transfer-Encoding: chunked
header is also sent if the client is doing a HTTP/1.0 request! This should not happen at all because chunked is only defined with HTTP/1.1.
This suggests that some part of the web application running on the server and not the web server itself is issuing the first Transfer-Encoding: chunked
header. Thus, if you have access to the application or to the developer of the application you should fix it there.