java rest keep-alive heartbeat tcp-keepalive

What's the proper heartbeat/keep-alive technology/layer for Java REST? Http? Tcp? Encoding: chunked?

The setup:

We have an https://Main.externaldomain/xmlservlet site, which is authenticating/validating/geo-locating and proxy-ing (slightly modified) requests to http://London04.internaldomain/xmlservlet for example.

There's no direct access to internaldomain exposed to end-users at all. The communication between the sites gets occasionally interrupted and sometimes the internaldomain nodes become unavailable/dead.

The Main site is using org.apache.http.impl.client.DefaultHttpClient (I know it's deprecated, we're gradually upgrading this legacy code) with readTimeout set to 10.000 milli-seconds. The request and response have xml payload/body of variable length and the Transfer-Encoding: chunked is used, also the Keep-Alive: timeout=15 is used.

The problem:

Sometimes London04 actually needs more than 10 seconds (let's say 2 minutes) to execute. Sometimes it non-gracefully crashes. Sometimes other (networking) issues happen. Sometimes during those 2 minutes - the portions of response-xml-data are being so gradually filled that there're no 10-second gaps between the portions and therefore the readTimeout is never exceeded, sometimes there's a 10+ seconds gap and HttpClient times out...

We could try to increase the timeout on Main side, but that would easily bloat/overload the listener pool (just by regular traffic, not even being DDOSed yet). We need a way to distinguish between internal-site-still-working-on-generating-the-response and the cases where it really crashed/network_lost/etc. And a best thing feels to be some kind of heart-beat (every 5 seconds) during the communication.

We thought the Keep-Alive would save us, but it seems to only secure the gaps between the requests (not during the requests) and it seems to not do any heartbeating during the gap (just having/waiting_for the timeout).

We thought chunked-encoding may save us by sending some heartbeat (0-bytes-sized-chunks) to let other side aware, but there seems to be no such/default implementation of supporting any heartbeat this way and moreso it seems that 0-bytes-sized chunk is an EOD indicator itself...

Question(s):

If we're correct in assumptions that KeepAlive/ChunkedEncoding won't help us with achieving the keptAlive/hearbeat/fastDetectionOfDeadBackend then:

1) which layer such a heart-beat should be rather implemented at? Http? tcp?

2) any standard framework/library/setting/etc implementing it already? (if possible: Java, REST)

UPDATE

I've also looked into heartbeat-implementers for WADL/WSDL, though found none for REST, checked out the WebSockets... Also looked into TCP-keepalives which seem to be the right feauture for the task:

BUT according to those I'd have to set up something like:

tcp_keepalive_time=5
tcp_keepalive_intvl=1
tcp_keepalive_probes=3

which seems to be a counter-recommendation (2h is the recommended, 10min already presented as an odd value, is going to 5s sane/safe?? if it is - might be my solution upfront...)

also where should I configure this? on London04 alone or on Main too? (if I set it up on Main - won't it flood client-->Main frontend communication? or might the NATs/etc between sites ruin the keepalive intent/support easily?)

P.S. any link to an RTFM is welcome - I might just be missing something obvious :)

Solution

My advice would be don't use a heartbeat. Have your external-facing API return a 303 See Other with headers that indicates when and where the desired response might be available.

So you might call:

POST https://public.api/my/call

and get back

303 See Other
Location: "https://public.api/my/call/results"
Retry-After: 10

To the extent your server can guess how long a response will take to build, it should factor that into the Retry-After value. If a later GET call is made to the new location and the results are not yet done being built, return a response with an updated Retry-After value. So maybe you try 10, and if that doesn't work, you tell the client to wait another 110, which would be two minutes in total.

Alternately, use a protocol that's designed to stay open for long periods of time, such as WebSockets.