Can we reliably keep HTTP/S connection open for a long time?
Asked Answered
C

6

6

My team maintains an application (written on Java) which processes long running batch jobs. These jobs needs to be run on a defined sequence. Hence, the application starts a socket server on a pre-defined port to accept job execution requests. It keeps the socket open until the job completes (with success or failure). This way the job scheduler knows when one job ends and upon successful completion of the job, it triggers the next job in the pre-defined sequence. If the job fails, scheduler sends out an alert.

This is a setup we have had for over a decade. We have some jobs which runs for a few minutes and other which takes a couple hours (depending on the volume) to complete. The setup has worked without any issues.

Now, we need to move this application to a container (RedHat OpenShift Container Platform) and the infra policy in place allows only default HTTPS port be exposed. The scheduler sits outside OCP and cannot access any port other than the default HTTPS port.

In theory, we could use the HTTPS, set Client timeout to a very large duration and try to mimic the the current setup with TCP socket. But would this setup be reliable enough as HTTP protocol is designed to serve short-lived requests?

Chaeta answered 14/1, 2023 at 9:20 Comment(6)
HTTP itself will do that just fine (it is something layered on top of TCP), but in my experience, HTTP servers, clients, and middleware like loadbalancers, firewalls and proxies might be configured to terminate HTTP connections after a few minutes. It might be better to have the HTTP request submit the job, return a job-id, and let the client poll for completion of that job.Musick
Client polling is definitely something we have thought of but we feel it isn't very elegant. What would the polling interval be? Job completion time may vary from a couple seconds to hours. If you keep small interval, it might be an overkill for long running jobs; if you keep it large, it would result delay in getting results for short running jobs. Also, since the there will be multiple instances (pods) of Batch service, each request will hit different pod. So, batch service will need to store the job completion result into some persistent storage adding another layer of complexity.Chaeta
I have absolutely no practical experience with them, but what about web sockets? I believe they use the same ports as HTTP(S). And as I understand it, they are designed for long-lived connections with two-way communication. Again, I don't know if they're appropriate to your use case, but thought I'd just throw the idea out there.Disagree
@Disagree This is an interesting idea and could potentially work. It will require us to do some R&D to make with work with existing load balancers in place (a potential problem like Mark mentioned in the first comment.Chaeta
It sounds a bit weird. What if a computer goes to sleep? It seems like there are lots of ways this task can fail especially if you're running a task for hours. When it does fail, do you expect it to just start over?Alabama
You need some state. In the past the state was in the connection that was open, it worked, you were not using that state directly and it was prone to all sort of issues. Once you moved your workload to openshift you have multiple abstractions added. Probably ingress controller in cluster and very likely load balancer infront of the cluster. Maybe also cloudflare (which by default will terminate connections after some time as well) Like someone mentioned all of them would have to support what you need. Options: pooling (easy & dummy) or more async architecture with callbacks when job is doneSevern
P
2

There isn't a reliable way to keep a connection alive for a long period over the internet, because of nodes (routers, load balancers, proxies, nat gateways, etc) that may be sitting between your client and server, they might drop mid connection under load, some of them will happily ignore your HTTP keep alive request, or have an internal max connection duration time that will kill long running TCP connections, you may find it works for you today but there is no guarantee it will work for you tomorrow.

So you'll probably need to submit the job as a short lived request and check the status via other means:

  • Push based strategy by sending a webhook URL as part of the job submission and have the server call it (possibly with retries) on job completion to notify interested parties.
  • Pull based strategy by having the server return a job ID on submission, then have the client check periodically. Due to the nature of your job durations, you may want to implement this with some form of exponential backoff up to a certain limit, for example, first check after waiting for 2 seconds, then wait for 4 seconds before next check, then 8 seconds, and so on, up to a maximum of time you are happy to wait between each check. So that you can find out about short job completions sooner and not check too frequently for long jobs.
Prussia answered 23/1, 2023 at 8:56 Comment(0)
R
1

When your worked with socket and TCPprotocol you were in control on how long to keep connections open. With HTTP you are only in control of logical connections and not physical ones. Actual connections are controlled by OS and usually IT people can configure all those timeouts. But by default how it works is that when you even close logical connection the real connection is no closed in anticipation of next communication. It is closed by OS and not controlled by your code. However even if it closes and your next request comes after that it is opened transparently to you. SO it doesn't really matter if it closed or not. It should be transparent to your code. So in short I assume that you can move to HTTP/HTTPS with no problems. But you will have to test and see.

Also about other options on server to client communications you can look at my answer to this question: How to continues send data from backend to frontend when something changes

Ramp answered 16/1, 2023 at 10:54 Comment(0)
M
1

We have had bad experiences with long standing HTTP/HTTPS connections. We used to schedule short jobs (only a couple of minutes) via HTTP and wait for it to finish and send a response. This worked fine, until the jobs got longer (hours) and some network infrastructure closed the inactive connections. We ended up only submitting the request via HTTP, get an immediate response and then implemented a polling to wait for the response. At the time, the migration was pretty quick for us, but since then we have migrated it even further to use "webhooks", e.g. allow the processor of the job to signal it's state back to the server using a known webhook address.

Menken answered 23/1, 2023 at 12:16 Comment(0)
I
0

IMHO, you should improve your scheduler to a REST API server, Websocket isn't effective in this scenario, the connection will inactive most of time

Irbm answered 22/1, 2023 at 13:11 Comment(0)
A
0

The jobs can be short-lived or long running. So, When a long running job fails in the middle, how does the restart of the job happen? Does it start from beginning again?

In a similar scenario, we had a database to keep track of the progress of the job (no of records successfully processed). So, the jobs can resume after a failure. With such a design, another webservice can monitor the status of the job by looking at the database. So, the main process is not impacted by constant polling by the client.

Arian answered 23/1, 2023 at 3:16 Comment(0)
A
0

How about the Job Scheduler posting a message to a request-queue with a correlation id, and the job executor takes its own time to execute and posts a message to a different response-queue with the same correlation id? The Job Scheduler can wake up with a message in the response queue and then based on the correlation id, figure out the next job and post it again on the request-queue.

Arterialize answered 27/6, 2023 at 18:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.