What should the HTTP Status Code of a Degraded Health Check Be?
C

4

16

I have a health check endpoint at /status that returns the following status codes and response bodies:

  • Healthy - 200 OK
  • Degraded - ?
  • Unhealthy - 503 Service Unnavailable

What should the HTTP status code be for a degraded response be? A 'degraded' check is used for checks that did succeed but are slow or unstable. What HTTP status code makes the most sense?

Collett answered 31/12, 2018 at 8:38 Comment(11)
I don't think your question makes sense. You need to decide what HTTP GET of /status should doLet
Exactly, so what should I decide? What HTTP status code makes the most sense?Collett
You may decide and document that such a GET of /status returns some JSON data. What is the format and the interpretation of that data is your decisionLet
What do you believe your choices to be? If it's working we use 200 and return additional information if necessary. Really, it's up to you.Shelf
Yes I get that it's up to me. I'm here for guidance, you know...what StackOverflow is here for. I think my choices are 200, 503, 202 Accepted or something else.Collett
@MuhammadRehanSaeed return a custom code withing the 2xx Success range that is not already taken withing the known/common codes. Similar to some of the unofficial codes not supported by any standard. For example 218 This is fine (Apache Web Server)Mcgregor
@MuhammadRehanSaeed also found this tools.ietf.org/html/draft-inadarei-api-health-check-00Mcgregor
@Mcgregor That is a really nice find. Thank you!Collett
@MuhammadRehanSaeed hoping you check the more recent version They also suggested In case of the “warn” status, endpoints MUST return HTTP status in the 2xx-3xx range, and additional information SHOULD be provided, utilizing optional fields of the response. where warn status is healthy, with some concerns, which I believe aligns closely to you mode.Mcgregor
The spec is not as prescriptive as I'd have hoped but you should definately add that as an answer instead of a comment.Collett
It also provided some insight about one of your other question about dependencies on other components/microservices.Mcgregor
B
13

The most suitable HTTP status code for a "Degraded" status response from a health endpoint is nothing other than 200 OK.

I say this because I can't find any better code in the official Hypertext Transfer Protocol (HTTP) Status Code Registry maintained by IANA, pointed to by [RFC7231] HTTP/1.1: Semantics and Content. Unofficial codes should be avoided, because they only make your API more difficult to understand.

You should design your APIs so that they become easy to use. Resource names, HTTP verbs, status codes, etc. should be more or less self-explanatory, so that people who already know "the REST language" can immediately understand how to use your API without having to decipher vague names or unusual status codes. Which brings me to the next part of my answer...

Other comments on your design

The most natural way to interpret a 5xx response to any request is that the operation in question failed.

So a 503 Service Unavailable response to a GET /status request means that the status checking operation itself failed. Such a response would only be useful if we can be certain that /status is a health endoint, as pointed out in the API Health Check draft referred to in Nkosi's answer:

A health endpoint is only meaningful in the context of the component it indicates the health of. It has no other meaning or purpose. As such, its health is a conduit to the health of the component. Clients SHOULD assume that the HTTP response code returned by the health endpoint is applicable to the entire component (e.g. a larger API or a microservice).

But with a URL path of just /status, it is not completely obvious that this really is a health endpoint. From looking at the URL, we only know that it returns information about the status of something, but we can't really be sure what that "something" is.

Since you're also telling us that yes, it is in fact a health endpoint, I must suggest that you change the name to health. I would also suggest placing it under some base path, e.g. /things/health, to make it more clear which component it indicates the health of.

If, on the other hand, /status was actually a resource of it own, i.e. something that represents the status of some other component/thing (like its name currently suggests), then 200 OK is the only reasonable status for successful invocations, even if the thing that it indicates the status of is "Unhealthy". In that case, a 5xx would mean that no status could be obtained, and details in the response payload would be assumed to be related to a failure in the /status service itself.

So be careful with how you name things and what status codes you use!

Bergess answered 8/1, 2019 at 17:49 Comment(0)
M
3

Consider returning a custom code within the 2xx Success range that is not already taken within the known/common status codes. Similar to some of the unofficial codes not supported by any standard.

For example 218 This is fine (Apache Web Server)

Used as a catch-all error condition for allowing response bodies to flow through Apache when ProxyErrorOverride is enabled. When ProxyErrorOverride is enabled in Apache, response bodies that contain a status code of 4xx or 5xx are automatically discarded by Apache in favor of a generic response or a custom response specified by the ErrorDocument directive

After doing some research I came across a draft

Health Check Response Format for HTTP APIs: draft-inadarei-api-health-check-03

Where they also made similar suggestions

In case of the “warn” status, endpoints MUST return HTTP status in the 2xx-3xx range, and additional information SHOULD be provided, utilizing optional fields of the response.

where the warn status in the draft is healthy, with some concerns, which I believe aligns closely to your desired model.

While not definitive, I believe it provides some ideas to help with the eventual design.

Mcgregor answered 3/1, 2019 at 9:1 Comment(1)
I contacted the author of the draft over Twitter (See twitter.com/RehanSaeedUK/status/1081121474667253760?s=20). His response was basically to refer to the HTTP RFC (which isn't much help) and avoid unofficial status codes. While not a complete answer, your input is valuable, so thank you!Collett
M
2

I would be wary of splitting hairs like this on a healthcheck on the upstream server side. The service providing the healthcheck should be lightly (and concurrently) testing all its upstream dependencies based on its own set of policies or rules - request timeouts, connection failures and so on. In reality the healthcheck either works or it doesn't and the application shouldn't really need to be keeping track of the results of the healthcheck (other than capturing metrics about what happened). IMHO a stateful healthcheck is a recipe for disaster.

I typically use the following interface for application healthchecks:

204 - No Content, everything is working within tolerences

500 - Something failed, and here's some details in the response about what went wrong

Where it gets tricky depends on your architecture. You may have a VIP or reverse proxy that is interpreting this response and deciding if a given node is healthy or not, in which case it's going to either route the request to a healthy node or return the 503 Service Unavailable. This decision is going to made on some policy basis - x healthcheck requests failed over a y time period across z upstream services.

If you use a mesh then everyone can feed data back to the service registry to keep the health state up to date and it can be based on actual service calls rather than a healthcheck.

The client is perfectly placed to make a decision based on the health of services it depends on as they can keep track of the various responses from the service. Circuit breakers are an excellent way to handle that and can do it continuously on actual requests rather than just on the healthcheck. Circuit breaker libraries (such as resilience4j) will do this for you at the cost of setting up some policies about how many failed/slow requests constitute a bad service. Service Registrys like the netflix eureka can help with the discovery and ongoing monitoring.

Middleaged answered 7/1, 2019 at 5:53 Comment(0)
F
0

Assuming you are referring to the status code of a liveness/healthcheck endpoint of a service - to distinguish from 200 OK a 203 likely seems applicable and in line with:

HTTP/1.1 203 Non-Authoritative Information
Warning: 199 - "FooBar Warning Details"
Content-Type: application/health+json
Cache-Control: max-age=10
Connection: close

{"status": "warn"}
Fletcher answered 8/3, 2020 at 18:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.