Backend keep-alive

If you see sporadic 503s on traffic Apoxy routes to your backend - and your backend pods aren't restarting or under load - the cause is almost always an HTTP keep-alive idle mismatch between your server and Apoxy's edge Envoy.

Symptom

Intermittent 503 responses on otherwise-healthy requests, often correlated with irregular request cadences (webhooks, cron-driven calls, low-traffic APIs). In Envoy access logs the response flag is UC:

$access log entryJSON

{
  "envoy_response_flags": "UC",
  "http_request_duration_ms": "54",
  "http_response_status_code": "503"
}

UC means upstream connection termination - the backend sent an RST mid-request rather than a response. The short duration is the giveaway: Envoy wrote the request onto a pooled connection and got reset back immediately.

Why it happens

HTTP/1.1 keep-alive connection pooling has a fundamental race:

Backend closes (FIN) a pooled connection because its keep-alive idle timer expired.
A new request arrives at Envoy a few milliseconds later, before Envoy's event loop has processed the FIN.
Envoy picks the (now half-closed) connection from the pool, writes the request.
The OS happily accepts the write() - half-closed sockets are still writable. The backend has no app state for the request and RSTs.
Envoy logs UC.

This race exists no matter how fast Envoy is. The fix is to make sure Envoy is the side that closes idle connections, not your backend.

What Apoxy already does

Edge Envoy is tuned to take the close side in most cases:

Upstream cluster idle timeout: 60s. Envoy initiates close on idle pooled connections after 60s, before any reasonably configured backend will.
TCP keepalive on upstream sockets: SO_KEEPALIVE on, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s. Detects dead peers (network partition, peer crash) within ~2 minutes even when there's no in-flight request.
Single transparent retry on RST for idempotent methods. GET, HEAD, PUT, DELETE, OPTIONS, and TRACE retry one time on UC. POST and PATCH don't, since they may have already been processed upstream.

Configure your backend to keep idle connections open for at least 90 seconds and Envoy will reliably close first.

Why 90?

"Just above Apoxy's 60s cluster idle, with margin." Same convention nginx (keepalive_timeout 75) and Cloudflare (900s edge, recommending ≥300 at origin) follow - the upstream-most intermediary should be the one to initiate close, never the server.

Recommended server settings

$terminalSH

uvicorn app:app --timeout-keep-alive 90

Uvicorn's default is 5 seconds - far too short. This is the single most common cause of UC on Apoxy-routed FastAPI deployments.

Gunicorn's --keep-alive flag is a no-op when the worker class is uvicorn.workers.UvicornWorker (uvicorn owns HTTP, not gunicorn). Pass through via env var:

$deployment.yamlYAML

env:
  - name: UVICORN_TIMEOUT_KEEP_ALIVE
    value: "90"

$terminalSH

gunicorn app:app --keep-alive 90 --worker-class gthread --threads 8

Gunicorn's default is 2 seconds. Note that sync workers don't really support keep-alive - use gthread, gevent, or eventlet.

$nginx.confNGINX

keepalive_timeout 90s;

The nginx default of 75s is borderline; bump to 90s for headroom.

$server.jsJS

server.keepAliveTimeout = 90_000;  // milliseconds
server.headersTimeout = 95_000;    // must be > keepAliveTimeout

Node's default keepAliveTimeout is 5 seconds.

$main.goGO

srv := &http.Server{
    IdleTimeout: 90 * time.Second,
}

Go's IdleTimeout falls through to ReadTimeout if unset, which is also unset by default - connections sit forever. Setting an explicit IdleTimeout gives you predictable behavior.

Kubernetes pod lifecycle

If your backend runs in Kubernetes, the keep-alive setting interacts with pod termination:

$deployment.yamlYAML

spec:
  terminationGracePeriodSeconds: 45    # > graceful-timeout + slack
  containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          command: ["sleep", "5"]      # let endpoint removal propagate

Why preStop sleep

Without preStop sleep, new requests can land on a pod that's already received SIGTERM - endpoint removal from EndpointSlice and SIGTERM happen in parallel and kube-proxy reconciliation is asynchronous across nodes. You'll see UC regardless of keep-alive tuning until the endpoint propagates.

Disabling Apoxy's default retry

The single-attempt idempotent retry is on by default on every HTTPRoute. To turn it off or replace it, set spec.rules[].retry on the route - any non-nil block overrides our default:

$httproute.yamlYAML

apiVersion: gateway.apoxy.dev/v1
kind: HTTPRoute
spec:
  rules:
    - retry:
        attempts: 0          # no retries on anything
      backendRefs: [...]

To keep retries but apply them to all methods (including POST/PATCH):

$httproute.yamlYAML

spec:
  rules:
    - retry:
        attempts: 2          # retries fire on all methods, not just idempotent
        backoff: 200ms
      backendRefs: [...]

When you opt in, the idempotent scoping is dropped

A user-supplied retry block is honored as-given. We don't second-guess your idempotency assumptions, but it also means retries will fire on POST/PATCH - make sure your handlers are idempotent or that duplicate processing is acceptable.

Verifying the fix

After tuning, check your access logs for UC response flags. The rate should drop to near-zero for normal traffic. Any residual UC is most likely:

A backend crash or OOMKill (check pod restarts).
A network event (check tunnel diag if traffic is tunneled).
A POST/PATCH that hit the race - idempotent methods are auto-retried; non-idempotent surface to the client by design.

If you still see frequent UC after setting server keep-alive ≥90s, open a support ticket and include a few sample access log lines plus your server's keep-alive configuration.