Backend keep-alive
Tune your upstream server's HTTP keep-alive idle so Apoxy doesn't see intermittent 503s on traffic that races a backend-initiated close.
If you see sporadic 503s on traffic Apoxy routes to your backend — and your backend pods aren't restarting or under load — the cause is almost always an HTTP keep-alive idle mismatch between your server and Apoxy's edge Envoy.
Symptom
Intermittent 503 responses on otherwise-healthy requests, often correlated with irregular request cadences (webhooks, cron-driven calls, low-traffic APIs). In Envoy access logs the response flag is UC:
{
"envoy_response_flags": "UC",
"http_request_duration_ms": "54",
"http_response_status_code": "503"
}UC means upstream connection termination — the backend sent an RST mid-request rather than a response. The short duration is the giveaway: Envoy wrote the request onto a pooled connection and got reset back immediately.
Why it happens
HTTP/1.1 keep-alive connection pooling has a fundamental race:
- Backend closes (FIN) a pooled connection because its keep-alive idle timer expired.
- A new request arrives at Envoy a few milliseconds later, before Envoy's event loop has processed the FIN.
- Envoy picks the (now half-closed) connection from the pool, writes the request.
- The OS happily accepts the
write()— half-closed sockets are still writable. The backend has no app state for the request and RSTs. - Envoy logs
UC.
This race exists no matter how fast Envoy is. The fix is to make sure Envoy is the side that closes idle connections, not your backend.
What Apoxy already does
Edge Envoy is tuned to take the close side in most cases:
- Upstream cluster idle timeout: 60s. Envoy initiates close on idle pooled connections after 60s, before any reasonably configured backend will.
- TCP keepalive on upstream sockets:
SO_KEEPALIVEon,TCP_KEEPIDLE=30s,TCP_KEEPINTVL=10s. Detects dead peers (network partition, peer crash) within ~2 minutes even when there's no in-flight request. - Single transparent retry on RST for idempotent methods.
GET,HEAD,PUT,DELETE,OPTIONS, andTRACEretry one time onUC.POSTandPATCHdon't, since they may have already been processed upstream.
Configure your backend to keep idle connections open for at least 90 seconds and Envoy will reliably close first.
"Just above Apoxy's 60s cluster idle, with margin." Same convention nginx (keepalive_timeout 75) and Cloudflare (900s edge, recommending ≥300 at origin) follow — the upstream-most intermediary should be the one to initiate close, never the server.
Recommended server settings
uvicorn app:app --timeout-keep-alive 90Uvicorn's default is 5 seconds — far too short. This is the single most common cause of UC on Apoxy-routed FastAPI deployments.
Gunicorn's --keep-alive flag is a no-op when the worker class is uvicorn.workers.UvicornWorker (uvicorn owns HTTP, not gunicorn). Pass through via env var:
env:
- name: UVICORN_TIMEOUT_KEEP_ALIVE
value: "90"gunicorn app:app --keep-alive 90 --worker-class gthread --threads 8Gunicorn's default is 2 seconds. Note that sync workers don't really support keep-alive — use gthread, gevent, or eventlet.
keepalive_timeout 90s;The nginx default of 75s is borderline; bump to 90s for headroom.
server.keepAliveTimeout = 90_000; // milliseconds
server.headersTimeout = 95_000; // must be > keepAliveTimeoutNode's default keepAliveTimeout is 5 seconds.
srv := &http.Server{
IdleTimeout: 90 * time.Second,
}Go's IdleTimeout falls through to ReadTimeout if unset, which is also unset by default — connections sit forever. Setting an explicit IdleTimeout gives you predictable behavior.
Kubernetes pod lifecycle
If your backend runs in Kubernetes, the keep-alive setting interacts with pod termination:
spec:
terminationGracePeriodSeconds: 45 # > graceful-timeout + slack
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["sleep", "5"] # let endpoint removal propagateWithout preStop sleep, new requests can land on a pod that's already received SIGTERM — endpoint removal from EndpointSlice and SIGTERM happen in parallel and kube-proxy reconciliation is asynchronous across nodes. You'll see UC regardless of keep-alive tuning until the endpoint propagates.
Disabling Apoxy's default retry
The single-attempt idempotent retry is on by default on every HTTPRoute. To turn it off or replace it, set spec.rules[].retry on the route — any non-nil block overrides our default:
apiVersion: gateway.apoxy.dev/v1
kind: HTTPRoute
spec:
rules:
- retry:
attempts: 0 # no retries on anything
backendRefs: [...]To keep retries but apply them to all methods (including POST/PATCH):
spec:
rules:
- retry:
attempts: 2 # retries fire on all methods, not just idempotent
backoff: 200ms
backendRefs: [...]A user-supplied retry block is honored as-given. We don't second-guess your idempotency assumptions, but it also means retries will fire on POST/PATCH — make sure your handlers are idempotent or that duplicate processing is acceptable.
Verifying the fix
After tuning, check your access logs for UC response flags. The rate should drop to near-zero for normal traffic. Any residual UC is most likely:
- A backend crash or OOMKill (check pod restarts).
- A network event (check tunnel diag if traffic is tunneled).
- A
POST/PATCHthat hit the race — idempotent methods are auto-retried; non-idempotent surface to the client by design.
If you still see frequent UC after setting server keep-alive ≥90s, open a support ticket and include a few sample access log lines plus your server's keep-alive configuration.