#kubernetes (2023-05)
Archive: https://archive.sweetops.com/kubernetes/

Hi everyone, I want to create a production level high availability kubernetes cluster on premise, so what are the things that I should take care of, can you guys guide me or give me some resources so that I can read these things?

The etcd certificate will be expired after a while, will need to renew it manually

If anyone is willing to field an Istio/Envoy question in here:
I have an issue where all services are behind the same Gateway
(Istio 1.13.x) attached to a GCP LB and the calls are svcA -> svcB -> svcC -> svcB -> svcD
and is resulting in downstream_remote_disconnect
in istio-proxy
logs for svcB
When I clone svcB
in another namespace behind a different VirtualService
and do svcA -> svcB -> svcC -> svcB(clone) -> svcD
I get the expected 200
. Is there anything obvious as to why this is the case?
I am happy to provide more details as needed.
[2023-05-16T22:15:56.839Z] "GET /uri?query=test HTTP/1.1" 0 DC downstream_remote_disconnect - "-" 0 0 119998 - "34.29.X.X, 34.111.X.X,35.191.X.X" "axios/0.27.2" "53b3a438-a3dc-4df7-86a9-7229048d993e" "svcB.stage.XXX.com" "10.88.X.X:8080" inbound|8080|| 10.88.X.X:8080 35.191.X.X:0 outbound_.8080_._.svcB.svcB.svc.cluster.local default

Is there network policy in the stack?

No defined NetworkPolicy

any logs in Envoy container?

The one I pasted above with the downstream_remote_disconnect
. What other logs would you be looking for?

istio got a proxy pods or container

That was from the sidecar container istio-proxy

oh sorry it is lol

The new pods in another namespace run on the same node or different node?

either security group rule issue or need to ssh into the node and take a look

Let me check. It is definitely possible it is on another node

svcB and svcB(clone) are actually on the same node

any difference between VirtualServices?

only name?

Only the 3rd level domain

What are the outputs of describe

Host: svcB
Number: 8080
Timeout: 120s

Very straight forward

any events or status in outputs?


If you are reporting _any_ crash or _any_ potential security issue, _do not_
open an issue in this repo. Please report the issue via emailing
[email protected] where the issue will be triaged appropriately.
Title: What are possible scenarios where we get downstream_remote_disconnect response ?
We are using envoy proxy to route requests based on header value to respective upstreams. While doing performance testing, few of our requests fail with response code details - downstream_remote_disconnect. Wanted to understand when can we experience this ?
[optional Relevant Links:]
Any extra documentation required to understand the issue.

This is set when the stream is terminated due to a downstream FIN. If you're encountering a situation where that detail is set and a wireshark trace shows downstream isn't sending the FIN we'd be happy to look into it, but by default we'd assume it's client-caused at which point there's not much we can do to diagnose what's going on :-)

It is the same source code so the client/container image artifact are identical. I did see that ticket but that can’t be accurate in my case

I am guessing there is a routing issue based on a call back to svcB after the initial downstream but I don’t see anything to support that or to resolve it

if the error was seen in svcB’s pods, there may be more logs in either C or D

C was sending back timeouts that I have set to 120s

But only in that flow. C does not send any timeouts outside of that specific call

is iptables used?

And I can test that call to the C pod from local and from Apollo without timeout

seems to run into a corner case

how about sending requests to C and D?

No issue other than that specific flow

or use tcpdump in B pods

Yeah I was going to escalate this to Wireshark but I was trying to avoid that

or try Cilium Mesh

which can support non-side-car mode

Yeah and Traefik Maesh as well. Istio is releasing their own sidecarless option as well

How about Linkerd?

I’d rather not throw away Istio for this one potential bug

they rewrite 2.0 with Rust

yeah, that is a huge change

do you use CNI in the stack?

the default GKE CNI

oh if GKE dataplane v2 is used, GKE Dataplane V2 is implemented using Cilium

The legacy dataplane for GKE is implemented using Calico

Yes I believe it’s Calico

ok, v1->v2 may worth a test

or there may be some leftover policies in the namespace