#kubernetes (2023-05)
Archive: https://archive.sweetops.com/kubernetes/
2023-05-03
Hi everyone, I want to create a production level high availability kubernetes cluster on premise, so what are the things that I should take care of, can you guys guide me or give me some resources so that I can read these things?
The etcd certificate will be expired after a while, will need to renew it manually
2023-05-08
2023-05-17
If anyone is willing to field an Istio/Envoy question in here:
I have an issue where all services are behind the same Gateway
(Istio 1.13.x) attached to a GCP LB and the calls are svcA -> svcB -> svcC -> svcB -> svcD
and is resulting in downstream_remote_disconnect
in istio-proxy
logs for svcB
When I clone svcB
in another namespace behind a different VirtualService
and do svcA -> svcB -> svcC -> svcB(clone) -> svcD
I get the expected 200
. Is there anything obvious as to why this is the case?
I am happy to provide more details as needed.
[2023-05-16T22:15:56.839Z] "GET /uri?query=test HTTP/1.1" 0 DC downstream_remote_disconnect - "-" 0 0 119998 - "34.29.X.X, 34.111.X.X,35.191.X.X" "axios/0.27.2" "53b3a438-a3dc-4df7-86a9-7229048d993e" "svcB.stage.XXX.com" "10.88.X.X:8080" inbound|8080|| 127.0.0.6:46205 10.88.X.X:8080 35.191.X.X:0 outbound_.8080_._.svcB.svcB.svc.cluster.local default
Is there network policy in the stack?
No defined NetworkPolicy
any logs in Envoy container?
The one I pasted above with the downstream_remote_disconnect
. What other logs would you be looking for?
istio got a proxy pods or container
That was from the sidecar container istio-proxy
oh sorry it is lol
The new pods in another namespace run on the same node or different node?
either security group rule issue or need to ssh into the node and take a look
Let me check. It is definitely possible it is on another node
svcB and svcB(clone) are actually on the same node
any difference between VirtualServices?
only name?
Only the 3rd level domain
What are the outputs of describe
them?
Spec:
Gateways:
istio-system/istio-default-external
Hosts:
svcB.stg.XXX.com
Http:
Route:
Destination:
Host: svcB
Port:
Number: 8080
Timeout: 120s
Very straight forward
any events or status in outputs?
<none>
If you are reporting _any_ crash or _any_ potential security issue, _do not_
open an issue in this repo. Please report the issue via emailing
[email protected] where the issue will be triaged appropriately.
Title: What are possible scenarios where we get downstream_remote_disconnect response ?
Description:
We are using envoy proxy to route requests based on header value to respective upstreams. While doing performance testing, few of our requests fail with response code details - downstream_remote_disconnect. Wanted to understand when can we experience this ?
[optional Relevant Links:]
Any extra documentation required to understand the issue.
This is set when the stream is terminated due to a downstream FIN. If you're encountering a situation where that detail is set and a wireshark trace shows downstream isn't sending the FIN we'd be happy to look into it, but by default we'd assume it's client-caused at which point there's not much we can do to diagnose what's going on :-)
It is the same source code so the client/container image artifact are identical. I did see that ticket but that can’t be accurate in my case
I am guessing there is a routing issue based on a call back to svcB after the initial downstream but I don’t see anything to support that or to resolve it
if the error was seen in svcB’s pods, there may be more logs in either C or D
C was sending back timeouts that I have set to 120s
But only in that flow. C does not send any timeouts outside of that specific call
is iptables used?
And I can test that call to the C pod from local and from Apollo without timeout
seems to run into a corner case
how about sending requests to C and D?
No issue other than that specific flow
or use tcpdump in B pods
Yeah I was going to escalate this to Wireshark but I was trying to avoid that
or try Cilium Mesh
which can support non-side-car mode
Yeah and Traefik Maesh as well. Istio is releasing their own sidecarless option as well
How about Linkerd?
I’d rather not throw away Istio for this one potential bug
they rewrite 2.0 with Rust
yeah, that is a huge change
do you use CNI in the stack?
the default GKE CNI
oh if GKE dataplane v2 is used, GKE Dataplane V2 is implemented using Cilium
The legacy dataplane for GKE is implemented using Calico
Yes I believe it’s Calico
ok, v1->v2 may worth a test
or there may be some leftover policies in the namespace