REF https://cloud.google.com/kubernetes-engine/docs/concepts/alias-ips#cluster_sizing REF Services only available within the cluster: https://cloud.google.com/kubernetes-engine/docs/how-to/alias-ips REF https://wdenniss.com/gke-network-planning REF https://cloud.google.com/blog/products/containers-kubernetes/best-practices-for-kubernetes-pod-ip-allocation-in-gke

REF SHARE IP: https://cloud.google.com/kubernetes-engine/docs/how-to/internal-load-balancing#terraform REF GATEWAY: https://github.com/GoogleCloudPlatform/gke-networking-recipes/tree/main/gateway/single-cluster/regional-l7-ilb REF node NAT: https://cloud.google.com/kubernetes-engine/docs/how-to/ip-masquerade-agent

REF "GKE networking model doesn't allow IP addresses to be reused across the network. When you migrate to GKE, you must plan your IP address allocation to Reduce internal IP address usage in GKE." : https://cloud.google.com/kubernetes-engine/docs/concepts/network-overview

REF "Combining multiple Ingress resources into a single Google Cloud load balancer is not supported." : https://cloud.google.com/kubernetes-engine/docs/concepts/ingress

TOOD: replace tf with terraform

GKE IP Address Usage Demo

This repo contains a terraform configuration that demonstrates efficient use of RFC-1918 IP addresses with GKE kubernetes clusters. IT IS NOT meant to be an example of best practices (for example, in real use I would use flux to apply kubernetes manifests instead of terraform, I would use Horizontal Pod Autoscaling, and I would use node pool auto scaling) but rather it is a contrived example of nearly minimal RFC-1918 IP address consumption.

TL;DR

  • Service IP addresses are not accessible outside a cluster (TODO REF)
  • Pod IP addresses by default are not accessible outside a cluster (TODO REF)
  • Therefore, we can use (TODO ranges) for pods and services (TODO REF)
  • This is recommended by Google (TODO REF)

What is spun up

The terraform configuration spins up:

  • A Compute Engine virtual machine for you to use to test gce-internal ingresses.
  • Cloud DNS for your (sub)domain
  • ExternalDNS for automatically creating DNS records for each ingress
  • 14 clusters

And on each cluster it spins up:

  • 2 nodes
  • 1 gateway
  • 12 HTTP Routes
  • 12 services
  • 24 pods

For a grand total of:

  • 28 nodes (and 1 user machine)
  • 14 gateways
  • 168 HTTP Routes (and 168 subdomains)
  • 168 services
  • 336 pods

All of this while only using 10.10.10.0/26 from the RFC-1918 space (64 addresses).

What do I need to provide

To use the terraform configuration, you will need:

  1. An already existing Google Cloud project (There is no need to set anything up in the project, terraform will handle all of that, but this configuration does not create an project for you).
  2. gcloud authenticated with an account that has access to that project.
  3. A (sub)domain that can have its nameservers pointed at Google Cloud DNS.

IP Address Allocations

REF: https://cloud.google.com/kubernetes-engine/docs/concepts/alias-ips#cluster_sizing_secondary_range_pods REF: https://cloud.google.com/vpc/docs/subnets#valid-ranges

Purpose CIDR Notes
Node IP range 10.10.10.0/26 1 address per node, 1 address per gateway, 1 address per cluster (cluster private endpoint)
Service IP range
Pod IP range
Envoy Proxy range This is used by the GKE ingress controller. Consumes a /24 per network

What consumes RFC-1918 IP addresses

Thing Quantity Consumed Notes
Unusable addresses 4 addresses The first two and last two addresses of a primary IP range are unusable.
Each Node 1 address This example uses 2 nodes per cluster to make it numerically distinct from the quantity of clusters
The user-machine virtual machine 1 address This is not needed in a production deploy.
Each Gateway 1 address This can be 1 per cluster.
The control plane private endpoint 1 address 1 per cluster.

With our 64 addresses from 10.10.10.0/26, we lose 4 as unusable addresses, we use another for the user machine, and then we have 4 addresses per cluster which means we can fit 14 clusters with 3 IP addresses left over.

Usage

To apply the terraform, authenticate with the gcloud CLI tool:

gcloud auth application-default login

Then go into the terraform folder and apply the configuration. We need to apply the config in two phases via the cluster_exists variable because the kubernetes terraform provider does not have native support for the Gateway API and the kubernetes_manifest terraform resource has a shortcoming that requires the cluster exists at plan time.

tf apply -var dns_root="k8sdemo.mydomain.example." -var quota_email="MrManager@mydomain.example" -var quota_justification="Explain why you need quotas increased here." -var cluster_exists=false
tf apply -var dns_root="k8sdemo.mydomain.example." -var quota_email="MrManager@mydomain.example" -var quota_justification="Explain why you need quotas increased here." -var cluster_exists=true

Please note that this will exceed the default quotas on new Google Cloud projects. The terraform configuration will automatically put in requests for quota increases but they can take multiple days to be approved or denied. You should be able to fit 3 clusters in the default quota until then.

Please note that the kubernetes cluster will take a couple extra minutes to get fully set up and running after the tf apply command has finished. During this time, the cluster is getting IP addresses assigned to Gateway objects and updating DNS records via ExternalDNS.

This will spin up the kubernetes clusters and output some helpful information. One such piece of information is the nameservers for Google Cloud DNS. We need to point our (sub)domain at those name servers. If you want to get the list of nameservers again without having to wait for tf apply, you can run tf output dns_name_servers.

Personally, I run PowerDNS, so as an example, I would first clear the old NS records from previous runs from k8sdemo.mydomain.example (if you are setting this up for the first time you can skip this step):

pdnsutil delete-rrset mydomain.example k8sdemo NS

And then I'd add the new records (naturally you should use the domains output from terraform, they will change each time you add the domain to Cloud DNS:

pdnsutil add-record mydomain.example k8sdemo NS 600 ns-cloud-a1.googledomains.com.
pdnsutil add-record mydomain.example k8sdemo NS 600 ns-cloud-a2.googledomains.com.
pdnsutil add-record mydomain.example k8sdemo NS 600 ns-cloud-a3.googledomains.com.
pdnsutil add-record mydomain.example k8sdemo NS 600 ns-cloud-a4.googledomains.com.

Give some time for DNS caches to expire and then you should be able to access service<num>.cluster<num>.k8sdemo.mydomain.example by connecting the to user-machine over ssh and using curl to hit the internal ingresses. First, get the gcloud command to ssh into the user-machine:

tf output user_machine_ssh_command

Then ssh into the machine (your command will be different):

gcloud compute ssh --zone 'us-central1-c' 'user-machine' --project 'k8s-ip-demo-1aa0405a'

and hit the various ingresses on the various clusters:

curl service1.cluster1.k8sdemo.mydomain.example

Clean Up

Just like we did a 2-stage apply by toggling the cluster_exists variable, we will need to do a 2-stage destroy. First we tear down any kubernetes resources by running apply with the cluster_exists variable set to false. Then we can destroy the entire project.

tf apply -var dns_root="k8sdemo.mydomain.example." -var quota_email="MrManager@mydomain.example" -var quota_justification="Explain why you need quotas increased here." -var cluster_exists=false
tf destroy -var dns_root="k8sdemo.mydomain.example." -var quota_email="MrManager@mydomain.example" -var quota_justification="Explain why you need quotas increased here." -var cluster_exists=false

Explanation

To conserve the RFC-1918 address space, we need to take advantage of two facts:

  1. Service IP addresses aren't real
  2. Pod IP addresses do not need to leave the cluster (and by default they do not on GKE)

Service IP Addresses

Service IP addresses are a fiction created by kubernetes. Service IP addresses are not routable from outside the cluster and packets to service IP addresses are never written to the wire. When a pod sends a packet to a service IP address, it is intercepted by iptables which perform DNAT to the pod's IP address. We can see this on our GKE cluster by connecting to the compute engine instance for a node over ssh and inspecting its iptables rules.

gcloud compute ssh --zone 'us-central1-f' 'gke-cluster1-cluster1-pool-9d7804fe-fl8w' --project 'k8s-ip-demo-90bdaee2'

First, we look at the PREROUTING chain:

$ sudo /sbin/iptables --table nat --list PREROUTING
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  anywhere             anywhere             /* kubernetes service portals */
DNAT       tcp  --  anywhere             metadata.google.internal  tcp dpt:http-alt /* metadata-concealment: bridge traffic to metadata server goes to metadata proxy */ to:169.254.169.252:987
DNAT       tcp  --  anywhere             metadata.google.internal  tcp dpt:http /* metadata-concealment: bridge traffic to metadata server goes to metadata proxy */ to:169.254.169.252:988

That is sending all our traffic to the KUBE-SERVICES chain:

$ sudo /sbin/iptables --table nat --list KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
KUBE-SVC-XBBXYMVKK37OV7LG  tcp  --  anywhere             100.64.28.70         /* gmp-system/gmp-operator:webhook cluster IP */ tcp dpt:https
KUBE-SVC-GQKLSXF4KTGNIMSQ  tcp  --  anywhere             100.64.28.107        /* default/service11 cluster IP */ tcp dpt:http
KUBE-SVC-AI5DROXYLCYX27ZS  tcp  --  anywhere             100.64.11.22         /* default/service5 cluster IP */ tcp dpt:http
KUBE-SVC-F4AADAVBSY5MPKOB  tcp  --  anywhere             100.64.12.233        /* default/service6 cluster IP */ tcp dpt:http
KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  anywhere             100.64.0.1           /* default/kubernetes:https cluster IP */ tcp dpt:https
KUBE-SVC-XP4WJ6VSLGWALMW5  tcp  --  anywhere             100.64.25.226        /* kube-system/default-http-backend:http cluster IP */ tcp dpt:http
KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  anywhere             100.64.0.10          /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
KUBE-SVC-QMWWTXBG7KFJQKLO  tcp  --  anywhere             100.64.7.174         /* kube-system/metrics-server cluster IP */ tcp dpt:https
KUBE-SVC-3ISFTUHJIYANB2XG  tcp  --  anywhere             100.64.9.63          /* default/service4 cluster IP */ tcp dpt:http
KUBE-SVC-T467R3VJHOQP3KAJ  tcp  --  anywhere             100.64.8.240         /* default/service9 cluster IP */ tcp dpt:http
KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  anywhere             100.64.0.10          /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
KUBE-SVC-JOVDIF256A6Q5HDW  tcp  --  anywhere             100.64.16.250        /* default/service8 cluster IP */ tcp dpt:http
KUBE-SVC-E7SFLZD2Y2FAKTSV  tcp  --  anywhere             100.64.16.205        /* default/service2 cluster IP */ tcp dpt:http
KUBE-SVC-OA62VCLUSJYXZDQQ  tcp  --  anywhere             100.64.16.149        /* default/service10 cluster IP */ tcp dpt:http
KUBE-SVC-SAREEPXIBVBCS5LQ  tcp  --  anywhere             100.64.8.122         /* default/service12 cluster IP */ tcp dpt:http
KUBE-SVC-MVJGFDRMC5WIL772  tcp  --  anywhere             100.64.6.210         /* default/service7 cluster IP */ tcp dpt:http
KUBE-SVC-4RM6KDP54NYR4K6S  tcp  --  anywhere             100.64.22.23         /* default/service1 cluster IP */ tcp dpt:http
KUBE-SVC-Y7ZLLRVMCD5M4HRL  tcp  --  anywhere             100.64.12.22         /* default/service3 cluster IP */ tcp dpt:http
KUBE-NODEPORTS  all  --  anywhere             anywhere             /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

This matches packets destinated for each service IP address and sends them to their respective chains. For service1 it is matching packets destined for 100.64.22.23. That happens to be our service IP address for service1:

$ kubectl --kubeconfig /bridge/git/kubernetes_ip_demo/output/kubeconfig/cluster1.yaml get svc service1
NAME       TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service1   ClusterIP   100.64.22.23   <none>        80/TCP    34m

So its matching packets destined for service1 and sending them to KUBE-SVC-4RM6KDP54NYR4K6S:

$ sudo /sbin/iptables --table nat --list KUBE-SVC-4RM6KDP54NYR4K6S
Chain KUBE-SVC-4RM6KDP54NYR4K6S (1 references)
target     prot opt source               destination
KUBE-MARK-MASQ  tcp  -- !240.10.0.0/24        100.64.22.23         /* default/service1 cluster IP */ tcp dpt:http
KUBE-SEP-XCTUYJ3QDWA727EN  all  --  anywhere             anywhere             /* default/service1 -> 240.10.0.24:8080 */ statistic mode random probability 0.50000000000
KUBE-SEP-5LQWHS2W6LUXXNGL  all  --  anywhere             anywhere             /* default/service1 -> 240.10.0.25:8080 */

This is how kubernetes load balances services: it uses iptables on the machine opening the connection to randomly distribute connections to the various pods. If we take a look at the chain for the first pod:

$ sudo /sbin/iptables --table nat --list KUBE-SEP-XCTUYJ3QDWA727EN
Chain KUBE-SEP-XCTUYJ3QDWA727EN (1 references)
target     prot opt source               destination
KUBE-MARK-MASQ  all  --  240.10.0.24          anywhere             /* default/service1 */
DNAT       tcp  --  anywhere             anywhere             /* default/service1 */ tcp to:240.10.0.24:8080

This corresponds to one of our pod IP addresses:

$ kubectl --kubeconfig /bridge/git/kubernetes_ip_demo/output/kubeconfig/cluster1.yaml get pods -l 'app=hello-app-1' -o wide
NAME                           READY   STATUS    RESTARTS   AGE   IP            NODE                                       NOMINATED NODE   READINESS GATES
deployment1-69bddf99b6-gjl94   1/1     Running   0          55m   240.10.0.24   gke-cluster1-cluster1-pool-9d7804fe-fl8w   <none>           <none>
deployment1-69bddf99b6-vrtc7   1/1     Running   0          55m   240.10.0.25   gke-cluster1-cluster1-pool-9d7804fe-fl8w   <none>           <none>

If we launched a routes-based cluster instead of a VPC-native cluster then the ip addresses in KUBE-SERVICES would be the node IP addresses:

$ sudo /sbin/iptables --table nat --list KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  anywhere             10.107.240.1         /* default/kubernetes:https cluster IP */ tcp dpt:https
KUBE-SVC-Y7ZLLRVMCD5M4HRL  tcp  --  anywhere             10.107.245.254       /* default/service3 cluster IP */ tcp dpt:http
KUBE-SVC-OA62VCLUSJYXZDQQ  tcp  --  anywhere             10.107.250.149       /* default/service10 cluster IP */ tcp dpt:http
KUBE-SVC-JOVDIF256A6Q5HDW  tcp  --  anywhere             10.107.250.156       /* default/service8 cluster IP */ tcp dpt:http
KUBE-SVC-4RM6KDP54NYR4K6S  tcp  --  anywhere             10.107.250.111       /* default/service1 cluster IP */ tcp dpt:http
KUBE-SVC-3ISFTUHJIYANB2XG  tcp  --  anywhere             10.107.241.148       /* default/service4 cluster IP */ tcp dpt:http
KUBE-SVC-E7SFLZD2Y2FAKTSV  tcp  --  anywhere             10.107.255.251       /* default/service2 cluster IP */ tcp dpt:http
KUBE-SVC-T467R3VJHOQP3KAJ  tcp  --  anywhere             10.107.246.240       /* default/service9 cluster IP */ tcp dpt:http
KUBE-SVC-AI5DROXYLCYX27ZS  tcp  --  anywhere             10.107.253.168       /* default/service5 cluster IP */ tcp dpt:http
KUBE-SVC-GQKLSXF4KTGNIMSQ  tcp  --  anywhere             10.107.255.31        /* default/service11 cluster IP */ tcp dpt:http
KUBE-SVC-XP4WJ6VSLGWALMW5  tcp  --  anywhere             10.107.252.203       /* kube-system/default-http-backend:http cluster IP */ tcp dpt:http
KUBE-SVC-SAREEPXIBVBCS5LQ  tcp  --  anywhere             10.107.249.4         /* default/service12 cluster IP */ tcp dpt:http
KUBE-SVC-F4AADAVBSY5MPKOB  tcp  --  anywhere             10.107.250.177       /* default/service6 cluster IP */ tcp dpt:http
KUBE-SVC-MVJGFDRMC5WIL772  tcp  --  anywhere             10.107.252.157       /* default/service7 cluster IP */ tcp dpt:http
KUBE-NODEPORTS  all  --  anywhere             anywhere             /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

But regardless, the end result is the same: Service IP addresses aren't real, so they can be anything. Despite their fictional nature, Google uses a "flat" architecture that does not allow re-using IP addresses across multiple clusters so Google recommends using (TODO ip range) for service IP ranges.

Pod IP Addresses

In the previous section we saw how sending a packet to service1 results in iptables intercepting that packet and rewriting the destination to the pod IP address. In a VPC-native GKE cluster, each node has a network interface for each pod:

$ netstat -4nr
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         10.10.10.1      0.0.0.0         UG        0 0          0 eth0
10.10.10.1      0.0.0.0         255.255.255.255 UH        0 0          0 eth0
169.254.123.0   0.0.0.0         255.255.255.0   U         0 0          0 docker0
169.254.169.254 10.10.10.1      255.255.255.255 UGH       0 0          0 eth0
240.10.0.2      0.0.0.0         255.255.255.255 UH        0 0          0 gke200305c2a96
240.10.0.3      0.0.0.0         255.255.255.255 UH        0 0          0 gke026d556ebe9
240.10.0.4      0.0.0.0         255.255.255.255 UH        0 0          0 gke7d4f3a7a7fe
240.10.0.5      0.0.0.0         255.255.255.255 UH        0 0          0 gke60f18655088
240.10.0.6      0.0.0.0         255.255.255.255 UH        0 0          0 gke1a72a682490
240.10.0.8      0.0.0.0         255.255.255.255 UH        0 0          0 gke1c4d51adb0d
240.10.0.9      0.0.0.0         255.255.255.255 UH        0 0          0 gke9d25513aa8f
240.10.0.10     0.0.0.0         255.255.255.255 UH        0 0          0 gke6a364803b2a
240.10.0.12     0.0.0.0         255.255.255.255 UH        0 0          0 gke6a63d89ef86
240.10.0.13     0.0.0.0         255.255.255.255 UH        0 0          0 gke35b91a8a487
240.10.0.14     0.0.0.0         255.255.255.255 UH        0 0          0 gke96c13f51f03
240.10.0.15     0.0.0.0         255.255.255.255 UH        0 0          0 gke84a95b2f8d9
240.10.0.16     0.0.0.0         255.255.255.255 UH        0 0          0 gkec88ce3d8bdb
240.10.0.17     0.0.0.0         255.255.255.255 UH        0 0          0 gkeacb4e0652ac
240.10.0.18     0.0.0.0         255.255.255.255 UH        0 0          0 gke49bb9e75be2
240.10.0.19     0.0.0.0         255.255.255.255 UH        0 0          0 gke0ece9ad356b
240.10.0.20     0.0.0.0         255.255.255.255 UH        0 0          0 gke0a1351c4ee3
240.10.0.21     0.0.0.0         255.255.255.255 UH        0 0          0 gke72a06fc23ca
240.10.0.22     0.0.0.0         255.255.255.255 UH        0 0          0 gke9845db36eb5
240.10.0.23     0.0.0.0         255.255.255.255 UH        0 0          0 gkecb6bf7230eb
240.10.0.24     0.0.0.0         255.255.255.255 UH        0 0          0 gke7dae60021d4
240.10.0.25     0.0.0.0         255.255.255.255 UH        0 0          0 gkeb8396784860
240.10.0.26     0.0.0.0         255.255.255.255 UH        0 0          0 gke4bd6d44f52d
240.10.0.27     0.0.0.0         255.255.255.255 UH        0 0          0 gke3adcfdc91bc
240.10.0.28     0.0.0.0         255.255.255.255 UH        0 0          0 gkefabe3212dac
240.10.0.29     0.0.0.0         255.255.255.255 UH        0 0          0 gke0f41cfda23e
240.10.0.30     0.0.0.0         255.255.255.255 UH        0 0          0 gke91fc0947c42
240.10.0.31     0.0.0.0         255.255.255.255 UH        0 0          0 gke9ee620217b1
240.10.0.32     0.0.0.0         255.255.255.255 UH        0 0          0 gke12336532836
240.10.0.33     0.0.0.0         255.255.255.255 UH        0 0          0 gke369d5150571
240.10.0.34     0.0.0.0         255.255.255.255 UH        0 0          0 gke97dfb4bceed
240.10.0.35     0.0.0.0         255.255.255.255 UH        0 0          0 gke085b5ff7d93

We can see the pod IP addresses for service1 of 240.10.0.24 and 240.10.0.25 would route over gke7dae60021d4 and gkeb8396784860. At that point, Google's infrastructure takes over delivering the packet.

Description
This is a terraform config demonstrating spinning up 14 clusters in only a /26 (64 addresses) to demonstrate the GKE clusters do not need to consume large amounts of RFC1918 IP addresses.
Readme 0BSD 87 KiB
Languages
HCL 100%