REF "GKE networking model doesn't allow IP addresses to be reused across the network. When you migrate to GKE, you must plan your IP address allocation to Reduce internal IP address usage in GKE." : https://cloud.google.com/kubernetes-engine/docs/concepts/network-overview
REF "Combining multiple Ingress resources into a single Google Cloud load balancer is not supported." : https://cloud.google.com/kubernetes-engine/docs/concepts/ingress
REF "At minimum, the nonMasqueradeCIDRs property should include the node and Pod IP address ranges of your cluster." : https://cloud.google.com/kubernetes-engine/docs/how-to/ip-masquerade-agent
This repo contains a terraform configuration that demonstrates efficient use of RFC-1918 IP addresses with GKE kubernetes clusters. **IT IS NOT** meant to be an example of best practices (for example, in real use I would use [flux](https://github.com/fluxcd/flux2) to apply kubernetes manifests instead of terraform, I would use Horizontal Pod Autoscaling, and I would use node pool auto scaling) but rather it is a contrived example of nearly minimal RFC-1918 IP address consumption.
TL;DR
-----
- Service IP addresses are not accessible outside a cluster (TODO REF)
- A Compute Engine virtual machine for you to use to test `gce-internal` ingresses.
- Cloud DNS for your (sub)domain
- [ExternalDNS](https://github.com/kubernetes-sigs/external-dns) for automatically creating DNS records for each ingress
- 14 clusters
And on each cluster it spins up:
- 2 nodes
- 1 gateway
- 12 HTTP Routes
- 12 services
- 24 pods
For a grand total of:
- 28 nodes (and 1 user machine)
- 14 gateways
- 168 HTTP Routes (and 168 subdomains)
- 168 services
- 336 pods
All of this while only using `10.10.10.0/26` from the RFC-1918 space (64 addresses).
What do I need to provide
-------------------------
To use the terraform configuration, you will need:
1. An already existing Google Cloud project (There is no need to set anything up in the project, terraform will handle all of that, but this configuration does not create an project for you).
2.`gcloud` authenticated with an account that has access to that project.
3. A (sub)domain that can have its nameservers pointed at Google Cloud DNS.
| [Unusable addresses](https://cloud.google.com/vpc/docs/subnets#unusable-ip-addresses-in-every-subnet) | 4 addresses | The first two and last two addresses of a primary IP range are unusable. |
| Each Node | 1 address | This example uses 2 nodes per cluster to make it numerically distinct from the quantity of clusters |
| The user-machine virtual machine | 1 address | This is not needed in a production deploy. |
| Each Gateway | 1 address | This can be 1 per cluster. |
| The control plane private endpoint | 1 address | 1 per cluster. |
With our 64 addresses from `10.10.10.0/26`, we lose 4 as unusable addresses, we use another for the user machine, and then we have 4 addresses per cluster which means we can fit 14 clusters with 3 IP addresses left over.
Usage
=====
To apply the terraform, authenticate with the gcloud CLI tool:
```
gcloud auth application-default login
```
Then go into the `terraform` folder and apply the configuration. We need to apply the config in two phases via the `cluster_exists` variable because the kubernetes terraform provider does not have native support for the Gateway API and the `kubernetes_manifest` terraform resource [has a shortcoming that requires the cluster exists at plan time](https://github.com/hashicorp/terraform-provider-kubernetes/issues/1775).
```
tf apply -var dns_root="k8sdemo.mydomain.example." -var quota_email="MrManager@mydomain.example" -var quota_justification="Explain why you need quotas increased here." -var cluster_exists=false
tf apply -var dns_root="k8sdemo.mydomain.example." -var quota_email="MrManager@mydomain.example" -var quota_justification="Explain why you need quotas increased here." -var cluster_exists=true
```
Please note that this will exceed the default quotas on new Google Cloud projects. The terraform configuration will automatically put in requests for quota increases but they can take multiple days to be approved or denied. You should be able to fit 3 clusters in the default quota until then.
Please note that the kubernetes cluster will take a couple extra minutes to get fully set up and running after the `tf apply` command has finished. During this time, the cluster is getting IP addresses assigned to `Gateway` objects and updating DNS records via `ExternalDNS`.
This will spin up the kubernetes clusters and output some helpful information. One such piece of information is the nameservers for Google Cloud DNS. We need to point our (sub)domain at those name servers. If you want to get the list of nameservers again without having to wait for `tf apply`, you can run `tf output dns_name_servers`.
Personally, I run [PowerDNS](https://github.com/PowerDNS/pdns), so as an example, I would first clear the old `NS` records from previous runs from `k8sdemo.mydomain.example` (if you are setting this up for the first time you can skip this step):
```
pdnsutil delete-rrset mydomain.example k8sdemo NS
```
And then I'd add the new records (naturally you should use the domains output from `terraform`, they will change each time you add the domain to Cloud DNS:
Give some time for DNS caches to expire and then you should be able to access `service<num>.cluster<num>.k8sdemo.mydomain.example` by connecting the to `user-machine` over `ssh` and using `curl` to hit the internal ingresses. First, get the `gcloud` command to `ssh` into the `user-machine`:
```
tf output user_machine_ssh_command
```
Then `ssh` into the machine (your command will be different):
and hit the various ingresses on the various clusters:
```
curl service1.cluster1.k8sdemo.mydomain.example
```
Clean Up
========
Just like we did a 2-stage apply by toggling the `cluster_exists` variable, we will need to do a 2-stage destroy. First we tear down any kubernetes resources by running *apply* with the `cluster_exists` variable set to `false`. Then we can destroy the entire project.
```
tf apply -var dns_root="k8sdemo.mydomain.example." -var quota_email="MrManager@mydomain.example" -var quota_justification="Explain why you need quotas increased here." -var cluster_exists=false
tf destroy -var dns_root="k8sdemo.mydomain.example." -var quota_email="MrManager@mydomain.example" -var quota_justification="Explain why you need quotas increased here." -var cluster_exists=false
```
Explanation
===========
To conserve the RFC-1918 address space, we need to take advantage of two facts:
Service IP addresses are a fiction created by kubernetes. Service IP addresses are not routable from outside the cluster and packets to service IP addresses are never written to the wire. When a pod sends a packet to a service IP address, it is intercepted by iptables which performs DNAT to either the pod or the node's IP address (depending on cluster type). We can see this on our GKE cluster by connecting to the compute engine instance for a node over `ssh` and inspecting its iptables rules.
KUBE-NODEPORTS all -- anywhere anywhere /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL
```
This matches packets destinated for each service IP address and sends them to their respective chains. For `service1` it is matching packets destined for `100.64.22.23`. That happens to be our service IP address for `service1`:
```
$ kubectl --kubeconfig /bridge/git/kubernetes_ip_demo/output/kubeconfig/cluster1.yaml get svc service1
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service1 ClusterIP 100.64.22.23 <none> 80/TCP 34m
```
So its matching packets destined for `service1` and sending them to `KUBE-SVC-4RM6KDP54NYR4K6S`:
KUBE-SEP-XCTUYJ3QDWA727EN all -- anywhere anywhere /* default/service1 -> 240.10.0.24:8080 */ statistic mode random probability 0.50000000000
KUBE-SEP-5LQWHS2W6LUXXNGL all -- anywhere anywhere /* default/service1 -> 240.10.0.25:8080 */
```
This is how kubernetes load balances services: it uses `iptables` on the machine opening the connection to randomly distribute connections to the various pods. If we take a look at the chain for the first pod:
KUBE-NODEPORTS all -- anywhere anywhere /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL
But regardless, the end result is the same: Service IP addresses aren't real, so they can be anything. Despite their fictional nature, Google uses a "flat" architecture that does not allow re-using IP addresses across multiple clusters so [Google recommends using slices of `100.64.0.0/10` for service IP ranges](https://cloud.google.com/blog/products/containers-kubernetes/best-practices-for-kubernetes-pod-ip-allocation-in-gke).
But that doesn't mean that we need to use the valuable RFC-1918 IP address space for them. Instead, we can configure our cluster to perform SNAT to the node's IP address using kubernetes' [ip-masq-agent](https://github.com/kubernetes-sigs/ip-masq-agent). This frees us up to use other reserved-but-not-universally-supported IP address ranges like grabbing slices of `240.0.0.0/4`.
To demonstrate, we can apply the terraform config again but with the `enable_snat=true` variable set:
```
tf apply -var dns_root="k8sdemo.mydomain.example." -var quota_email="MrManager@mydomain.example" -var quota_justification="Explain why you need quotas increased here." -var cluster_exists=true -var enable_snat=true
```
Then in our kubernetes pod, we can run the `curl` again:
So this means that just like Service IP addresses, we can make the pod IP addresses anything. [Google recommends using slices of `` for pod IP ranges](https://cloud.google.com/blog/products/containers-kubernetes/best-practices-for-kubernetes-pod-ip-allocation-in-gke), and then enabling SNAT if you need to talk to networks outside of Google Cloud.