a Kubernetes Operator for renewing ECR credentials cluster wide

If you have used AWS ECR service (AWS container registry) before, you must be already aware that the registry credentials are only valid for 12 hours. It can be seen as a plus from the security perspective, but from a user/developer experience perspective, it can be annoying, especially if you are running your containers in Kubernetes. To make sure that the registry secret have not expired, one has to always delete the “kubernetes.io/dockerconfigjson” secret and recreate it with a new ECR token prior to each workload creation/update. For example, before creating the kube-ecr-secrets-operator, I was running the following script in my CI/CD pipelines before each deploy operation:

NAMESPACE=$1

kubectl delete --ignore-not-found=true secret docker-registry-secret -n $NAMESPACE && \
kubectl create secret docker-registry docker-registry-secret \
--docker-server=123456789.dkr.ecr.us-east-1.amazonaws.com \
--docker-username=AWS \
--docker-password=$(aws ecr get-authorization-token --region us-east-1 | jq --raw-output '.authorizationData[0].authorizationToken' | base64 -d | cut -d: -f2) \
-n $NAMESPACE

Because I use AWS ECR extensively in my personal projects, I decided to approach the issue using a Kubernetes operator which would take care of renewing all the ECR access secrets across all the namespaces: https://github.com/zak905/kube-ecr-secrets-operator. I used kubebuilder which has become the de facto standard tool for writing Kubernetes operators in golang.

The initial design (version 0.1.0):

The first idea that I had in mind was to leverage Kubernetes admission webhooks. Whenever a pod is created/updated, the admission webhook will contact the operator server which will check as a first step if the imagePullSecrets are corresponding to the one configured in the CRD (creating a CRD speaks for itself, because an operator needs to watch a CRD object in principle). If the imagePullSecret matches the secret in the CRD, then the logic in the webhook endpoint would check if the expiration period have been reached (there is an annotation on the secret for that) and then update the secret if it is the case. The intial design of the CRD looked like:

apiVersion: aws.zakariaamine.com/v1alpha1
kind: AWSECRCredential
metadata:
  name: my-ecr-credentials
spec:
  awsAccess:
    #secret containing AWS access used to get the ECR secret from AWS
    secretName: aws-access
    #optional namespace of the aws-access secret. Defaults to default.
    namespace: default
  #the name of the K8 secret that will be created
  secretName: ecr-login
  #all the namespaces in which the operator will create and manage ecr secrets
  namespaces:
    - ns1
    - ns2
    - ns3
    - ns4

In order to work properly, a Kubernetes secret with the AWS access (the access key id and the secret access key) needs to be present (the namepace and the name are configured in .spec.awsAccess). After experimenting for a while, I quickly realized that having an admission webhook on the UPDATE and CREATE actions of a pod can become problematic. If an error occurs during the process of renewing the ECR credentials, the pod update or creation would be blocked. This can be mitigated by using the failurePolicy config property in the MutatingWebhookConfiguration or the ValidatingWebhookConfiguration, but still, this would lead to admission webhooks being sent for all the pods even if the pod in question is not concerned with ECR secrets or even if it does not have imagePullSecrets at all. I decided to rollout a second version in which I made some simplifications.

The later improvements (version 0.1.1):

In the second version, I decided to remove the admission webhook on pod creation/update and use the Kubernetes controller requeue mechanism to tell the controller to perform a reconciliation after a defined period of time. I also decided to inline the AWS access credentials in the CRD spec instead depending on a Kubernetes secret that should created prior to the object. Here is an outline of the changes I introduced (github issue: https://github.com/zak905/kube-ecr-secrets-operator/issues/3):

AWSECRCredential in action:

Imagine you have three namespaces ns1, ns2, ns3 in which you need to create a pull secret for an ECR repostiory, then you need to simply create the following object (after installing the operator):

apiVersion: aws.zakariaamine.com/v1alpha1
kind: AWSECRCredential
metadata:
  name: my-ecr-credentials
spec:
  awsAccess:
    accessKeyId: THE_AWS_ACCESS_ID
    secretAccessKey: THE_AWS_SECRET_ACCESS_KEY
    region: us-east-1
  secretName: ecr-credential
  namespaces:
    - ns1
    - ns2
    - ns3

After the object is submitted, three kubernetes secrets of type kubernetes.io/dockerconfigjson will be created in the three namespaces, and will be scheduled for renewal every 12h. The status of the my-ecr-credentials object will show the following informations:

  status:
    conditions:
    - lastTransitionTime: "2023-10-27T13:47:08Z"
      message: 'AWS ECR secret with type kubernetes.io/dockerconfigjson have been
        created/updated successfully in namespaces: [experimental stage production]
        next update at: 2023-10-28 01:47:08.792 +0000 UTC'
      reason: SecretsUpdated
      status: "True"
      type: Ready

In case, anything goes wrong, the status will have status: "False", and the message field will hold the detailed error message.

Key learnings about Kubernetes operators: