Helm contribution note: fixing post install hook deletion failure due to before-hook-creation policy

After over one year and 4 months, I was happy to see the PR #11387 that I have submitted to the helm project finally accepted and merged. I would like to take the opportunity to explain what problem did my contribution solve. It all started after this error: Error: failed post-install: warning: Hook post-install testing-hooks-chart/templates/pod.yaml failed: object is being deleted: pods "random-pod" already exists started to show up from time to time when I run helm install.

The cause of issue:

The issue is caused if a post-install hook resource (pod, volume…) is not fully deleted after a helm release is deleted/uninstalled. This can happen for example if under certain circumstances, a helm chart needs to be uninstalled and installed again. Even if those hook resources did not get fully deleted, helm will still consider the chart uninstall as successfull (because the hook resources are technically not part of the release). If the new chart installation finds that the resources from the previous installation are still present, it fails with the error message mentionned above. You can find examples of users complaining from this error in the related issue. This can happen mostly if the resources (pods, volumes, custom CRDs…) have finalizers or go through some lengthy process before deletion.

To summarize, the error above can arise under the following conditions:

How to reproduce the issue:

First of all, the fix was released in the version 3.14.0 of helm: https://github.com/helm/helm/releases/tag/v3.14.0, so the issue can be reproduced using any version of helm prior to 3.14.0.

The issue can be reproduced using a simple helm chart that uses a post-install hook with a finalizer. For example, let’s assume we want to create a chart with an nginx deployment. On success, the chart launches a pod that prints something and exits.

file: deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: testing-hooks
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: testing-hooks
  template:
    metadata:
      labels:
        app.kubernetes.io/name: testing-hooks
    spec:
      containers:
        - name: testing-hooks
          image: "nginx:latest"
          imagePullPolicy: IfNotPresent
          ports:
            - name: http
              containerPort

file: pod-hook.yaml

apiVersion: v1
kind: Pod
metadata:
  name: random-pod
  annotations:
    helm.sh/hook: "post-install"
    helm.sh/hook-delete-policy: before-hook-creation
  finalizers:
    - kubernetes
spec:
  containers:
    - name: test
      image: "alpine"
      command: ['echo']
      args:
        - "bye bye"
  restartPolicy: Never

The chart structutre looks as follow:

├── Chart.yaml
├── templates
│   ├── deployment.yaml
│   ├── _helpers.tpl
│   └── pod-hook.yaml
└── values.yaml

The issue can be reproduced if the following commands are run sequentially:

The second install fails with the following message:

Error: failed post-install: warning: Hook post-install testing-hooks-chart/templates/pod.yaml failed: object is being deleted: pods "random-pod" already exists

How does the patch fix the issue:

The patch introduces waiting (with timeout) for hook deletion when re-installing the chart. If a hook resource like the pod in the example have a finalizer or takes time to delete, then the second helm install will hang until the resource is deleted or the timeout is reached. If the timeout is reached and hook resource is still not deleted, then the chart install fails (depedening on whether --atomic is set, the other chart resouces will be removed upon failure). For example, if we try to execute the steps mentionned above, the second install will hang for the duration of the timeout (the default is 5m but with the --timeout flag we can lower it for experimental purposes). Here is the output of helm install test-hook . --debug --timeout 10s (the second one):

client.go:142: [debug] creating 1 resource(s)
client.go:486: [debug] Starting delete for "random-pod" Pod
wait.go:66: [debug] beginning wait for 1 resources to be deleted with timeout of 10s
Error: INSTALLATION FAILED: failed post-install: context deadline exceeded
helm.go:84: [debug] failed post-install: context deadline exceeded

As you can see, the difference is that now helm waits for the deletion of random-pod for the specified timeout, and do not fail immediately. In case the resource has a finalizer like the case of the random-pod here, one can jump into another terminal prompt and remove the finalizer using kubectl edit.