iconik outage post-mortem
On Thursday 2019-09-05 at 23:09 we got an alert from the Google Cloud monitoring, signaling that us.iconik.io and app.iconik.io were unreachable. Around the same time, we received a support ticket from a customer asking if the system was down. Our engineers directly started investigating and could quickly discover that both the Kubernetes pods that were serving the main webpage of iconik were down, and that they were also running on the same physical node. A restart of these pods forced them to be rescheduled on another node and the service was restored. The time from the first alert we received until the issue was fixed was about 8 minutes.
The first occurance of a 503 error status in the logs is at 22:51:22.880 CEST. The first alert was sent out at 23:09, to Slack and email. Our engineers had restored system functionality at 23:17:36.537 CEST. After that, an initial investigation into the underlying cause started which continued the next business-day.
After the service was restored we continued to investigate and discovered that the node running the web pods had an extremely high load. We then discovered that this had caused the kubelet to miss several heartbeats to the control plane, causing the node to be scheduled as non-responsive. However, the node was not unresponsive enough that kubernetes considered it completely down. This left the pods which were running on that node in a limbo state without getting them scheduled on other nodes. We need to investigate what the cause of this could be.
The memory usage on the node was almost at 100% and processes were getting killed with OOM-errors. We could see that some celery processes were using a lot more memory than expected. This is something we have seen in our pre-production environment and have implemented a fix for which is scheduled for the next release. This excessive memory allocation is considered the root cause of the issue and it is documented that kubernetes nodes can become unstable if unbound memory allocation is allowed in the pods. We should therefore implement memory limits on our pods so they get killed and rescheduled rather than bring the whole node down.
The outage occurred in our US region. https://eu.iconik.io was unaffected and users could have used that site if they were aware of it, or if https://app.iconik.io had automatically redirected the users there. This outage was isolated to the pod serving the static webpage which loads the iconik GUI. All the API endpoints remained operational so any user who already had the GUI loaded was unaffected. Likewise, any API integration and automation also remained functional. Any end-users who tried to access https://app.iconik.io or https://us.iconik.io were greeted with a default nginx 503 error page.
This section outlines the improvements we can make to the product and our processes.
Since the outage was limited to the US region, an application load balancer which had redirected users to https://eu.iconik.io instead would have kept the system operational but with a higher latency for US customers. We have a geographic load-balancer under test and we should investigate how to configure it to also do fail-over in cases like this.
Both of the redundant web pods were located on the same node. This should not have been the case but because we had missed to configure what is called pod anti-affinity for these pods kubernetes did not know to schedule the pods on different nodes. This is a configuration change we will implement for all parts of the system so kubernetes will schedule the pods better. This should remove the single-point-of-failure which we suffered from.
We should implement the limits we have added on memory and cpu to the pre-production system which are already in place in the pre-production system. This will cause pods to be killed when they reach the specified thresholds, but will keep the other pods on the same nodes running. It will also allow kubernetes to schedule pods more effectively.
The first alert that app.iconik.io was down wasn’t sent until 18 minutes after the first error occurred. This should have happened much sooner to minimize the downtime.
We should have monitoring if all pods in a service are scheduled on the same node. This should not happen with the configuration change to the pod affinity above, but we should still have monitoring for this condition as well.
We should have better alerts on the memory usage on the individual nodes. It’s unclear if that would have helped in this case since the control plane lost contact with the faulty node periodically.
We should also have better alerts if all pods in a service are unresponsive. In this case that could have directed the engineers to the faulty service even faster.
Better end-user information
When the user tried to access https://app.iconik.io they were met by a white page saying “503 Service Temporarily Unavailable”. That is a fairly unappealing message and we should replace that with something which provides more information about what is going on, including where our engineers can update users on the trouble-shooting progress.
- Investigate what happens to a node in a non-autoscaling pool which becomes overloaded.
- Investigate how the load balancer can be configured to fail-over between regions.
The system was down for a total of 30 minutes. This could have been a lot shorter if the alert system had reacted faster. We can also make a number of configuration changes so this would not have caused an outage, both through redundancy within the US region and also through fail-over to the EU region.