Kubernetes Incident Response: Building Your Strategy
Kubernetes is the popular container orchestration platform developed by Google to manage large-scale containerized applications. Kubernetes manages microservices applications over a distributed cluster of nodes. It is very resilient and supports scaling, rollback, zero downtime, and self-healing containers.
The primary aim of Kubernetes is to mask the complexity of overseeing a large fleet of containers. It can run on bare metal machines in an on premise data center as well as on private or public cloud platforms such as Azure, OpenShift, and AWS.
Kubernetes security is a complex undertaking, and organizations everywhere are scrambling to secure their containerized workloads. One very specific and critical aspect of Kubernetes security is Kubernetes incident response. This includes:
- What to do when your Kubernetes cluster is attacked.
- How to coordinate efforts in your organization to deal with an attack.
- How to ensure you have an effective process as well as the necessary tools and data to investigate and recover from any security incident.
Kubernetes Incident Response Components
Incident response is a structured process that an organization uses to detect, manage, and recover from a cybersecurity event. The ultimate aim is to manage the incident successfully so that recovery costs, downtime, and collateral damage (including business losses and brand debasement) are minimal.
To enable an efficient incident response, it is essential to involve individuals from all areas within an organization. Depending on the escalation path, inclusion can reach beyond the obvious technical and security teams to include Client Support, Human Resources, Legal, Compliance, and Senior executives.
Since many guides do not specifically include Kubernetes, an organization should consider the following organizational elements that need to participate in a Kubernetes incident response process.
DevOps
Responding to a Kubernetes security incident almost always requires a deployment, a rollback, a change to cluster configuration, or some combination of these as specified in Kubernetes deployment strategies. All these are the purview of DevOps professionals. The DevOps team must have a clear process for identifying which build or configuration change resulted in a security incident and how to revert to a known good configuration or roll forward with a fix.
Software Development
When a security incident occurs, this usually indicates that a vulnerability in containers or applications is running in the Kubernetes cluster. Remediating a vulnerability requires software developers. There must be a clear line of communication from incident responders to developers. Developers need to know the exact security issue, in which component, and in which lines of code. The Development team must also have a prioritized process for remediating vulnerabilities and pushing them to production. This is standard practice for a strong change management program.
The goal is to make the hand-off of incidents routine and straightforward. It is also important to establish protocols for out-of-hours support in the case of severe incidents or actual breaches.
Core Infrastructure
Depending on the organization, core infrastructure may be managed by DevOps teams, Software Reliability and Engineering (SRE) roles, or external cloud providers. Incident responders should be aware of who owns the responsibility for hardening servers and configurations for each Kubernetes deployment. If a vulnerability is discovered at the infrastructure level, there should be clear processes for obtaining support from infrastructure or security teams at cloud providers.
Building Your Kubernetes Incident Response Strategy
An incident response strategy can be built for a Kubernetes environment in two steps: building an incident response plan and preparing for container forensics.
Preparing an Incident Response Plan
It is critical to prepare an incident response plan for your Kubernetes environment. The plan should contain at least the following four stages. This can be expanded as required using professional guidance offerings.
Identification
This step aims to track security events to identify and report on suspected security incidents. Kubernetes monitoring tools should be used to report on activity in Kubernetes nodes and pods. To identify security-related issues such as container privilege escalations or malicious network communication, utilize dedicated Kubernetes security tools.
Coordination
Once security analysts identify an incident, they should escalate it to senior analysts and involve others in the organization. This is where established processes with DevOps, development, and infrastructure teams will be extremely helpful. There should be a clear process, agreed in advance with senior management, for sharing details about vulnerabilities and receiving prioritized fixes.
Resolution
Even if DevOps and developers are doing their part, it remains the responsibility of the incident response team to resolve the incident. They must verify fixes, ensure the vulnerability can no longer be exploited, and clean intruders and malware from affected systems. Then, the appropriate staff must undertake the complex task of recovering production systems while working with the security team to ensure that the exploited vulnerabilities are remediated.
Continuous improvement
Every security incident is an opportunity to learn and improve. Beyond the emergency fixes performed during the crisis, incident responders should meet with technical teams to share lessons about broader security issues in the environment. Every incident should result in improved cluster configuration and the identification of weak or missing security controls.
Container Forensics
Once the required security protection measures for the Kubernetes environment is initiated, part of the incident response plan should ensure that the security team has access to all the required information for forensic analysis.
Logs
Some of the logs that will be vital for a full security investigation consist of Kubernetes logs from components, including the API Server, and the kubelet on individual nodes, cloud infrastructure logs, application logs, and operating system logs, with a particular focus on network connections, user logins, Secure Shell sessions, and process execution.
Snapshot of the node
A simple, automated procedure to take a snapshot of a node running a suspected malicious container should be mandatory for any deployment. After doing that, a node can be isolated, or the infected container can be removed to restore the rest of the environment.
Using the node snapshot enables analysis such as:
- Investigating and scanning disk images for malicious activity.
- Using Docker Inspect and other container engine tools to investigate malicious activity at the container level.
- Reviewing operating system activity in detail to identify if attackers managed to break out of containers to achieve root access.
Container Visibility Tools
It is recommended that DevOps security analysts initially leverage the tools available within Kubernetes and Docker, including the Docker statistics API, to help them gather system metrics. System metrics can be useful for analysts who only need to know how the system is affected by container loads when it operates at scale.
Container visibility tools help DevOps find out what is occurring inside containers and pods. For example, they can help security teams understand if important files are missing or if unknown files have been added to a container, monitor real time network communications, and identify anomalous behavior at the container or application level. All information must be available without requiring login credentials to the container.
Conclusion
For any organization that uses Kubernetes, the importance of including Kubernetes specific actions in an incident response plan cannot be overstated. Some of the keys in an effective plan must include a DevOps process, integrating security processes, software development staff who are aware of security best practices, and hardening the underlying infrastructure.
About the Author: Gilad David Maayan is a technology writer who has worked with over 150 technology companies including SAP, Imperva, Samsung NEXT, NetApp, and Ixia, producing technical and thought leadership content that elucidates technical solutions for developers and IT leadership. Today, he heads Agile SEO, the leading marketing agency in the technology industry.
LinkedIn: Gilad David Maayan
Twitter: @gilad_maayan
Editor’s Note: The opinions expressed in this guest author article are solely those of the contributor, and do not necessarily reflect those of Tripwire, Inc.