Post

Running Apache Kafka on Kubernetes: A Comprehensive Deployment Guide

A deep dive into deploying Apache Kafka on Kubernetes, covering node architecture considerations, deployment patterns, storage solutions, and best practices for production-ready clusters.

Welcome back to our Kafka series! In previous posts, we’ve explored the fundamentals of Apache Kafka, its architecture, and use cases for real-time data streaming. Today, we’re diving into deploying Kafka on Kubernetes (K8s), a powerful combination that enables scalable, resilient, and manageable Kafka clusters. This guide builds on our earlier discussions by focusing on production-ready deployments, with an emphasis on why node architecture directly impacts Kafka’s performance and reliability.

Whether you’re migrating from on-premises setups or building cloud-native applications, this post will equip you with the knowledge to run Kafka effectively on K8s. We’ll cover everything from infrastructure considerations to best practices, complete with code examples and YAML configurations.

1. Why Kubernetes is Important for Kafka Deployments

Kubernetes has become the de facto standard for container orchestration, and for good reason when it comes to Kafka:

  • Scalability and Elasticity: K8s allows dynamic scaling of Kafka brokers based on load, ensuring optimal resource utilization.
  • High Availability: Built-in replication, self-healing, and rolling updates minimize downtime.
  • Resource Management: Fine-grained control over CPU, memory, and storage allocations prevents resource contention.
  • Portability: Deploy the same Kafka setup across on-premises, cloud, or hybrid environments.
  • Ecosystem Integration: Seamless integration with other K8s-native tools like Prometheus for monitoring or Istio for service mesh.

Kafka’s distributed nature aligns perfectly with K8s’ pod-based architecture, where each broker can run as a pod. However, Kafka’s performance is sensitive to underlying infrastructure, making node architecture a critical factor.

2. Node Architecture Considerations

Kafka’s performance hinges on low-latency, high-throughput operations. Poor node design can lead to bottlenecks in message processing, replication, and disk I/O. Here’s what to prioritize:

CPU

  • Cores and Threads: Kafka benefits from multi-core CPUs. Allocate at least 4-8 cores per broker for handling concurrent requests. Hyper-threading can improve performance, but avoid over-subscription.
  • Clock Speed: Higher clock speeds (e.g., 3.0 GHz+) reduce processing latency for message serialization/deserialization.
  • NUMA Awareness: In multi-socket nodes, ensure pods are scheduled on the same NUMA node to avoid cross-socket memory access penalties.

Memory

  • Heap Size: Set JVM heap to 50-75% of available RAM. For example, on a 32GB node, allocate 16-24GB to the Kafka JVM.
  • Off-Heap Memory: Reserve memory for page cache (Linux buffer cache) – Kafka relies heavily on OS caching for log segments.
  • Avoid Swapping: Disable swap entirely; swapping can cause unpredictable latency spikes.

Storage

  • Disk Type: Use NVMe SSDs for low-latency I/O. HDDs are insufficient for high-throughput scenarios.
  • IOPS and Throughput: Aim for 10,000+ IOPS and 500MB/s+ throughput per disk.
  • RAID Configuration: RAID 10 for redundancy and performance, or use K8s persistent volumes with replication.
  • Log Directory: Mount separate disks for Kafka logs to isolate I/O from system operations.

Networking

  • Bandwidth: 10Gbps+ NICs are essential for inter-broker replication and client traffic.
  • Latency: Keep network latency under 1ms within the cluster.
  • MTU: Use jumbo frames (9000 MTU) to reduce CPU overhead on large packets.

Why It Matters: In a benchmark, a node with insufficient CPU cores saw 40% lower throughput during peak loads, while slow disks increased end-to-end latency by 200ms. Proper architecture ensures Kafka can handle millions of messages per second with sub-millisecond latency.

Example node specification for a Kafka broker:

apiVersion: v1 kind: Node metadata: labels: kafka-node: "true" spec: # Ensure nodes have at least: # - 8 CPU cores # - 32GB RAM # - NVMe SSDs with 1TB+ capacity # - 10Gbps networking

3. Deployment Patterns

Several approaches exist for deploying Kafka on K8s, each with trade-offs:

Strimzi is a K8s-native operator for Kafka, providing declarative deployments.

apiVersion: kafka.strimzi.io/v1beta2 kind: Kafka metadata: name: my-kafka-cluster spec: kafka: version: 3.6.0 replicas: 3 listeners: - name: plain port: 9092 type: internal tls: false config: offsets.topic.replication.factor: 3 transaction.state.log.replication.factor: 3 transaction.state.log.min.isr: 2 storage: type: persistent-claim size: 100Gi deleteClaim: false zookeeper: replicas: 3 storage: type: persistent-claim size: 10Gi entityOperator: topicOperator: {} userOperator: {}

Confluent Platform

Confluent’s operator offers enterprise features like Schema Registry and ksqlDB.

apiVersion: platform.confluent.io/v1beta1 kind: Kafka metadata: name: kafka spec: replicas: 3 image: application: confluentinc/cp-kafka:7.5.0 dataVolumeCapacity: 100Gi # Additional Confluent-specific configs

Custom Operators

For bespoke needs, build a custom operator using K8s SDKs. This allows fine-tuning but requires more maintenance.

Best Practice: Start with Strimzi for most use cases due to its active community and simplicity.

4. Storage Solutions for Kafka on K8s

Kafka requires durable, high-performance storage for log segments. K8s persistent volumes (PVs) are ideal:

  • Persistent Volume Claims (PVCs): Use dynamic provisioning with storage classes like AWS EBS, GCP Persistent Disk, or local SSDs.
  • Storage Classes: Define classes for different performance tiers.
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: kafka-storage provisioner: kubernetes.io/aws-ebs parameters: type: gp3 # General-purpose SSD iops: "3000" throughput: "125" reclaimPolicy: Retain
  • Local Storage: For ultra-low latency, use local NVMe with hostPath volumes, but ensure data persistence via backups.
  • Network Storage: NFS or Ceph for shared storage, though avoid for high-IOPS needs.

Pitfall: Never use ephemeral storage; Kafka logs must survive pod restarts.

5. Networking and Service Discovery

K8s networking ensures brokers communicate efficiently:

  • ClusterIP Services: For internal access, expose brokers via headless services for stable DNS.
  • Ingress Controllers: For external clients, use NGINX or Traefik ingress with TLS termination.
  • Service Mesh: Integrate with Istio for advanced routing, load balancing, and security.
apiVersion: v1 kind: Service metadata: name: kafka-service spec: selector: app: kafka ports: - port: 9092 targetPort: 9092 clusterIP: None # Headless for DNS-based discovery

Why It Matters: Proper networking prevents partition issues; misconfigured DNS can cause replication failures.

6. Scaling and Resource Management

Scale Kafka horizontally by adding brokers or vertically by adjusting resources:

  • Horizontal Pod Autoscaler (HPA): Scale based on CPU/memory metrics.
  • Kafka’s Built-in Scaling: Increase partition count dynamically.
  • Resource Limits: Set requests/limits to prevent noisy neighbor issues.
apiVersion: v1 kind: Pod metadata: name: kafka-broker spec: containers: - name: kafka image: confluentinc/cp-kafka:latest resources: requests: memory: "16Gi" cpu: "4" limits: memory: "24Gi" cpu: "8" # Affinity rules to spread across nodes affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - kafka topologyKey: kubernetes.io/hostname

Best Practice: Use taints/tolerations to dedicate nodes for Kafka workloads.

7. Monitoring and Observability in K8s

Monitor Kafka with Prometheus and Grafana:

  • Metrics: Expose JMX metrics via Kafka’s built-in exporter.
  • Dashboards: Use pre-built Grafana dashboards for Kafka (e.g., Confluent’s).
  • Logging: Centralize logs with Fluentd/Elasticsearch.
  • Alerts: Set up rules for under-replicated partitions or high latency.

Example Prometheus scrape config:

scrape_configs: - job_name: 'kafka' static_configs: - targets: ['kafka-broker-0:9092', 'kafka-broker-1:9092'] metrics_path: /metrics

Common Pitfall: Ignoring JVM metrics; monitor GC pauses to detect memory issues.

8. Security Considerations

Secure Kafka on K8s with:

  • Authentication: Use SASL/SCRAM or OAuth with K8s service accounts.
  • Authorization: ACLs via Kafka’s authorizer.
  • Encryption: TLS for in-transit data; encrypt at-rest with storage-level encryption.
  • Network Policies: Restrict pod-to-pod traffic.
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: kafka-policy spec: podSelector: matchLabels: app: kafka policyTypes: - Ingress - Egress ingress: - from: - podSelector: matchLabels: app: kafka-client

Best Practice: Enable RBAC for K8s resources and use secrets for sensitive configs.

9. Best Practices and Common Pitfalls

  • Avoid Over-Provisioning: Start small and scale; over-provisioning wastes resources.
  • Test Failures: Simulate node failures to validate HA.
  • Version Upgrades: Use rolling updates; test compatibility.
  • Pitfalls: Ignoring disk I/O can cause log corruption; mismatched replication factors lead to data loss.
  • Performance Tuning: Adjust num.io.threads, num.network.threads, and buffer sizes based on load.

10. Real-World Examples and Case Studies

  • E-commerce Platform: A retail company scaled Kafka from 10 to 100 brokers on K8s, achieving 10x throughput by optimizing node CPU and storage.
  • IoT Data Pipeline: Used Strimzi to deploy Kafka on EKS, reducing operational overhead by 50% with automated scaling.
  • Financial Services: Implemented Confluent on K8s for low-latency trading data, ensuring <1ms latency through dedicated nodes and NVMe storage.

In conclusion, running Kafka on Kubernetes unlocks immense scalability and reliability, but success depends on thoughtful node architecture and configuration. Experiment with these patterns in a staging environment, and remember: performance is king for real-time systems.

Stay tuned for more in our Kafka series, including advanced topics like stream processing with Kafka Streams. What challenges have you faced deploying Kafka? Share in the comments!

References:

This post is licensed under CC BY 4.0 by the author.