Post

Kafka Monitoring, Operations, and Troubleshooting

Master Kafka operations: implement comprehensive monitoring, handle common issues, perform maintenance, and ensure high availability.

Kafka Monitoring, Operations, and Troubleshooting

Welcome to Part 10 of our Apache Kafka series! We’ve covered the fundamentals, architecture, producers, consumers, topics/partitions, Streams, Connect, Schema Registry, and security. Now we focus on operations - keeping your Kafka clusters running smoothly in production.

Effective monitoring and operations are crucial for maintaining reliable, performant Kafka deployments. This post covers monitoring strategies, operational procedures, and troubleshooting techniques.

Monitoring Overview

Key Monitoring Areas

  • Cluster Health: Broker status, leadership, replication
  • Performance: Throughput, latency, resource utilization
  • Data Flow: Producer/consumer metrics, lag monitoring
  • Errors: Failed operations, exceptions, timeouts
  • Capacity: Disk usage, network I/O, memory

Monitoring Tools

  • JMX: Built-in Kafka metrics
  • Prometheus + Grafana: Popular open-source stack
  • Confluent Control Center: Commercial monitoring solution
  • Kafka Manager: Open-source web UI
  • Burrow: Consumer lag monitoring

JMX Metrics

Enabling JMX

# Start broker with JMX export JMX_PORT=9999 export KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false" ./bin/kafka-server-start.sh config/server.properties

Key JMX Metrics

Broker Metrics

// Broker state kafka.server:type=BrokerState,name=BrokerState // 1=Starting, 2=Recovering, 3=Running, 4=PendingControlledShutdown, 5=BrokerShuttingDown, 6=Shutdown // Partition count kafka.server:type=BrokerTopicMetrics,name=Partitions // Leader count kafka.server:type=ReplicaManager,name=LeaderCount // Under-replicated partitions kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions

Topic Metrics

// Per-topic metrics kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=orders kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic=orders kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic=orders

Producer Metrics

// Producer metrics kafka.producer:type=producer-metrics,client-id=producer-1,name=records-send-rate kafka.producer:type=producer-metrics,client-id=producer-1,name=record-error-rate kafka.producer:type=producer-metrics,client-id=producer-1,name=request-latency-avg

Consumer Metrics

// Consumer metrics kafka.consumer:type=consumer-metrics,client-id=consumer-1,name=records-consumed-rate kafka.consumer:type=consumer-fetch-manager-metrics,client-id=consumer-1,name=records-lag

Prometheus Monitoring

Kafka Exporter

# docker-compose.yml version: '3.8' services: kafka-exporter: image: danielqsj/kafka-exporter ports: - "9308:9308" command: - '--kafka.server=localhost:9092' - '--kafka.version=2.8.0' depends_on: - kafka

Prometheus Configuration

# prometheus.yml global: scrape_interval: 15s scrape_configs: - job_name: 'kafka' static_configs: - targets: ['localhost:9308']

Grafana Dashboard

// Key panels for Kafka dashboard { "title": "Kafka Overview", "panels": [ { "title": "Active Brokers", "targets": [{"expr": "up{job=\"kafka\"}"}] }, { "title": "Messages In/Out", "targets": [ {"expr": "kafka_topic_partition_current_offset{topic=~\"$topic\"}"}, {"expr": "rate(kafka_topic_partition_current_offset[5m])"} ] }, { "title": "Consumer Lag", "targets": [{"expr": "kafka_consumergroup_lag"}] } ] }

Operational Procedures

Cluster Startup

# Start ZooKeeper ensemble ./bin/zookeeper-server-start.sh config/zookeeper.properties # Start brokers in order ./bin/kafka-server-start.sh config/server-1.properties ./bin/kafka-server-start.sh config/server-2.properties ./bin/kafka-server-start.sh config/server-3.properties # Verify cluster health ./bin/kafka-topics.sh --describe --bootstrap-server localhost:9092

Rolling Restart

# Check for under-replicated partitions ./bin/kafka-topics.sh --describe --under-replicated-partitions --bootstrap-server localhost:9092 # Restart brokers one by one ./bin/kafka-server-stop.sh config/server-1.properties ./bin/kafka-server-start.sh config/server-1.properties # Wait for full recovery # Check metrics: kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions == 0

Scaling the Cluster

# Add new broker # Update broker.id in server.properties broker.id=4 # Start new broker ./bin/kafka-server-start.sh config/server-4.properties # Reassign partitions (optional, for load balancing) ./bin/kafka-reassign-partitions.sh --execute --reassignment-json-file expand-cluster.json --bootstrap-server localhost:9092

Backup and Recovery

# Backup topics ./bin/kafka-mirror-maker.sh --consumer.config consumer.properties --producer.config producer.properties --whitelist ".*" # Backup consumer offsets ./bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-group --members --verbose # Restore from backup # Use MirrorMaker in reverse direction

Troubleshooting Common Issues

High Consumer Lag

Symptoms:

  • Consumer lag increasing
  • Consumers can’t keep up with producers

Diagnosis:

# Check consumer group status ./bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-consumer-group # Check consumer metrics # Look for: records-consumed-rate, records-lag

Solutions:

# Increase consumer instances # Increase partition count # Optimize consumer configuration fetch.min.bytes=1 fetch.max.wait.ms=500 max.poll.records=1000 # Check for processing bottlenecks # Monitor consumer thread utilization

Under-Replicated Partitions

Symptoms:

  • UnderReplicatedPartitions > 0
  • Potential data loss risk

Diagnosis:

# Find under-replicated partitions ./bin/kafka-topics.sh --describe --under-replicated-partitions --bootstrap-server localhost:9092 # Check broker status ./bin/kafka-broker-api-versions.sh --bootstrap-server localhost:9092

Solutions:

# Restart failed brokers # Check disk space # Verify network connectivity # Increase replication factor if needed ./bin/kafka-topics.sh --alter --topic my-topic --replication-factor 3 --bootstrap-server localhost:9092

Broker Unavailable

Symptoms:

  • Broker not responding to requests
  • Clients can’t connect

Diagnosis:

# Check broker logs tail -f /var/log/kafka/server.log # Check broker metrics # Look for: BrokerState, NetworkProcessorAvgIdlePercent # Test connectivity telnet localhost 9092

Solutions:

# Check system resources (CPU, memory, disk) # Verify configuration # Check for JVM issues (GC pauses) # Restart broker if necessary

Disk Full

Symptoms:

  • Broker crashes or becomes unresponsive
  • NoSpaceLeftOnDevice errors

Diagnosis:

# Check disk usage df -h /var/lib/kafka # Check log directory size du -sh /var/lib/kafka/data

Solutions:

# Increase retention time ./bin/kafka-configs.sh --alter --add-config retention.ms=86400000 --topic my-topic --bootstrap-server localhost:9092 # Enable log compaction ./bin/kafka-configs.sh --alter --add-config cleanup.policy=compact --topic my-topic --bootstrap-server localhost:9092 # Add more disk space # Clean up old logs

Network Issues

Symptoms:

  • Connection timeouts
  • High latency
  • Intermittent failures

Diagnosis:

# Check network connectivity ping broker1.example.com # Monitor network metrics # Look for: network-io-rate, request-queue-size # Check for packet loss mtr broker1.example.com

Solutions:

# Adjust timeout settings request.timeout.ms=30000 replica.lag.time.max.ms=30000 # Increase socket buffer sizes socket.send.buffer.bytes=1048576 socket.receive.buffer.bytes=1048576 # Check network hardware # Implement network redundancy

Performance Tuning

Broker Tuning

# server.properties - Performance settings num.network.threads=9 num.io.threads=16 num.replica.fetchers=2 # Buffer sizes socket.send.buffer.bytes=1048576 socket.receive.buffer.bytes=1048576 # Log settings num.partitions=8 log.flush.interval.messages=10000 log.flush.interval.ms=1000

Producer Tuning

# Producer performance batch.size=1048576 linger.ms=10 compression.type=lz4 acks=1 # Buffer settings buffer.memory=67108864 max.block.ms=60000

Consumer Tuning

# Consumer performance fetch.min.bytes=1024 fetch.max.wait.ms=500 max.poll.records=1000 enable.auto.commit=false # Processing settings max.poll.interval.ms=300000 session.timeout.ms=30000

Capacity Planning

Sizing Guidelines

# Disk capacity # Estimate: (messages_per_sec * message_size * retention_hours * 2) / 3600 # Network capacity # Estimate: messages_per_sec * (message_size + overhead) # Memory requirements # Heap: 4-8GB per broker # Page cache: 50-70% of system memory

Monitoring Capacity

# Disk usage alerts # Alert when > 80% disk used # Network saturation # Monitor network-io-rate vs capacity # CPU utilization # Alert when > 70% sustained CPU

Maintenance Tasks

Log Cleanup

# Manual log cleanup ./bin/kafka-log-dirs.sh --describe --bootstrap-server localhost:9092 # Force log compaction ./bin/kafka-configs.sh --alter --add-config cleanup.policy=compact,delete --topic my-topic --bootstrap-server localhost:9092

Index Rebuilding

# Rebuild indexes (requires broker restart) # Delete .index files # Broker will rebuild on startup

Configuration Updates

# Update broker configs dynamically ./bin/kafka-configs.sh --alter --add-config log.retention.hours=24 --entity-type brokers --entity-name 1 --bootstrap-server localhost:9092 # Update topic configs ./bin/kafka-configs.sh --alter --add-config retention.ms=86400000 --topic my-topic --bootstrap-server localhost:9092

Disaster Recovery

Multi-Cluster Setup

# MirrorMaker 2.0 for cross-cluster replication ./bin/connect-mirror-maker.sh config/mm2.properties # Configuration clusters=primary,secondary primary.bootstrap.servers=primary:9092 secondary.bootstrap.servers=secondary:9092

Backup Strategy

# Regular backups of: # - Configuration files # - Consumer offsets # - Schema Registry schemas # - Custom connectors # Test restore procedures regularly

Alerting

Key Alerts

# Prometheus alerting rules groups: - name: kafka rules: - alert: KafkaDown expr: up{job="kafka"} == 0 for: 5m labels: severity: critical - alert: HighConsumerLag expr: kafka_consumergroup_lag > 10000 for: 10m labels: severity: warning - alert: UnderReplicatedPartitions expr: kafka_server_replicamanager_underreplicatedpartitions > 0 for: 5m labels: severity: critical

Best Practices

1. Monitoring Strategy

# Implement comprehensive monitoring # Use multiple tools (JMX, Prometheus, logs) # Set up alerts for critical metrics # Monitor trends, not just current values # Document normal vs abnormal behavior

2. Operational Procedures

# Document all procedures # Test procedures in staging # Implement change management # Have rollback plans # Schedule maintenance windows

3. Capacity Management

# Monitor resource utilization trends # Plan for growth (3-6 months ahead) # Implement auto-scaling where possible # Regular capacity reviews

4. Incident Response

# Define severity levels # Establish response procedures # Set up communication channels # Conduct post-mortems # Implement preventive measures

Real-World Monitoring Dashboard

{ "dashboard": { "title": "Kafka Production Monitoring", "panels": [ { "title": "Cluster Health", "type": "stat", "targets": [ {"expr": "count(kafka_server_brokerstate == 3)"}, {"expr": "kafka_server_replicamanager_underreplicatedpartitions"} ] }, { "title": "Throughput", "type": "graph", "targets": [ {"expr": "rate(kafka_server_brokertopicmetrics_messagesin_total[5m])"}, {"expr": "rate(kafka_server_brokertopicmetrics_bytesin_total[5m])"} ] }, { "title": "Consumer Lag", "type": "table", "targets": [{"expr": "kafka_consumergroup_lag"}] }, { "title": "System Resources", "type": "graph", "targets": [ {"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"}, {"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"} ] } ] } }

What’s Next?

In this comprehensive guide to Kafka monitoring and operations, we’ve covered:

  • JMX and Prometheus monitoring
  • Operational procedures (startup, scaling, backup)
  • Troubleshooting common issues
  • Performance tuning
  • Capacity planning
  • Maintenance tasks
  • Disaster recovery
  • Alerting and best practices

You should now be able to operate and maintain production Kafka clusters effectively.

In Part 11, we’ll explore real-world use cases and best practices - applying Kafka in various scenarios with proven patterns. For production deployments on Kubernetes, check out our comprehensive Kafka on Kubernetes deployment guide.

Additional Resources


*This is Part 10 of our comprehensive Apache Kafka series. Part 9: Kafka Security ← Part 11: Real-World Use Cases →*
This post is licensed under CC BY 4.0 by the author.