Scaling Smart: My Journey to Cost-Effective and Optimized EKS Clusters with Karpenter, KEDA, and Goldilocks

Scaling Smart: My Journey to Cost-Effective and Optimized EKS Clusters with Karpenter, KEDA, and Goldilocks

“Welcome to my first tech blog! As someone who’s been deeply immersed in optimizing cloud infrastructure, I’m excited to share my journey and insights on EKS cost optimization. This is my debut into the world of tech blogging, and I hope you find this exploration into the world of Karpenter, KEDA, and Goldilocks both informative and engaging.”

Introduction

Navigating cloud infrastructure management, particularly with Amazon EKS (Elastic Kubernetes Service), is no small feat. Balancing performance, scalability, and cost efficiency can feel like a constant juggling act. Our multi-tenant EKS cluster, handling around 6000+ pods across over 180+ nodes, presented a significant financial challenge. This blog details our journey to optimizing EKS costs, culminating in a remarkable $300,000 in savings through the strategic implementation of Karpenter, KEDA, and Goldilocks.

The Challenge: Rising EKS Costs and the Need for Efficient Optimization

Our initial approach relied on the Cluster Autoscaler, which worked reasonably well but struggled to keep up with our scaling demands as our environment grew. We found ourselves in a situation where our resource allocation was akin to booking a grand banquet hall for a small gathering—resulting in wasted capacity and soaring costs. Recognizing the urgency to refine our strategy, we turned to more advanced solutions.

Our Solution: A Triumvirate of Tools for Enhanced Cost Efficiency

To address our cost challenges, we implemented three powerful tools—Karpenter, KEDA, and Goldilocks. Each tool addressed different aspects of cost optimization and resource management, providing a holistic approach to our scaling and efficiency issues.

Karpenter: Transforming Node Provisioning

Karpenter revolutionized our approach to node provisioning with its dynamic, just-in-time provisioning capabilities.

Key Benefits:

  • Dynamic Node Provisioning: Karpenter provisions nodes based on real-time application needs, spinning up or down nodes as required. This approach ensures we only pay for what we use, dynamically adjusting to traffic spikes and reducing costs during periods of lower demand.

  • Cost-Effective Instances: We configured our default nodepool to include a mix of on-demand and spot instances. Spot instances, being more cost-effective, were used for non-critical applications, while on-demand instances ensured reliability for high-priority workloads.

  • ARM64(GRAVITON) Instances: Introducing ARM64 architecture into our nodepools resulted in up to 20% cost savings, as ARM64 instances offer competitive performance at lower costs compared to x86 counterparts.

Challenges and Solutions:

  • Deployment Issues: Initial deployment conflicts arose when Karpenter was installed on the same nodes it was managing. We resolved this by redeploying Karpenter on Fargate, separating the management layer from the managed nodes.

  • Max Pods Issue: Karpenter’s handling of max pod limits with custom VPC CNI configurations caused problems. We addressed this by integrating a max_pods calculation script from AWS during node bootstrapping to ensure accurate pod density.

  1. KEDA: Scaling Event-Driven Workloads with Precision

    KEDA (Kubernetes Event-Driven Autoscaler) was instrumental in scaling workloads based on real-time metrics, such as API request counts and message queue lengths.

    Key Benefits:

    • Custom Metrics: KEDA enabled scaling based on diverse metrics beyond CPU and memory usage. For example, we scaled applications based on queue length and request rates, ensuring precise resource allocation according to demand.

    • Cron-Based Scaling: KEDA’s cron scheduler allowed us to scale down non-production environments to zero during off-hours, significantly reducing costs for development and staging environments that didn’t require 24/7 operation.

Challenges and Solutions:

  • Implementation Complexity: Implementing KEDA involved updating our scaling strategies and training application teams. We provided comprehensive documentation and hands-on training sessions to facilitate effective use of KEDA’s features.
  1. Goldilocks: Achieving the Perfect Resource Allocation

    Goldilocks was crucial in identifying and eliminating over-provisioned resources, ensuring we only used and paid for what was necessary.

    Key Benefits:

    • Resource Utilization Analysis: Goldilocks analyzed pod resource usage and offered recommendations for right-sizing. By integrating it with Prometheus, we utilized historical data to optimize resource allocation. Goldilocks highlighted that many pods were over-provisioned by 30-40%, leading to significant cost reductions.

    • Visualization and Reporting: Goldilocks’ visualization capabilities helped pinpoint resource waste. We created custom Grafana dashboards with SSO integration to provide secure, real-time access to optimization recommendations.

Challenges and Solutions:

  • Security and Accessibility: Goldilocks Opensource version lacked certain security features and user access controls. We addressed this by developing custom Grafana dashboards with SSO integration, ensuring secure and convenient access to recommendations.

  • Automated Reporting: We created a Python script to generate and email optimization reports based on namespace labels, ensuring that even teams without direct dashboard access received actionable insights.

Our Journey: Implementing and Optimizing

Phase One: Preparation and Initial Setup

  • Resource Quotas: Implemented resource quotas for each namespace to prevent over-provisioning and ensure fair distribution of resources. This approach helped manage our multi-tenant environment effectively.

  • Annotations & Labels: Added metadata to applications, including team names, owners, and support contacts. This metadata facilitated targeted communications and streamlined responsibility tracking during optimization.

Phase Two: Migrating to Karpenter for Advanced Node Provisioning

  • Deployment and Configuration: Resolved initial deployment issues by moving Karpenter to Fargate and configuring nodepools with appropriate instance types. Introducing ARM64 instances into the default nodepool provided additional cost savings and performance benefits.

Phase Three: Right-Sizing with VPA and Goldilocks

  • Prometheus Integration: Deployed Prometheus to monitor resource usage and feed data into the Vertical Pod Autoscaler (VPA). VPA used 30 days of historical data for balanced recommendations, leading to more efficient resource allocation.

  • Goldilocks Dashboards: Custom Grafana dashboards with SSO integration provided secure access to optimization insights. Automated email reports ensured teams received timely information on resource recommendations.

Phase Four: Dynamic Scaling with KEDA

  • KEDA Deployment: Deployed KEDA to manage dynamic scaling based on custom metrics, enhancing our ability to handle event-driven workloads efficiently. The cron scheduler was particularly effective for reducing costs in non-production environments.

Lessons Learned and Challenges

  • Effective Communication: Keeping application teams informed and engaged was crucial for successful optimization. Regular updates, clear documentation, and training sessions drove adoption and alignment with new strategies.

  • Spot Instance Management: Spot instances provided substantial cost savings but posed risks of interruptions. We carefully selected workloads for spot instances and implemented fallback strategies for critical applications to mitigate potential disruptions.

  • Tooling Limitations: While Karpenter, KEDA, and Goldilocks offered powerful capabilities, we faced limitations that required custom solutions. For instance, integrating Karpenter with VPC CNI configurations and enhancing Goldilocks with secure visualization were necessary for realizing the full potential of these tools.

  • Continuous Monitoring: Optimization is an ongoing process. We set up continuous monitoring to track resource utilization and adjust configurations as needed to maintain efficiency and cost-effectiveness.

Key Takeaways

  • Holistic Approach: Combining Karpenter, KEDA, and Goldilocks provided a comprehensive solution for cost optimization, addressing node provisioning, scaling, and resource allocation.

  • Efficiency Through Automation: Automating node provisioning and scaling with Karpenter and KEDA led to substantial cost reductions and improved resource management.

  • Right-Sizing for Savings: Goldilocks’ insights into resource utilization helped us eliminate over-provisioning and achieve significant cost savings.

Practical Tips

  • Implement Resource Quotas: Set quotas for each namespace to manage resource allocation and prevent over-provisioning.

  • Use Annotations & Labels: Add metadata to applications for better tracking and communication.

  • Configure Karpenter Properly: Ensure accurate node provisioning by deploying Karpenter on separate infrastructure and using max_pods calculation scripts.

  • Leverage Goldilocks and VPA: Utilize these tools for precise right-sizing and visualization of resource usage.

  • Utilize KEDA Effectively: Implement dynamic scaling based on custom metrics and use cron scheduling for non-production environments.

Call to Action

Have you faced similar challenges with EKS cost management? Share your experiences in the comments below or reach out for more insights and tips. If you found this blog helpful, explore these tools further to optimize your own EKS deployments. Let’s drive innovation in cloud cost optimization together!