Engineering

Optimizing Our ML Feature Store: Cutting Compute Costs by 55%

An inside look at how ODAIA's ML platform team replaced Fargate with Karpenter and swapped fixed memory limits for predictive models — without changing a line of application code.

Kayhan Babaee

Hetansh Mehta

Thai Chau Truong

Published on

May 11, 2026

Share this post

TL;DR

In a previous post, we optimized our ML feature store's Parquet layer to cut processing time by ~90%. That targeted I/O. This post tackles the next bottleneck: the compute layer that runs feature generation jobs on those Parquet files.

Karpenter replaced Fargate for batch compute. Pod startup dropped from minutes to seconds. Multi-arch images (ARM64 + x86) doubled our spot pool. A 60/40 spot/on-demand split cut compute costs by ~55-60%.

Quantile regression models replaced fixed memory allocations. Because the feature store registers each feature with defined input schemas, job behavior stays consistent across runs, making memory profiles predictable. Per-job-type models predict 90th-percentile peak memory, cutting resource waste by ~40%.

The Problem

Our feature store's compute layer runs thousands of feature engineering jobs per cycle. Parallel pods process entity data, run ML inference, and write outputs to S3. We had two expensive problems.

Every pod paid a startup tax: All compute ran on AWS Fargate. Each pod takes 1-3 minutes to provision. Across thousands of concurrent pods, that overhead adds up to hours of aggregate billed compute time per run.
‍Half our allocated memory sat idle: Conservative provisioning kept us safe from OOM kills, but memory utilization hovered around 50% across most job types.

Track 1: Karpenter, From Fargate to Right-Priced EC2

Why EC2 Changes the Economics

Karpenter provisions EC2 instances for pending pods in 30-60 seconds. Once a node is warm, subsequent pods start in under a second. EC2 on-demand runs 20-30% cheaper than Fargate. EC2 spot goes further. AWS advertises up to 90%; in our workload mix we see 60-80%.

Why Taints, Not Node Affinity

We needed feature compute nodes to be dedicated exclusively to feature compute pods. Node affinity pulls pods toward labeled nodes but cannot prevent other pods from landing there. Taints with NoExecute reject any pod lacking the toleration and evict already-running pods that don't belong. Karpenter reinforces this: it uses pod tolerations to match pending pods to the correct NodePool.

# Tolerations on compute pods matching Karpenter EC2 node taints
tolerations:
  - key: workload-type
    operator: Equal
    value: feature-compute
    effect: NoExecute
  - key: environment
    operator: Equal
    value: production
    effect: NoExecute

One exception: our job scheduling daemon must run at all times. Karpenter's consolidation would evict it when its node empties. We kept that pod on Fargate.

Multi-Arch Images to Expand the Spot Pool

With x86 only, spot terminations interrupted compute jobs. AWS maintains separate capacity pools for x86 and Graviton families. Supporting only amd64 bet all our spot capacity on one pool.

The fix: build multi-arch Docker images (docker buildx build --platform linux/amd64,linux/arm64 --push) and open the NodePool to both architectures. The --push flag publishes a multi-arch manifest; a subsequent single-arch push would overwrite it, so detect this in CI and skip the redundant push. Graviton instances also price ~20% below equivalent x86, so Karpenter selecting Graviton saves money with no extra configuration.

The 60/40 Spot/On-Demand Split

Multi-arch reduced interruptions, but AWS can still reclaim spot capacity with 2 minutes' notice. Running 100% spot risks cascading failure. We settled on 60% spot, 40% on-demand using Karpenter's capacity-spread topology: two NodePools with disjoint virtual topology domains.

# NodePool capacity-spread domains (simplified)
# Spot NodePool:      domains ["1","2","3"]  -> 60% of pods
# On-Demand NodePool: domains ["4","5"]      -> 40% of pods
spec:
  template:
    spec:
      requirements:
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: [ "spot" ]
        - key: "capacity-spread"
          operator: In
          values: [ "1", "2", "3" ]
        - key: "kubernetes.io/arch"
          operator: In
          values: [ "arm64", "amd64" ]
        - key: "karpenter.k8s.aws/instance-category"
          operator: In
          values: [ "c", "m" ]

Compute pods carry topologySpreadConstraints targeting this key with maxSkew: 1. The Kubernetes scheduler distributes pods evenly across 5 virtual zones: 3 on spot, 2 on on-demand. Changing the ratio requires no application code. Adjust domain counts in Terraform: 4/1 for 80/20, 2/2 for 50/50.

Track 2: Memory Prediction, From Guesswork to Models

Memory prediction works for us because the feature store makes job behavior repeatable. Each feature is registered with a defined computation graph and known input schemas. Given the same feature definition and similar input data, a job's memory profile follows the same curve. That consistency is what makes modeling feasible.

The Old Approach

Our original allocation was a fixed formula: base_overhead + coefficient ^x row_count. It broke as job complexity grew. Hierarchical aggregation jobs in our feature store scale closer to quadratically with entity relationship density. Simpler jobs were over-provisioned by 40-50% because the conservative coefficient protecting complex jobs applied uniformly.

Row count alone does not explain memory variance. A downstream job joining multiple upstream feature partitions might match an export's row count but consume four times the memory, holding multiple datasets in-memory at once.

Better Training Data

We extended job tagging to capture richer metadata: entity counts, temporal dimensions, upstream partition sizes. Tags link to run IDs, which become join keys for correlating inputs with peak memory from our monitoring stack. We query peak container_memory_working_set_bytes over the full execution window. Peak matters because ML jobs spike during joins and inference.

Why Quantile Regression, Not MSE

Memory allocation is a worst-case avoidance problem. Traditional regression predicts the mean. We need the allocation that covers a job 90% of the time.

We use pinball loss at alpha = 0.9. Under-prediction gets penalized 9x more than over-prediction, forcing the model toward the upper bound.

def pinball_loss(y_true, y_pred, alpha=0.9):
    """Asymmetric loss for quantile regression.
    alpha=0.9: targets the 90th percentile of memory usage."""
    error = y_true - y_pred
    return np.mean(np.maximum(alpha * error, (alpha - 1) * error))

The remaining 10% of edge cases get caught by a 20% safety buffer added on top.

Per-Job-Type Models and Weekly Retraining

We train a separate xgboost.XGBRegressor (with objective='reg:quantileerror' and quantile_alpha=0.9) per job category. A single model across all types would average across different scaling profiles and miss all of them.

Each model validates on a held-out 80/20 split. If quantile coverage falls below 90%, the model is under-predicting and needs more data or better features.

A Dagster asset retrains all models weekly over a 30-day rolling window. At submission time, the job launcher predicts memory from the latest model, multiplies by 1.2 as a safety buffer, and falls back to a conservative default for unknown job types.

Pod Failure Sensor

Two transient failures hit our pipeline: OOM kills when the model under-allocates, and pod terminations from spot reclamation or node timeouts. Both produce a failed job and a paged engineer. We automated recovery with a Dagster @run_failure_sensor that:

Classifies the failure via the Prometheus API (OOM vs. infrastructure termination)
Respects a retry budget. Repeated failures on the same run signal a real bug
Cancels in-flight child runs to avoid retry storms
Bumps memory by 50% for OOM cases before retrying

After deployment, fewer than 1% of the feature engineering and aggregation jobs in our pipeline hit OOM. The sensor handles those without human intervention.

Results

Startup Time

Compute Backend	Pod Startup	vs. Fargate Baseline
Fargate (before)	1–3 minutes	—
EC2 via Karpenter, warm node	~10–30 seconds	~5–10x faster
EC2 via Karpenter, cold node provision	~30–60 seconds	~2–3x faster

Across thousands of concurrent pods, the aggregate billed compute time saved per run measures in hours, even though wall-clock improvement is closer to the 1-3 minutes each pod no longer waits.

Compute Cost

Cost Lever	Savings vs. Fargate On-Demand
EC2 on-demand vs. Fargate	~20–30% cheaper
EC2 spot vs. on-demand	60–80% cheaper in practice
Graviton vs. x86	~20% additional savings
60/40 blended	~55–60% total compute cost reduction

Memory Prediction

Static and graph-complexity feature jobs saw the largest per-job reductions (70-80% in some categories). Data export jobs saw smaller per-job gains (~25%) but the largest aggregate impact due to volume.

Production Observations

Spot termination frequency dropped after adding ARM64. Since the 60/40 split, no pipeline has failed from a spot event.
Both ARM64 and x86 instances schedule and complete feature compute jobs.
Long-lived service pods remain stable on Fargate, unaffected by Karpenter consolidation.
Nodes drain and terminate within 5 minutes of emptying.

Implementation Checklist

Full Karpenter and EC2NodeClass docs cover the Terraform details.

Karpenter: (1) Tag private subnets for discovery. (2) Configure an EC2NodeClass with AMI family, IAM role, and subnet/security-group selectors. (3) Create two NodePools with disjoint capacity-spread domains (spot + on-demand). ( 4) Add NoExecute tolerations and topologySpreadConstraints to compute pods. (5) Build multi-arch images via docker buildx. (6) Keep long-lived services on Fargate.

Memory prediction: (1) Tag jobs at execution time with row counts, entity counts, upstream partition sizes, linked to run IDs. (2) Query peak container_memory_working_set_bytes per run. (3) Engineer features, including per-entity-type breakdowns for graph-complexity jobs. (4) Train per-job-type quantile models with pinball loss at alpha = 0.9; validate ~90% coverage on held-out data. (5) Apply predictions with a 20% buffer in the job launcher.

Limitations and Future Work

Karpenter: We are evaluating a node termination handler to drain pods before spot reclamation, reducing blast radius further. Performance profiling of compute-intensive ML workloads across ARM64 and x86 is ongoing.

Memory prediction: Model accuracy improves as the training set grows, especially for newer feature definitions with limited run history. Weekly retraining handles gradual drift; we are evaluating live coverage metrics as an early-warning signal for sudden shifts. Graph-complexity jobs have the most variable profiles and need more data for acceptable coverage.

Key Takeaways

Reducing compute unit cost (Karpenter) and reducing allocation waste (memory prediction) are complementary. Neither captures the full opportunity alone.
Fargate cold starts become a pipeline tax at scale. EC2 with Karpenter eliminates this for pods on warm nodes.
Use NoExecute taints over node affinity for workload isolation. Affinity attracts; taints enforce.
Spot pool size matters more than spot price. Multi-arch support doubles your effective pool.
Karpenter's capacity-spread gives you ratio control with no application code changes.
Predict the 90th percentile, not the mean. Pinball loss targets OOM avoidance while minimizing waste.
Richer input features beat a more complex model. Multi-dimensional metadata improved accuracy far more than switching model families.