TL;DR
In a previous post, we optimized our ML feature store's Parquet layer to cut processing time by ~90%. That targeted I/O. This post tackles the next bottleneck: the compute layer that runs feature generation jobs on those Parquet files.
Karpenter replaced Fargate for batch compute. Pod startup dropped from minutes to seconds. Multi-arch images (ARM64 + x86) doubled our spot pool. A 60/40 spot/on-demand split cut compute costs by ~55-60%.
Quantile regression models replaced fixed memory allocations. Because the feature store registers each feature with defined input schemas, job behavior stays consistent across runs, making memory profiles predictable. Per-job-type models predict 90th-percentile peak memory, cutting resource waste by ~40%.
The Problem
Our feature store's compute layer runs thousands of feature engineering jobs per cycle. Parallel pods process entity data, run ML inference, and write outputs to S3. We had two expensive problems.
- Every pod paid a startup tax: All compute ran on AWS Fargate. Each pod takes 1-3 minutes to provision. Across thousands of concurrent pods, that overhead adds up to hours of aggregate billed compute time per run.
- Half our allocated memory sat idle: Conservative provisioning kept us safe from OOM kills, but memory utilization hovered around 50% across most job types.
Track 1: Karpenter, From Fargate to Right-Priced EC2
Why EC2 Changes the Economics
Karpenter provisions EC2 instances for pending pods in 30-60 seconds. Once a node is warm, subsequent pods start in under a second. EC2 on-demand runs 20-30% cheaper than Fargate. EC2 spot goes further. AWS advertises up to 90%; in our workload mix we see 60-80%.
Why Taints, Not Node Affinity
We needed feature compute nodes to be dedicated exclusively to feature compute pods. Node affinity pulls pods toward labeled nodes but cannot prevent other pods from landing there. Taints with NoExecute reject any pod lacking the toleration and evict already-running pods that don't belong. Karpenter reinforces this: it uses pod tolerations to match pending pods to the correct NodePool.
# Tolerations on compute pods matching Karpenter EC2 node taints
tolerations:
- key: workload-type
operator: Equal
value: feature-compute
effect: NoExecute
- key: environment
operator: Equal
value: production
effect: NoExecuteOne exception: our job scheduling daemon must run at all times. Karpenter's consolidation would evict it when its node empties. We kept that pod on Fargate.
Multi-Arch Images to Expand the Spot Pool
With x86 only, spot terminations interrupted compute jobs. AWS maintains separate capacity pools for x86 and Graviton families. Supporting only amd64 bet all our spot capacity on one pool.
The fix: build multi-arch Docker images (docker buildx build --platform linux/amd64,linux/arm64 --push) and open the NodePool to both architectures. The --push flag publishes a multi-arch manifest; a subsequent single-arch push would overwrite it, so detect this in CI and skip the redundant push. Graviton instances also price ~20% below equivalent x86, so Karpenter selecting Graviton saves money with no extra configuration.
The 60/40 Spot/On-Demand Split
Multi-arch reduced interruptions, but AWS can still reclaim spot capacity with 2 minutes' notice. Running 100% spot risks cascading failure. We settled on 60% spot, 40% on-demand using Karpenter's capacity-spread topology: two NodePools with disjoint virtual topology domains.
# NodePool capacity-spread domains (simplified)
# Spot NodePool: domains ["1","2","3"] -> 60% of pods
# On-Demand NodePool: domains ["4","5"] -> 40% of pods
spec:
template:
spec:
requirements:
- key: "karpenter.sh/capacity-type"
operator: In
values: [ "spot" ]
- key: "capacity-spread"
operator: In
values: [ "1", "2", "3" ]
- key: "kubernetes.io/arch"
operator: In
values: [ "arm64", "amd64" ]
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: [ "c", "m" ]Compute pods carry topologySpreadConstraints targeting this key with maxSkew: 1. The Kubernetes scheduler distributes pods evenly across 5 virtual zones: 3 on spot, 2 on on-demand. Changing the ratio requires no application code. Adjust domain counts in Terraform: 4/1 for 80/20, 2/2 for 50/50.
Track 2: Memory Prediction, From Guesswork to Models
Memory prediction works for us because the feature store makes job behavior repeatable. Each feature is registered with a defined computation graph and known input schemas. Given the same feature definition and similar input data, a job's memory profile follows the same curve. That consistency is what makes modeling feasible.
The Old Approach
Our original allocation was a fixed formula: base_overhead + coefficient x row_count. It broke as job complexity grew. Hierarchical aggregation jobs in our feature store scale closer to quadratically with entity relationship density. Simpler jobs were over-provisioned by 40-50% because the conservative coefficient protecting complex jobs applied uniformly.
Row count alone does not explain memory variance. A downstream job joining multiple upstream feature partitions might match an export's row count but consume four times the memory, holding multiple datasets in-memory at once.
Better Training Data
We extended job tagging to capture richer metadata: entity counts, temporal dimensions, upstream partition sizes. Tags link to run IDs, which become join keys for correlating inputs with peak memory from our monitoring stack. We query peak container_memory_working_set_bytes over the full execution window. Peak matters because ML jobs spike during joins and inference.
Why Quantile Regression, Not MSE
Memory allocation is a worst-case avoidance problem. Traditional regression predicts the mean. We need the allocation that covers a job 90% of the time.
We use pinball loss at alpha = 0.9. Under-prediction gets penalized 9x more than over-prediction, forcing the model toward the upper bound.
def pinball_loss(y_true, y_pred, alpha=0.9):
"""Asymmetric loss for quantile regression.
alpha=0.9: targets the 90th percentile of memory usage."""
error = y_true - y_pred
return np.mean(np.maximum(alpha * error, (alpha - 1) * error))The remaining 10% of edge cases get caught by a 20% safety buffer added on top.
Per-Job-Type Models and Weekly Retraining
We train a separate xgboost.XGBRegressor (with objective='reg:quantileerror' and quantile_alpha=0.9) per job category. A single model across all types would average across different scaling profiles and miss all of them.
Each model validates on a held-out 80/20 split. If quantile coverage falls below 90%, the model is under-predicting and needs more data or better features.
A Dagster asset retrains all models weekly over a 30-day rolling window. At submission time, the job launcher predicts memory from the latest model, multiplies by 1.2 as a safety buffer, and falls back to a conservative default for unknown job types.
Pod Failure Sensor
Two transient failures hit our pipeline: OOM kills when the model under-allocates, and pod terminations from spot reclamation or node timeouts. Both produce a failed job and a paged engineer. We automated recovery with a Dagster @run_failure_sensor that:
- Classifies the failure via the Prometheus API (OOM vs. infrastructure termination)
- Respects a retry budget. Repeated failures on the same run signal a real bug
- Cancels in-flight child runs to avoid retry storms
- Bumps memory by 50% for OOM cases before retrying
After deployment, fewer than 1% of the feature engineering and aggregation jobs in our pipeline hit OOM. The sensor handles those without human intervention.
Results
Startup Time
Across thousands of concurrent pods, the aggregate billed compute time saved per run measures in hours, even though wall-clock improvement is closer to the 1-3 minutes each pod no longer waits.
Compute Cost
Memory Prediction
Static and graph-complexity feature jobs saw the largest per-job reductions (70-80% in some categories). Data export jobs saw smaller per-job gains (~25%) but the largest aggregate impact due to volume.
Production Observations
- Spot termination frequency dropped after adding ARM64. Since the 60/40 split, no pipeline has failed from a spot event.
- Both ARM64 and x86 instances schedule and complete feature compute jobs.
- Long-lived service pods remain stable on Fargate, unaffected by Karpenter consolidation.
- Nodes drain and terminate within 5 minutes of emptying.
Implementation Checklist
Full Karpenter and EC2NodeClass docs cover the Terraform details.
Karpenter: (1) Tag private subnets for discovery. (2) Configure an EC2NodeClass with AMI family, IAM role, and subnet/security-group selectors. (3) Create two NodePools with disjoint capacity-spread domains (spot + on-demand). ( 4) Add NoExecute tolerations and topologySpreadConstraints to compute pods. (5) Build multi-arch images via docker buildx. (6) Keep long-lived services on Fargate.
Memory prediction: (1) Tag jobs at execution time with row counts, entity counts, upstream partition sizes, linked to run IDs. (2) Query peak container_memory_working_set_bytes per run. (3) Engineer features, including per-entity-type breakdowns for graph-complexity jobs. (4) Train per-job-type quantile models with pinball loss at alpha = 0.9; validate ~90% coverage on held-out data. (5) Apply predictions with a 20% buffer in the job launcher.
Limitations and Future Work
Karpenter: We are evaluating a node termination handler to drain pods before spot reclamation, reducing blast radius further. Performance profiling of compute-intensive ML workloads across ARM64 and x86 is ongoing.
Memory prediction: Model accuracy improves as the training set grows, especially for newer feature definitions with limited run history. Weekly retraining handles gradual drift; we are evaluating live coverage metrics as an early-warning signal for sudden shifts. Graph-complexity jobs have the most variable profiles and need more data for acceptable coverage.
Key Takeaways
- Reducing compute unit cost (Karpenter) and reducing allocation waste (memory prediction) are complementary. Neither captures the full opportunity alone.
- Fargate cold starts become a pipeline tax at scale. EC2 with Karpenter eliminates this for pods on warm nodes.
- Use
NoExecutetaints over node affinity for workload isolation. Affinity attracts; taints enforce. - Spot pool size matters more than spot price. Multi-arch support doubles your effective pool.
- Karpenter's
capacity-spreadgives you ratio control with no application code changes. - Predict the 90th percentile, not the mean. Pinball loss targets OOM avoidance while minimizing waste.
- Richer input features beat a more complex model. Multi-dimensional metadata improved accuracy far more than switching model families.
References
- ODAIA Engineering Blog: Building a Secure ML Feature Store: Cutting Processing Time by 90%
- AWS EC2 Spot Instances
- AWS Fargate Pricing
- AWS Graviton Instances
- Karpenter NodePools
- Karpenter Scheduling
- Kubernetes Taints and Tolerations
- Kubernetes topologySpreadConstraints
- Docker Buildx Multi-Platform Builds
- XGBoost Quantile Regression
- Prometheus Query API






