โœฆ WORKLOAD RECOMMENDATIONS

Best GPU for Machine Learning Training in 2026: Price vs Performance Guide

๐Ÿ“… May 25, 2026 โฑ 9 min read ๐Ÿ—ƒ Updated 2026-06-03

Picking the wrong GPU for your training job is an expensive mistake. The gap between the best and worst choice for a given workload can easily be 10ร— the cost per hour โ€” and if you're running hundreds of GPU-hours per week, that's real money. This guide uses 30-day spot pricing data from 7 cloud providers to help you match the right GPU to your training task based on what actually matters: VRAM, compute throughput, and $/TFLOP.

Cheapest H100 spot
$1.47/hr
Vast.ai ยท 30-day avg
Cheapest A100 spot
$0.43/hr
Vast.ai ยท 30-day avg
Best value GPU
L40S
$0.79/hr ยท 730 FP16 TFLOPS
Lowest cost entry
RTX 4090
$0.01/hr min on Vast.ai

Why GPU Selection Matters More Than Ever in 2026

The GPU market has fractured. You no longer simply pick "the best GPU" โ€” you pick the right GPU for your workload, your budget, and your interruption tolerance. An H100 costs $46/hr on AWS but $1.47/hr on Vast.ai. An A100 that's $1.21/hr on CoreWeave can be $0.43/hr on Vast.ai on any given day. That variance changes which GPU is actually the right choice.

Three specs determine GPU suitability for ML training: VRAM (determines max batch size and model size), FP16 TFLOPS (raw throughput for matrix operations), and interconnect bandwidth (NVLink/NVSwitch for multi-GPU scaling). Price matters, but it's the last of these three to evaluate.

The three questions to answer first: (1) What's your model's parameter count? (2) Can your workload tolerate spot interruptions? (3) Do you need multi-GPU scaling? The answers narrow your GPU tier immediately. Everything below flows from those three inputs.

GPU Tier List Based on Real Pricing Data

The table below shows 30-day average spot prices per GPU per provider. Use these to see where the actual price floor is โ€” not the list price, but what you're likely to pay on a good day.

๐Ÿช™ Budget Tier

Best $/TFLOP for fine-tuning, small models, and experiments

RTX 4090 24GB VRAM 330 FP16 TFLOPS
โ˜… Top pick: Vast.ai at $0.13/hr
ProviderCurrent spot30d avg30d rangeStabilityValue
Vast.ai $0.13/hr $0.13/hr $0.01 โ€“ $0.27/hr โ˜…โ˜…โ˜… Very stable โ˜…โ˜…โ˜…โ˜… Best
RunPod $0.34/hr $0.34/hr $0.34 โ€“ $0.34/hr โ˜…โ˜…โ˜… Very stable โ˜…โ˜…โ˜…โ˜… Best

Source: RoofRun price_snapshots, 30-day window ending 2026-06-03. Raw data via API.

โš–๏ธ Mid-Range Tier

The sweet spot for most production ML training workloads

A100 80GB VRAM 1,248 FP16 TFLOPS
โ˜… Top pick: Vast.ai at $0.43/hr
ProviderCurrent spot30d avg30d rangeStabilityValue
Vast.ai $0.43/hr $0.43/hr $0.43 โ€“ $0.85/hr โ˜…โ˜… Stable โ˜…โ˜…โ˜…โ˜… Best
RunPod $1.00/hr $1.02/hr $1.00 โ€“ $1.39/hr โ˜…โ˜… Stable โ˜…โ˜…โ˜…โ˜… Best
CoreWeave $1.21/hr $1.21/hr $1.19 โ€“ $1.23/hr โ˜…โ˜…โ˜… Very stable โ˜…โ˜…โ˜…โ˜… Best
Lambda Cloud $1.29/hr $1.29/hr $1.28 โ€“ $1.30/hr โ˜…โ˜…โ˜… Very stable โ˜…โ˜…โ˜…โ˜… Best
Google Cloud Platform $1.52/hr $1.50/hr $1.46 โ€“ $1.55/hr โ˜…โ˜…โ˜… Very stable โ˜…โ˜…โ˜…โ˜… Best
Microsoft Azure $1.83/hr $3.24/hr $0.86 โ€“ $7.67/hr โš  Volatile โ˜…โ˜… Fair
Amazon Web Services $5.73/hr $11.54/hr $5.73 โ€“ $24.48/hr โš  Volatile โ˜… Poor

Source: RoofRun price_snapshots, 30-day window ending 2026-06-03. Raw data via API.

L40S 48GB VRAM 730 FP16 TFLOPS
โ˜… Top pick: Vast.ai at $0.62/hr
ProviderCurrent spot30d avg30d rangeStabilityValue
Vast.ai $0.43/hr $0.62/hr $0.43 โ€“ $0.85/hr โ˜… Variable โ˜…โ˜…โ˜…โ˜… Best
RunPod $0.79/hr $0.79/hr $0.79 โ€“ $0.79/hr โ˜…โ˜…โ˜… Very stable โ˜…โ˜…โ˜…โ˜… Best
CoreWeave $1.82/hr $1.84/hr $1.80 โ€“ $1.88/hr โ˜…โ˜…โ˜… Very stable โ˜…โ˜… Fair

Source: RoofRun price_snapshots, 30-day window ending 2026-06-03. Raw data via API.

๐Ÿš€ High-End Tier

Large model pretraining, multi-GPU training, frontier research

H100 80GB VRAM 1,979 FP16 TFLOPS
โ˜… Top pick: Vast.ai at $1.53/hr
ProviderCurrent spot30d avg30d rangeStabilityValue
Vast.ai $1.47/hr $1.53/hr $1.40 โ€“ $6.13/hr โ˜… Variable โ˜…โ˜…โ˜…โ˜… Best
CoreWeave $2.07/hr $2.06/hr $2.02 โ€“ $2.10/hr โ˜…โ˜…โ˜… Very stable โ˜…โ˜…โ˜…โ˜… Best
Lambda Cloud $2.49/hr $2.49/hr $2.47 โ€“ $2.51/hr โ˜…โ˜…โ˜… Very stable โ˜…โ˜…โ˜… Good
RunPod $2.59/hr $2.59/hr $2.59 โ€“ $2.59/hr โ˜…โ˜…โ˜… Very stable โ˜…โ˜…โ˜… Good
Microsoft Azure $2.16/hr $3.19/hr $2.04 โ€“ $4.43/hr โ˜… Variable โ˜…โ˜…โ˜… Good
Google Cloud Platform $26.85/hr $27.43/hr $25.75 โ€“ $30.08/hr โ˜…โ˜…โ˜… Very stable โ˜… Poor
Amazon Web Services $57.76/hr $46.22/hr $31.30 โ€“ $60.12/hr โ˜… Variable โ˜… Poor

Source: RoofRun price_snapshots, 30-day window ending 2026-06-03. Raw data via API.

Provider Recommendations by Use Case

Price is one variable; availability, stability, and setup complexity are others. Here's the practical recommendation per use case based on 30-day data patterns:

Use CaseRecommended GPUBest ProvidersTypical Spot RangeNotes
Fine-tuning (7Bโ€“30B models) A100 80GB RunPod, Vast.ai $1.00โ€“$1.22/hr Checkpoints every 100โ€“500 steps; spot-friendly.
Pretraining (70B+ models) H100 SXM CoreWeave, Vast.ai $1.47โ€“$2.49/hr Multi-node, NVLink critical. Reserve via CoreWeave for consistency.
RLHF / Reward Modeling H100 or A100 Lambda Cloud, RunPod $1.28โ€“$2.49/hr High GPU-memory. Spot works with checkpointing.
Inference (batch) L40S, A100 Vast.ai, RunPod $0.13โ€“$0.79/hr Batch inference tolerates spot interruptions well.
Experiments / Dev / Finetuning small RTX 4090 Vast.ai, RunPod $0.01โ€“$0.34/hr Cheapest path for non-critical dev workloads.
Enterprise / SLA-required A100 or H100 CoreWeave, GCP $1.21โ€“$2.06/hr Higher price, but stability and support justify premium.

Understanding $/TFLOP: The Real Efficiency Metric

Price per hour is not the whole story. A GPU that costs twice as much but delivers 3ร— the throughput has better $/TFLOP โ€” the metric that actually measures your cost efficiency per unit of compute. Here's how the tier compares:

$/TFLOP reality check: At current Vast.ai spot prices, the RTX 4090 achieves $0.0015 per FP16 TFLOPS-hour. The H100 on CoreWeave comes in at $0.70 per FP16 TFLOPS-hour. The absolute $/TFLOP winner depends heavily on which provider you access โ€” and your interruption tolerance. At full on-demand pricing the picture flips, which is why we always recommend spot-first with checkpointing.

When Spot Pricing Makes Sense vs. Reserved/On-Demand

โœ… Use Spot Instances When:

  • Training with checkpointing (PyTorch save_checkpoint, DeepSpeed, etc.)
  • Batch jobs you can restart if interrupted
  • Hyperparameter sweeps โ€” many parallel trials, fault-tolerant
  • Fine-tuning smaller models (7Bโ€“30B) where interruption risk is low
  • Dev/test environments where a restart is cheap
  • Inference batch jobs without strict SLA requirements

โŒ Use On-Demand/Reserved When:

  • Serving live inference APIs with latency SLA
  • Long multi-day pretraining runs with no checkpoint strategy
  • Stateful applications with no restart capability
  • Jobs under 30 minutes โ€” interrupt overhead isn't worth the discount
  • Strict compliance or data residency requirements
  • Production RLHF pipelines where interruption corrupts state

Hybrid strategy: Run your training on spot with aggressive checkpointing (save every 100โ€“500 steps depending on step duration). Keep a single on-demand or reserved instance for your primary model serving endpoint. This gives you spot economics for training while protecting your production inference SLA.

For large-scale pretraining runs where every hour of wall time translates to days of schedule delay, reserved capacity on CoreWeave or Lambda Cloud (1.2โ€“2.6ร— cheaper than AWS/GCP on H100) is worth the premium. The consistency of pricing matters more than the absolute floor when you're burning 100+ GPUs per run.

Multi-GPU Training: NVLink Changes the Math

If you're doing multi-GPU training, interconnect matters enormously. PCIe bandwidth between GPUs creates a bottleneck that NVLink eliminates. Only specialist providers (CoreWeave, Lambda Cloud, and the hyperscalers) offer NVLink-connected H100 and A100 nodes. Vast.ai and RunPod primarily offer PCIe-connected GPUs.

For pretraining 70B+ models, you need NVLink or NVSwitch to maintain efficient tensor parallelism across 8+ GPUs. The NVLink-connected tier on CoreWeave (8ร—H100 SXM, ~$16โ€“$20/hr) is the right choice. Trying to replicate this on PCIe-connected GPUs will produce 30โ€“50% lower effective throughput due to inter-GPU communication overhead.

For fine-tuning 7Bโ€“30B models, a single A100 or H100 with 80GB VRAM is sufficient for most use cases. Multi-GPU fine-tuning (FSDP, DeepSpeed ZeRO-3) is only necessary when the model doesn't fit in a single 80GB card โ€” which is becoming rarer as quantization techniques improve.

Quick Decision Guide

Track Live GPU Prices Across 7 Providers

Real-time spot pricing, 30-day trends, and price alerts for H100, A100, L40S, RTX 4090 and more.

Open Live Dashboard โ†’

Methodology

All pricing data in this article comes from RoofRun's continuous polling of provider APIs and pricing pages every 30 minutes. Prices shown are spot instance rates โ€” on-demand pricing is typically 2โ€“4ร— higher. The "current spot" column reflects the most recent available snapshot per GPU/provider; the "30-day avg" column is the mean across the trailing 30 days. GPU TFLOPS are NVIDIA-published FP16 Tensor Core specs. $/TFLOP calculated as average spot price divided by FP16 TFLOPS. Raw data available via the public JSON API.