✦ WORKLOAD RECOMMENDATIONS

Best GPU for Machine Learning Training in 2026: Price vs Performance Guide

📅 May 25, 2026 ⏱ 9 min read 🗃 Updated 2026-06-03

Picking the wrong GPU for your training job is an expensive mistake. The gap between the best and worst choice for a given workload can easily be 10× the cost per hour — and if you're running hundreds of GPU-hours per week, that's real money. This guide uses 30-day spot pricing data from 7 cloud providers to help you match the right GPU to your training task based on what actually matters: VRAM, compute throughput, and $/TFLOP.

Cheapest H100 spot

$1.47/hr

Vast.ai · 30-day avg

Cheapest A100 spot

$0.43/hr

Vast.ai · 30-day avg

Best value GPU

L40S

$0.79/hr · 730 FP16 TFLOPS

Lowest cost entry

RTX 4090

$0.01/hr min on Vast.ai

Why GPU Selection Matters More Than Ever in 2026

The GPU market has fractured. You no longer simply pick "the best GPU" — you pick the right GPU for your workload, your budget, and your interruption tolerance. An H100 costs $46/hr on AWS but $1.47/hr on Vast.ai. An A100 that's $1.21/hr on CoreWeave can be $0.43/hr on Vast.ai on any given day. That variance changes which GPU is actually the right choice.

Three specs determine GPU suitability for ML training: VRAM (determines max batch size and model size), FP16 TFLOPS (raw throughput for matrix operations), and interconnect bandwidth (NVLink/NVSwitch for multi-GPU scaling). Price matters, but it's the last of these three to evaluate.

The three questions to answer first: (1) What's your model's parameter count? (2) Can your workload tolerate spot interruptions? (3) Do you need multi-GPU scaling? The answers narrow your GPU tier immediately. Everything below flows from those three inputs.

GPU Tier List Based on Real Pricing Data

The table below shows 30-day average spot prices per GPU per provider. Use these to see where the actual price floor is — not the list price, but what you're likely to pay on a good day.

🪙 Budget Tier

Best $/TFLOP for fine-tuning, small models, and experiments

RTX 4090 24GB VRAM 330 FP16 TFLOPS

★ Top pick: Vast.ai at $0.13/hr

Provider	Current spot	30d avg	30d range	Stability	Value
Vast.ai	$0.13/hr	$0.13/hr	$0.01 – $0.27/hr	★★★ Very stable	★★★★ Best
RunPod	$0.34/hr	$0.34/hr	$0.34 – $0.34/hr	★★★ Very stable	★★★★ Best

Source: RoofRun price_snapshots, 30-day window ending 2026-06-03. Raw data via API.

⚖️ Mid-Range Tier

The sweet spot for most production ML training workloads

A100 80GB VRAM 1,248 FP16 TFLOPS

★ Top pick: Vast.ai at $0.43/hr

Provider	Current spot	30d avg	30d range	Stability	Value
Vast.ai	$0.43/hr	$0.43/hr	$0.43 – $0.85/hr	★★ Stable	★★★★ Best
RunPod	$1.00/hr	$1.02/hr	$1.00 – $1.39/hr	★★ Stable	★★★★ Best
CoreWeave	$1.21/hr	$1.21/hr	$1.19 – $1.23/hr	★★★ Very stable	★★★★ Best
Lambda Cloud	$1.29/hr	$1.29/hr	$1.28 – $1.30/hr	★★★ Very stable	★★★★ Best
Google Cloud Platform	$1.52/hr	$1.50/hr	$1.46 – $1.55/hr	★★★ Very stable	★★★★ Best
Microsoft Azure	$1.83/hr	$3.24/hr	$0.86 – $7.67/hr	⚠ Volatile	★★ Fair
Amazon Web Services	$5.73/hr	$11.54/hr	$5.73 – $24.48/hr	⚠ Volatile	★ Poor

Source: RoofRun price_snapshots, 30-day window ending 2026-06-03. Raw data via API.

L40S 48GB VRAM 730 FP16 TFLOPS

★ Top pick: Vast.ai at $0.62/hr

Provider	Current spot	30d avg	30d range	Stability	Value
Vast.ai	$0.43/hr	$0.62/hr	$0.43 – $0.85/hr	★ Variable	★★★★ Best
RunPod	$0.79/hr	$0.79/hr	$0.79 – $0.79/hr	★★★ Very stable	★★★★ Best
CoreWeave	$1.82/hr	$1.84/hr	$1.80 – $1.88/hr	★★★ Very stable	★★ Fair

Source: RoofRun price_snapshots, 30-day window ending 2026-06-03. Raw data via API.

🚀 High-End Tier

Large model pretraining, multi-GPU training, frontier research

H100 80GB VRAM 1,979 FP16 TFLOPS

★ Top pick: Vast.ai at $1.53/hr

Provider	Current spot	30d avg	30d range	Stability	Value
Vast.ai	$1.47/hr	$1.53/hr	$1.40 – $6.13/hr	★ Variable	★★★★ Best
CoreWeave	$2.07/hr	$2.06/hr	$2.02 – $2.10/hr	★★★ Very stable	★★★★ Best
Lambda Cloud	$2.49/hr	$2.49/hr	$2.47 – $2.51/hr	★★★ Very stable	★★★ Good
RunPod	$2.59/hr	$2.59/hr	$2.59 – $2.59/hr	★★★ Very stable	★★★ Good
Microsoft Azure	$2.16/hr	$3.19/hr	$2.04 – $4.43/hr	★ Variable	★★★ Good
Google Cloud Platform	$26.85/hr	$27.43/hr	$25.75 – $30.08/hr	★★★ Very stable	★ Poor
Amazon Web Services	$57.76/hr	$46.22/hr	$31.30 – $60.12/hr	★ Variable	★ Poor

Source: RoofRun price_snapshots, 30-day window ending 2026-06-03. Raw data via API.

Provider Recommendations by Use Case

Price is one variable; availability, stability, and setup complexity are others. Here's the practical recommendation per use case based on 30-day data patterns:

Use Case	Recommended GPU	Best Providers	Typical Spot Range	Notes
Fine-tuning (7B–30B models)	A100 80GB	RunPod, Vast.ai	$1.00–$1.22/hr	Checkpoints every 100–500 steps; spot-friendly.
Pretraining (70B+ models)	H100 SXM	CoreWeave, Vast.ai	$1.47–$2.49/hr	Multi-node, NVLink critical. Reserve via CoreWeave for consistency.
RLHF / Reward Modeling	H100 or A100	Lambda Cloud, RunPod	$1.28–$2.49/hr	High GPU-memory. Spot works with checkpointing.
Inference (batch)	L40S, A100	Vast.ai, RunPod	$0.13–$0.79/hr	Batch inference tolerates spot interruptions well.
Experiments / Dev / Finetuning small	RTX 4090	Vast.ai, RunPod	$0.01–$0.34/hr	Cheapest path for non-critical dev workloads.
Enterprise / SLA-required	A100 or H100	CoreWeave, GCP	$1.21–$2.06/hr	Higher price, but stability and support justify premium.

Understanding $/TFLOP: The Real Efficiency Metric

Price per hour is not the whole story. A GPU that costs twice as much but delivers 3× the throughput has better $/TFLOP — the metric that actually measures your cost efficiency per unit of compute. Here's how the tier compares:

RTX 4090 (budget): $0.13–$0.34/hr for 330 FP16 TFLOPS. Exceptional $/TFLOP at current spot prices. Best for fine-tuning 7B–13B models, experiments, and any workload where interruption is acceptable.
L40S (mid): $0.79/hr on RunPod, $1.84/hr on CoreWeave. 730 FP16 TFLOPS. Strong value for inference and medium-scale training without requiring H100-scale memory bandwidth.
A100 (mid): $0.43–$1.29/hr across Vast.ai, RunPod, Lambda, CoreWeave. 1,248 FP16 TFLOPS. The production workhorse — excellent $/TFLOP, widely available, mature tooling.
H100 (high): $1.47–$2.59/hr on specialists (Vast.ai, CoreWeave, Lambda, RunPod). 1,979 FP16 TFLOPS. Highest raw throughput. 2.5× the FP16 of A100. Essential for pretraining 70B+ models or multi-node training where time-to-train matters.

$/TFLOP reality check: At current Vast.ai spot prices, the RTX 4090 achieves $0.0015 per FP16 TFLOPS-hour. The H100 on CoreWeave comes in at $0.70 per FP16 TFLOPS-hour. The absolute $/TFLOP winner depends heavily on which provider you access — and your interruption tolerance. At full on-demand pricing the picture flips, which is why we always recommend spot-first with checkpointing.

When Spot Pricing Makes Sense vs. Reserved/On-Demand

✅ Use Spot Instances When:

Training with checkpointing (PyTorch save_checkpoint, DeepSpeed, etc.)
Batch jobs you can restart if interrupted
Hyperparameter sweeps — many parallel trials, fault-tolerant
Fine-tuning smaller models (7B–30B) where interruption risk is low
Dev/test environments where a restart is cheap
Inference batch jobs without strict SLA requirements

❌ Use On-Demand/Reserved When:

Serving live inference APIs with latency SLA
Long multi-day pretraining runs with no checkpoint strategy
Stateful applications with no restart capability
Jobs under 30 minutes — interrupt overhead isn't worth the discount
Strict compliance or data residency requirements
Production RLHF pipelines where interruption corrupts state

Hybrid strategy: Run your training on spot with aggressive checkpointing (save every 100–500 steps depending on step duration). Keep a single on-demand or reserved instance for your primary model serving endpoint. This gives you spot economics for training while protecting your production inference SLA.

For large-scale pretraining runs where every hour of wall time translates to days of schedule delay, reserved capacity on CoreWeave or Lambda Cloud (1.2–2.6× cheaper than AWS/GCP on H100) is worth the premium. The consistency of pricing matters more than the absolute floor when you're burning 100+ GPUs per run.

Multi-GPU Training: NVLink Changes the Math

If you're doing multi-GPU training, interconnect matters enormously. PCIe bandwidth between GPUs creates a bottleneck that NVLink eliminates. Only specialist providers (CoreWeave, Lambda Cloud, and the hyperscalers) offer NVLink-connected H100 and A100 nodes. Vast.ai and RunPod primarily offer PCIe-connected GPUs.

For pretraining 70B+ models, you need NVLink or NVSwitch to maintain efficient tensor parallelism across 8+ GPUs. The NVLink-connected tier on CoreWeave (8×H100 SXM, ~$16–$20/hr) is the right choice. Trying to replicate this on PCIe-connected GPUs will produce 30–50% lower effective throughput due to inter-GPU communication overhead.

For fine-tuning 7B–30B models, a single A100 or H100 with 80GB VRAM is sufficient for most use cases. Multi-GPU fine-tuning (FSDP, DeepSpeed ZeRO-3) is only necessary when the model doesn't fit in a single 80GB card — which is becoming rarer as quantization techniques improve.

Quick Decision Guide

Fitting a 7B model at 4-bit? → RTX 4090 on Vast.ai ($0.01–$0.34/hr). Cheap, fast enough.
Fine-tuning 7B–30B at FP16? → A100 80GB on RunPod or Vast.ai ($0.43–$1.22/hr). Best balance of VRAM and $/hr.
Pretraining 70B+ from scratch? → H100 SXM with NVLink on CoreWeave ($2.06/hr). Reserve for consistent runs.
Running large inference batches? → L40S on RunPod ($0.79/hr). High throughput, lower cost than A100 for inference-scale work.
Enterprise workload with SLA? → A100 or H100 on CoreWeave or Lambda Cloud. Stability and support justify the premium.

→ Compare live GPU prices → H100 pricing page → A100 pricing page → L40S pricing page → Set price alerts

Track Live GPU Prices Across 7 Providers

Real-time spot pricing, 30-day trends, and price alerts for H100, A100, L40S, RTX 4090 and more.

Open Live Dashboard →

Methodology

All pricing data in this article comes from RoofRun's continuous polling of provider APIs and pricing pages every 30 minutes. Prices shown are spot instance rates — on-demand pricing is typically 2–4× higher. The "current spot" column reflects the most recent available snapshot per GPU/provider; the "30-day avg" column is the mean across the trailing 30 days. GPU TFLOPS are NVIDIA-published FP16 Tensor Core specs. $/TFLOP calculated as average spot price divided by FP16 TFLOPS. Raw data available via the public JSON API.