In my previous article, I described how we use OpenAI’s Whisper model to transcribe radio and TV broadcasts for Monitorea, our media monitoring platform. At the time, we were running inference on RunPod - a serverless GPU platform that lets you deploy ML models without managing hardware. It was the right call to get started quickly. But as we scaled, the economics stopped making sense.
Here’s how we migrated to fully local inference in about a weekend, using MLX on Apple Silicon and a DGX Spark we call Sparky.
Why RunPod Worked - Until It Didn’t
RunPod was a great starting point. You pick a GPU tier, deploy a Docker container with your Whisper model, and get a REST API. No provisioning, no driver installs, no thermal management. For a small operation transcribing a few radio stations, it was ideal.
The problem was GPU availability. Our workload ran well on budget GPUs - we didn’t need A100s or H100s for Whisper medium. But those cheaper tiers became increasingly hard to get. Jobs would queue, cold starts would spike, and the platform would nudge us toward more expensive hardware. What started at a reasonable rate crept up to around $500/month for 27 stations.
At that spend, the math is simple: a one-time hardware purchase pays for itself in months.
The Constraint: Zero Downtime, Zero Code Changes
Monitorea processes audio segments continuously - radio and TV stations recording 24⁄7 across Puerto Rico and Washington D.C. We couldn’t afford a migration that required rewriting the transcription pipeline.
The key insight was designing the local server to be API-compatible with RunPod’s /v2/{endpoint_id}/runsync endpoint. Same request format, same response schema. The entire migration for each service was a single environment variable change:
# Before (RunPod)
WHISPER_API_BASE_URL=https://api.runpod.ai
# After (local MLX)
WHISPER_API_BASE_URL=http://mac-studio:8080
No code changes. No deployment. Just an env var swap and a service restart.
Building monitorea-whisper with MLX
We built monitorea-whisper - a FastAPI server that wraps MLX Whisper, Apple’s ML framework optimized for their unified memory architecture. The design is straightforward:
- FastAPI handles HTTP requests on port 8080
- A process pool distributes work across multiple workers (configurable per hardware)
- Each worker loads the Whisper model independently into GPU memory via MLX
- Workers recycle after 60 tasks to prevent Metal/GPU memory accumulation - a quirk of running sustained MLX workloads
The server accepts base64-encoded audio, decodes it, runs inference, and returns the transcription with word-level timestamps. Because it matches RunPod’s API contract exactly, existing clients don’t know (or care) where inference is happening.
Hardware: Mac Studio M3 Ultra
For the primary inference node, we went with a Mac Studio M3 Ultra with 96GB of unified memory. The key finding from our benchmarks: GPU memory bandwidth is the bottleneck, not core count or model size. The M3 Ultra’s 819 GB/s bandwidth lets us run 6 parallel workers at roughly 27.7x real-time speed - meaning a 60-second audio clip transcribes in about 2 seconds.
| Device | GPU Cores | Bandwidth | Workers | Capacity |
|---|---|---|---|---|
| Mac Mini M4 16GB | 10 | 120 GB/s | 1 | ~10 stations |
| Mac Mini M4 Pro 24GB | 20 | 273 GB/s | 2 | ~16 stations |
| Mac Studio M4 Max 64GB | 40 | 546 GB/s | 4 | ~25 stations |
| Mac Studio M3 Ultra 96GB | 60 | 819 GB/s | 6 | ~45 stations |
With 27 active stations, we’re running at about 60% capacity - plenty of headroom for growth.
Adding Sparky: CUDA Inference on a DGX Spark
While the Mac Studio handles our production workload, we also built monitorea-whisper-cuda - a PyTorch-based variant that runs on NVIDIA GPUs. This runs on Sparky, our NVIDIA DGX Spark, which serves as both a development machine and a secondary inference node.
The CUDA implementation differs from MLX in a few ways:
- Single worker optimized for GPU throughput (CUDA handles parallelism differently than MLX)
- Direct
model.generate()instead of Hugging Face pipelines (avoids a significant latency overhead we measured at ~8x) - FFmpeg subprocess for audio decoding rather than in-process handling
- Runs on port 8081 to coexist with the MLX server
Same API. Same response format. The backend doesn’t care which hardware is transcribing - it’s just a URL.
The Surprise: Better Quality
We expected cost savings. We didn’t expect quality improvements.
RunPod’s Whisper deployment had a known hallucination problem. Research from early 2025 confirmed what we were seeing: Whisper generates phantom text roughly 40% of the time on non-speech audio - music beds, silence, commercial jingles. We’d get segments full of repeated phrases, strings of exclamation marks, or gibberish Spanish.
MLX Whisper handles this significantly better:
| Segment Type | RunPod Output | MLX Output |
|---|---|---|
| Music | “¿Qué está pasando?” repeated 50x | Empty (correct) |
| Commercial break | 249 exclamation marks | Empty (correct) |
| Silence | “no, no, no, no…” | Empty (correct) |
| Actual speech | 818 chars | 845 chars (both correct) |
The difference comes down to better threshold tuning for silence detection and compression ratio filtering, which we can now control directly rather than relying on RunPod’s default configuration.
The Numbers
| RunPod | Local (Year 1) | Local (Year 2+) | |
|---|---|---|---|
| Hardware | - | $3,999 (one-time) | $0 |
| Monthly cost | ~$500 | ~$15 (electricity) | ~$15 |
| Annual cost | ~$6,000 | ~$4,179 | ~$180 |
| Quality | Hallucinations on silence | Clean output | Clean output |
Payback period: ~8 months. After that, we save roughly $485/month - or $5,820/year.
What I’d Do Again
Design for API compatibility from day one. The fact that migration was a single env var change made the whole process low-risk. We tested with 2-3 stations, verified quality, then flipped the rest over in minutes.
Benchmark on bandwidth, not specs. We almost bought a Mac Studio M4 Max, which would have been tight on capacity. The M3 Ultra’s higher memory bandwidth made the difference, despite being a previous generation chip.
Plan for worker recycling. MLX (and Metal in general) accumulates GPU memory over sustained workloads. Restarting workers every 60 tasks is ugly but effective - it keeps memory stable without any observable latency impact.
Where It Landed: RTX 5060 Ti 16GB
After running inference across the Mac Studio and Sparky for a while, I settled on a different long-term setup. The DGX Spark is a fantastic machine, but keeping it tied up as a dedicated Whisper node was a waste of its potential - I wanted it free for experimenting with larger models, fine-tuning, and other optimization work.
The answer turned out to be an RTX 5060 Ti 16GB installed in an existing Proxmox server (an AMD Ryzen 5 5600 box we were already running). At around $450 for the card, it’s the best value option for this workload:
- 16GB of VRAM comfortably fits Whisper large-v3-turbo
- Runs
monitorea-whisper-cudawith 4 concurrent workers via GPU passthrough - Handles all 27 stations with headroom to spare
- Sits inside a VM alongside the rest of our local infrastructure - no extra box to manage
The total cost of the inference node is essentially the GPU card itself, since it went into hardware we already had. Compare that to the Mac Studio at $3,999 or tying up the Spark full-time.
This freed Sparky to go back to what it’s best at - running larger experiments, testing new model architectures, and general development work. The Whisper workload is predictable and well-understood; it doesn’t need flagship hardware.
What I’d Do Again
Design for API compatibility from day one. The fact that migration was a single env var change made the whole process low-risk. We tested with 2-3 stations, verified quality, then flipped the rest over in minutes. This same design is what made it painless to move from the Mac Studio to the RTX 5060 Ti later - just another URL swap.
Benchmark on bandwidth, not specs. We almost bought a Mac Studio M4 Max, which would have been tight on capacity. The M3 Ultra’s higher memory bandwidth made the difference, despite being a previous generation chip.
Plan for worker recycling. MLX (and Metal in general) accumulates GPU memory over sustained workloads. Restarting workers every 60 tasks is ugly but effective - it keeps memory stable without any observable latency impact. The CUDA variant doesn’t have this problem - PyTorch’s memory management is more mature for long-running processes.
Start expensive, then optimize down. RunPod let us validate the workload. The Mac Studio proved local inference was viable. The DGX Spark confirmed CUDA was the right long-term path. And the RTX 5060 Ti gave us the best economics. Each step informed the next.
If you’re running ML inference on serverless GPU platforms and your workload is predictable, do the math on local hardware. The upfront cost is real, but the payback is fast - and you get full control over model configuration, latency, and quality. And you don’t have to get the hardware right on the first try - as long as you design the software to be hardware-agnostic.