Migrating from RunPod to Local Whisper Inference with MLX and a DGX Spark

In my previous article, I described how we use OpenAI’s Whisper model to transcribe radio and TV broadcasts for Monitorea, our media monitoring platform. At the time, we were running inference on RunPod - a serverless GPU platform that lets you deploy ML models without managing hardware. It was the right call to get started quickly. But as we scaled, the economics stopped making sense.

Here’s how we migrated to fully local inference in about a weekend, using MLX on Apple Silicon and a DGX Spark we call Sparky.

Why RunPod Worked - Until It Didn’t

RunPod was a great starting point. You pick a GPU tier, deploy a Docker container with your Whisper model, and get a REST API. No provisioning, no driver installs, no thermal management. For a small operation transcribing a few radio stations, it was ideal.

The problem was GPU availability. Our workload ran well on budget GPUs - we didn’t need A100s or H100s for Whisper medium. But those cheaper tiers became increasingly hard to get. Jobs would queue, cold starts would spike, and the platform would nudge us toward more expensive hardware. What started at a reasonable rate crept up to around $500/month for 27 stations.

At that spend, the math is simple: a one-time hardware purchase pays for itself in months.

The Constraint: Zero Downtime, Zero Code Changes

Monitorea processes audio segments continuously - radio and TV stations recording 24/7 across Puerto Rico and Washington D.C. We couldn’t afford a migration that required rewriting the transcription pipeline.

The key insight was designing the local server to be API-compatible with RunPod’s /v2/{endpoint_id}/runsync endpoint. Same request format, same response schema. The entire migration for each service was a single environment variable change:

# Before (RunPod)
WHISPER_API_BASE_URL=https://api.runpod.ai

# After (local MLX)
WHISPER_API_BASE_URL=http://mac-studio:8080

No code changes. No deployment. Just an env var swap and a service restart.

Building monitorea-whisper with MLX

We built monitorea-whisper - a FastAPI server that wraps MLX Whisper, Apple’s ML framework optimized for their unified memory architecture. The design is straightforward:

FastAPI handles HTTP requests on port 8080
A process pool distributes work across multiple workers (configurable per hardware)
Each worker loads the Whisper model independently into GPU memory via MLX
Workers recycle after 60 tasks to prevent Metal/GPU memory accumulation - a quirk of running sustained MLX workloads

The server accepts base64-encoded audio, decodes it, runs inference, and returns the transcription with word-level timestamps. Because it matches RunPod’s API contract exactly, existing clients don’t know (or care) where inference is happening.

Hardware: Mac Studio M3 Ultra

For the primary inference node, we went with a Mac Studio M3 Ultra with 96GB of unified memory. The key finding from our benchmarks: GPU memory bandwidth is the bottleneck, not core count or model size. The M3 Ultra’s 819 GB/s bandwidth lets us run 6 parallel workers at roughly 27.7x real-time speed - meaning a 60-second audio clip transcribes in about 2 seconds.

Device	GPU Cores	Bandwidth	Workers	Capacity
Mac Mini M4 16GB	10	120 GB/s	1	~10 stations
Mac Mini M4 Pro 24GB	20	273 GB/s	2	~16 stations
Mac Studio M4 Max 64GB	40	546 GB/s	4	~25 stations
Mac Studio M3 Ultra 96GB	60	819 GB/s	6	~45 stations

With 27 active stations, we’re running at about 60% capacity - plenty of headroom for growth.

Adding Sparky: CUDA Inference on a DGX Spark

While the Mac Studio handles our production workload, we also built monitorea-whisper-cuda - a PyTorch-based variant that runs on NVIDIA GPUs. This runs on Sparky, our NVIDIA DGX Spark, which serves as both a development machine and a secondary inference node.

The CUDA implementation differs from MLX in a few ways:

Single worker optimized for GPU throughput (CUDA handles parallelism differently than MLX)
Direct model.generate() instead of Hugging Face pipelines (avoids a significant latency overhead we measured at ~8x)
FFmpeg subprocess for audio decoding rather than in-process handling
Runs on port 8081 to coexist with the MLX server

Same API. Same response format. The backend doesn’t care which hardware is transcribing - it’s just a URL.

The Surprise: Better Quality

We expected cost savings. We didn’t expect quality improvements.

RunPod’s Whisper deployment had a known hallucination problem. Research from early 2025 confirmed what we were seeing: Whisper generates phantom text roughly 40% of the time on non-speech audio - music beds, silence, commercial jingles. We’d get segments full of repeated phrases, strings of exclamation marks, or gibberish Spanish.

MLX Whisper handles this significantly better:

Segment Type	RunPod Output	MLX Output
Music	“¿Qué está pasando?” repeated 50x	Empty (correct)
Commercial break	249 exclamation marks	Empty (correct)
Silence	“no, no, no, no…”	Empty (correct)
Actual speech	818 chars	845 chars (both correct)

The difference comes down to better threshold tuning for silence detection and compression ratio filtering, which we can now control directly rather than relying on RunPod’s default configuration.

The Numbers

	RunPod	Local (Year 1)	Local (Year 2+)
Hardware	-	$3,999 (one-time)	$0
Monthly cost	~$500	~$15 (electricity)	~$15
Annual cost	~$6,000	~$4,179	~$180
Quality	Hallucinations on silence	Clean output	Clean output

Payback period: ~8 months. After that, we save roughly $485/month - or $5,820/year.

What I’d Do Again

Benchmark on bandwidth, not specs. We almost bought a Mac Studio M4 Max, which would have been tight on capacity. The M3 Ultra’s higher memory bandwidth made the difference, despite being a previous generation chip.

Where It Landed: RTX 5060 Ti 16GB

After running inference across the Mac Studio and Sparky for a while, I settled on a different long-term setup. The DGX Spark is a fantastic machine, but keeping it tied up as a dedicated Whisper node was a waste of its potential - I wanted it free for experimenting with larger models, fine-tuning, and other optimization work.

The answer turned out to be an RTX 5060 Ti 16GB installed in an existing Proxmox server (an AMD Ryzen 5 5600 box we were already running). At around $450 for the card, it’s the best value option for this workload:

16GB of VRAM comfortably fits Whisper large-v3-turbo
Runs monitorea-whisper-cuda with 4 concurrent workers via GPU passthrough
Handles all 27 stations with headroom to spare
Sits inside a VM alongside the rest of our local infrastructure - no extra box to manage

The total cost of the inference node is essentially the GPU card itself, since it went into hardware we already had. Compare that to the Mac Studio at $3,999 or tying up the Spark full-time.

This freed Sparky to go back to what it’s best at - running larger experiments, testing new model architectures, and general development work. The Whisper workload is predictable and well-understood; it doesn’t need flagship hardware.

What I’d Do Again

Design for API compatibility from day one. The fact that migration was a single env var change made the whole process low-risk. We tested with 2-3 stations, verified quality, then flipped the rest over in minutes. This same design is what made it painless to move from the Mac Studio to the RTX 5060 Ti later - just another URL swap.

Plan for worker recycling. MLX (and Metal in general) accumulates GPU memory over sustained workloads. Restarting workers every 60 tasks is ugly but effective - it keeps memory stable without any observable latency impact. The CUDA variant doesn’t have this problem - PyTorch’s memory management is more mature for long-running processes.

Start expensive, then optimize down. RunPod let us validate the workload. The Mac Studio proved local inference was viable. The DGX Spark confirmed CUDA was the right long-term path. And the RTX 5060 Ti gave us the best economics. Each step informed the next.

If you’re running ML inference on serverless GPU platforms and your workload is predictable, do the math on local hardware. The upfront cost is real, but the payback is fast - and you get full control over model configuration, latency, and quality. And you don’t have to get the hardware right on the first try - as long as you design the software to be hardware-agnostic.