Inference as a Side Business: Hosting LLM Endpoints on Decentralized Networks with SLAs
If you can serve fast, reliable LLM responses at a fair price, there’s steady demand from agencies building chat tools to startups needing overflow capacity. The twist: instead of buying expensive GPUs, you can rent compute on decentralized networks and still promise real SLAs. This guide shows the full playbook: packaging models, choosing runtimes, meeting latency targets, monitoring, pricing, and writing SLAs that won’t trap you.
Note: This article is educational, not legal advice or a performance guarantee. Validate your stack, measure your latency, and tailor contracts to your jurisdiction.
Why this works now
- Runtimes matured: Modern servers (e.g., tensor-parallel engines, paged KV-cache) push high tokens/sec with stable latency.
- Decentralized marketplaces: You can lease GPUs on-demand across multiple providers and regions, hedging outages with multi-home routing.
- API expectations standardized: Most buyers expect OpenAI-compatible endpoints. If yours drops in as a base URL change, you remove friction.
- Composable contracts: Clear SLOs (p95 latency, 4xx/5xx error rates) + error budgets make “side business” reliability actually achievable.
Models, runtimes & quantization: pick for use-case, not hype
The right model depends on context length, response style, and budget. Your business lives or dies on predictable latency and cost per 1M tokens not leaderboard bragging rights.
Typical side-biz use-cases
- Customer support copilots (short prompts, high concurrency).
- Marketing/SEO paraphrasers (batch jobs + streaming previews).
- Thin wrappers over RAG (kb search + summarize).
- Low-latency chat widgets (p95 < 1.2–1.8 s first token).
Model selection heuristics
- Sub-8k contexts, utility tasks: smaller instruction-tuned 7–14B models, quantized (Q4–Q6) if quality holds.
- Reasoning / tool use: 14–32B class with KV-cache paging to protect tail latency.
- Long-doc chat: prioritise models with efficient attention or windowing; consider server-side chunking + RAG.
Runtimes & memory math
- GPU VRAM planning: VRAM ≈
Params_bytes + KV_cache_bytes(concurrency, seq_len)
. KV dominates under high concurrency; paged KV or multi-tenant cache is essential. - Quantization: INT4/FP8 can halve VRAM and double throughput at small quality cost. Keep an A/B canary to verify quality before rolling to all tenants.
- Batching vs. streaming: Micro-batch 4–16 requests to raise tokens/sec/core while keeping time-to-first-token under SLO.
Back-of-envelope KV cache
KV_bytes ≈ layers × heads × 2 (K,V) × head_dim × dtype_bytes × (prompt_tokens + generated_tokens) × concurrent_seqs Example (toy): 32 layers × 32 heads × 2 × 128 × 2 bytes × 2,048 tokens × 8 reqs ≈ ~430 MB (Real models vary; always measure.)
OpenAI-compatible API & gateway: drop-in for customers
Your gateway should expose /v1/chat/completions
and /v1/completions
, accept stream=true
, and translate to your runtime (e.g., vLLM, TGI, llama.cpp server). Put all auth, rate limits, metering, and fallbacks here. Keep model workers stateless behind it.
Minimal FastAPI gateway (OpenAI-style)
# main.py from fastapi import FastAPI, Header, HTTPException from fastapi.responses import StreamingResponse, JSONResponse import httpx, time, os UPSTREAM = os.getenv("UPSTREAM_URL", "http://worker:8000/generate") API_KEYS = set(os.getenv("API_KEYS", "dev_123").split(",")) app = FastAPI() def check_auth(key:str): if key not in API_KEYS: raise HTTPException(status_code=401, detail="invalid api key") @app.post("/v1/chat/completions") async def chat(payload: dict, authorization: str = Header(None)): # Auth if not authorization or not authorization.startswith("Bearer "): raise HTTPException(401, "missing bearer token") check_auth(authorization.split(" ",1)[1]) # Translate payload to worker format messages = payload.get("messages", []) stream = payload.get("stream", False) model = payload.get("model", "default") params = {k:payload.get(k) for k in ("temperature","top_p","max_tokens","stop") if k in payload} async with httpx.AsyncClient(timeout=30) as client: if stream: r = await client.stream("POST", UPSTREAM, json={"messages":messages,"model":model, **params}) async def gen(): async for chunk in r.aiter_text(): # wrap chunks as SSE for OpenAI-compatible streaming yield f"data: {chunk}\n\n" yield "data: [DONE]\n\n" return StreamingResponse(gen(), media_type="text/event-stream") else: r = await client.post(UPSTREAM, json={"messages":messages,"model":model, **params}) data = r.json() return JSONResponse({ "id": f"chatcmpl_{int(time.time()*1000)}", "object": "chat.completion", "model": model, "created": int(time.time()), "choices": [{"index":0,"finish_reason": data.get("finish_reason","stop"), "message": {"role":"assistant","content": data["text"]}}], "usage": data.get("usage", {}) })
Dockerfile (gateway)
FROM python:3.11-slim WORKDIR /app COPY main.py /app/ RUN pip install fastapi uvicorn httpx ENV UPSTREAM_URL=http://worker:8000/generate CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
Behind the gateway, run your preferred server (e.g., vLLM) pinned to specific GPUs. Put a thin adapter that returns OpenAI-style usage counts (prompt_tokens, completion_tokens) for billing.
SLA & SLOs: define, measure, and budget for errors
SLAs are the contract (what you promise); SLOs are the targets you operate against. Use SLOs to compute an error budget, then guardrail deployments, maintenance, and scaling to stay within budget.
Recommended SLOs (per calendar month)
- Availability: ≥ 99.9% (allowed downtime ≈ 43.8 min/month).
- Latency: p95 time-to-first-token (TTFT) ≤ 1.5 s; p95 time-to-last-token (TTLT) ≤ 8 s @ 256 tokens.
- Error rate: 5xx ≤ 0.2%; 429 ≤ 1% (excluding customer bursts beyond quota).
Error budget
ErrorBudget = 1 - SLO For SLO 99.9%, budget = 0.1% of minutes in month ≈ 43.8 min. Spend budget on: deploys, provider swaps, maintenance.
Capacity planning (tokens/sec → GPUs)
Given: Target QPS (requests/sec) Avg prompt tokens (Tp), avg completion tokens (Tc) Runtime throughput (Rt) in tokens/sec/GPU (measured!) Batch efficiency factor (E) 0.6–0.9 Required GPUs ≈ (QPS × (Tp + Tc)) / (Rt × E) Add headroom: × 1.3–1.5 for spikes and tail-latency control.
Decentralized GPU marketplaces: placement & hedging strategy
Decentralized networks let you rent GPUs from independent operators. You gain price flexibility and geographic spread, but face heterogeneous hardware and variable reliability. Your strategy:
- Abstract providers behind your gateway. Your customers never see individual nodes, only your endpoint. Model workers register with the gateway; health checks decide routing.
- Bid across at least two networks + one centralized fallback. Multi-home your fleet so you can drain traffic from a degraded provider without breaching SLOs.
- Pin runtime requirements in your container. Bundle model weights, tokenizer, and engine; don’t rely on host drivers beyond CUDA runtime compatibility. Assert GPU capability labels (SM version / VRAM).
- Preemption & churn plan. Treat nodes as ephemeral. Use checkpoint-warm restore and a “graceful drain” policy on SIGTERM.
- Regional routing. Anycast or geo DNS to land users within ~50–100 ms RTT of workers; it’s the cheapest latency win.
Provider selection rubric (score 1–5)
- Uptime track record (self-reported + your probes).
- GPU class & VRAM match to your model tier.
- Network egress throughput & RTT to your audience.
- Support/communication responsiveness.
- Price stability and contract terms (hourly vs reserved).
Latency tuning: first token fast, tail stable
- Warm pools: Keep N workers hot per region with model loaded and KV cache pre-allocated.
- Request shaping: Cap
max_tokens
by tier; favor short outputs for free/low-tier tenants. - Batching windows: 10–25 ms micro-batches often raise throughput without hurting TTFT.
- KV paging: Use paged KV cache to prevent OOM when a few long prompts arrive.
- Pinning: “Sticky” sessions for multi-turn chat keep KV local; expire after idle to reclaim memory.
- Circuit breakers: If p95 TTFT breaches threshold for 2–5 minutes, shed load by pausing long prompts or switching to a smaller model fallback.
Observability & synthetic probes: trust the graphs
Without graphs, you’re guessing. Track these SLIs per model, region, and provider:
- TTFT (p50/p90/p95) and TTLT (p50/p95) at fixed token budgets (e.g., 256).
- Request rate, queue time, batch size, GPU utilization, VRAM headroom.
- Error rates by class (429/5xx), timeout counts, retry counts.
- Per-tenant usage (prompt/complete tokens) for billing.
Prometheus metrics sketch
# Gateway process requests_total{route="/v1/chat/completions",status="200"} 1234 latency_ttf_seconds_bucket{le="0.25"} 100 ... gpu_utilization{provider="nodeA",region="eu"} 0.76 kvcache_bytes_in_use{model="X"} 5.4e9 errors_total{class="5xx"} 2
Synthetic probes
Every 30 seconds per region, issue a known prompt and assert TTFT/TTLT limits. If probes fail, remove the provider from rotation and page yourself. Synthetic SLOs protect you even when customers are asleep.
Pricing & unit economics: margins per 1M tokens
Bill like your customers expect: per token, per request, or per minute. Per-token is the norm. Your COGS buckets:
- GPU rental (hourly or reserved), amortized to tokens served.
- Bandwidth egress (small but non-zero for streaming).
- Gateway infra (VMs/load balancers), monitoring, and storage.
- Payment fees and support time.
COGS per 1M tokens (quick model)
Inputs: GPU_price_per_hour = $X Tokens_per_sec_per_GPU = R (measured) Utilization = U (0–1) Seconds_per_hour = 3600 Tokens_per_hour = R × U × 3600 COGS_per_1M = (GPU_price_per_hour / Tokens_per_hour) × 1,000,000 + Overheads_per_1M Example: $2.80/hr GPU, R=1,600 tok/s, U=0.6 Tokens/hr ≈ 1,600 × 0.6 × 3600 ≈ 3,456,000 GPU cost/1M ≈ 2.80 / 3.456 ≈ $0.81 Add $0.10–$0.30 overhead → target COGS ≈ $0.91–$1.11 per 1M Price with margin (e.g., $1.90–$3.50 per 1M) depending on model quality.
Tiering: Offer 3 tiers—Hobby (small model, shared), Pro (mid model, tighter SLOs), Business (bigger model, dedicated capacity). Reserve GPUs for Business to avoid noisy neighbors.
Security, privacy & data handling
- TLS everywhere: Terminate at edge. Encrypt node-gateway hops (mTLS or WireGuard).
- Key scopes: Per-tenant API keys with rate/credit limits; rotate on leak suspicions.
- Data retention: Default no logging of prompts/completions or redact by pattern. Offer opt-in logs for debugging with 7–30 day TTL.
- PII handling: If customers send PII, provide a DPA template. Forbid credentials or health data unless you’re ready for compliance scope.
- Model safety: Add input/output filters for obviously harmful content and rate-limit abuse patterns.
Copy-paste SLA addendum (practical & fair)
Service: Managed LLM Inference API (OpenAI-compatible). 1) Definitions • Monthly Uptime: % of 1-min intervals with successful responses (< 5s TTFT and no 5xx). • Latency SLO: p95 TTFT ≤ 1.5s; p95 TTLT ≤ 8s @ 256 tokens in-region. • Error Budget: 0.1% of monthly minutes (≈43.8 min). 2) SLA • Availability: 99.9%. Credit: 5% of monthly fee per 0.1% below SLA, capped at 25%. • Latency: If p95 TTFT exceeds SLO for ≥ 3 consecutive hours (outside agreed burst caps), credit 5% of monthly fee. 3) Exclusions • Customer network issues, misuse (exceeding rate/queue limits), force majeure, scheduled maintenance (≤ 60 min/mo with 48h notice). 4) Reporting • We publish per-region status & historical SLI charts. Customer may request logs for specific incidents. 5) Termination • Either party may terminate on 30 days’ notice; credits are sole remedy for SLA breaches.
Incident runbook: five steps to protect the SLO
- Detect: Synthetic probes fail or p95 TTFT > SLO for 5 min. Alert on-call.
- Stabilize: Rate-limit long prompts; reduce
max_tokens
for shared tiers; drain bad providers; route to backup region/model. - Diagnose: Check GPU utilization, VRAM headroom, queue times, error spike class. Inspect provider health channel.
- Fix or failover: Restart degraded workers with warm images; re-pull weights only if checksum mismatch. If marketplace outage, spin reserved fallback.
- Report: Post-mortem within 48h: impact window, root cause, actions, prevention. Deduct from error budget.
Launch checklist (one afternoon, two cups of coffee)
Tech
- Model & runtime container builds successfully with deterministic hash.
- Gateway exposes
/v1/chat/completions
& streaming; API keys enforced. - Two providers live in rotation; third as dark failover.
- Probes & dashboards show TTFT/TTLT and errors per region.
- Logs scrub PII; data retention set to default 0/7/30 days.
Business
- SLA addendum attached to order form; status page URL shared.
- Billing set: per-token price and quotas per tier; stripe/invoicing ready.
- Runbook printed; on-call rotation (even if it’s “you”) documented.
- One customer-facing quickstart doc + curl examples.
Examples: SKUs & product pages that sell
Tier | Model class | SLO | Burst policy | Price (per 1M) |
---|---|---|---|---|
Hobby | 7–8B quant, shared | 99.5% / p95 TTFT ≤ 2.5 s | Queue at peak | $1.99 |
Pro | 14–32B mixed, pinned | 99.9% / p95 TTFT ≤ 1.5 s | Burst to reserved pool | $3.20 |
Business | 32B+ dedicated | 99.95% / custom | Pre-allocated capacity | Custom (volume) |
Frequently Asked Questions
Can I actually hit 99.9% availability on decentralized GPUs?
Yes, if you abstract providers behind a health-checked gateway, multi-home across at least two networks, and keep a small reserved fallback. Your probes should eject sick nodes automatically and drain regions underperforming your SLOs.
How do I keep first token fast while batching?
Use micro-batches with a tiny queue window (10–25 ms) and pre-load the model. Cap maximum prompt tokens for low-tier tenants, and use paged KV to keep memory stable under load.
What about long contexts (100k tokens)?
Sell long-context as a distinct SKU with a higher price and looser latency SLO. For most customers, a RAG plan with 8–16k contexts is cheaper and faster than pushing giant contexts through one call.
Will quantization ruin quality?
Not for common utility tasks. Keep a shadow unquantized (or higher precision) canary and A/B a small slice of traffic. If outputs degrade for a tenant’s domain, route them to a higher-quality pool automatically.
How do I prevent abuse?
Per-key quotas, burst limits, and pattern-based blocklists (e.g., credential scraping prompts). Consider requiring a credit card for higher quotas and enforce content policies server-side.