Can a small team meet 99.9% SLA with decentralized GPUs?

Yes—by abstracting providers behind a health-checked gateway, multi-homing across at least two networks, keeping a reserved fallback, and using synthetic probes to eject unhealthy nodes automatically.

How do I keep first-token latency low while batching requests?

Use micro-batching with 10–25 ms windows, pre-load models, cap prompt length for lower tiers, and use paged KV caches to avoid memory spikes that increase tail latency.

Does quantization harm output quality?

For many utility tasks, no. Quantized models (e.g., INT4/FP8) retain acceptable quality while improving throughput. Validate with a canary route and automatically shift traffic if quality drops.

How should I price my API?

Calculate COGS per 1M tokens from GPU hourly cost and measured tokens/sec at your utilization. Add overhead (gateway, egress, monitoring) and margin. Offer tiered SKUs with different SLOs.

Inference as a Side Business: Hosting LLM Endpoints on Decentralized Networks with SLAs

If you can serve fast, reliable LLM responses at a fair price, there’s steady demand from agencies building chat tools to startups needing overflow capacity. The twist: instead of buying expensive GPUs, you can rent compute on decentralized networks and still promise real SLAs. This guide shows the full playbook: packaging models, choosing runtimes, meeting latency targets, monitoring, pricing, and writing SLAs that won’t trap you.

Note: This article is educational, not legal advice or a performance guarantee. Validate your stack, measure your latency, and tailor contracts to your jurisdiction.

Why this works now

Runtimes matured: Modern servers (e.g., tensor-parallel engines, paged KV-cache) push high tokens/sec with stable latency.
Decentralized marketplaces: You can lease GPUs on-demand across multiple providers and regions, hedging outages with multi-home routing.
API expectations standardized: Most buyers expect OpenAI-compatible endpoints. If yours drops in as a base URL change, you remove friction.
Composable contracts: Clear SLOs (p95 latency, 4xx/5xx error rates) + error budgets make “side business” reliability actually achievable.

Model

Runtime

Marketplaces

Gateway

SLOs

Five building blocks: swap models & GPUs, keep the gateway and SLOs constant.

Models, runtimes & quantization: pick for use-case, not hype

The right model depends on context length, response style, and budget. Your business lives or dies on predictable latency and cost per 1M tokens not leaderboard bragging rights.

Typical side-biz use-cases

Customer support copilots (short prompts, high concurrency).
Marketing/SEO paraphrasers (batch jobs + streaming previews).
Thin wrappers over RAG (kb search + summarize).
Low-latency chat widgets (p95 < 1.2–1.8 s first token).

Model selection heuristics

Sub-8k contexts, utility tasks: smaller instruction-tuned 7–14B models, quantized (Q4–Q6) if quality holds.
Reasoning / tool use: 14–32B class with KV-cache paging to protect tail latency.
Long-doc chat: prioritise models with efficient attention or windowing; consider server-side chunking + RAG.

Runtimes & memory math

GPU VRAM planning: VRAM ≈ Params_bytes + KV_cache_bytes(concurrency, seq_len). KV dominates under high concurrency; paged KV or multi-tenant cache is essential.
Quantization: INT4/FP8 can halve VRAM and double throughput at small quality cost. Keep an A/B canary to verify quality before rolling to all tenants.
Batching vs. streaming: Micro-batch 4–16 requests to raise tokens/sec/core while keeping time-to-first-token under SLO.

Back-of-envelope KV cache

KV_bytes ≈ layers × heads × 2 (K,V) × head_dim × dtype_bytes × (prompt_tokens + generated_tokens) × concurrent_seqs
Example (toy): 32 layers × 32 heads × 2 × 128 × 2 bytes × 2,048 tokens × 8 reqs ≈ ~430 MB
(Real models vary; always measure.)

OpenAI-compatible API & gateway: drop-in for customers

Your gateway should expose /v1/chat/completions and /v1/completions, accept stream=true, and translate to your runtime (e.g., vLLM, TGI, llama.cpp server). Put all auth, rate limits, metering, and fallbacks here. Keep model workers stateless behind it.

Minimal FastAPI gateway (OpenAI-style)

# main.py
from fastapi import FastAPI, Header, HTTPException
from fastapi.responses import StreamingResponse, JSONResponse
import httpx, time, os

UPSTREAM = os.getenv("UPSTREAM_URL", "http://worker:8000/generate")
API_KEYS = set(os.getenv("API_KEYS", "dev_123").split(","))

app = FastAPI()

def check_auth(key:str):
    if key not in API_KEYS: raise HTTPException(status_code=401, detail="invalid api key")

@app.post("/v1/chat/completions")
async def chat(payload: dict, authorization: str = Header(None)):
    # Auth
    if not authorization or not authorization.startswith("Bearer "):
        raise HTTPException(401, "missing bearer token")
    check_auth(authorization.split(" ",1)[1])

    # Translate payload to worker format
    messages = payload.get("messages", [])
    stream   = payload.get("stream", False)
    model    = payload.get("model", "default")
    params   = {k:payload.get(k) for k in ("temperature","top_p","max_tokens","stop") if k in payload}

    async with httpx.AsyncClient(timeout=30) as client:
        if stream:
            r = await client.stream("POST", UPSTREAM, json={"messages":messages,"model":model, **params})
            async def gen():
                async for chunk in r.aiter_text():
                    # wrap chunks as SSE for OpenAI-compatible streaming
                    yield f"data: {chunk}\n\n"
                yield "data: [DONE]\n\n"
            return StreamingResponse(gen(), media_type="text/event-stream")
        else:
            r = await client.post(UPSTREAM, json={"messages":messages,"model":model, **params})
            data = r.json()
            return JSONResponse({
              "id": f"chatcmpl_{int(time.time()*1000)}",
              "object": "chat.completion",
              "model": model,
              "created": int(time.time()),
              "choices": [{"index":0,"finish_reason": data.get("finish_reason","stop"),
                           "message": {"role":"assistant","content": data["text"]}}],
              "usage": data.get("usage", {})
            })

Dockerfile (gateway)

FROM python:3.11-slim
WORKDIR /app
COPY main.py /app/
RUN pip install fastapi uvicorn httpx
ENV UPSTREAM_URL=http://worker:8000/generate
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

Behind the gateway, run your preferred server (e.g., vLLM) pinned to specific GPUs. Put a thin adapter that returns OpenAI-style usage counts (prompt_tokens, completion_tokens) for billing.

SLA & SLOs: define, measure, and budget for errors

SLAs are the contract (what you promise); SLOs are the targets you operate against. Use SLOs to compute an error budget, then guardrail deployments, maintenance, and scaling to stay within budget.

Recommended SLOs (per calendar month)

Availability: ≥ 99.9% (allowed downtime ≈ 43.8 min/month).
Latency: p95 time-to-first-token (TTFT) ≤ 1.5 s; p95 time-to-last-token (TTLT) ≤ 8 s @ 256 tokens.
Error rate: 5xx ≤ 0.2%; 429 ≤ 1% (excluding customer bursts beyond quota).

Error budget

ErrorBudget = 1 - SLO
For SLO 99.9%, budget = 0.1% of minutes in month ≈ 43.8 min.
Spend budget on: deploys, provider swaps, maintenance.

Capacity planning (tokens/sec → GPUs)

Given:
  Target QPS (requests/sec)
  Avg prompt tokens (Tp), avg completion tokens (Tc)
  Runtime throughput (Rt) in tokens/sec/GPU (measured!)
  Batch efficiency factor (E) 0.6–0.9

Required GPUs ≈ (QPS × (Tp + Tc)) / (Rt × E)

Add headroom: × 1.3–1.5 for spikes and tail-latency control.

Decentralized GPU marketplaces: placement & hedging strategy

Decentralized networks let you rent GPUs from independent operators. You gain price flexibility and geographic spread, but face heterogeneous hardware and variable reliability. Your strategy:

Abstract providers behind your gateway. Your customers never see individual nodes, only your endpoint. Model workers register with the gateway; health checks decide routing.
Bid across at least two networks + one centralized fallback. Multi-home your fleet so you can drain traffic from a degraded provider without breaching SLOs.
Pin runtime requirements in your container. Bundle model weights, tokenizer, and engine; don’t rely on host drivers beyond CUDA runtime compatibility. Assert GPU capability labels (SM version / VRAM).
Preemption & churn plan. Treat nodes as ephemeral. Use checkpoint-warm restore and a “graceful drain” policy on SIGTERM.
Regional routing. Anycast or geo DNS to land users within ~50–100 ms RTT of workers; it’s the cheapest latency win.

Provider selection rubric (score 1–5)

Uptime track record (self-reported + your probes).
GPU class & VRAM match to your model tier.
Network egress throughput & RTT to your audience.
Support/communication responsiveness.
Price stability and contract terms (hourly vs reserved).

Latency tuning: first token fast, tail stable

Warm pools: Keep N workers hot per region with model loaded and KV cache pre-allocated.
Request shaping: Cap max_tokens by tier; favor short outputs for free/low-tier tenants.
Batching windows: 10–25 ms micro-batches often raise throughput without hurting TTFT.
KV paging: Use paged KV cache to prevent OOM when a few long prompts arrive.
Pinning: “Sticky” sessions for multi-turn chat keep KV local; expire after idle to reclaim memory.
Circuit breakers: If p95 TTFT breaches threshold for 2–5 minutes, shed load by pausing long prompts or switching to a smaller model fallback.

Observability & synthetic probes: trust the graphs

Without graphs, you’re guessing. Track these SLIs per model, region, and provider:

TTFT (p50/p90/p95) and TTLT (p50/p95) at fixed token budgets (e.g., 256).
Request rate, queue time, batch size, GPU utilization, VRAM headroom.
Error rates by class (429/5xx), timeout counts, retry counts.
Per-tenant usage (prompt/complete tokens) for billing.

Prometheus metrics sketch

# Gateway process
requests_total{route="/v1/chat/completions",status="200"} 1234
latency_ttf_seconds_bucket{le="0.25"} 100 ...
gpu_utilization{provider="nodeA",region="eu"} 0.76
kvcache_bytes_in_use{model="X"} 5.4e9
errors_total{class="5xx"} 2

Synthetic probes

Every 30 seconds per region, issue a known prompt and assert TTFT/TTLT limits. If probes fail, remove the provider from rotation and page yourself. Synthetic SLOs protect you even when customers are asleep.

Pricing & unit economics: margins per 1M tokens

Bill like your customers expect: per token, per request, or per minute. Per-token is the norm. Your COGS buckets:

GPU rental (hourly or reserved), amortized to tokens served.
Bandwidth egress (small but non-zero for streaming).
Gateway infra (VMs/load balancers), monitoring, and storage.
Payment fees and support time.

COGS per 1M tokens (quick model)

Inputs:
  GPU_price_per_hour = $X
  Tokens_per_sec_per_GPU = R (measured)
  Utilization = U (0–1)
  Seconds_per_hour = 3600
Tokens_per_hour = R × U × 3600
COGS_per_1M = (GPU_price_per_hour / Tokens_per_hour) × 1,000,000 + Overheads_per_1M

Example:
  $2.80/hr GPU, R=1,600 tok/s, U=0.6
  Tokens/hr ≈ 1,600 × 0.6 × 3600 ≈ 3,456,000
  GPU cost/1M ≈ 2.80 / 3.456 ≈ $0.81
Add $0.10–$0.30 overhead → target COGS ≈ $0.91–$1.11 per 1M
Price with margin (e.g., $1.90–$3.50 per 1M) depending on model quality.

Tiering: Offer 3 tiers—Hobby (small model, shared), Pro (mid model, tighter SLOs), Business (bigger model, dedicated capacity). Reserve GPUs for Business to avoid noisy neighbors.

Security, privacy & data handling

TLS everywhere: Terminate at edge. Encrypt node-gateway hops (mTLS or WireGuard).
Key scopes: Per-tenant API keys with rate/credit limits; rotate on leak suspicions.
Data retention: Default no logging of prompts/completions or redact by pattern. Offer opt-in logs for debugging with 7–30 day TTL.
PII handling: If customers send PII, provide a DPA template. Forbid credentials or health data unless you’re ready for compliance scope.
Model safety: Add input/output filters for obviously harmful content and rate-limit abuse patterns.

Copy-paste SLA addendum (practical & fair)

Service: Managed LLM Inference API (OpenAI-compatible).

1) Definitions
  • Monthly Uptime: % of 1-min intervals with successful responses (< 5s TTFT and no 5xx).
  • Latency SLO: p95 TTFT ≤ 1.5s; p95 TTLT ≤ 8s @ 256 tokens in-region.
  • Error Budget: 0.1% of monthly minutes (≈43.8 min).

2) SLA
  • Availability: 99.9%. Credit: 5% of monthly fee per 0.1% below SLA, capped at 25%.
  • Latency: If p95 TTFT exceeds SLO for ≥ 3 consecutive hours (outside agreed burst caps),
    credit 5% of monthly fee.

3) Exclusions
  • Customer network issues, misuse (exceeding rate/queue limits), force majeure, scheduled maintenance (≤ 60 min/mo with 48h notice).

4) Reporting
  • We publish per-region status & historical SLI charts. Customer may request logs for specific incidents.

5) Termination
  • Either party may terminate on 30 days’ notice; credits are sole remedy for SLA breaches.

Incident runbook: five steps to protect the SLO

Detect: Synthetic probes fail or p95 TTFT > SLO for 5 min. Alert on-call.
Stabilize: Rate-limit long prompts; reduce max_tokens for shared tiers; drain bad providers; route to backup region/model.
Diagnose: Check GPU utilization, VRAM headroom, queue times, error spike class. Inspect provider health channel.
Fix or failover: Restart degraded workers with warm images; re-pull weights only if checksum mismatch. If marketplace outage, spin reserved fallback.
Report: Post-mortem within 48h: impact window, root cause, actions, prevention. Deduct from error budget.

Launch checklist (one afternoon, two cups of coffee)

Tech

Model & runtime container builds successfully with deterministic hash.
Gateway exposes /v1/chat/completions & streaming; API keys enforced.
Two providers live in rotation; third as dark failover.
Probes & dashboards show TTFT/TTLT and errors per region.
Logs scrub PII; data retention set to default 0/7/30 days.

Business

SLA addendum attached to order form; status page URL shared.
Billing set: per-token price and quotas per tier; stripe/invoicing ready.
Runbook printed; on-call rotation (even if it’s “you”) documented.
One customer-facing quickstart doc + curl examples.

Examples: SKUs & product pages that sell

Tier	Model class	SLO	Burst policy	Price (per 1M)
Hobby	7–8B quant, shared	99.5% / p95 TTFT ≤ 2.5 s	Queue at peak	$1.99
Pro	14–32B mixed, pinned	99.9% / p95 TTFT ≤ 1.5 s	Burst to reserved pool	$3.20
Business	32B+ dedicated	99.95% / custom	Pre-allocated capacity	Custom (volume)

Frequently Asked Questions

Can I actually hit 99.9% availability on decentralized GPUs?

Yes, if you abstract providers behind a health-checked gateway, multi-home across at least two networks, and keep a small reserved fallback. Your probes should eject sick nodes automatically and drain regions underperforming your SLOs.

How do I keep first token fast while batching?

Use micro-batches with a tiny queue window (10–25 ms) and pre-load the model. Cap maximum prompt tokens for low-tier tenants, and use paged KV to keep memory stable under load.

What about long contexts (100k tokens)?

Sell long-context as a distinct SKU with a higher price and looser latency SLO. For most customers, a RAG plan with 8–16k contexts is cheaper and faster than pushing giant contexts through one call.

Will quantization ruin quality?

Not for common utility tasks. Keep a shadow unquantized (or higher precision) canary and A/B a small slice of traffic. If outputs degrade for a tenant’s domain, route them to a higher-quality pool automatically.

How do I prevent abuse?

Per-key quotas, burst limits, and pattern-based blocklists (e.g., credential scraping prompts). Consider requiring a credit card for higher quotas and enforce content policies server-side.

Inference as a Side Business: hosting LLM endpoints on decentralized networks with SLAs