Self-Hosted AI: How to Run Large Language Models on Your Own Server
Self-Hosted AI: How to Run Large Language Models on Your Own Server
The race to put AI into every product has a quiet counter-movement: teams choosing to run large language models on their own metal. Not through an API. Not through a managed cloud endpoint. On servers they control, behind firewalls they configure, with data that never leaves their network.
If you're evaluating whether to self-host LLMs — or you've already decided and need a practical roadmap — this guide walks through every layer: hardware sizing, model selection, deployment with Ollama and vLLM, API integration, performance tuning, and honest cost analysis.
Why Self-Host an LLM?
Let's start with the motivation. API-based LLM services like OpenAI, Anthropic, and Google are excellent, but they come with trade-offs that matter more as your usage scales.
Data Privacy and Compliance
When you send a prompt to a third-party API, that data leaves your environment. For regulated industries — healthcare (HIPAA), finance (SOC 2, PCI DSS), legal — this creates immediate compliance friction. Even with enterprise agreements and zero-data-retention policies, auditors and security teams often prefer data that physically cannot leave the network.
Self-hosting eliminates the question entirely. The model runs on your server. Your data stays on your server. The compliance argument becomes simple: there's nothing to intercept because nothing is transmitted.
Predictable Costs
API pricing is usage-based. At low volume, it's cheap. At scale, it's a budget line item that grows non-linearly. A team generating 50 million tokens per day on a frontier model can easily spend $1,500–$3,000 per month — every month, with no cap.
A self-hosted model has a fixed cost: the hardware (amortized) and electricity. Once the GPU is paid for, inference is effectively free at the margin. For high-volume workloads, the break-even point often arrives within 6–12 months.
No Rate Limits or Vendor Lock-In
API providers enforce rate limits, tokenize costs differently, and occasionally deprecate models with little warning (looking at you, text-davinci-003). When you self-host, you choose the model, the version, and the parameters. Nobody throttles you. Nobody retires your model. Nobody changes the API contract overnight.
Customization and Fine-Tuning
Running a model locally means you can fine-tune it on your data, swap quantization levels dynamically, and chain it with custom pre/post-processing pipelines — all without third-party constraints. You can serve a fine-tuned model that embodies domain knowledge your organization has accumulated over years.
Hardware Requirements: Choosing the Right GPU
This is where most teams get stuck. LLM inference is memory-bandwidth-bound, which means GPU VRAM is the single most important spec.
VRAM Sizing by Model
Here's a practical reference for VRAM requirements at common configurations:
| Model | Parameters | FP16 (Full Precision) | INT8 Quantized | INT4 Quantized | |---|---|---|---|---| | Llama 3.2 (3B) | 3B | ~6 GB | ~3.5 GB | ~2 GB | | Mistral 7B | 7B | ~14 GB | ~8 GB | ~5 GB | | Llama 3.1 (8B) | 8B | ~16 GB | ~9 GB | ~6 GB | | Qwen 2.5 (14B) | 14B | ~28 GB | ~16 GB | ~10 GB | | Mixtral 8x7B | 47B | ~90 GB | ~49 GB | ~26 GB | | Llama 3.1 (70B) | 70B | ~140 GB | ~75 GB | ~42 GB |
The golden rule: your GPU VRAM must exceed the model size plus working memory for the KV cache (context window). A 1B-parameter model at FP16 needs ~2 GB for weights, but a 32K-token context window adds another 1–2 GB of KV cache depending on architecture.
GPU Recommendations by Tier
Entry Level (7B–8B models, INT4/INT8):
- NVIDIA RTX 4060 Ti (16 GB) — ~$450
- NVIDIA RTX 4070 (12 GB) — ~$550
- Good for development, prototyping, and single-user workloads
Mid-Range (8B–14B models, FP16 or 70B INT4):
- NVIDIA RTX 4090 (24 GB) — ~$1,600
- NVIDIA RTX 6000 Ada (48 GB) — ~$4,500
- Suitable for small team inference and production workloads with moderate concurrency
Server-Grade (70B+ models, multi-GPU):
- NVIDIA A100 (80 GB) — ~$15,000–$20,000
- NVIDIA H100 (80 GB) — ~$30,000
- 2–4 GPU configurations for enterprise deployment
AMD note: ROCm support has improved significantly. The Radeon RX 7900 XTX (24 GB) works with most inference frameworks, though NVIDIA's CUDA ecosystem remains more mature for LLM tooling.
Other Hardware Considerations
- RAM: At least 2x your model size in system RAM for model loading and swapping
- Storage: NVMe SSD strongly recommended. Model files are large (a 70B FP16 checkpoint is ~140 GB). HDD load times will bottleneck you badly.
- Power: An RTX 4090 draws ~450W under load. Plan your PSU and cooling accordingly.
- CPU: Less critical for inference itself, but matters for data preprocessing and tokenization. Any modern 8-core CPU is sufficient.
Deploying with Ollama (The Easy Path)
Ollama is the most approachable way to run LLMs locally. It handles model downloading, quantization, and serving through a clean CLI and REST API.
Quick Start with Docker
# Pull and run Ollama in Docker
docker run -d \
--name ollama \
--gpus=all \
-v ollama_models:/root/.ollama \
-p 11434:11434 \
ollama/ollama:latest
This single command starts an Ollama server with GPU access, persistent model storage, and the API exposed on port 11434.
Pulling and Running Models
# Enter the container to pull models
docker exec -it ollama ollama pull llama3.1:8b
# List downloaded models
docker exec -it ollama ollama list
# Run a model interactively
docker exec -it ollama ollama run llama3.1:8b
# Try a Mistral model
docker exec -it ollama ollama pull mistral:7b
# Try a Qwen model
docker exec -it ollama ollama pull qwen2.5:14b
Using the Ollama REST API
Ollama exposes a simple HTTP API that's OpenAI-compatible as of version 0.1.x+:
# Generate a response
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Explain container orchestration in three sentences.",
"stream": false
}'
# Chat endpoint (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions -d '{
"model": "llama3.1:8b",
"messages": [
{"role": "user", "content": "Write a Python function to check if a number is prime."}
],
"temperature": 0.7
}'
Docker Compose for Production
For a more maintainable deployment, use Docker Compose:
# docker-compose.yml
version: "3.9"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_models:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
ollama_models:
docker compose up -d
Ollama is ideal for development, prototyping, and small-scale deployments. For high-throughput production workloads, vLLM is the better choice.
Deploying with vLLM (The Production Path)
vLLM is a high-performance inference engine that uses PagedAttention to maximize throughput. It supports continuous batching, tensor parallelism for multi-GPU setups, and serves an OpenAI-compatible API out of the box.
Running vLLM with Docker
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90
Key flags:
--tensor-parallel-size: Set to the number of GPUs for multi-GPU inference--max-model-len: Maximum sequence length (affects KV cache allocation)--gpu-memory-utilization: Fraction of VRAM to use (default 0.90)
vLLM with Docker Compose
version: "3.9"
services:
vllm:
image: vllm/vllm-openai:latest
container_name: vllm
restart: unless-stopped
ports:
- "8000:8000"
ipc: host
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
volumes:
- hf_cache:/root/.cache/huggingface
command: >
--model meta-llama/Meta-Llama-3.1-8B-Instruct
--tensor-parallel-size 1
--max-model-len 8192
--gpu-memory-utilization 0.90
--enforce-eager
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
hf_cache:
# Set your Hugging Face token (required for gated models)
export HF_TOKEN=hf_your_token_here
docker compose up -d
Testing the vLLM Endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful DevOps assistant."},
{"role": "user", "content": "How do I debug a crashing container?"}
],
"temperature": 0.7,
"max_tokens": 512
}'
Choosing the Right Model
Model selection depends on your use case, hardware, and quality requirements. Here's a practical breakdown:
For General Chat and Q&A
- Llama 3.1 (8B): Excellent general-purpose model. Well-balanced reasoning and creativity. The default recommendation for most deployments.
- Mistral 7B (v0.3): Fast, efficient, and surprisingly capable. Great for resource-constrained setups.
- Qwen 2.5 (7B/14B): Alibaba's model with strong multilingual support, particularly for Chinese and Southeast Asian languages.
For Coding Tasks
- Qwen 2.5 Coder (7B/32B): Outstanding code generation and completion. Rivals GPT-4 on many benchmarks at the 32B size.
- DeepSeek Coder V2: Excellent multi-language code generation with a large context window.
- CodeLlama: Meta's code-specialized Llama derivative. Solid, though newer models have surpassed it.
For Long-Context Applications
- Mistral Nemo (12B): 128K context window. Good for document analysis and summarization.
- Llama 3.1 (8B/70B): Native 128K context support. The 8B variant fits on consumer hardware.
For Maximum Quality (Multi-GPU Required)
- Llama 3.1 (70B): The open-weight quality leader. Approaches GPT-4-level performance on many benchmarks. Requires at least 2× A100 80GB or equivalent.
- Mixtral 8x22B: Mixture-of-experts model. High quality with faster inference than dense models of similar parameter counts.
- Qwen 2.5 (72B): Competitive with Llama 3.1 70B, particularly strong in Asian languages.
Practical Tip: Start Small, Scale Up
Don't start with a 70B model. Start with an 8B model, evaluate whether it meets your quality bar, and only scale up if necessary. The difference in infrastructure requirements between 8B and 70B is enormous, and many use cases are well-served by smaller models — especially with good prompting and RAG pipelines.
API Wrappers and Client Integration
Once your model is serving, you'll want to integrate it with applications. Both Ollama and vLLM expose OpenAI-compatible APIs, which means any OpenAI client library works with zero code changes.
Python Integration (OpenAI SDK)
from openai import OpenAI
# Point the client at your self-hosted endpoint
client = OpenAI(
base_url="http://localhost:8000/v1", # vLLM
# base_url="http://localhost:11434/v1", # Ollama
api_key="not-needed", # vLLM and Ollama don't require a real key
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a technical writer."},
{"role": "user", "content": "Explain Kubernetes pods to a beginner."}
],
temperature=0.7,
max_tokens=500,
)
print(response.choices[0].message.content)
JavaScript / TypeScript Integration
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8000/v1",
apiKey: "not-needed",
});
const response = await client.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-8B-Instruct",
messages: [
{ role: "user", content: "Summarize this article: ..." }
],
});
console.log(response.choices[0].message.content);
cURL (For Quick Testing)
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "The future of self-hosted AI is",
"max_tokens": 100,
"temperature": 0.8
}'
Drop-In Replacement for Existing Apps
If your application already uses the OpenAI SDK, switching to a self-hosted model is typically a one-line change — update base_url and model. That's it. LangChain, LlamaIndex, AutoGen, and most other frameworks detect the OpenAI API format and work without modification.
Performance Optimization Tips
1. Choose the Right Quantization
Quantization reduces model size at the cost of minimal quality loss:
- INT4 (4-bit): ~75% VRAM reduction. Quality drop is often imperceptible for chat and Q&A. Use GGUF format with Ollama for best results.
- INT8 (8-bit): ~50% VRAM reduction. Near-identical quality to FP16. Good middle ground.
- FP16 (16-bit): Full quality. Use when VRAM allows and maximum accuracy matters.
2. Use Continuous Batching (vLLM)
vLLM's continuous batching processes multiple requests simultaneously without waiting for batch completion. This dramatically improves throughput:
# Enable continuous batching (enabled by default in vLLM)
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--enable-chunked-prefill \
--max-num-batched-tokens 4096
3. Optimize Context Length
Longer context windows consume more VRAM for KV cache. If your use case doesn't need 128K tokens, reduce it:
# Limit context to 4096 tokens to save memory
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--max-model-len 4096
4. Use Speculative Decoding
Speculative decoding uses a small "draft" model to predict tokens, which a larger model then verifies. This can deliver 2–3x speedup:
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Meta-Llama-3.2-1B-Instruct \
--num-speculative-tokens 5
5. KV Cache Quantization
Reduce KV cache memory by quantizing it to FP8:
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--kv-cache-dtype fp8
This can save 30–50% of KV cache memory with negligible quality impact.
Cost Analysis: Self-Hosted vs API
Let's break down the real costs. We'll compare a self-hosted 8B model against OpenAI's API for a mid-sized workload.
Scenario: 10 Million Tokens/Day
API Costs (OpenAI GPT-4o-mini):
- Input tokens: ~$0.15 per 1M tokens
- Output tokens: ~$0.60 per 1M tokens
- Estimated monthly cost (70/30 input/output split): $1,350/month
Self-Hosted (RTX 4090 Server):
- RTX 4090: ~$1,600 (one-time)
- Server (CPU, RAM, PSU, case): ~$1,200 (one-time)
- Electricity (600W × 24h × $0.12/kWh): ~$52/month
- Co-location or hosting: ~$100–$200/month
- Total upfront: ~$2,800
- Monthly operating: ~$200–$250/month
Break-even: ~2.5 months
After the break-even point, you save ~$1,100/month compared to the API. For 70B models compared to GPT-4-class APIs, the savings are even more dramatic because the API cost per token is 5–10x higher, while self-hosting costs scale only with hardware (which you already own).
Hidden Costs to Consider
- Maintenance: OS updates, security patches, model updates — budget 2–4 hours/month
- Redundancy: For production, you'll want at least two inference nodes
- Monitoring: GPU temperatures, VRAM usage, inference latency dashboards
- Scaling: Self-hosting doesn't auto-scale the way cloud APIs do
Security Best Practices
When self-hosting, security is entirely your responsibility:
- Never expose the inference port directly to the internet. Use a reverse proxy (nginx, Caddy) with authentication.
- Add rate limiting at the proxy level to prevent abuse.
- Run containers as non-root when possible.
- Keep model files verified — download only from official Hugging Face or Ollama sources, and verify checksums.
- Network segmentation: Place inference servers on an internal network, accessible only through your API gateway.
Example nginx reverse proxy with auth:
server {
listen 443 ssl;
server_name llm.yourcompany.com;
ssl_certificate /etc/ssl/certs/yourcompany.crt;
ssl_certificate_key /etc/ssl/private/yourcompany.key;
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# Rate limiting
limit_req zone=llm_api burst=20 nodelay;
# Simple API key auth
if ($http_authorization != "Bearer YOUR_SECRET_KEY") {
return 401;
}
}
}
# Define rate limit zone in http block
# limit_req_zone $binary_remote_addr zone=llm_api:10m rate=10r/s;
Conclusion
Self-hosting large language models has crossed the threshold from experimental to practical. With tools like Ollama for quick prototyping and vLLM for production-grade serving, a developer with server management experience can deploy a capable LLM in an afternoon.
The decision ultimately comes down to volume, privacy requirements, and how much control you need. For high-volume workloads, regulated industries, or teams that simply want sovereignty over their AI infrastructure, self-hosting delivers compelling economics and total data control.
Start with a single GPU, an 8B model, and Ollama. Benchmark it against your actual workload. You may be surprised at how capable open-weight models have become — and how much you save by running them yourself.
Need help with your self-hosted AI infrastructure? TechTrends Pro covers DevOps, AI deployment, and infrastructure engineering. Subscribe for weekly guides on building production AI systems.