But if you've spent any time reading tutorials online, you've probably noticed a problem: most of them are written for running models locally on a laptop, not for deploying them on a production-grade dedicated server.
Many guides will tell you to run a simple Python script using the transformers library. What they don't mention is that the moment two users ping your server at the same time, a basic Python script will bottleneck and crash.
In this guide, we are going to show you how to properly self-host LLaMA 3 on a dedicated GPU server using true production-ready inference engines.
The Hardware Reality Check: What Do You Actually Need?
A massive misconception in the AI space right now is how much VRAM (Video RAM) you need. You cannot run a 70-Billion parameter model on a standard gaming rig without severely degrading its quality through deep quantization.
Furthermore, modern models like LLaMA 3 are natively trained in bfloat16 (BF16). Running them in their native precision avoids overflow issues and leverages the optimizations built into modern Ampere and Ada GPUs.
If you are renting a dedicated server from Fit Servers, here is the actual GPU VRAM you need depending on the LLaMA 3 variant:
| Model Version | Precision | VRAM Required | Ideal Fit Servers Setup |
|---|---|---|---|
| LLaMA 3 8B | BF16 (Uncompressed) | ~16 GB | 1x RTX 4090 (24 GB) or RTX 5090 (32 GB) |
| LLaMA 3 8B | 4-bit Quantized | ~6 GB | 1x RTX 3090 (24 GB) or RTX 4090 (24 GB) |
| LLaMA 3 70B | BF16 (Uncompressed) | ~140 GB | 3x RTX 6000 Ada (144 GB) or 2x A100 80GB (160 GB) |
| LLaMA 3 70B | 4-bit Quantized | ~40 GB | 2x RTX 3090/4090 (48 GB total) or 1x RTX 6000 Ada (48 GB) |
- Always leave a 20% VRAM buffer for context windows (KV Cache). If your model requires 40 GB to load, you need at least 48 GB of total VRAM to handle real user traffic without running out of memory.
- GPU Model Note: Within the RTX 5000 series, only the RTX 5090 (32 GB) has sufficient VRAM for 8B BF16 with headroom. The RTX 5080 carries only 16 GB and will not meet the buffer requirement. Within the A100 line, always specify the 80 GB variant — the A100 40 GB (80 GB total across two cards) cannot fit the 70B BF16 model.
The Software Stack: vLLM vs. Ollama
To serve LLaMA 3, you need an inference engine.
- Ollama: Brilliant for internal testing, dev environments, or single-user applications. It is incredibly easy to set up.
- vLLM: The gold standard for production. It uses a technique called PagedAttention, which drastically reduces memory waste and allows for high-throughput, concurrent user requests.
We will cover the production route (vLLM) first, as that is what a dedicated server is built for.
Step 1: Prepare Your Dedicated Server (Ubuntu)
Assuming you have spun up an Ubuntu server with a GPU, you need the NVIDIA driver stack and Docker. Many tutorials skip the NVIDIA Container Toolkit, which results in Docker being unable to "see" your GPU.
1. Update & Install Docker
sudo apt update && sudo apt upgrade -y
sudo apt install docker.io -y
2. Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
Crucial Fix: Configure Docker to use NVIDIA as the default runtime. Open the Docker daemon file sudo nano /etc/docker/daemon.json and paste this exact configuration:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Restart Docker to apply the changes:
sudo systemctl restart docker
Step 2: The Security Check (Do Not Skip)
If you are running on a rented dedicated server, it likely has a public IP address. Do not expose an unauthenticated AI API to the open internet. Anyone with a port scanner can hijack your compute and run malicious prompts on your hardware.
To fix this, you must bind your Docker containers strictly to localhost (127.0.0.1) and use a reverse proxy (like Nginx or Caddy) to handle API key authentication safely. First, lock down your server using UFW for standard access:
sudo ufw allow ssh
sudo ufw enable
(We will handle the safe local port binding in the upcoming Docker command).
Step 3: The Missing Link – Hugging Face Authentication
LLaMA 3 is a gated model. You cannot just pull it anonymously.
- Go to Hugging Face and create an account.
- Navigate to the
meta-llama/Meta-Llama-3-8B-Instructrepository and agree to Meta's terms. (Approval usually takes less than 5 minutes.) - Go to your Hugging Face Settings > Access Tokens and generate a Read token.
- Export this token to your server environment:
export HF_TOKEN="your_huggingface_token_here"
Step 4: Deploying LLaMA 3 for Production (Using vLLM)
Instead of installing endless Python dependencies directly onto your bare metal, use the official vLLM Docker container. This keeps your server clean.
Create a directory to store the model weights so you don't have to redownload them every time the server restarts:
mkdir -p /opt/models
Run the vLLM container (Note the 127.0.0.1 binding for security and the --tensor-parallel-size argument):
docker run -d --gpus all \
-v /opt/models:/root/.cache/huggingface \
-e "HF_TOKEN=$HF_TOKEN" \
-p 127.0.0.1:8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--dtype bfloat16 \
--max-model-len 8192 \
--tensor-parallel-size 1
- -p 127.0.0.1:8000:8000: This forces Docker to only expose the API locally, preventing hackers from bypassing your firewall.
- --ipc=host: PyTorch uses shared memory to pass data between processes. Without this flag, your container will eventually crash out of memory during high concurrent loads.
- --tensor-parallel-size 1: This tells vLLM how many GPUs to split the model across. If you are running the 70B model on multiple GPUs, you MUST change this number to match your GPU count (e.g.,
--tensor-parallel-size 2for a dual-GPU setup), otherwise, the deployment will fail.
Your server is now hosting an OpenAI-compatible API locally on port 8000.
Step 5 (Alternative): The Quick Route (Using Ollama)
If you don't need high-concurrency production throughput and just want the model running for internal team use in 60 seconds, use Ollama. It automatically handles quantization behind the scenes.
Install Ollama (this automatically sets it up as a background systemd service):
curl -fsSL https://ollama.com/install.sh | sh
By default, Ollama only listens to localhost. If you use Nginx to reverse proxy to it, you don't need to change anything. Pull and run the model:
ollama run llama3
Step 6: Testing Your AI Server
Whether you used vLLM or Ollama, you can test your local endpoint using a simple curl command. (Run this from inside the server terminal since we secured the ports).
If using vLLM (OpenAI Compatible):
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [
{"role": "user", "content": "Explain quantum computing in one sentence."}
]
}'
If using Ollama:
curl http://127.0.0.1:11434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3",
"messages": [
{"role": "user", "content": "Explain quantum computing in one sentence."}
],
"stream": false
}'
Ready to Bring Your AI In-House?
Self-hosting LLaMA 3 guarantees that your company's proprietary data never leaves your infrastructure, all while eliminating unpredictable API billing. But software is only half the equation—you need bare-metal performance that won't buckle under heavy inference loads.
Looking for the perfect hardware to run your LLMs? Explore fitservers.com for high-performance, cost-effective dedicated GPU servers tailored for heavy AI workloads. Whether you need a single GPU for 8B models or a multi-GPU cluster for 70B parameter deployments, Fit Servers gives you the raw power and dedicated bandwidth required to keep your AI fast, secure, and always online.
Configure your AI server today!