The Production-Ready Guide to Self-Hosting LLaMA 3 on a GPU Dedicated Server

If you're looking to break free from third-party API costs, data privacy concerns, and rate limits, self-hosting Meta's LLaMA 3 is the logical next step. Learn how to properly self-host on a dedicated GPU server using true production-ready inference engines.

But if you've spent any time reading tutorials online, you've probably noticed a problem: most of them are written for running models locally on a laptop, not for deploying them on a production-grade dedicated server.

Many guides will tell you to run a simple Python script using the transformers library. What they don't mention is that the moment two users ping your server at the same time, a basic Python script will bottleneck and crash.

In this guide, we are going to show you how to properly self-host LLaMA 3 on a dedicated GPU server using true production-ready inference engines.

The Hardware Reality Check: What Do You Actually Need?

A massive misconception in the AI space right now is how much VRAM (Video RAM) you need. You cannot run a 70-Billion parameter model on a standard gaming rig without severely degrading its quality through deep quantization.

Furthermore, modern models like LLaMA 3 are natively trained in bfloat16 (BF16). Running them in their native precision avoids overflow issues and leverages the optimizations built into modern Ampere and Ada GPUs.

If you are renting a dedicated server from Fit Servers, here is the actual GPU VRAM you need depending on the LLaMA 3 variant:

Model Version Precision VRAM Required Ideal Fit Servers Setup
LLaMA 3 8B BF16 (Uncompressed) ~16 GB 1x RTX 4090 (24 GB) or RTX 5090 (32 GB)
LLaMA 3 8B 4-bit Quantized ~6 GB 1x RTX 3090 (24 GB) or RTX 4090 (24 GB)
LLaMA 3 70B BF16 (Uncompressed) ~140 GB 3x RTX 6000 Ada (144 GB) or 2x A100 80GB (160 GB)
LLaMA 3 70B 4-bit Quantized ~40 GB 2x RTX 3090/4090 (48 GB total) or 1x RTX 6000 Ada (48 GB)
Hardware Context Notes
  • Always leave a 20% VRAM buffer for context windows (KV Cache). If your model requires 40 GB to load, you need at least 48 GB of total VRAM to handle real user traffic without running out of memory.
  • GPU Model Note: Within the RTX 5000 series, only the RTX 5090 (32 GB) has sufficient VRAM for 8B BF16 with headroom. The RTX 5080 carries only 16 GB and will not meet the buffer requirement. Within the A100 line, always specify the 80 GB variant — the A100 40 GB (80 GB total across two cards) cannot fit the 70B BF16 model.

The Software Stack: vLLM vs. Ollama

To serve LLaMA 3, you need an inference engine.

  • Ollama: Brilliant for internal testing, dev environments, or single-user applications. It is incredibly easy to set up.
  • vLLM: The gold standard for production. It uses a technique called PagedAttention, which drastically reduces memory waste and allows for high-throughput, concurrent user requests.

We will cover the production route (vLLM) first, as that is what a dedicated server is built for.

Step 1: Prepare Your Dedicated Server (Ubuntu)

Assuming you have spun up an Ubuntu server with a GPU, you need the NVIDIA driver stack and Docker. Many tutorials skip the NVIDIA Container Toolkit, which results in Docker being unable to "see" your GPU.

1. Update & Install Docker

Bash
sudo apt update && sudo apt upgrade -y
sudo apt install docker.io -y

2. Install NVIDIA Container Toolkit

Bash
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update
sudo apt install -y nvidia-container-toolkit

Crucial Fix: Configure Docker to use NVIDIA as the default runtime. Open the Docker daemon file sudo nano /etc/docker/daemon.json and paste this exact configuration:

JSON
{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}

Restart Docker to apply the changes:

Bash
sudo systemctl restart docker

Step 2: The Security Check (Do Not Skip)

If you are running on a rented dedicated server, it likely has a public IP address. Do not expose an unauthenticated AI API to the open internet. Anyone with a port scanner can hijack your compute and run malicious prompts on your hardware.

The Docker Bypass Warning A common mistake is relying entirely on UFW (Uncomplicated Firewall). By default, Docker manipulates iptables and will bypass UFW entirely, exposing your API to the world even if you block port 8000.

To fix this, you must bind your Docker containers strictly to localhost (127.0.0.1) and use a reverse proxy (like Nginx or Caddy) to handle API key authentication safely. First, lock down your server using UFW for standard access:

Bash
sudo ufw allow ssh
sudo ufw enable

(We will handle the safe local port binding in the upcoming Docker command).

Step 3: The Missing Link – Hugging Face Authentication

LLaMA 3 is a gated model. You cannot just pull it anonymously.

  1. Go to Hugging Face and create an account.
  2. Navigate to the meta-llama/Meta-Llama-3-8B-Instruct repository and agree to Meta's terms. (Approval usually takes less than 5 minutes.)
  3. Go to your Hugging Face Settings > Access Tokens and generate a Read token.
  4. Export this token to your server environment:
Bash
export HF_TOKEN="your_huggingface_token_here"

Step 4: Deploying LLaMA 3 for Production (Using vLLM)

Instead of installing endless Python dependencies directly onto your bare metal, use the official vLLM Docker container. This keeps your server clean.

Create a directory to store the model weights so you don't have to redownload them every time the server restarts:

Bash
mkdir -p /opt/models

Run the vLLM container (Note the 127.0.0.1 binding for security and the --tensor-parallel-size argument):

Bash
docker run -d --gpus all \
  -v /opt/models:/root/.cache/huggingface \
  -e "HF_TOKEN=$HF_TOKEN" \
  -p 127.0.0.1:8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --tensor-parallel-size 1
  • -p 127.0.0.1:8000:8000: This forces Docker to only expose the API locally, preventing hackers from bypassing your firewall.
  • --ipc=host: PyTorch uses shared memory to pass data between processes. Without this flag, your container will eventually crash out of memory during high concurrent loads.
  • --tensor-parallel-size 1: This tells vLLM how many GPUs to split the model across. If you are running the 70B model on multiple GPUs, you MUST change this number to match your GPU count (e.g., --tensor-parallel-size 2 for a dual-GPU setup), otherwise, the deployment will fail.

Your server is now hosting an OpenAI-compatible API locally on port 8000.

Step 5 (Alternative): The Quick Route (Using Ollama)

If you don't need high-concurrency production throughput and just want the model running for internal team use in 60 seconds, use Ollama. It automatically handles quantization behind the scenes.

Install Ollama (this automatically sets it up as a background systemd service):

Bash
curl -fsSL https://ollama.com/install.sh | sh

By default, Ollama only listens to localhost. If you use Nginx to reverse proxy to it, you don't need to change anything. Pull and run the model:

Bash
ollama run llama3

Step 6: Testing Your AI Server

Whether you used vLLM or Ollama, you can test your local endpoint using a simple curl command. (Run this from inside the server terminal since we secured the ports).

If using vLLM (OpenAI Compatible):

Bash
curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in one sentence."}
    ]
  }'

If using Ollama:

Bash
curl http://127.0.0.1:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in one sentence."}
    ],
    "stream": false
  }'

Ready to Bring Your AI In-House?

Self-hosting LLaMA 3 guarantees that your company's proprietary data never leaves your infrastructure, all while eliminating unpredictable API billing. But software is only half the equation—you need bare-metal performance that won't buckle under heavy inference loads.

Looking for the perfect hardware to run your LLMs? Explore fitservers.com for high-performance, cost-effective dedicated GPU servers tailored for heavy AI workloads. Whether you need a single GPU for 8B models or a multi-GPU cluster for 70B parameter deployments, Fit Servers gives you the raw power and dedicated bandwidth required to keep your AI fast, secure, and always online.

Configure your AI server today!