Running Large Language Models (LLMs) used to require expensive NVIDIA GPUs. Today, thanks to the heavily optimized GGUF format and engines like llama.cpp (which powers Ollama), you can run highly capable quantized models on affordable CPU-only dedicated servers.

This guide provides a production-ready, highly secure deployment for Ollama on Ubuntu 24.04.

The Reality Check: What Most Tutorials Get Wrong

If you asked an AI or read a standard blog post on how to set this up, you likely received dangerous or counterproductive advice. Here is what competitors consistently get wrong:

The OLLAMA_HOST=0.0.0.0 Security Trap: Almost every tutorial tells you to expose the Ollama API to 0.0.0.0 so you can reach it remotely. Do not do this without a firewall. Ollama has zero built-in authentication. If you expose port 11434 to the internet, anyone can query your server, consume your CPU, and access your locally downloaded models. We will use secure SSH tunneling instead.
The "More Threads = More Speed" Hallucination: It is a common misconception to set OLLAMA_NUM_THREADS to your CPU's maximum logical core count. In CPU inference, hyperthreading overhead actually slows down token generation. You should align your thread count with your physical cores, not logical threads.
Ignoring the Memory Bandwidth Bottleneck: CPU clock speed matters less than RAM speed. CPU inference is bottlenecked by memory bandwidth. A server with DDR5 memory will significantly outperform one with DDR4, even if the DDR4 server has a faster processor.

Prerequisites

A dedicated server or VPS running Ubuntu 24.04 LTS.
RAM: 8 GB absolute minimum (for 1B–3B models); 16 GB recommended for 7B–8B models; 32 GB for 14B models.
A modern CPU with AVX2 or AVX-512 support (Intel 12th Gen+ or AMD Ryzen/EPYC).
Root or sudo access.

Deployment Walkthrough

1. Update System and Lock Down the Firewall

Crucial for preventing unauthorized API access.

Before installing anything, update your package lists and ensure your firewall blocks public access to the default Ollama port.

sudo apt update && sudo apt upgrade -y
sudo apt install -y curl ufw

Configure Uncomplicated Firewall (UFW) to allow SSH access, but strictly deny port 11434 (Ollama's API) from the outside world.

sudo ufw allow 22/tcp
sudo ufw deny 11434/tcp
sudo ufw --force enable
sudo ufw status numbered

Note:

If your server uses a custom SSH port instead of the default port 22, ensure you allow that specific port before enabling the firewall.

2. Install Ollama via the Official Script

Deploys the runtime and creates isolated user groups.

Ollama provides a streamlined installation script. Because your server lacks a GPU, Ollama will automatically detect this and configure itself for CPU-only inference.

curl -fsSL https://ollama.com/install.sh | sh

Verify the installation and ensure the service is active:

ollama --version
systemctl status ollama --no-pager

3. Optimize Systemd for CPU Inference

Tuning thread allocation for maximum tokens per second.

To prevent hyperthreading overhead from slowing down your LLM, bind Ollama to your physical cores.

First, check your physical core count per CPU.

Important note for dual-socket servers: Do NOT multiply this number by 2. Most dual-socket servers use a NUMA (Non-Uniform Memory Access) architecture. Crossing NUMA boundaries introduces significant memory latency because data must traverse the inter-socket communication fabric. It is highly recommended to set OLLAMA_NUM_THREADS to the physical core count of a single CPU socket for maximum tokens-per-second performance.

lscpu | grep "Core(s) per socket"

Next, create a systemd override file for the Ollama service:

sudo systemctl edit ollama.service

Add the following lines between the ### comment blocks, replacing [YOUR_PHYSICAL_CORES] with the number you found above (e.g., 8):

[Service]
Environment="OLLAMA_NUM_THREADS=[YOUR_PHYSICAL_CORES]"
Environment="OLLAMA_HOST=127.0.0.1:11434"

Reload systemd and restart the service:

sudo systemctl daemon-reload
sudo systemctl restart ollama

4. Pull and Run a Quantized GGUF Model

Deploying RAM-efficient AI models.

Ollama uses the GGUF format by default, applying 4-bit quantization (usually Q4_K_M) to drastically reduce the RAM footprint of models without severely impacting their reasoning ability.

For an 8GB to 16GB server, Llama 3.2 (3B) or Qwen 2.5 (7B) are excellent choices.

# Pull and run the model interactively
ollama run qwen2.5:7b

You are now in the interactive chat prompt. Type a message to test the generation speed. You should expect roughly 5 to 15 tokens per second depending on your CPU generation and RAM speed. Type /bye to exit.

5. Access the API Securely via SSH Tunneling

Zero-trust remote access from your local machine.

Since we blocked external access to port 11434 in Step 1, you cannot hit the API directly via your server's IP. Instead, establish an SSH tunnel from your local computer.

Run this command on your local machine (not the server):

ssh -N -L 11434:localhost:11434 your_user@your_server_ip

Leave that terminal window running. You can now use tools like Open WebUI, anythingLLM, or standard curl commands on your local machine by pointing them to http://localhost:11434 — the traffic is securely tunneled to your dedicated server.

Note for Production Apps

If you are deploying your own application (like a Node.js backend, a Python script, or a hosted UI) directly on this same dedicated server, you do not need an SSH tunnel. Because your app and Ollama share the same environment, your code can securely communicate with the API directly via http://127.0.0.1:11434.

Frequently Asked Questions (FAQ)

Can I run 70B parameter models on a CPU?

Technically, yes. If your dedicated server has 64 GB of RAM, the model will load. However, the inference speed will likely be painfully slow (around 1 to 2 tokens per second). CPU-only servers are best utilized for models in the 1.5B to 14B parameter range.

Why does my server freeze when generating a response?

You likely ran out of physical RAM, forcing the server to use "swap" space on your hard drive. Swap is thousands of times slower than RAM. If this happens, you must pull a smaller model (e.g., switching from a 14B model to a 7B model).

Pro Tip: On dedicated AI inference servers, consider disabling swap entirely (sudo swapoff -a). If a model exceeds available RAM, Linux will terminate the Ollama process through the OOM (Out Of Memory) Killer rather than allowing the system to enter severe swap thrashing. This protects the operating system and typically results in a faster recovery and a more stable server environment.

What is GGUF and why does it matter for CPUs?

GGUF (GPT-Generated Unified Format) is a file format designed explicitly for running LLMs on CPUs via the llama.cpp engine. It allows for "quantization" — mathematical rounding of the model's weights from 16-bit precision down to 4-bit precision. This shrinks the model's memory requirement by roughly 70% with only a negligible loss in output quality.

Do I need a GPU for embeddings or RAG?

No. You can run embedding models entirely on your CPU using Ollama. For example, running ollama pull nomic-embed-text will download a highly efficient embedding model that works flawlessly for Retrieval-Augmented Generation (RAG) pipelines on CPU architecture.

How to Install Ollama on a Dedicated Server (Ubuntu 24.04): Running Quantized LLMs on CPU-Only Hardware (GGUF)

The Reality Check: What Most Tutorials Get Wrong

Prerequisites

Deployment Walkthrough

1. Update System and Lock Down the Firewall

Note:

2. Install Ollama via the Official Script

3. Optimize Systemd for CPU Inference

4. Pull and Run a Quantized GGUF Model

5. Access the API Securely via SSH Tunneling

Note for Production Apps

Frequently Asked Questions (FAQ)

COMPANY

SERVICES

Client Details

SERVERS CATEGORIES

The Reality Check: What Most Tutorials Get Wrong

Prerequisites

Deployment Walkthrough

1. Update System and Lock Down the Firewall

Note:

2. Install Ollama via the Official Script

3. Optimize Systemd for CPU Inference

4. Pull and Run a Quantized GGUF Model

5. Access the API Securely via SSH Tunneling

Note for Production Apps

Frequently Asked Questions (FAQ)

Ready to Scale Your Private AI Infrastructure?