Running LLMs Locally with Docker Model Runner

Running large language models locally has never been easier. Docker Model Runner brings the simplicity of container workflows to AI models - pull, run, and interact with LLMs using familiar Docker commands. This guide walks you through everything you need to get started.

Why Run LLMs Locally?

Before diving in, here's why local LLMs matter:

•Privacy: Your data never leaves your machine
•No API costs: Zero usage fees or rate limits
•Offline access: Works without internet connectivity
•Low latency: No network round-trips

Prerequisites

You'll need:

•Docker Desktop 4.40+ (with Model Runner support)
•8GB RAM minimum (32GB recommended for larger models)
•20GB+ free disk space for model storage

Verify your Docker installation:

bash

1docker --version

Enabling Docker Model Runner

Open Docker Desktop and navigate to Settings > AI. Enable Docker Model Runner and check host-side TCP support to access the API from your host machine. The default port is 12434, which you can change if needed. For CORS allowed origins, set it to All to allow requests from any origin, or configure a custom value if you want to restrict access to specific domains.

Click Apply and wait for Docker to restart. The inference engine may take a few minutes to initialize.

You can verify it's working by running:

bash

1docker model list
2
3# If the command isn't recognized, you may need to link the CLI plugin:
4ln -s /Applications/Docker.app/Contents/Resources/cli-plugins/docker-model ~/.docker/cli-plugins/docker-model

Understanding Inference Engines

Docker Model Runner supports two inference engines:

llama.cpp is the default and works everywhere - macOS (Apple Silicon), Windows, and Linux. It's optimized for running quantized models efficiently on consumer hardware.

vLLM is designed for production workloads with high throughput requirements. It requires NVIDIA GPUs on Linux x86_64 or Windows WSL2.

For most local development, llama.cpp is the right choice.

Model Selection Guide

Before pulling models, it helps to understand what's available. You can browse models directly in Docker Desktop by navigating to Models > Docker Hub. This gives you a visual catalog with descriptions, download counts, and one-click pulling.

Choosing the right model depends on your hardware and use case:

Lightweight Models (Under 3B parameters)

Fast responses, minimal resource usage. Ideal for development, testing, and edge devices.

Model	Parameters	Size	Best For
`ai/smollm2`	360M	~256MB	Quick prototyping, constrained devices
`ai/llama3.2:1B`	1B	~1.3GB	On-device apps, summarization
`ai/llama3.2:3B`	3B	~2GB	Instruction following, tool calling

bash

1# SmolLM2 - Ultra-lightweight, great for testing
2docker model pull smollm2
3
4# Llama 3.2 1B - Best quality for size
5docker model pull llama3.2:1B-Q8_0
6
7# Llama 3.2 3B - Outperforms many larger models on specific tasks
8docker model pull llama3.2:3B-Q4_K_M

Mid-Range Models (7B-14B parameters)

Better quality, requires more resources. Good for most practical applications.

Model	Parameters	Size	Best For
`ai/mistral`	7B	~4.4GB	General purpose, enterprise use
`ai/gemma3`	9B	~2.5GB	Research, reasoning tasks
`ai/phi4`	14B	~9GB	Complex reasoning
`ai/qwen2.5`	7B	~4.4GB	Multilingual, coding

bash

1# Mistral 7B - Excellent all-around model
2docker model pull mistral
3
4# Qwen 2.5 - Great for code generation
5docker model pull qwen2.5

Large Models (70B+ parameters)

Highest quality output, requires powerful hardware (32GB+ RAM recommended).

Model	Parameters	Size	Best For
`ai/llama3.3`	70B	~42GB	Complex tasks, near-frontier quality
`ai/deepseek-r1-distill-llama`	70B	~5GB	Advanced reasoning

bash

1# Llama 3.3 70B - State-of-the-art open model
2docker model pull llama3.3

Understanding Quantization

Model names often include a quantization suffix (like Q4_K_M or Q8_0) that indicates compression level:

•Q8_0: Highest quality, largest file size
•Q4_K_M: Good balance of quality and efficiency (recommended for most users)
•Q4_0: Smallest size, slightly lower quality

For most use cases, Q4_K_M offers the best trade-off between quality and performance.

Pulling Your First Model

Models are pulled from Docker Hub, just like container images:

bash

1# Pull a lightweight model for testing
2docker model pull smollm2
3
4# Or pull Llama 3.2 1B for better quality
5docker model pull llama3.2:1B-Q8_0
6
7# List your downloaded models
8docker model list

Models are cached locally after the first download, so subsequent runs are instant.

Running a Model

Start a model with a simple command:

bash

1docker model run llama3.2:1B-Q8_0

This starts an interactive chat session in your terminal. Type your prompts and get responses directly.

Interacting via API

Docker Model Runner provides OpenAI-compatible, Anthropic-compatible, and Ollama-compatible APIs. This means you can use existing SDKs and tools without modification.

API Endpoints

From	Base URL
Host machine	`http://localhost:12434`
Inside containers	`http://model-runner.docker.internal`

For OpenAI SDK compatibility, use: http://localhost:12434/engines/v1

Using cURL

bash

1curl http://localhost:12434/engines/v1/chat/completions \
2  -H "Content-Type: application/json" \
3  -d '{
4    "model": "ai/smollm2",
5    "messages": [
6      {"role": "system", "content": "You are a helpful assistant."},
7      {"role": "user", "content": "What is Docker?"}
8    ],
9    "temperature": 0.7,
10    "max_tokens": 500
11  }'

Using Python with OpenAI SDK

Since the API is OpenAI-compatible, you can use the official OpenAI Python library:

python

1from openai import OpenAI
2
3client = OpenAI(
4    base_url="http://localhost:12434/engines/v1",
5    api_key="not-needed"  # No API key required for local models
6)
7
8response = client.chat.completions.create(
9    model="ai/smollm2",
10    messages=[
11        {"role": "system", "content": "You are a helpful assistant."},
12        {"role": "user", "content": "Explain Docker containers in simple terms."}
13    ],
14    temperature=0.7,
15    max_tokens=500
16)
17
18print(response.choices[0].message.content)

Streaming Responses

For real-time output, enable streaming:

python

1from openai import OpenAI
2
3client = OpenAI(
4    base_url="http://localhost:12434/engines/v1",
5    api_key="not-needed"
6)
7
8stream = client.chat.completions.create(
9    model="ai/smollm2",
10    messages=[
11        {"role": "user", "content": "Write a haiku about coding."}
12    ],
13    stream=True
14)
15
16for chunk in stream:
17    if chunk.choices[0].delta.content:
18        print(chunk.choices[0].delta.content, end="", flush=True)

Building a Simple Chatbot

Let's build a complete example - a terminal chatbot that maintains conversation history:

python

1from openai import OpenAI
2
3def create_chatbot():
4    client = OpenAI(
5        base_url="http://localhost:12434/engines/v1",
6        api_key="not-needed"
7    )
8
9    messages = [
10        {"role": "system", "content": "You are a helpful AI assistant. Be concise and friendly."}
11    ]
12
13    print("Local LLM Chatbot (type 'quit' to exit)")
14    print("-" * 40)
15
16    while True:
17        user_input = input("\nYou: ").strip()
18
19        if user_input.lower() in ['quit', 'exit', 'q']:
20            print("Goodbye!")
21            break
22
23        if not user_input:
24            continue
25
26        messages.append({"role": "user", "content": user_input})
27
28        try:
29            response = client.chat.completions.create(
30                model="ai/smollm2",
31                messages=messages,
32                temperature=0.7,
33                max_tokens=1000
34            )
35
36            assistant_message = response.choices[0].message.content
37            messages.append({"role": "assistant", "content": assistant_message})
38
39            print(f"\nAssistant: {assistant_message}")
40
41        except Exception as e:
42            print(f"\nError: {e}")
43            messages.pop()  # Remove failed user message
44
45if __name__ == "__main__":
46    create_chatbot()

Save this as chatbot.py and run it:

bash

1pip install openai
2python chatbot.py

Using from Docker Containers

If your application runs in a Docker container, use the internal hostname:

python

1from openai import OpenAI
2
3client = OpenAI(
4    base_url="http://model-runner.docker.internal/engines/v1",
5    api_key="not-needed"
6)

This allows containerized applications to access the local LLM without exposing ports to the host.

Troubleshooting

Model won't start

bash

1# Check if Docker Model Runner is enabled
2docker model list
3
4# Check Docker Desktop logs for errors

Slow responses

•Try a smaller model or higher quantization (Q4 instead of Q8)
•Ensure no other heavy processes are running
•Check if GPU acceleration is being used

Out of memory

•Use a smaller model
•Increase Docker's memory allocation in Docker Desktop settings
•Try a more aggressive quantization level

Conclusion

Docker Model Runner makes local LLM development accessible to everyone. With just a few commands, you can pull and run various open-source models, integrate with applications using familiar OpenAI-compatible APIs, and develop and test without cloud dependencies or costs.

Start with a small model like ai/smollm2 or ai/llama3.2:1B-Q8_0 to get familiar with the workflow, then scale up to larger models as your needs grow.

The combination of Docker's simplicity and local LLM capabilities opens up possibilities for privacy-focused applications, offline tools, and cost-effective AI development. The learning curve is minimal if you already know Docker.

The only question is what you will build with them!

Why Run LLMs Locally?

Prerequisites

Enabling Docker Model Runner

Understanding Inference Engines

Model Selection Guide

Lightweight Models (Under 3B parameters)

Mid-Range Models (7B-14B parameters)

Large Models (70B+ parameters)

Understanding Quantization

Pulling Your First Model

Running a Model

Interacting via API

API Endpoints

Using cURL

Using Python with OpenAI SDK

Streaming Responses

Building a Simple Chatbot

Using from Docker Containers

Troubleshooting

Model won't start

Slow responses

Out of memory

Conclusion

Ibrahim Shittu