Using LLMs Locally via Ollama
Running large language models (LLMs) locally provides privacy, control, and cost savings compared to cloud-based AI services. Ollama simplifies the process of deploying and managing LLMs in containerized environments, making it accessible for developers and researchers who want to run AI models on their own infrastructure.
This guide covers the complete setup process for Ollama, from basic installation to advanced model customization using Hugging Face models. You’ll learn how to deploy models in containers, access them through various interfaces, and customize models for specific use cases.
Prerequisites
Before starting with Ollama, ensure you have:
- A Linux system with container runtime support
- Sufficient system resources (8GB+ RAM recommended for most models)
- GPU support (optional but recommended for performance)
- Network access for downloading models
This guide assumes you have containerd configured as your container runtime. If you need to set up containerd in rootless mode, see our comprehensive containerd setup guide which covers installation, configuration, and Docker migration strategies.
Core Ollama Setup
Running the Ollama Container
The foundation of using Ollama locally is deploying the main container that serves as the model runtime environment. This container handles model loading, inference, and API endpoints.
nerdctl run --name ollama --rm \
-v /vault/ai/ollama:/root/.ollama \
-p 11434:11434 \
--gpus all \
ollama/ollama
Command breakdown:
--name ollama
: Names the container for easy reference--rm
: Automatically removes container when stopped-v /vault/ai/ollama:/root/.ollama
: Mounts local directory for model storage-p 11434:11434
: Exposes Ollama’s default API port--gpus all
: Enables GPU acceleration (requires NVIDIA Container Runtime)
Basic Ollama Operations
Once the container is running, you can interact with Ollama using the built-in CLI tools. Access the help system to explore available commands:
nerdctl exec -it ollama -- ollama help
This command provides a comprehensive list of available operations including model management, inference, and system administration functions.
Model Management
Downloading Pre-built Models
Ollama provides access to a curated library of optimized language models. Browse the complete model catalog at the official Ollama library to find models suitable for your use case.
Download a model using the pull command:
nerdctl exec -it ollama -- ollama pull llama2
Popular models include:
- llama2: Meta’s Llama 2 model family
- codellama: Specialized for code generation
- mistral: Efficient general-purpose model
- neural-chat: Optimized for conversational AI
Running Models Interactively
After downloading a model, start an interactive session:
nerdctl exec -it ollama -- ollama run llama2
This launches a chat interface where you can interact directly with the model through the command line. The interface provides real-time responses and maintains conversation context throughout the session.
Web Interface Integration
Setting Up Ollama WebUI
For users who prefer graphical interfaces, the Ollama WebUI provides a modern web-based frontend for model interaction. This interface offers features like conversation history, model switching, and enhanced formatting.
Clone the WebUI repository:
git clone https://github.com/ollama-webui/ollama-webui
cd ollama-webui
Build the WebUI container image:
nerdctl build -f Dockerfile . -t ollama-webui
Deploy the WebUI container with proper networking:
nerdctl run --name ollama-webui \
--env OLLAMA_API_BASE_URL=http://ollama:11434/api \
-p 3000:8080 \
--rm -it \
ollama-webui
Configuration details:
OLLAMA_API_BASE_URL
: Points to the Ollama API endpoint- Port
3000
: External access port for the web interface - Port
8080
: Internal WebUI application port
Access the interface at http://localhost:3000
to begin using the graphical interface.
Terminal User Interface (TUI)
Installing and Using Oterm
For users who prefer terminal-based interfaces with enhanced features, Oterm provides a sophisticated TUI for Ollama interaction. This tool offers features like syntax highlighting, conversation management, and model switching within a terminal environment.
Set up a Python virtual environment for Oterm:
python -m venv venv
source venv/bin/activate
pip install oterm
Launch Oterm with the virtual environment active:
oterm
Oterm automatically detects running Ollama instances and provides an intuitive interface for model interaction with additional features like:
- Conversation history browsing
- Multi-model session management
- Export capabilities for conversations
- Customizable themes and layouts
Custom Model Integration
Using Hugging Face Models
Extend Ollama’s capabilities by integrating custom models from Hugging Face. This process allows you to use specialized models not available in the standard Ollama library.
Prerequisites for Custom Models
Install Git LFS for handling large model files using the pacman package manager:
sudo pacman -S git-lfs
git lfs install
For comprehensive pacman usage including installation options, dependency management, and troubleshooting, see our detailed pacman cheatsheet.
Downloading Custom Models
Browse the Hugging Face model hub to find models suitable for your requirements. Clone the desired model repository:
git clone https://huggingface.co/SweatyCrayfish/Linux-CodeLlama-2-7B
cd Linux-CodeLlama-2-7B
Model Format Conversion
Most Hugging Face models require conversion to GGUF format for Ollama compatibility. Use the Ollama quantization tool:
nerdctl run --rm -v .:/model ollama/quantize -q q4_0 /model
Quantization options:
q4_0
: 4-bit quantization (good balance of size and quality)q5_0
: 5-bit quantization (higher quality, larger size)q8_0
: 8-bit quantization (highest quality, largest size)
Creating Model Definitions
Create a Modelfile
with model configuration:
FROM ./q4_0.bin
TEMPLATE "[INST] {{ .Prompt }} [/INST]"
PARAMETER temperature 0.7
PARAMETER top_k 40
PARAMETER top_p 0.9
Configuration parameters:
FROM
: Specifies the model file locationTEMPLATE
: Defines the prompt format for the modelPARAMETER
: Sets inference parameters for model behavior
Building Custom Models
Create the custom model in Ollama:
nerdctl exec -it ollama -- ollama create linux-codellama2 -f Modelfile
The custom model is now available for use like any standard Ollama model:
nerdctl exec -it ollama -- ollama run linux-codellama2
Performance Optimization
GPU Configuration
For optimal performance with large models, ensure proper GPU configuration:
- Install NVIDIA Container Runtime for GPU support
- Verify GPU accessibility within containers
- Monitor GPU memory usage during model inference
- Consider model quantization for memory-constrained environments
Resource Management
Monitor system resources during model operation:
- RAM Usage: Large models require significant memory (7B models ~8GB, 13B models ~16GB)
- Storage: Models consume substantial disk space; plan storage accordingly
- CPU Usage: CPU inference is possible but significantly slower than GPU
- Network: Initial model downloads require good bandwidth
Container Runtime Configuration
For optimal Ollama deployment, ensure your container runtime is properly configured. If you’re using containerd (recommended), our containerd setup guide provides detailed instructions for:
- Rootless Operation: Enhanced security without requiring root privileges
- GPU Integration: Proper GPU acceleration setup for AI workloads
- Network Configuration: Container networking for multi-service deployments
- Performance Optimization: Resource management and optimization techniques
The containerd setup includes nerdctl installation and configuration, which is essential for the commands shown in this guide.
Troubleshooting Common Issues
Model Loading Problems
Container Connectivity Issues
Issue: WebUI cannot connect to Ollama API Solutions:
- Verify container networking configuration
- Check port availability and binding
- Ensure firewall rules allow container communication
- Validate API endpoint URLs
- Reference our containerd troubleshooting guide for container-specific network issues
Performance Issues
Issue: Slow model inference or high latency Solutions:
- Enable GPU acceleration if available
- Optimize model quantization settings
- Increase system memory allocation
- Consider using smaller, faster models for testing
For containerized deployments specifically, ensure your container runtime is optimized for AI workloads with proper resource allocation and GPU support.
Package Management Issues
Issue: Git LFS or other dependencies fail to install Solutions:
- Update package database with
sudo pacman -Sy
- Check mirror connectivity and refresh mirror list
- For detailed package management troubleshooting, see our pacman troubleshooting section
- Verify sufficient disk space for large model downloads
References and Resources
- Official Ollama Documentation
- Ollama Model Import Guide
- Ollama WebUI Project
- Oterm Terminal Interface
- Hugging Face Model Hub
- Ollama Model Library
Questions Answered in This Document
Q: What is Ollama and why should I use it for running LLMs locally? A: Ollama is a containerized solution for running large language models locally, providing privacy, cost control, and customization capabilities without relying on cloud services.
Q: What system requirements do I need to run Ollama effectively? A: You need a Linux system with 8GB+ RAM, container runtime support, and optionally GPU acceleration for optimal performance with larger models.
Q: How do I download and run my first model with Ollama?
A: Use ollama pull llama2
to download a model, then ollama run llama2
to start an interactive session with the downloaded model.
Q: Can I use models from Hugging Face with Ollama? A: Yes, you can integrate Hugging Face models by downloading them, converting to GGUF format using Ollama’s quantization tool, and creating a custom Modelfile configuration.
Q: What interfaces are available for interacting with Ollama models? A: Ollama supports command-line interaction, web-based interfaces through Ollama WebUI, terminal user interfaces via Oterm, and direct API access for custom applications.
Q: How do I optimize Ollama performance for my hardware? A: Enable GPU acceleration, choose appropriate model quantization levels, ensure sufficient RAM allocation, and consider using smaller models if resources are constrained.
Model Loading Problems
Issue: Models fail to load or respond slowly Solutions:
- Verify sufficient system memory
- Check GPU memory availability
- Ensure model files aren’t corrupted
- Try smaller models if resources are limited
Q: How do I set up the Ollama WebUI for browser-based model interaction? A: Clone the WebUI repository, build the container image, and run it with proper environment variables pointing to your Ollama API endpoint.
Q: Can I run multiple models simultaneously with Ollama? A: Yes, Ollama supports loading multiple models simultaneously, though this requires adequate system resources and proper resource management.
Q: How do I create custom model configurations for specific use cases? A: Create a Modelfile with custom templates, parameters, and model specifications, then use ollama create
to build your customized model variant.