Using LLMs Locally via Ollama

Running large language models (LLMs) locally provides privacy, control, and cost savings compared to cloud-based AI services. Ollama simplifies the process of deploying and managing LLMs in containerized environments, making it accessible for developers and researchers who want to run AI models on their own infrastructure.

This guide covers the complete setup process for Ollama, from basic installation to advanced model customization using Hugging Face models. You’ll learn how to deploy models in containers, access them through various interfaces, and customize models for specific use cases.

Prerequisites

Before starting with Ollama, ensure you have:

  • A Linux system with container runtime support
  • Sufficient system resources (8GB+ RAM recommended for most models)
  • GPU support (optional but recommended for performance)
  • Network access for downloading models

This guide assumes you have containerd configured as your container runtime. If you need to set up containerd in rootless mode, see our comprehensive containerd setup guide which covers installation, configuration, and Docker migration strategies.

Core Ollama Setup

Running the Ollama Container

The foundation of using Ollama locally is deploying the main container that serves as the model runtime environment. This container handles model loading, inference, and API endpoints.

nerdctl run --name ollama --rm \
  -v /vault/ai/ollama:/root/.ollama \
  -p 11434:11434 \
  --gpus all \
  ollama/ollama

Command breakdown:

  • --name ollama: Names the container for easy reference
  • --rm: Automatically removes container when stopped
  • -v /vault/ai/ollama:/root/.ollama: Mounts local directory for model storage
  • -p 11434:11434: Exposes Ollama’s default API port
  • --gpus all: Enables GPU acceleration (requires NVIDIA Container Runtime)

Basic Ollama Operations

Once the container is running, you can interact with Ollama using the built-in CLI tools. Access the help system to explore available commands:

nerdctl exec -it ollama -- ollama help

This command provides a comprehensive list of available operations including model management, inference, and system administration functions.

Model Management

Downloading Pre-built Models

Ollama provides access to a curated library of optimized language models. Browse the complete model catalog at the official Ollama library to find models suitable for your use case.

Download a model using the pull command:

nerdctl exec -it ollama -- ollama pull llama2

Popular models include:

  • llama2: Meta’s Llama 2 model family
  • codellama: Specialized for code generation
  • mistral: Efficient general-purpose model
  • neural-chat: Optimized for conversational AI

Running Models Interactively

After downloading a model, start an interactive session:

nerdctl exec -it ollama -- ollama run llama2

This launches a chat interface where you can interact directly with the model through the command line. The interface provides real-time responses and maintains conversation context throughout the session.

Web Interface Integration

Setting Up Ollama WebUI

For users who prefer graphical interfaces, the Ollama WebUI provides a modern web-based frontend for model interaction. This interface offers features like conversation history, model switching, and enhanced formatting.

Clone the WebUI repository:

git clone https://github.com/ollama-webui/ollama-webui
cd ollama-webui

Build the WebUI container image:

nerdctl build -f Dockerfile . -t ollama-webui

Deploy the WebUI container with proper networking:

nerdctl run --name ollama-webui \
  --env OLLAMA_API_BASE_URL=http://ollama:11434/api \
  -p 3000:8080 \
  --rm -it \
  ollama-webui

Configuration details:

  • OLLAMA_API_BASE_URL: Points to the Ollama API endpoint
  • Port 3000: External access port for the web interface
  • Port 8080: Internal WebUI application port

Access the interface at http://localhost:3000 to begin using the graphical interface.

Terminal User Interface (TUI)

Installing and Using Oterm

For users who prefer terminal-based interfaces with enhanced features, Oterm provides a sophisticated TUI for Ollama interaction. This tool offers features like syntax highlighting, conversation management, and model switching within a terminal environment.

Set up a Python virtual environment for Oterm:

python -m venv venv
source venv/bin/activate
pip install oterm

Launch Oterm with the virtual environment active:

oterm

Oterm automatically detects running Ollama instances and provides an intuitive interface for model interaction with additional features like:

  • Conversation history browsing
  • Multi-model session management
  • Export capabilities for conversations
  • Customizable themes and layouts

Custom Model Integration

Using Hugging Face Models

Extend Ollama’s capabilities by integrating custom models from Hugging Face. This process allows you to use specialized models not available in the standard Ollama library.

Prerequisites for Custom Models

Install Git LFS for handling large model files using the pacman package manager:

sudo pacman -S git-lfs
git lfs install

For comprehensive pacman usage including installation options, dependency management, and troubleshooting, see our detailed pacman cheatsheet.

Downloading Custom Models

Browse the Hugging Face model hub to find models suitable for your requirements. Clone the desired model repository:

git clone https://huggingface.co/SweatyCrayfish/Linux-CodeLlama-2-7B
cd Linux-CodeLlama-2-7B

Model Format Conversion

Most Hugging Face models require conversion to GGUF format for Ollama compatibility. Use the Ollama quantization tool:

nerdctl run --rm -v .:/model ollama/quantize -q q4_0 /model

Quantization options:

  • q4_0: 4-bit quantization (good balance of size and quality)
  • q5_0: 5-bit quantization (higher quality, larger size)
  • q8_0: 8-bit quantization (highest quality, largest size)

Creating Model Definitions

Create a Modelfile with model configuration:

FROM ./q4_0.bin
TEMPLATE "[INST] {{ .Prompt }} [/INST]"
PARAMETER temperature 0.7
PARAMETER top_k 40
PARAMETER top_p 0.9

Configuration parameters:

  • FROM: Specifies the model file location
  • TEMPLATE: Defines the prompt format for the model
  • PARAMETER: Sets inference parameters for model behavior

Building Custom Models

Create the custom model in Ollama:

nerdctl exec -it ollama -- ollama create linux-codellama2 -f Modelfile

The custom model is now available for use like any standard Ollama model:

nerdctl exec -it ollama -- ollama run linux-codellama2

Performance Optimization

GPU Configuration

For optimal performance with large models, ensure proper GPU configuration:

  • Install NVIDIA Container Runtime for GPU support
  • Verify GPU accessibility within containers
  • Monitor GPU memory usage during model inference
  • Consider model quantization for memory-constrained environments

Resource Management

Monitor system resources during model operation:

  • RAM Usage: Large models require significant memory (7B models ~8GB, 13B models ~16GB)
  • Storage: Models consume substantial disk space; plan storage accordingly
  • CPU Usage: CPU inference is possible but significantly slower than GPU
  • Network: Initial model downloads require good bandwidth

Container Runtime Configuration

For optimal Ollama deployment, ensure your container runtime is properly configured. If you’re using containerd (recommended), our containerd setup guide provides detailed instructions for:

  • Rootless Operation: Enhanced security without requiring root privileges
  • GPU Integration: Proper GPU acceleration setup for AI workloads
  • Network Configuration: Container networking for multi-service deployments
  • Performance Optimization: Resource management and optimization techniques

The containerd setup includes nerdctl installation and configuration, which is essential for the commands shown in this guide.

Troubleshooting Common Issues

Model Loading Problems

Container Connectivity Issues

Issue: WebUI cannot connect to Ollama API Solutions:

  • Verify container networking configuration
  • Check port availability and binding
  • Ensure firewall rules allow container communication
  • Validate API endpoint URLs
  • Reference our containerd troubleshooting guide for container-specific network issues

Performance Issues

Issue: Slow model inference or high latency Solutions:

  • Enable GPU acceleration if available
  • Optimize model quantization settings
  • Increase system memory allocation
  • Consider using smaller, faster models for testing

For containerized deployments specifically, ensure your container runtime is optimized for AI workloads with proper resource allocation and GPU support.

Package Management Issues

Issue: Git LFS or other dependencies fail to install Solutions:

  • Update package database with sudo pacman -Sy
  • Check mirror connectivity and refresh mirror list
  • For detailed package management troubleshooting, see our pacman troubleshooting section
  • Verify sufficient disk space for large model downloads

References and Resources

Questions Answered in This Document

Q: What is Ollama and why should I use it for running LLMs locally? A: Ollama is a containerized solution for running large language models locally, providing privacy, cost control, and customization capabilities without relying on cloud services.

Q: What system requirements do I need to run Ollama effectively? A: You need a Linux system with 8GB+ RAM, container runtime support, and optionally GPU acceleration for optimal performance with larger models.

Q: How do I download and run my first model with Ollama? A: Use ollama pull llama2 to download a model, then ollama run llama2 to start an interactive session with the downloaded model.

Q: Can I use models from Hugging Face with Ollama? A: Yes, you can integrate Hugging Face models by downloading them, converting to GGUF format using Ollama’s quantization tool, and creating a custom Modelfile configuration.

Q: What interfaces are available for interacting with Ollama models? A: Ollama supports command-line interaction, web-based interfaces through Ollama WebUI, terminal user interfaces via Oterm, and direct API access for custom applications.

Q: How do I optimize Ollama performance for my hardware? A: Enable GPU acceleration, choose appropriate model quantization levels, ensure sufficient RAM allocation, and consider using smaller models if resources are constrained.

Model Loading Problems

Issue: Models fail to load or respond slowly Solutions:

  • Verify sufficient system memory
  • Check GPU memory availability
  • Ensure model files aren’t corrupted
  • Try smaller models if resources are limited

Q: How do I set up the Ollama WebUI for browser-based model interaction? A: Clone the WebUI repository, build the container image, and run it with proper environment variables pointing to your Ollama API endpoint.

Q: Can I run multiple models simultaneously with Ollama? A: Yes, Ollama supports loading multiple models simultaneously, though this requires adequate system resources and proper resource management.

Q: How do I create custom model configurations for specific use cases? A: Create a Modelfile with custom templates, parameters, and model specifications, then use ollama create to build your customized model variant.