Skip to main content

Local Models Setup

Running AI models directly on your machine offers compelling advantages over cloud APIs. Your data never leaves your computer, preserving complete privacy. Once you've downloaded a model, it works without internet connectivity, making it perfect for offline work or unstable connections. There are no per-request charges—after the one-time download, inference is free. Most importantly, you maintain total control over which models you use, how they're configured, and how your resources are allocated.

The two most popular tools for running local models are Ollama and LM Studio. Ollama provides a command-line interface and background service that makes model management straightforward through terminal commands. LM Studio offers a graphical application with an intuitive interface, making it accessible even if you're less comfortable with command-line tools. Both tools run models locally and expose them through OpenAI-compatible APIs that Lattice can use seamlessly.

Setting Up Ollama

Ollama has become the standard way to run local language models on macOS and Linux. It handles all the complexity of model loading, memory management, and API serving, letting you focus on using models rather than managing infrastructure.

Installing Ollama

On macOS, Homebrew provides the easiest installation path. Open your terminal and run brew install ollama to download and install the latest version. The Homebrew package includes everything you need, including automatic PATH configuration.

Linux users can use the official installation script that detects your distribution and handles setup automatically. Running curl -fsSL https://ollama.ai/install.sh | sh downloads and executes the installer, which configures Ollama as a system service and ensures it starts on boot.

Windows support is newer but works well through the official installer available at ollama.ai. Download the installer, run it, and follow the setup wizard. The installer handles all PATH configuration and sets up Ollama to run as a Windows service.

Getting Ollama Running

After installation, you need to start the Ollama service before you can use it. The ollama serve command starts the background service that manages model loading and serves the API:

bash
ollama serve

This command starts Ollama and keeps it running in your current terminal. For day-to-day use, you'll want Ollama to run in the background automatically. On macOS and Linux, installing Ollama as a system service handles this—it starts on boot and runs continuously without needing manual intervention.

Downloading Your First Model

Models in Ollama are downloaded on-demand through the ollama pull command. The most accessible starting point is Llama 3.2, which offers good performance on most machines without requiring excessive RAM:

bash
ollama pull llama3.2

This downloads the default 3B parameter version of Llama 3.2, which needs about 4GB of disk space and runs smoothly on machines with 8GB of RAM. The download takes a few minutes depending on your internet speed. Once complete, the model is available immediately for use.

If you have more powerful hardware—specifically 64GB of RAM or more—you might want the significantly more capable 70B parameter version. Download it with ollama pull llama3.2:70b, understanding that this requires around 50GB of disk space and substantial system resources.

For code-focused work, Code Llama specializes in programming tasks and often produces better results than general-purpose models. Get it with ollama pull codellama. Similarly, Mistral offers a nice balance of speed and capability through ollama pull mistral.

Configuring Lattice to Use Ollama

With Ollama running and models downloaded, you need to tell Lattice how to connect to it. Edit your configuration file at ~/.lovelace/lattice/config.toml and add the Ollama provider configuration:

toml
[orchestrator]
default_provider = "ollama"

[providers.ollama]
base_url = "http://localhost:11434"
default_model = "llama3.2"
enabled = true

This configuration tells Lattice that Ollama is your default provider, specifies where to find the Ollama API (the standard port 11434), sets which model to use by default, and enables the provider. Ollama runs on localhost port 11434 by default, so unless you've configured it differently, this base URL will work correctly.

After saving your configuration changes, restart the daemon to load the new settings:

bash
lattice-ctl daemon stop
lattice-ctl daemon start

Verify everything is working by testing the connection and checking available models:

bash
lattice-ctl provider test ollama
lattice-ctl provider models ollama

The test command should confirm successful connection to Ollama, and the models command should list all models you've downloaded. If either command fails, double-check that Ollama is actually running with ps aux | grep ollama and verify the base URL in your configuration matches where Ollama is listening.

LM Studio

Installation

Download from lmstudio.ai

Supported platforms:

  • macOS (Intel & Apple Silicon)
  • Windows
  • Linux

Setup

  1. Start LM Studio

  2. Download a model (in app):

    • Search for models (e.g., "llama-3.2")
    • Click download
    • Wait for completion
  3. Start local server:

    • Go to "Local Server" tab
    • Click "Start Server"
    • Note the URL (usually http://localhost:1234)
  4. Configure Lattice:

    toml
    # ~/.lovelace/lattice/config.toml
    
    [orchestrator]
    default_provider = "lmstudio"
    
    [providers.lmstudio]
    base_url = "http://localhost:1234/v1"
    default_model = "llama-3.2-3b"
    enabled = true
    
  5. Restart daemon:

    bash
    lattice-ctl daemon stop
    lattice-ctl daemon start
    
  6. Test:

    bash
    lattice-ctl provider test lmstudio
    

Model Selection

In LM Studio app:

  1. Go to "Local Server" tab
  2. Select loaded model from dropdown
  3. Model is now active for Lattice

Performance Tuning

Hardware Requirements

Minimum (3B models):

  • 8GB RAM
  • 4 CPU cores
  • 10GB disk space

Recommended (13B models):

  • 16GB RAM
  • 8 CPU cores
  • 20GB disk space

High-end (70B models):

  • 64GB RAM
  • 16+ CPU cores
  • 100GB disk space

Ollama Configuration

bash
# Set threads (CPU cores to use)
export OLLAMA_NUM_THREADS=8

# Set GPU layers (if you have GPU)
export OLLAMA_GPU_LAYERS=35

# Start with config
OLLAMA_NUM_THREADS=8 ollama serve

Resource Limits in Lattice

toml
# ~/.lovelace/lattice/config.toml

[resources]
# Limit memory per agent (MB)
max_memory_mb = 4096

# Limit CPU percentage
max_cpu_percent = 80

Choosing Models

For General Use

  • llama3.2 (3B) - Fast, good quality, works on most machines
  • mistral (7B) - Balanced performance and capability

For Coding

  • codellama - Purpose-built for code generation
  • deepseek-coder - Strong coding performance

For Low Resources

  • phi (2.7B) - Surprisingly capable, minimal resources
  • tinyllama (1.1B) - Ultra-fast, basic tasks

For Maximum Quality

  • llama3.2:70b (70B) - Best quality, needs 64GB RAM
  • mixtral (8x7B) - Excellent, uses mixture of experts

Troubleshooting

Ollama not starting

bash
# Check if running
ps aux | grep ollama

# Kill stuck process
pkill ollama

# Restart
ollama serve

Model download fails

bash
# Check disk space
df -h

# Try again with specific version
ollama pull llama3.2:latest

# Check Ollama logs
tail -f ~/.ollama/logs/server.log

LM Studio connection refused

  1. Verify server started in app
  2. Check port (usually 1234)
  3. Test with curl:
    bash
    curl http://localhost:1234/v1/models
    

Out of memory

bash
# Use smaller model
ollama pull llama3.2  # 3B instead of 70B

# Or configure limits
export OLLAMA_MAX_LOADED_MODELS=1

Best Practices

  1. Start small - Begin with 3B models
  2. Keep models updated - ollama pull regularly
  3. Monitor resources - Watch RAM/CPU usage
  4. Clean up - Remove unused models (ollama rm <model>)

Next Steps

Related

Managing Your Model Collection

As you work with Ollama, you'll accumulate models for different purposes. The ollama list command shows everything you've downloaded, including model sizes and when they were last modified. This helps you track disk usage and identify models you're no longer using.

Searching for new models to try is straightforward with ollama search. For example, ollama search llama shows all available Llama variants. The search results include model names, sizes, and brief descriptions to help you choose.

When you need a specific model version rather than the latest, append a version tag to the pull command. For instance, ollama pull llama3.2:13b downloads the 13-billion parameter version specifically, while ollama pull llama3.2:latest explicitly gets the newest version (though :latest is implied if you omit the tag).

Popular models each serve different purposes. The default Llama 3.2 (3B parameters) strikes an excellent balance between capability and resource requirements, making it ideal for most users. If you have substantial RAM—64GB or more—Llama 3.2 70B delivers significantly better results but demands correspondingly more resources. Code Llama specializes in programming tasks and often outperforms general models for code generation and explanation. Mistral offers impressive speed without sacrificing too much capability, while Phi surprises with its competence despite being only 2.7B parameters, making it perfect for resource-constrained environments.

Setting Up LM Studio

LM Studio takes a different approach to local models, providing a polished graphical interface instead of command-line tools. This makes it particularly accessible if you're less comfortable with terminal workflows or simply prefer visual interfaces.

Installing LM Studio

Download the appropriate version for your platform from lmstudio.ai. The site automatically detects your operating system and offers the correct download. LM Studio supports macOS (both Intel and Apple Silicon), Windows, and Linux, with native builds optimized for each platform.

After downloading, installation follows standard platform conventions. On macOS, drag the application to your Applications folder. On Windows, run the installer and follow the wizard. On Linux, extract the archive and run the executable. The application is self-contained and doesn't require additional system-level setup.

Downloading Models Through the GUI

Launch LM Studio and you'll see the main interface with several tabs. The "Discover" or "Search" tab lets you browse available models. You can search for specific models by name—for example, typing "llama 3.2" shows all Llama 3.2 variants with their sizes and descriptions.

When you find a model you want, click its download button. LM Studio shows download progress with percentage and speed metrics. Depending on model size and your internet connection, downloads can take anywhere from a few minutes to over an hour for the largest models. The downloaded models are stored locally on your machine, ready for immediate use.

Starting the Local Server

Before Lattice can use LM Studio's models, you need to enable its API server. Switch to the "Local Server" tab in LM Studio's interface. This tab shows server controls and configuration options. Click the "Start Server" button to begin serving models via API.

By default, LM Studio's server runs on http://localhost:1234 and exposes an OpenAI-compatible API. This means Lattice can talk to it using the same protocol it uses for OpenAI itself, making integration seamless. The server runs as long as LM Studio is open and continues serving whichever model you have loaded.

Configuring Lattice for LM Studio

With LM Studio's server running, configure Lattice to use it by editing ~/.lovelace/lattice/config.toml:

toml
[orchestrator]
default_provider = "lmstudio"

[providers.lmstudio]
base_url = "http://localhost:1234/v1"
default_model = "llama-3.2-3b"
enabled = true

The base URL includes /v1 at the end because LM Studio follows OpenAI's API conventions, where the v1 endpoint serves current API operations. The default model should match the name of the model you've loaded in LM Studio—you can verify the exact name in LM Studio's interface.

After saving the configuration, restart the Lattice daemon to apply changes:

bash
lattice-ctl daemon stop
lattice-ctl daemon start

Test that everything is working with lattice-ctl provider test lmstudio. If the test succeeds, Lattice can successfully communicate with LM Studio and you're ready to use local models through the graphical interface.

Selecting Active Models

LM Studio lets you load different models without restarting anything. In the "Local Server" tab, you'll see a dropdown menu showing available models. Select the model you want to use, and LM Studio loads it into memory, making it available through the API. Lattice will automatically use whichever model is currently loaded when you send requests.

This flexibility means you can switch between models easily. If you're working on code and want Code Llama's specialized capabilities, load that model. When you switch to general text work, load a model optimized for that purpose. Each model swap takes a moment as LM Studio loads it into memory, but no configuration changes are needed.