Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/operatoronline/weaver/llms.txt

Use this file to discover all available pages before exploring further.

Weaver supports running AI models locally for privacy, cost savings, and offline operation. This guide covers Ollama, vLLM, and custom OpenAI-compatible endpoints.

Supported Local Providers

  • Ollama: Easy local model deployment
  • vLLM: High-performance inference server
  • Custom Endpoints: Any OpenAI-compatible API

Ollama

Ollama provides the easiest way to run models locally.

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows
# Download from https://ollama.ai/download

2. Pull Models

# Pull Llama 3.1
ollama pull llama3.1

# Pull Qwen 2.5
ollama pull qwen2.5:14b

# Pull Mistral
ollama pull mistral

3. Configure Weaver

Add to ~/.weaver/config.json:
{
  "providers": {
    "ollama": {
      "api_key": "",
      "api_base": "http://localhost:11434/v1"
    }
  },
  "agents": {
    "defaults": {
      "provider": "ollama",
      "model": "ollama/llama3.1"
    }
  }
}
API key is not required for Ollama. The api_key field can be empty or omitted.

4. Usage

# Use Llama 3.1
weaver chat --model ollama/llama3.1

# Use Qwen 2.5
weaver chat --model ollama/qwen2.5:14b

# Use Mistral
weaver chat --model ollama/mistral

Model Name Format

Ollama models use the format ollama/model-name:tag:
  • ollama/llama3.1 - Latest Llama 3.1
  • ollama/qwen2.5:14b - Qwen 2.5 14B parameter version
  • ollama/mistral:7b-instruct - Mistral 7B Instruct
Weaver automatically strips the ollama/ prefix when sending to Ollama API. Source: pkg/providers/http_provider.go:55-62

vLLM

vLLM is a high-performance inference server for LLMs.

1. Install vLLM

pip install vllm

2. Start Server

# Start with Llama 3.1 8B
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --port 8000

# With GPU acceleration
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 2 \
  --port 8000

3. Configure Weaver

Add to ~/.weaver/config.json:
{
  "providers": {
    "vllm": {
      "api_key": "",
      "api_base": "http://localhost:8000/v1"
    }
  },
  "agents": {
    "defaults": {
      "provider": "vllm",
      "model": "meta-llama/Llama-3.1-8B-Instruct"
    }
  }
}

4. Usage

weaver chat --model meta-llama/Llama-3.1-8B-Instruct

Custom OpenAI-Compatible Endpoints

Weaver works with any OpenAI-compatible API endpoint.

Local Server Examples

LocalAI

{
  "providers": {
    "vllm": {
      "api_key": "",
      "api_base": "http://localhost:8080/v1"
    }
  }
}

LM Studio

{
  "providers": {
    "vllm": {
      "api_key": "",
      "api_base": "http://localhost:1234/v1"
    }
  }
}

Jan

{
  "providers": {
    "vllm": {
      "api_key": "",
      "api_base": "http://localhost:1337/v1"
    }
  }
}

Configuration Options

api_key
string
API key for authentication (optional for local servers)
api_base
string
Local server endpoint URL (e.g., http://localhost:11434/v1)
proxy
string
HTTP/HTTPS proxy URL (optional)

Model Parameters

Configure model behavior:
{
  "agents": {
    "defaults": {
      "model": "ollama/llama3.1",
      "max_tokens": 4096,
      "temperature": 0.7
    }
  }
}
max_tokens
integer
default:"4096"
Maximum tokens in response
temperature
float
default:"0.7"
Controls randomness (0.0 = deterministic, 2.0 = very random)

Llama Family

# Llama 3.1 8B (Recommended)
ollama pull llama3.1

# Llama 3.1 70B (High capability)
ollama pull llama3.1:70b

# Llama 3.1 405B (Largest)
ollama pull llama3.1:405b

Qwen Family

# Qwen 2.5 7B
ollama pull qwen2.5

# Qwen 2.5 14B
ollama pull qwen2.5:14b

# Qwen 2.5 32B
ollama pull qwen2.5:32b

Other Models

# Mistral 7B
ollama pull mistral

# Mixtral 8x7B
ollama pull mixtral

# DeepSeek Coder
ollama pull deepseek-coder

# Phi-3
ollama pull phi3

Implementation Details

Weaver uses the HTTPProvider for all local model providers:
  • OpenAI-compatible API format
  • Standard /chat/completions endpoint
  • Automatic model namespace handling
  • Tool calling support (if supported by server)
Source: pkg/providers/http_provider.go

Automatic Provider Detection

Weaver automatically uses Ollama when:
// From http_provider.go:442-450
case (strings.Contains(lowerModel, "ollama") || strings.HasPrefix(model, "ollama/")) && cfg.Providers.Ollama.APIKey != "":
  apiKey = cfg.Providers.Ollama.APIKey
  apiBase = cfg.Providers.Ollama.APIBase
  proxy = cfg.Providers.Ollama.Proxy
  if apiBase == "" {
    apiBase = "http://localhost:11434/v1"
  }

Model Name Stripping

// From http_provider.go:55-62
if idx := strings.Index(model, "/"); idx != -1 {
  prefix := model[:idx]
  if prefix == "moonshot" || prefix == "nvidia" || prefix == "groq" || prefix == "ollama" {
    model = model[idx+1:]
  }
}
This strips provider prefixes before sending to the API.

Hardware Requirements

Model Size Guidelines

Model SizeRAM RequiredGPU VRAMExample Models
7B params8GB6GBLlama 3.1 8B, Mistral 7B
13-14B params16GB12GBQwen 2.5 14B
30-34B params32GB24GBMixtral 8x7B
70B params64GB48GBLlama 3.1 70B

Performance Tips

  1. Use GPU acceleration for faster inference
  2. Quantize models (4-bit, 8-bit) to reduce memory usage
  3. Use smaller models for development and testing
  4. Batch requests when processing multiple prompts

Troubleshooting

Ollama Issues

# Check if Ollama is running
curl http://localhost:11434/v1/models

# Restart Ollama
ollama serve

# List installed models
ollama list

vLLM Issues

# Check server status
curl http://localhost:8000/v1/models

# Check logs
python -m vllm.entrypoints.openai.api_server --help

# Increase timeout for large models
export VLLM_TIMEOUT=600

Common Errors

  • Verify the server is running
  • Check the port number is correct
  • Ensure no firewall is blocking the connection
  • Try curl http://localhost:PORT/v1/models
  • Use a smaller model
  • Enable quantization (4-bit or 8-bit)
  • Reduce max_tokens in configuration
  • Close other applications
  • Use GPU acceleration if available
  • Try a smaller model
  • Increase vLLM tensor parallel size
  • Reduce context window size
  • For Ollama: Run ollama pull model-name
  • For vLLM: Verify model name matches HuggingFace
  • Check ollama list to see installed models

Privacy Benefits

Running models locally provides:
  • Complete privacy: Data never leaves your machine
  • No API costs: No per-token pricing
  • Offline operation: Works without internet
  • Full control: Customize models and parameters
  • No rate limits: Process as many requests as your hardware allows

Next Steps

Provider Overview

Back to all providers

Model Selection

Choose the right model