Models
Lume works with any model available in Ollama. The right choice depends on what features you’re using and how much RAM your machine has.
Recommendations
Section titled “Recommendations”| Model | Size | RAM | Best for |
|---|---|---|---|
qwen2.5 | 7B | 6 GB | Recommended — best tool-calling reliability for actions and components |
llama3.1 | 8B | 8 GB | Strong reasoning, good alternative to qwen2.5 |
mistral | 7B | 8 GB | Solid general-purpose option |
gemma3 | 4B | 4 GB | Lighter and faster — good for chat-only without actions or components |
Why qwen2.5
Section titled “Why qwen2.5”If you are using actions or structured UI components, qwen2.5 is strongly recommended. It was fine-tuned specifically for tool-calling and JSON output. Other models can work but are more likely to:
- Return JSON in the wrong format
- Add text before or after JSON blocks
- Trigger components on the wrong intent
For chat-only use without actions or components, any model works well.
Pulling a model
Section titled “Pulling a model”ollama pull qwen2.5 # recommendedollama pull llama3.1 # alternativeollama pull gemma3 # lightweightApple Silicon
Section titled “Apple Silicon”On M1/M2/M3 Macs, Ollama automatically uses the GPU via Metal. No configuration needed.
Approximate performance on M2 Air 16GB with qwen2.5 7B:
| Metric | Value |
|---|---|
| Cold start | 3–5 seconds |
| First token | under 1 second |
| Generation speed | ~40–60 tokens/sec |
| Classification call | ~200–400ms |
This is fast enough to feel instant in a chat widget. The 14B model also runs on 16GB but starts to crowd RAM if other apps are open — the quality improvement for actions and components is not worth it.
Switching models
Section titled “Switching models”Change the model prop at any time:
<LumeProvider model="llama3.1" ...>// or<AssistantWidget model="gemma3" />The same model is used for both the intent classifier and the response generator.