Why Model Choice Matters
Most AI applications lock you into a single model provider. ChatGPT uses OpenAI. Copilot uses OpenAI. Gemini uses Google. You get whatever model the company chose, at whatever price they set, with whatever privacy policy they enforce. If the model is too expensive, too slow, or sends your data to servers you do not trust, your only option is to switch to an entirely different product.
Nemo takes a fundamentally different approach. It is a model-agnostic AI agent that supports five LLM providers out of the box. You choose which model powers your agent. You can change it at any time. You can even use different providers for different tasks simultaneously. This is not a theoretical feature — it is a practical necessity because different models genuinely excel at different things.
Claude is exceptional at nuanced writing and safety-conscious reasoning. GPT-4 has the broadest general capabilities and strong vision support. Ollama models run entirely on your hardware for complete privacy. OpenRouter gives you access to over 100 models, including free tiers for experimentation. And custom endpoints let enterprises use their own self-hosted models behind corporate firewalls.
The model you choose affects three things: quality (how well the agent performs tasks), cost (how much you pay per task), and privacy (where your data goes). This guide helps you make an informed decision across all three dimensions.
Anthropic (Claude)
Anthropic's Claude is our recommended model for users who prioritize quality. Claude is known for nuanced, thoughtful responses that follow instructions precisely. It excels at understanding complex multi-step tasks, producing well-structured output, and behaving predictably with safety-related instructions.
Strengths
- Instruction following: Claude is exceptionally good at following detailed system prompts, which is critical for skill-based AI agents. When Nemo's system prompt says "use the batch tool instead of calling tools one at a time," Claude consistently follows that instruction. This reduces LLM roundtrips and makes tasks faster.
- Nuanced writing: For email composition, document drafting, and any task involving natural language output, Claude produces text that sounds human, matches the appropriate tone, and avoids the "AI-generated" feel that plagues many models.
- Safety awareness: Claude is trained to be cautious with sensitive operations. When a task involves potentially destructive actions, Claude tends to ask for confirmation rather than proceeding blindly. This complements Sentinel's safety layer.
- Long context: Claude supports 200K token context windows, allowing Nemo to process large documents, long email threads, and complex multi-turn conversations without truncation.
Setup
To use Claude with Nemo, you need an Anthropic API key. Sign up at console.anthropic.com, create an API key, and enter it in Nemo's Settings under LLM Provider. The key is stored in Nemo's encrypted vault — never in plain text, never sent anywhere except Anthropic's API endpoint.
Best for
Email triage and composition, document writing, complex multi-step tasks that require precise instruction following, any task where output quality matters more than cost.
OpenAI (GPT-4)
OpenAI's GPT-4 family is the most widely used LLM ecosystem. GPT-4 and its variants (GPT-4 Turbo, GPT-4o) offer broad capabilities across virtually every task category. If you are already paying for an OpenAI API subscription, using GPT-4 with Nemo is a natural choice.
Strengths
- Broad capabilities: GPT-4 performs well across a wide range of tasks without notable weak spots. From code generation to creative writing to data analysis, it delivers consistently good results.
- Vision support: GPT-4o includes multimodal vision capabilities, allowing Nemo to process screenshots and images. This is particularly useful for desktop automation tasks where the agent needs to understand what is on screen.
- Function calling: OpenAI pioneered the function calling format that Nemo's tool system is based on. GPT-4 has excellent support for structured tool calls, reducing parsing errors and improving reliability.
- Ecosystem: OpenAI has the largest model ecosystem, with frequent updates, new model variants, and extensive documentation. Compatibility issues are rare.
Setup
Sign up at platform.openai.com, generate an API key, and enter it in Nemo's Settings. OpenAI requires a paid account with API credits. There is no permanent free tier for API access, though new accounts typically receive a small starter credit.
Best for
General-purpose tasks, desktop automation with screenshot analysis (vision), users already in the OpenAI ecosystem, tasks that benefit from the latest model updates.
Ollama (Fully Local)
Ollama is an open-source tool that lets you run large language models entirely on your local hardware. No API keys, no cloud services, no usage fees, no data leaving your machine. It is the only provider option that makes Nemo truly cost-free and completely private.
How it works
Ollama downloads and manages open-source models on your computer. It provides a local API endpoint (typically http://localhost:11434) that is compatible with the same interface Nemo uses for cloud providers. From Nemo's perspective, Ollama is just another LLM provider — the fact that it is running on your local GPU or CPU is transparent.
Recommended models
- Llama 3 (8B) — Meta's latest open-source model. Excellent general performance. Requires about 5GB of RAM. Good balance of speed and capability for most Nemo skills.
- Llama 3 (70B) — The full-size version. Near-GPT-4 quality on many benchmarks. Requires 40GB+ RAM or a GPU with 40GB+ VRAM. For users with powerful hardware.
- Mistral (7B) — French AI lab's model. Fast, efficient, and strong at structured tasks. Great for form filling and data extraction. Requires about 4GB RAM.
- Phi-3 (3.8B) — Microsoft's compact model. Surprisingly capable for its size. Runs on low-end hardware. Good for simple tasks like file management and basic email categorization.
- CodeLlama (7B/13B) — Specialized for code-related tasks. If you use Nemo for development workflows, this is the best local option.
Performance on consumer hardware
Running LLMs locally is more practical than most people expect. On a modern laptop with 16GB RAM, Llama 3 8B generates about 20–30 tokens per second on CPU. With a discrete GPU (even a mid-range NVIDIA RTX 3060), speeds jump to 40–80 tokens per second. For comparison, cloud APIs typically deliver 30–60 tokens per second. The experience is comparable for most tasks.
The main limitation is the initial model download (4–40GB depending on the model) and the first-load time (10–30 seconds to load the model into memory). Once loaded, the model stays in memory and subsequent queries are fast. Ollama handles model management, caching, and memory optimization automatically.
The privacy advantage
When you use Ollama, your data never leaves your computer. Not for inference, not for safety screening (Sentinel already runs locally), not for anything. Your emails, documents, form data, and browsing activity stay on your hardware. No API provider sees your data. No cloud service stores your queries. This is the strongest privacy guarantee any AI system can offer.
Setup
Download Ollama from ollama.com, install it, and run ollama pull llama3 to download your first model. In Nemo's Settings, select Ollama as your provider and enter the local URL (http://localhost:11434). No API key needed.
Best for
Privacy-focused users, offline operation, zero-cost AI, users with decent hardware (16GB+ RAM), document summarization, desktop automation, any task where data sensitivity is paramount.
OpenRouter
OpenRouter is an API aggregator that provides access to over 100 models from multiple providers through a single, unified API. It uses the OpenAI-compatible API format, making it seamless to use with Nemo. OpenRouter is the best option for users who want to experiment with different models or access free-tier models without signing up for multiple providers.
How it works
OpenRouter acts as a proxy between Nemo and various model providers. You send your request to OpenRouter, they route it to the appropriate model provider, and return the response. The API format is identical to OpenAI's, so Nemo's existing integration works without modification. OpenRouter requires two custom headers (HTTP-Referer and X-Title) for attribution, which Nemo sends automatically.
Free tier models
OpenRouter hosts several models with free tiers, including variants of Llama, Mistral, and other open-source models. These free tiers have rate limits (typically 10–20 requests per minute) but no per-token costs. Nemo can discover available free models automatically using the OpenRouter API, so you always know what is available without checking the website.
Best for
Experimentation with multiple models, budget-conscious users who want to find the cheapest capable model, accessing newer or niche models that are not available directly through major providers, fallback provider when your primary provider is down.
Custom Endpoints
For enterprises and advanced users who host their own models, Nemo supports custom API endpoints. Any server that exposes an OpenAI-compatible chat completions API can be used as a provider. This includes self-hosted deployments of vLLM, text-generation-inference, LocalAI, and LiteLLM.
Use cases
- Corporate compliance: Companies that cannot send data to external APIs can host models on internal servers and point Nemo at those endpoints.
- Fine-tuned models: If you have fine-tuned a model for your specific domain (legal, medical, financial), you can run it on your infrastructure and use it with Nemo.
- GPU clusters: Organizations with dedicated GPU infrastructure can run the largest models (70B+, 180B) at speeds that exceed consumer hardware.
- Air-gapped environments: For maximum security, Nemo can operate in fully air-gapped environments where the custom endpoint is on the same local network with no internet connectivity at all.
Setup
In Nemo's Settings, select Custom as your provider. Enter the base URL of your OpenAI-compatible endpoint (e.g., http://your-server:8000/v1). Enter any required API key or leave blank if your endpoint does not require authentication. Nemo will test the connection and confirm that the endpoint responds correctly.
Smart Routing Explained
Using a top-tier model like Claude or GPT-4 for every task is like driving a Ferrari to the grocery store. It works, but it is unnecessarily expensive. Many tasks that Nemo performs — reading a file listing, categorizing a simple email, extracting text from a document — do not require the reasoning capabilities of a frontier model. A smaller, cheaper model handles them just as well.
Smart routing is Nemo's automatic model selection system. When you configure it (in Settings under LLM Provider > Routing), you specify a primary model and a secondary model. Nemo then analyzes the complexity of each task and routes it to the appropriate model:
- Simple tasks — Routed to the secondary (cheaper) model. Examples: listing files, reading a web page, categorizing an email as urgent or not, extracting structured data from a form.
- Complex tasks — Routed to the primary (more capable) model. Examples: composing a nuanced email reply, filling a multi-field form with derived values, multi-step desktop automation sequences, summarizing a complex legal document.
The complexity classification uses a lightweight analysis of the task description, the number of tools available, and the expected number of tool calls. It does not require a separate LLM call — the classification itself is rule-based and adds no latency.
Smart routing also supports a judge model for the Sentinel safety layer. Instead of using your expensive primary model to run safety checks, you can assign a cheaper model specifically for Sentinel's screening decisions. Since safety screening is a simpler classification task (safe vs. unsafe, PII vs. no PII), a smaller model handles it effectively. This can reduce the total cost of Sentinel's overhead to near zero.
Typical savings
In our testing, smart routing reduces LLM costs by 40–60% compared to using a single frontier model for everything. For a user spending $10/month on Claude, enabling smart routing with Ollama as the secondary model can bring costs down to $4–6/month with no noticeable quality difference in task completion.
Cost Comparison Table
Here is what each provider costs for typical Nemo usage patterns, based on February 2026 pricing: