Self-hosting an LLM used to mean compromising on quality. That's no longer true. The gap between frontier closed models and the best open-source alternatives has narrowed to the point where, for most production workloads, you can run locally and never miss a GPT-4-class API. The catch: hardware requirements, license restrictions, and inference trade-offs vary wildly. Here's how to pick without wasting a weekend.
How We Picked
We evaluated models on four criteria: benchmark quality relative to size, minimum viable hardware (using 4-bit quantization as the baseline), license permissiveness (Apache 2.0 / MIT beats custom community licenses for commercial use), and ecosystem support in tools like Ollama, vLLM, and TGI. Models that exist only as API-first offerings were excluded — everything here runs on hardware you control.
VRAM estimates use the INT4 rule of thumb: ~0.5 bytes per parameter, plus 10–20% KV-cache overhead. Ollama defaults to Q4KM quantization, which is a reasonable starting point for most users.
---
The 7 Best Open-Source LLMs for Self-Hosting
1. Mistral 7B — Best for Low-Latency, Single-GPU Deployments
License: Apache 2.0 | VRAM: ~14 GB | Context: 32K
Mistral 7B remains the most practical model for teams running a single consumer GPU (RTX 3090/4090). Apache 2.0 means zero legal friction for commercial products. Inference is fast enough for real-time chat without batching tricks.
Pros: Truly permissive license; runs on one GPU; excellent tokens/sec; huge community support. Cons: 7B parameters show their limits on complex multi-step reasoning; context window is smaller than newer models.
---
2. Mixtral 8×7B — Best Cost-Efficient MoE for General Tasks
License: Apache 2.0 | VRAM: ~28–32 GB (4-bit) | Active params: 12.9B | Context: 32K
Mixtral routes each token through 2 of 8 expert networks, so you get ~46.7B total parameters but only pay the compute cost of ~12.9B active ones. It hits ~70% on MMLU — competitive with much larger dense models — while fitting on two RTX 4090s at 4-bit. Apache 2.0 license makes it a clean commercial choice.
Pros: Strong benchmark scores for the inference cost; Apache 2.0; well-supported in vLLM and Ollama. Cons: MoE architecture means higher memory bandwidth requirements than a pure 13B dense model; not multimodal.
---
3. Gemma 3 27B — Best for Safety-Conscious Enterprise Deployments
License: Google Gemma Terms (commercial use permitted after acceptance) | VRAM: ~18 GB (4-bit) | Context: 128K | Languages: 140+
Gemma 3 27B outperforms Llama 3.1 405B on human preference evaluations while fitting on a single high-end consumer GPU — a remarkable size-to-quality ratio. It supports 140+ languages and multimodal input from the 4B variant up. Google's safety tuning makes it a defensible choice for regulated industries.
Pros: Single-GPU at 27B; 128K context; strong multilingual; Google's safety investment baked in. Cons: Custom license (not Apache/MIT) — read the terms before commercial deployment; Google can update the acceptable-use policy.
---
4. Llama 3.3 70B — Best All-Around Quality for Teams with Multi-GPU Hardware
License: Llama 3 Community License | VRAM: ~40 GB (Q4KM) | Context: 128K
For teams that can throw two A100s or four RTX 4090s at a model, Llama 3.3 70B is the current quality ceiling for open-weight text models. Instruction-following is excellent, RAG pipelines work reliably, and the ecosystem support is the best of any model on this list. Unlike Llama 4, there's no EU multimodal restriction.
Pros: Best-in-class quality for open-weight text; massive community; strong RAG and agent performance. Cons: Requires ~40 GB VRAM at 4-bit — that's real hardware cost; Llama community license has commercial caps (orgs >700M MAU need a separate agreement).
---
5. DeepSeek R1 (32B distill) — Best for Math, Reasoning, and Chain-of-Thought
License: MIT | VRAM: ~20 GB (4-bit) | Context: 128K
DeepSeek R1's 32B distilled variant achieves roughly 90% on AIME math benchmarks — a score that was frontier-model territory six months ago. MIT license is as permissive as it gets. The full 671B MoE model needs a multi-GPU server, but the 32B distill is the sweet spot: reasoning quality that punches well above its weight class, on hardware that's actually accessible.
Pros: MIT license; exceptional reasoning and math; distilled variants scale from 1.5B to 70B. Cons: Optimized for chain-of-thought tasks; slower for simple Q&A than a 7B generalist; Chinese company origin may be a procurement concern for some organizations.
---
6. Phi-4 (~14B) — Best for Edge and Memory-Constrained Environments
License: MIT | VRAM: ~9 GB (4-bit) | Context: 16K
Microsoft's Phi-4 consistently outperforms models twice its size on reasoning and knowledge benchmarks. MIT license, sub-10 GB VRAM at 4-bit, and solid coding ability make it the go-to for edge deployments, Raspberry Pi-class servers, or any setup where memory is the hard constraint.
Pros: MIT license; runs on a single mid-range GPU or Apple Silicon; strong reasoning per GB of VRAM. Cons: Shorter context window; not multimodal; smaller community than Llama or Mistral.
---
7. Qwen 2.5 72B — Best for Multilingual and Coding Workloads
License: Apache 2.0 | VRAM: ~40 GB (4-bit) | Context: 128K
Qwen 2.5 72B matches Llama 3.3 70B on most English benchmarks and pulls ahead on Chinese, math, and code. Apache 2.0 is a genuine advantage over Llama's community license for commercial teams. If your user base is multilingual or your core use case is code generation, Qwen 2.5 72B is the better pick.
Pros: Apache 2.0; best-in-class multilingual (especially CJK languages); strong code generation; competitive with Llama 3.3 70B overall. Cons: Same hardware requirements as Llama 3.3 70B; Alibaba origin may face the same procurement scrutiny as DeepSeek.
---
Comparison Table
| Model | Active Params | Min VRAM (4-bit) | Context | License | Best For |
|---|---|---|---|---|---|
| Mistral 7B | 7B | ~14 GB | 32K | Apache 2.0 | Low-latency, single GPU |
| Mixtral 8×7B | 12.9B | ~28–32 GB | 32K | Apache 2.0 | Cost-efficient general use |
| Gemma 3 27B | 27B | ~18 GB | 128K | Gemma Terms | Enterprise, safety, multilingual |
| Llama 3.3 70B | 70B | ~40 GB | 128K | Llama 3 Community | Best text quality, RAG |
| DeepSeek R1 32B | 32B | ~20 GB | 128K | MIT | Math, reasoning, CoT |
| Phi-4 14B | 14B | ~9 GB | 16K | MIT | Edge, low-memory |
| Qwen 2.5 72B | 72B | ~40 GB | 128K | Apache 2.0 | Multilingual, coding |
---
Recommendation by Use Case
You have one consumer GPU (16–24 GB VRAM): Start with Mistral 7B for speed or Gemma 3 12B for quality. Both run in Ollama with a single command.
You need the best quality and have two A100s or four RTX 4090s: Llama 3.3 70B for English-first workloads; Qwen 2.5 72B if you need Apache 2.0 or multilingual coverage.
Your workload is math, science, or multi-step reasoning: DeepSeek R1 32B distill is the clear winner. The MIT license removes any commercial friction.
You're deploying at the edge or on Apple Silicon with limited RAM: Phi-4 at 4-bit quantization. Nothing else at this memory footprint comes close on reasoning tasks.
You need a clean Apache 2.0 or MIT license for a commercial product: Eliminate Llama 4 (EU restrictions, community license) and Gemma (custom terms). Your shortlist is Mistral 7B, Mixtral 8×7B, Qwen 2.5 72B, DeepSeek R1, and Phi-4.
Cost reality check: Running a 7B model on a cloud GPU costs roughly $300–800/month. Break-even against a commercial API typically happens somewhere between 100K and 500K queries/month, depending on your API pricing tier. Below that threshold, self-hosting is an architectural choice, not a cost optimization.
This article was drafted with AI assistance, reviewed and edited by a human editor, and all hardware specs, license terms, and benchmark claims were verified against the cited research sources before publication.