Overview

DeepSeek released a complete ecosystem of open-source reasoning models in January 2025, ranging from lightweight 1.5B parameter versions for edge deployment to the full 671B parameter model for enterprise use. All models are released under the MIT License, enabling commercial use and modifications.

This guide focuses on the available open source model options, business economics of different deployment approaches, and practical implementation strategies. The critical insight for businesses: cloud API services like Together AI provide the only economically viable path for deploying powerful reasoning capabilities at scale.

Available DeepSeek Open Source Models

DeepSeek provides a complete ecosystem of reasoning models, from lightweight edge deployment to full-scale research models. Understanding your options is critical before implementation.

Official DeepSeek Models

ModelParametersSizeRAM RequiredUse CaseLicense
DeepSeek-R1671B (MoE)404GB512GB+Research/EnterpriseMIT
DeepSeek-V3671B (MoE)400GB512GB+General purposeMIT
DeepSeek-R1 (8B)8B5.2GB8GB+Local reasoningMIT
DeepSeek-R1 (7B)7B4.7GB8GB+Code reasoningMIT
DeepSeek-R1 (1.5B)1.5B1.1GB4GB+Edge devicesMIT
Model
DeepSeek-R1
Parameters
671B (MoE)
Size
404GB
RAM Required
512GB+
Use Case
Research/Enterprise
License
MIT
Model
DeepSeek-V3
Parameters
671B (MoE)
Size
400GB
RAM Required
512GB+
Use Case
General purpose
License
MIT
Model
DeepSeek-R1 (8B)
Parameters
8B
Size
5.2GB
RAM Required
8GB+
Use Case
Local reasoning
License
MIT
Model
DeepSeek-R1 (7B)
Parameters
7B
Size
4.7GB
RAM Required
8GB+
Use Case
Code reasoning
License
MIT
Model
DeepSeek-R1 (1.5B)
Parameters
1.5B
Size
1.1GB
RAM Required
4GB+
Use Case
Edge devices
License
MIT

Distilled Models: The Key to Accessible Reasoning

Distilled models are smaller, faster versions that learn to mimic the reasoning behavior of the full 671B DeepSeek-R1 model. Think of it as having a master chef teach their techniques to apprentice chefs - the apprentices can’t match the master’s expertise perfectly, but they can deliver 75-85% of the quality at a fraction of the cost.

How DeepSeek Created These Models: DeepSeek used their full R1 model to generate 800,000 high-quality question-answer pairs with complete reasoning processes. They then fine-tuned smaller base models (Llama and Qwen architectures) on this dataset, teaching them to replicate the step-by-step thinking patterns.

Why This Approach Works:

  • Perfect Teacher: The 671B R1 provides extremely high-quality reasoning examples
  • Pattern Learning: Smaller models learn to replicate thinking processes, not just final answers
  • Specialization: Different base architectures excel at different reasoning types
  • Accessibility: Get most reasoning benefits on consumer hardware

Example Comparison:

Original Llama-3.1-8B (No Reasoning):

User: Solve 2x + 5 = 15
Model: The answer is x = 5

Llama-3.1-8B-R1 (Distilled with Reasoning):

User: Solve 2x + 5 = 15
Model: [thinking]
I need to solve for x in the equation 2x + 5 = 15
First, I’ll subtract 5 from both sides:
2x + 5 - 5 = 15 - 5
2x = 10
Then divide both sides by 2:
x = 10/2 = 5
[/thinking]

The answer is x = 5

ModelBase ArchitectureParametersSizeSpecializationDownload
Llama-3.1-70B-R1Llama 3.170B~40GBProduction reasoningHF/Ollama
Qwen2.5-72B-R1Qwen 2.572B~42GBMath & scienceHF/Ollama
Llama-3.1-8B-R1Llama 3.18B5.2GBGeneral reasoningHF/Ollama
Qwen2.5-14B-R1Qwen 2.514B~8GBBalanced performanceHF/Ollama
Qwen2.5-Coder-7B-R1Qwen 2.57B4.7GBCode & programmingHF/Ollama
Qwen2.5-1.5B-R1Qwen 2.51.5B1.1GBMobile/embeddedHF/Ollama
Model
Llama-3.1-70B-R1
Base Architecture
Llama 3.1
Parameters
70B
Size
~40GB
Specialization
Production reasoning
Download
HF/Ollama
Model
Qwen2.5-72B-R1
Base Architecture
Qwen 2.5
Parameters
72B
Size
~42GB
Specialization
Math & science
Download
HF/Ollama
Model
Llama-3.1-8B-R1
Base Architecture
Llama 3.1
Parameters
8B
Size
5.2GB
Specialization
General reasoning
Download
HF/Ollama
Model
Qwen2.5-14B-R1
Base Architecture
Qwen 2.5
Parameters
14B
Size
~8GB
Specialization
Balanced performance
Download
HF/Ollama
Model
Qwen2.5-Coder-7B-R1
Base Architecture
Qwen 2.5
Parameters
7B
Size
4.7GB
Specialization
Code & programming
Download
HF/Ollama
Model
Qwen2.5-1.5B-R1
Base Architecture
Qwen 2.5
Parameters
1.5B
Size
1.1GB
Specialization
Mobile/embedded
Download
HF/Ollama

Model Configurations and Capabilities

DeepSeek-R1 (671B) - Full Model:

  • Architecture: Mixture of Experts (MoE) with 671B total parameters
  • Active Parameters: ~37B per token (efficient inference)
  • Context Window: 128K tokens
  • Training Cost: $5.5M (base V3 model)
  • Performance: Matches GPT-4 level reasoning
  • Special Features: Full reasoning capability, self-correction, multi-step analysis

DeepSeek-R1 (8B) - Most Popular:

  • Architecture: Dense transformer model
  • Context Window: 128K tokens
  • Reasoning Quality: 85-90% of full R1 performance
  • Speed: 15-25 tokens/second (local)
  • Memory: 8GB VRAM minimum, 16GB recommended
  • Best For: Local deployment, development, testing

Distilled Models Features:

  • Training: Fine-tuned on 800K reasoning examples from R1
  • Performance: 75-85% of full R1 reasoning quality
  • Inference: 2-5x faster than full R1 model
  • Deployment: Optimized for local/edge deployment
  • Compatibility: Standard transformer architecture

Download and Installation Options

Ollama (Recommended for Local Use):

# Most popular choice
ollama pull deepseek-r1:8b

# Lightweight option
ollama pull deepseek-r1:1.5b

# High performance
ollama pull deepseek-r1:14b

# Full model (requires 512GB+ RAM)
ollama pull deepseek-r1:671b

Hugging Face (For Development):

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load 8B reasoning model
model_name = "deepseek-ai/DeepSeek-R1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

API Access (No Local Setup):

  • Together AI: deepseek-ai/deepseek-r1
  • DeepInfra: deepseek-ai/deepseek-r1
  • OpenRouter: deepseek/deepseek-r1

Hardware Requirements by Model Size

Model SizeMinimum VRAMRecommended RAMDisk SpacePerformance
1.5B2GB4GB2GBBasic reasoning
7-8B8GB16GB6GBGood reasoning
14B16GB24GB10GBVery good reasoning
32B32GB48GB20GBExcellent reasoning
70B48GB64GB45GBNear full performance
671B400GB512GB450GBFull performance
Model Size
1.5B
Minimum VRAM
2GB
Recommended RAM
4GB
Disk Space
2GB
Performance
Basic reasoning
Model Size
7-8B
Minimum VRAM
8GB
Recommended RAM
16GB
Disk Space
6GB
Performance
Good reasoning
Model Size
14B
Minimum VRAM
16GB
Recommended RAM
24GB
Disk Space
10GB
Performance
Very good reasoning
Model Size
32B
Minimum VRAM
32GB
Recommended RAM
48GB
Disk Space
20GB
Performance
Excellent reasoning
Model Size
70B
Minimum VRAM
48GB
Recommended RAM
64GB
Disk Space
45GB
Performance
Near full performance
Model Size
671B
Minimum VRAM
400GB
Recommended RAM
512GB
Disk Space
450GB
Performance
Full performance

Business Case: Traditional Cloud GPU vs API Services

When deploying reasoning models for business applications, organizations face a critical choice between two fundamentally different cloud approaches: managing GPU infrastructure on traditional cloud platforms (AWS, Azure, GCP) versus using managed API services (Together AI, DeepInfra). The economic and operational differences are dramatic.

The Traditional Cloud GPU Reality

Traditional cloud GPU deployment for reasoning models creates prohibitive cost structures and operational complexity that makes them unsuitable for most business applications.

Massive Fixed Infrastructure Costs

Deploying reasoning models on AWS, Azure, or Google Cloud Platform requires provisioning massive GPU instances that remain online continuously to avoid cold start delays. A 70B reasoning model demands A100 or H100 GPU instances costing $3-8 per hour, resulting in monthly bills of $2,160 to $5,760 for 24/7 operation regardless of actual usage patterns.

The full 671B DeepSeek-R1 model requires multiple high-end GPUs with inter-GPU communication optimizations, pushing monthly costs to $15,000-30,000 just for compute resources before considering storage, networking, and data transfer costs that can add thousands more.

GPU Availability and Capacity Constraints

These cost projections assume organizations can actually access the required GPUs during peak demand periods. Securing A100 or H100 instances often requires reserved capacity commitments that lock in expenses regardless of actual usage patterns. Many organizations discover that GPU availability constraints force them into expensive multi-region deployments to ensure reliability, further multiplying infrastructure costs.

The scarcity of high-end GPU instances means organizations must often accept less optimal configurations or pay premium prices for guaranteed availability, making cost planning extremely difficult and often forcing budget overruns.

Engineering and Operational Overhead

Beyond raw compute costs, traditional cloud GPU deployment demands significant engineering overhead that most organizations underestimate. Teams must manage model loading optimization, implement comprehensive health monitoring, handle GPU instance failures gracefully, optimize memory usage patterns, and maintain complex deployment pipelines.

Most teams discover that GPU-optimized model serving requires specialized expertise in CUDA programming, distributed systems, and high-performance computing that diverts scarce engineering resources from core business development priorities.

Deployment OptionSetup ComplexityMonthly Cost (70B)Monthly Cost (671B)Operational Burden
AWS A100 InstancesHigh$2,160-5,760$15,000-30,000Extensive
Azure H100 InstancesHigh$2,880-6,480$18,000-35,000Extensive
GCP A100 InstancesHigh$2,520-5,400$16,000-28,000Extensive
Together AI APIMinimal$200-1,500*$500-4,000*None
DeepInfra APIMinimal$180-1,200*$400-3,500*None
Deployment Option
AWS A100 Instances
Setup Complexity
High
Monthly Cost (70B)
$2,160-5,760
Monthly Cost (671B)
$15,000-30,000
Operational Burden
Extensive
Deployment Option
Azure H100 Instances
Setup Complexity
High
Monthly Cost (70B)
$2,880-6,480
Monthly Cost (671B)
$18,000-35,000
Operational Burden
Extensive
Deployment Option
GCP A100 Instances
Setup Complexity
High
Monthly Cost (70B)
$2,520-5,400
Monthly Cost (671B)
$16,000-28,000
Operational Burden
Extensive
Deployment Option
Together AI API
Setup Complexity
Minimal
Monthly Cost (70B)
$200-1,500*
Monthly Cost (671B)
$500-4,000*
Operational Burden
None
Deployment Option
DeepInfra API
Setup Complexity
Minimal
Monthly Cost (70B)
$180-1,200*
Monthly Cost (671B)
$400-3,500*
Operational Burden
None

*Usage-dependent pricing

Why API Services Transform Business Economics

Managed API services fundamentally transform the economics of reasoning model deployment by eliminating fixed infrastructure costs and operational complexity through shared, optimized infrastructure.

Shared Infrastructure Cost Efficiency

Managed API services like Together AI and DeepInfra operate on a fundamentally different model that eliminates both the massive fixed costs and operational complexity of traditional cloud GPU deployment. Instead of provisioning and managing dedicated GPU infrastructure, organizations pay only for actual reasoning tokens consumed by their applications.

These platforms achieve dramatic cost efficiencies through shared infrastructure and optimized model serving. They maintain GPU clusters running multiple models simultaneously, distributing fixed infrastructure costs across thousands of applications rather than requiring dedicated instances per customer. Advanced techniques like dynamic batching, model quantization, and intelligent routing deliver the same reasoning capabilities at a fraction of traditional cloud costs.

Usage-Based Pricing Reality

Consider a realistic business scenario processing 2 million reasoning tokens monthly across various applications. Traditional cloud deployment requires maintaining expensive GPU instances continuously, resulting in costs of $2,160-5,760 monthly regardless of actual usage patterns or business seasonality.

API services charge only for consumed tokens, typically resulting in monthly costs of $200-1,500 for the same workload while providing superior reliability, automatic scaling, and no operational overhead. This usage-based model aligns costs with business value and eliminates the risk of massive fixed expenses for unpredictable usage patterns.

Operational Simplicity and Focus

The operational advantages prove equally compelling for business organizations. API services eliminate the need for specialized GPU expertise, infrastructure management, or model optimization. Development teams focus on business logic and user experience rather than managing CUDA drivers, optimizing memory allocation, or debugging distributed GPU systems.

This operational simplicity accelerates development cycles and allows technical teams to concentrate on features that directly impact business outcomes rather than infrastructure management that provides no competitive advantage.

Real-World Cost Analysis

Most business reasoning applications exhibit highly variable usage patterns with significant daily and seasonal fluctuations. Traditional cloud GPU deployment forces you to provision for peak capacity that remains idle during low-usage periods, creating massive cost inefficiencies.

A financial services company processing reasoning requests for risk analysis might handle 10,000 tokens during overnight batch processing and 500,000 tokens during market hours. Traditional cloud deployment requires maintaining expensive H100 instances continuously, costing approximately $6,000 monthly. API services charge only for actual tokens consumed, resulting in monthly costs around $800-1,200 while providing better reliability and global availability.

Strategic Implementation Framework

The optimal strategy begins with API services to establish baseline performance and validate business requirements. Start with Together AI or DeepInfra to deploy reasoning capabilities without infrastructure investment or technical complexity. This approach allows rapid iteration, accurate cost measurement, and validation of reasoning quality against actual business use cases.

Traditional cloud GPU deployment becomes viable only for applications with consistently high-volume usage patterns exceeding 50-100 million tokens monthly, combined with specific requirements for custom model architectures or specialized inference optimizations. Even then, the operational complexity often outweighs potential cost savings unless you maintain dedicated ML infrastructure teams.

For most organizations, API services provide superior economics, operational simplicity, and strategic agility. They eliminate the capital risk of infrastructure investment while enabling rapid scaling based on actual business growth rather than capacity planning projections.

Understanding Reasoning Models: When They Help

Reasoning models work differently from standard language models. Instead of immediately generating an answer, they first generate “thinking tokens” that show their internal reasoning process.

Standard Model Response:

User: What is 1 + 1?
Model: 2

Reasoning Model Response:

User: What is 1 + 1?
Model: [thinking]
This is a straightforward arithmetic question. 1 + 1 = 2.
Wait, let me double-check if this might be a trick question.
The user could be referring to binary arithmetic where 1 + 1 = 10,
or they could be asking about something metaphorical.
But in standard decimal arithmetic, 1 + 1 = 2.
[/thinking]

The answer is 2.

Reasoning models excel at complex math and logic problems where you need to see the step-by-step process. They provide significant value for multi-step programming challenges, analysis requiring multiple perspectives, and situations where you need to verify the reasoning process. Debugging and error-checking scenarios particularly benefit from the visible thinking process, as you can identify where the model’s logic breaks down.

Standard models work better for simple factual questions, creative writing and storytelling, and conversational chat where the reasoning process adds unnecessary overhead. Function calling and API integration typically perform better with standard models, as do any tasks requiring fast response times where the additional thinking tokens create unwanted latency.

Quick Start: Trying R1 Models

The fastest way to understand reasoning models is to try them. You can access DeepSeek-R1 through several platforms without setup:

Online Playgrounds

Together AI Playground:

  1. Visit together.ai/playground
  2. Select “DeepSeek-R1” from the model dropdown
  3. Try asking: “Solve this step by step: If I have 15 apples and give away 1/3, then buy 8 more, how many apples do I have?”

DeepInfra Console:

  1. Go to deepinfra.com
  2. Select DeepSeek-R1 model
  3. Test with coding problems or math questions

Example API Usage

import requests

response = requests.post(
    "https://api.together.xyz/v1/completions",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "model": "deepseek-ai/deepseek-r1",
        "prompt": "Write Python code to find the largest prime factor of 315",
        "max_tokens": 1500,
        "temperature": 0.6
    }
)

# Response will include [thinking] and [answer] sections
print(response.json()['choices'][0]['text'])

What You’ll Notice

  • Thinking tokens: Much longer responses showing step-by-step reasoning
  • Self-correction: Model catches and fixes its own mistakes
  • Multiple approaches: Explores different solution strategies
  • Higher latency: Takes longer due to reasoning process
  • Variable output length: Complex problems generate more thinking

Common Issues and Solutions

Implementing reasoning models presents unique challenges. Here are solutions to frequently encountered problems.

Model Not Showing Thinking Process

When reasoning models output answers without showing reasoning steps, the most effective solution is to force thinking tags directly in the prompt. Include explicit placeholder text like “[Please analyze this problem step by step here]” within thinking tags, followed by your actual problem. Alternatively, you can add explicit instructions stating “Before answering, you MUST show your reasoning process in thinking tags” to ensure the model follows the expected format.

Inconsistent Output Format

Format consistency issues arise when models don’t follow the requested structure. The solution requires extreme explicitness in your formatting requirements. Specify exactly how you want the response structured, including thinking tags and answer sections, and explicitly state “Do not deviate from this format.” The more detailed and rigid your format specification, the more consistent the model’s adherence becomes.

Poor Reasoning Quality

Shallow or incorrect reasoning often stems from configuration issues rather than model limitations. Using the correct temperature setting of 0.6 significantly improves reasoning quality compared to 0.0 or 1.0. For critical applications, generate multiple responses and take a majority vote across 4-8 generations to improve reliability. Adding a verification step where you ask the model to double-check its work also enhances accuracy, and larger models (70B+) consistently demonstrate superior reasoning capabilities compared to smaller variants.

High API Costs

Token consumption can become expensive with reasoning models due to extensive thinking processes. Distilled models provide 60% cost reduction while maintaining most reasoning capabilities. Optimize prompts by removing unnecessary context, implement response caching for repeated problems, and use a hybrid approach where V3 handles preprocessing while R1 focuses only on complex reasoning steps.

Slow Response Times

Response latency issues affect interactive applications where reasoning overhead becomes problematic. Smaller models (8B-14B) deliver 3-5x faster performance while maintaining reasonable reasoning quality. Local deployment eliminates network latency, setting reasonable max_tokens prevents excessive thinking, and streaming responses allow users to see the thinking process as it develops rather than waiting for completion.

Model Getting Stuck in Loops

Repetitive thinking without conclusions occurs when models enter circular reasoning patterns. Implement token limits to prevent infinite loops, add explicit stopping criteria that force conclusions after a reasonable number of reasoning steps, use timeouts to kill requests that exceed acceptable duration, and engineer prompts to request concise reasoning focused on reaching definitive answers.

Integration with Existing Systems

Reasoning models often don’t integrate smoothly with existing LLM infrastructures designed for standard models. Parse thinking tokens separately from final answers to log reasoning processes independently, use wrapper functions that abstract reasoning complexity from your application logic, implement fallback mechanisms to standard V3 models when reasoning fails, and consider deploying reasoning capabilities as a dedicated microservice for complex problems while maintaining your existing architecture for standard tasks.