September 3, 2025

Technology Education

DeepSeek-R1 Open Source Models: Available Options and Business Implementation Guide

Essential guide to DeepSeek's open source reasoning models. Explore available model options, understand business economics of cloud APIs vs GPU deployment, and learn key implementation approaches.

DeepSeek-R1reasoning modelsthinking tokensreinforcement learningopen source AItest-time compute

Overview

DeepSeek released a complete ecosystem of open-source reasoning models in January 2025, ranging from lightweight 1.5B parameter versions for edge deployment to the full 671B parameter model for enterprise use. All models are released under the MIT License, enabling commercial use and modifications.

This guide focuses on the available open source model options, business economics of different deployment approaches, and practical implementation strategies. The critical insight for businesses: cloud API services like Together AI provide the only economically viable path for deploying powerful reasoning capabilities at scale.

Overview

Available DeepSeek Open Source Models

DeepSeek provides a complete ecosystem of reasoning models, from lightweight edge deployment to full-scale research models. Understanding your options is critical before implementation.

Official DeepSeek Models

Model	Parameters	Size	RAM Required	Use Case	License
DeepSeek-R1	671B (MoE)	404GB	512GB+	Research/Enterprise	MIT
DeepSeek-V3	671B (MoE)	400GB	512GB+	General purpose	MIT
DeepSeek-R1 (8B)	8B	5.2GB	8GB+	Local reasoning	MIT
DeepSeek-R1 (7B)	7B	4.7GB	8GB+	Code reasoning	MIT
DeepSeek-R1 (1.5B)	1.5B	1.1GB	4GB+	Edge devices	MIT

Model

DeepSeek-R1

Parameters

671B (MoE)

Size

404GB

RAM Required

512GB+

Use Case

Research/Enterprise

License

MIT

Model

DeepSeek-V3

Parameters

671B (MoE)

Size

400GB

RAM Required

512GB+

Use Case

General purpose

License

MIT

Model

DeepSeek-R1 (8B)

Parameters

Size

5.2GB

RAM Required

8GB+

Use Case

Local reasoning

License

MIT

Model

DeepSeek-R1 (7B)

Parameters

Size

4.7GB

RAM Required

8GB+

Use Case

Code reasoning

License

MIT

Model

DeepSeek-R1 (1.5B)

Parameters

1.5B

Size

1.1GB

RAM Required

4GB+

Use Case

Edge devices

License

MIT

Distilled Models: The Key to Accessible Reasoning

Distilled models are smaller, faster versions that learn to mimic the reasoning behavior of the full 671B DeepSeek-R1 model. Think of it as having a master chef teach their techniques to apprentice chefs - the apprentices can’t match the master’s expertise perfectly, but they can deliver 75-85% of the quality at a fraction of the cost.

How DeepSeek Created These Models: DeepSeek used their full R1 model to generate 800,000 high-quality question-answer pairs with complete reasoning processes. They then fine-tuned smaller base models (Llama and Qwen architectures) on this dataset, teaching them to replicate the step-by-step thinking patterns.

Why This Approach Works:

Perfect Teacher: The 671B R1 provides extremely high-quality reasoning examples
Pattern Learning: Smaller models learn to replicate thinking processes, not just final answers
Specialization: Different base architectures excel at different reasoning types
Accessibility: Get most reasoning benefits on consumer hardware

Example Comparison:

Original Llama-3.1-8B (No Reasoning):

User: Solve 2x + 5 = 15
Model: The answer is x = 5

Llama-3.1-8B-R1 (Distilled with Reasoning):

User: Solve 2x + 5 = 15
Model: [thinking]
I need to solve for x in the equation 2x + 5 = 15
First, I’ll subtract 5 from both sides:
2x + 5 - 5 = 15 - 5
2x = 10
Then divide both sides by 2:
x = 10/2 = 5
[/thinking]

The answer is x = 5

Model	Base Architecture	Parameters	Size	Specialization	Download
Llama-3.1-70B-R1	Llama 3.1	70B	~40GB	Production reasoning	HF/Ollama
Qwen2.5-72B-R1	Qwen 2.5	72B	~42GB	Math & science	HF/Ollama
Llama-3.1-8B-R1	Llama 3.1	8B	5.2GB	General reasoning	HF/Ollama
Qwen2.5-14B-R1	Qwen 2.5	14B	~8GB	Balanced performance	HF/Ollama
Qwen2.5-Coder-7B-R1	Qwen 2.5	7B	4.7GB	Code & programming	HF/Ollama
Qwen2.5-1.5B-R1	Qwen 2.5	1.5B	1.1GB	Mobile/embedded	HF/Ollama

Model

Llama-3.1-70B-R1

Base Architecture

Llama 3.1

Parameters

70B

Size

~40GB

Specialization

Production reasoning

Download

HF/Ollama

Model

Qwen2.5-72B-R1

Base Architecture

Qwen 2.5

Parameters

72B

Size

~42GB

Specialization

Math & science

Download

HF/Ollama

Model

Llama-3.1-8B-R1

Base Architecture

Llama 3.1

Parameters

Size

5.2GB

Specialization

General reasoning

Download

HF/Ollama

Model

Qwen2.5-14B-R1

Base Architecture

Qwen 2.5

Parameters

14B

Size

~8GB

Specialization

Balanced performance

Download

HF/Ollama

Model

Qwen2.5-Coder-7B-R1

Base Architecture

Qwen 2.5

Parameters

Size

4.7GB

Specialization

Code & programming

Download

HF/Ollama

Model

Qwen2.5-1.5B-R1

Base Architecture

Qwen 2.5

Parameters

1.5B

Size

1.1GB

Specialization

Mobile/embedded

Download

HF/Ollama

Model Configurations and Capabilities

DeepSeek-R1 (671B) - Full Model:

Architecture: Mixture of Experts (MoE) with 671B total parameters
Active Parameters: ~37B per token (efficient inference)
Context Window: 128K tokens
Training Cost: $5.5M (base V3 model)
Performance: Matches GPT-4 level reasoning
Special Features: Full reasoning capability, self-correction, multi-step analysis

DeepSeek-R1 (8B) - Most Popular:

Architecture: Dense transformer model
Context Window: 128K tokens
Reasoning Quality: 85-90% of full R1 performance
Speed: 15-25 tokens/second (local)
Memory: 8GB VRAM minimum, 16GB recommended
Best For: Local deployment, development, testing

Distilled Models Features:

Training: Fine-tuned on 800K reasoning examples from R1
Performance: 75-85% of full R1 reasoning quality
Inference: 2-5x faster than full R1 model
Deployment: Optimized for local/edge deployment
Compatibility: Standard transformer architecture

Download and Installation Options

Ollama (Recommended for Local Use):

# Most popular choice
ollama pull deepseek-r1:8b

# Lightweight option
ollama pull deepseek-r1:1.5b

# High performance
ollama pull deepseek-r1:14b

# Full model (requires 512GB+ RAM)
ollama pull deepseek-r1:671b

Hugging Face (For Development):

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load 8B reasoning model
model_name = "deepseek-ai/DeepSeek-R1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

API Access (No Local Setup):

Together AI: deepseek-ai/deepseek-r1
DeepInfra: deepseek-ai/deepseek-r1
OpenRouter: deepseek/deepseek-r1

Hardware Requirements by Model Size

Model Size	Minimum VRAM	Recommended RAM	Disk Space	Performance
1.5B	2GB	4GB	2GB	Basic reasoning
7-8B	8GB	16GB	6GB	Good reasoning
14B	16GB	24GB	10GB	Very good reasoning
32B	32GB	48GB	20GB	Excellent reasoning
70B	48GB	64GB	45GB	Near full performance
671B	400GB	512GB	450GB	Full performance

Model Size

1.5B

Minimum VRAM

2GB

Recommended RAM

4GB

Disk Space

2GB

Performance

Basic reasoning

Model Size

7-8B

Minimum VRAM

8GB

Recommended RAM

16GB

Disk Space

6GB

Performance

Good reasoning

Model Size

14B

Minimum VRAM

16GB

Recommended RAM

24GB

Disk Space

10GB

Performance

Very good reasoning

Model Size

32B

Minimum VRAM

32GB

Recommended RAM

48GB

Disk Space

20GB

Performance

Excellent reasoning

Model Size

70B

Minimum VRAM

48GB

Recommended RAM

64GB

Disk Space

45GB

Performance

Near full performance

Model Size

671B

Minimum VRAM

400GB

Recommended RAM

512GB

Disk Space

450GB

Performance

Full performance

Business Case: Traditional Cloud GPU vs API Services

When deploying reasoning models for business applications, organizations face a critical choice between two fundamentally different cloud approaches: managing GPU infrastructure on traditional cloud platforms (AWS, Azure, GCP) versus using managed API services (Together AI, DeepInfra). The economic and operational differences are dramatic.

The Traditional Cloud GPU Reality

Traditional cloud GPU deployment for reasoning models creates prohibitive cost structures and operational complexity that makes them unsuitable for most business applications.

Massive Fixed Infrastructure Costs

Deploying reasoning models on AWS, Azure, or Google Cloud Platform requires provisioning massive GPU instances that remain online continuously to avoid cold start delays. A 70B reasoning model demands A100 or H100 GPU instances costing $3-8 per hour, resulting in monthly bills of $2,160 to $5,760 for 24/7 operation regardless of actual usage patterns.

The full 671B DeepSeek-R1 model requires multiple high-end GPUs with inter-GPU communication optimizations, pushing monthly costs to $15,000-30,000 just for compute resources before considering storage, networking, and data transfer costs that can add thousands more.

GPU Availability and Capacity Constraints

These cost projections assume organizations can actually access the required GPUs during peak demand periods. Securing A100 or H100 instances often requires reserved capacity commitments that lock in expenses regardless of actual usage patterns. Many organizations discover that GPU availability constraints force them into expensive multi-region deployments to ensure reliability, further multiplying infrastructure costs.

The scarcity of high-end GPU instances means organizations must often accept less optimal configurations or pay premium prices for guaranteed availability, making cost planning extremely difficult and often forcing budget overruns.

Engineering and Operational Overhead

Beyond raw compute costs, traditional cloud GPU deployment demands significant engineering overhead that most organizations underestimate. Teams must manage model loading optimization, implement comprehensive health monitoring, handle GPU instance failures gracefully, optimize memory usage patterns, and maintain complex deployment pipelines.

Most teams discover that GPU-optimized model serving requires specialized expertise in CUDA programming, distributed systems, and high-performance computing that diverts scarce engineering resources from core business development priorities.

Deployment Option	Setup Complexity	Monthly Cost (70B)	Monthly Cost (671B)	Operational Burden
AWS A100 Instances	High	$2,160-5,760	$15,000-30,000	Extensive
Azure H100 Instances	High	$2,880-6,480	$18,000-35,000	Extensive
GCP A100 Instances	High	$2,520-5,400	$16,000-28,000	Extensive
Together AI API	Minimal	$200-1,500*	$500-4,000*	None
DeepInfra API	Minimal	$180-1,200*	$400-3,500*	None

Deployment Option

AWS A100 Instances

Setup Complexity

High

Monthly Cost (70B)

$2,160-5,760

Monthly Cost (671B)

$15,000-30,000

Operational Burden

Extensive

Deployment Option

Azure H100 Instances

Setup Complexity

High

Monthly Cost (70B)

$2,880-6,480

Monthly Cost (671B)

$18,000-35,000

Operational Burden

Extensive

Deployment Option

GCP A100 Instances

Setup Complexity

High

Monthly Cost (70B)

$2,520-5,400

Monthly Cost (671B)

$16,000-28,000

Operational Burden

Extensive

Deployment Option

Together AI API

Setup Complexity

Minimal

Monthly Cost (70B)

$200-1,500*

Monthly Cost (671B)

$500-4,000*

Operational Burden

None

Deployment Option

DeepInfra API

Setup Complexity

Minimal

Monthly Cost (70B)

$180-1,200*

Monthly Cost (671B)

$400-3,500*

Operational Burden

None

*Usage-dependent pricing

Why API Services Transform Business Economics

Managed API services fundamentally transform the economics of reasoning model deployment by eliminating fixed infrastructure costs and operational complexity through shared, optimized infrastructure.

Shared Infrastructure Cost Efficiency

Managed API services like Together AI and DeepInfra operate on a fundamentally different model that eliminates both the massive fixed costs and operational complexity of traditional cloud GPU deployment. Instead of provisioning and managing dedicated GPU infrastructure, organizations pay only for actual reasoning tokens consumed by their applications.

These platforms achieve dramatic cost efficiencies through shared infrastructure and optimized model serving. They maintain GPU clusters running multiple models simultaneously, distributing fixed infrastructure costs across thousands of applications rather than requiring dedicated instances per customer. Advanced techniques like dynamic batching, model quantization, and intelligent routing deliver the same reasoning capabilities at a fraction of traditional cloud costs.

Usage-Based Pricing Reality

Consider a realistic business scenario processing 2 million reasoning tokens monthly across various applications. Traditional cloud deployment requires maintaining expensive GPU instances continuously, resulting in costs of $2,160-5,760 monthly regardless of actual usage patterns or business seasonality.

API services charge only for consumed tokens, typically resulting in monthly costs of $200-1,500 for the same workload while providing superior reliability, automatic scaling, and no operational overhead. This usage-based model aligns costs with business value and eliminates the risk of massive fixed expenses for unpredictable usage patterns.

Operational Simplicity and Focus

The operational advantages prove equally compelling for business organizations. API services eliminate the need for specialized GPU expertise, infrastructure management, or model optimization. Development teams focus on business logic and user experience rather than managing CUDA drivers, optimizing memory allocation, or debugging distributed GPU systems.

This operational simplicity accelerates development cycles and allows technical teams to concentrate on features that directly impact business outcomes rather than infrastructure management that provides no competitive advantage.

Real-World Cost Analysis

Most business reasoning applications exhibit highly variable usage patterns with significant daily and seasonal fluctuations. Traditional cloud GPU deployment forces you to provision for peak capacity that remains idle during low-usage periods, creating massive cost inefficiencies.

A financial services company processing reasoning requests for risk analysis might handle 10,000 tokens during overnight batch processing and 500,000 tokens during market hours. Traditional cloud deployment requires maintaining expensive H100 instances continuously, costing approximately $6,000 monthly. API services charge only for actual tokens consumed, resulting in monthly costs around $800-1,200 while providing better reliability and global availability.

Strategic Implementation Framework

The optimal strategy begins with API services to establish baseline performance and validate business requirements. Start with Together AI or DeepInfra to deploy reasoning capabilities without infrastructure investment or technical complexity. This approach allows rapid iteration, accurate cost measurement, and validation of reasoning quality against actual business use cases.

Traditional cloud GPU deployment becomes viable only for applications with consistently high-volume usage patterns exceeding 50-100 million tokens monthly, combined with specific requirements for custom model architectures or specialized inference optimizations. Even then, the operational complexity often outweighs potential cost savings unless you maintain dedicated ML infrastructure teams.

For most organizations, API services provide superior economics, operational simplicity, and strategic agility. They eliminate the capital risk of infrastructure investment while enabling rapid scaling based on actual business growth rather than capacity planning projections.

Understanding Reasoning Models: When They Help

Reasoning models work differently from standard language models. Instead of immediately generating an answer, they first generate “thinking tokens” that show their internal reasoning process.

Standard Model Response:

User: What is 1 + 1?
Model: 2

Reasoning Model Response:

User: What is 1 + 1?
Model: [thinking]
This is a straightforward arithmetic question. 1 + 1 = 2.
Wait, let me double-check if this might be a trick question.
The user could be referring to binary arithmetic where 1 + 1 = 10,
or they could be asking about something metaphorical.
But in standard decimal arithmetic, 1 + 1 = 2.
[/thinking]

The answer is 2.

Reasoning models excel at complex math and logic problems where you need to see the step-by-step process. They provide significant value for multi-step programming challenges, analysis requiring multiple perspectives, and situations where you need to verify the reasoning process. Debugging and error-checking scenarios particularly benefit from the visible thinking process, as you can identify where the model’s logic breaks down.

Standard models work better for simple factual questions, creative writing and storytelling, and conversational chat where the reasoning process adds unnecessary overhead. Function calling and API integration typically perform better with standard models, as do any tasks requiring fast response times where the additional thinking tokens create unwanted latency.

Quick Start: Trying R1 Models

The fastest way to understand reasoning models is to try them. You can access DeepSeek-R1 through several platforms without setup:

Online Playgrounds

Together AI Playground:

Visit together.ai/playground
Select “DeepSeek-R1” from the model dropdown
Try asking: “Solve this step by step: If I have 15 apples and give away 1/3, then buy 8 more, how many apples do I have?”

DeepInfra Console:

Go to deepinfra.com
Select DeepSeek-R1 model
Test with coding problems or math questions

Example API Usage

import requests

response = requests.post(
    "https://api.together.xyz/v1/completions",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "model": "deepseek-ai/deepseek-r1",
        "prompt": "Write Python code to find the largest prime factor of 315",
        "max_tokens": 1500,
        "temperature": 0.6
    }
)

# Response will include [thinking] and [answer] sections
print(response.json()['choices'][0]['text'])

What You’ll Notice

Thinking tokens: Much longer responses showing step-by-step reasoning
Self-correction: Model catches and fixes its own mistakes
Multiple approaches: Explores different solution strategies
Higher latency: Takes longer due to reasoning process
Variable output length: Complex problems generate more thinking

Common Issues and Solutions

Implementing reasoning models presents unique challenges. Here are solutions to frequently encountered problems.

Model Not Showing Thinking Process

When reasoning models output answers without showing reasoning steps, the most effective solution is to force thinking tags directly in the prompt. Include explicit placeholder text like “[Please analyze this problem step by step here]” within thinking tags, followed by your actual problem. Alternatively, you can add explicit instructions stating “Before answering, you MUST show your reasoning process in thinking tags” to ensure the model follows the expected format.

Inconsistent Output Format

Format consistency issues arise when models don’t follow the requested structure. The solution requires extreme explicitness in your formatting requirements. Specify exactly how you want the response structured, including thinking tags and answer sections, and explicitly state “Do not deviate from this format.” The more detailed and rigid your format specification, the more consistent the model’s adherence becomes.

Poor Reasoning Quality

Shallow or incorrect reasoning often stems from configuration issues rather than model limitations. Using the correct temperature setting of 0.6 significantly improves reasoning quality compared to 0.0 or 1.0. For critical applications, generate multiple responses and take a majority vote across 4-8 generations to improve reliability. Adding a verification step where you ask the model to double-check its work also enhances accuracy, and larger models (70B+) consistently demonstrate superior reasoning capabilities compared to smaller variants.

High API Costs

Token consumption can become expensive with reasoning models due to extensive thinking processes. Distilled models provide 60% cost reduction while maintaining most reasoning capabilities. Optimize prompts by removing unnecessary context, implement response caching for repeated problems, and use a hybrid approach where V3 handles preprocessing while R1 focuses only on complex reasoning steps.

Slow Response Times

Response latency issues affect interactive applications where reasoning overhead becomes problematic. Smaller models (8B-14B) deliver 3-5x faster performance while maintaining reasonable reasoning quality. Local deployment eliminates network latency, setting reasonable max_tokens prevents excessive thinking, and streaming responses allow users to see the thinking process as it develops rather than waiting for completion.

Model Getting Stuck in Loops

Repetitive thinking without conclusions occurs when models enter circular reasoning patterns. Implement token limits to prevent infinite loops, add explicit stopping criteria that force conclusions after a reasonable number of reasoning steps, use timeouts to kill requests that exceed acceptable duration, and engineer prompts to request concise reasoning focused on reaching definitive answers.

Integration with Existing Systems

Reasoning models often don’t integrate smoothly with existing LLM infrastructures designed for standard models. Parse thinking tokens separately from final answers to log reasoning processes independently, use wrapper functions that abstract reasoning complexity from your application logic, implement fallback mechanisms to standard V3 models when reasoning fails, and consider deploying reasoning capabilities as a dedicated microservice for complex problems while maintaining your existing architecture for standard tasks.

Published on September 3, 2025 • By Tekta Team • 15 min read

DeepSeek-R1 Open Source Models: Available Options and Business Implementation Guide

Overview

Overview

Available DeepSeek Open Source Models

Official DeepSeek Models

Distilled Models: The Key to Accessible Reasoning

Model Configurations and Capabilities

Download and Installation Options

Hardware Requirements by Model Size

Business Case: Traditional Cloud GPU vs API Services

The Traditional Cloud GPU Reality

Massive Fixed Infrastructure Costs

GPU Availability and Capacity Constraints

Engineering and Operational Overhead

Why API Services Transform Business Economics

Shared Infrastructure Cost Efficiency

Usage-Based Pricing Reality

Operational Simplicity and Focus

Real-World Cost Analysis

Strategic Implementation Framework

Understanding Reasoning Models: When They Help

Quick Start: Trying R1 Models

Online Playgrounds

Example API Usage

What You’ll Notice

Common Issues and Solutions

Model Not Showing Thinking Process

Inconsistent Output Format

Poor Reasoning Quality

High API Costs

Slow Response Times

Model Getting Stuck in Loops

Integration with Existing Systems

Complete Your Profile

Welcome to Tekta.ai!