pho[to]rum

LeilaniPan · 2025-02-01 12:56:04

DeepSeek is a Chinese AI company "dedicated to making AGI a reality" and open-sourcing all its designs. They began in 2023, but have been making waves over the previous month approximately, and especially this past week with the release of their two latest thinking models: DeepSeek-R1-Zero and the advanced DeepSeek-R1, also referred to as DeepSeek Reasoner.

They have actually released not only the designs however likewise the code and assessment prompts for public usage, in addition to a detailed paper describing their technique.

Aside from developing 2 highly performant models that are on par with OpenAI's o1 design, the paper has a great deal of valuable information around support learning, chain of thought thinking, timely engineering with thinking models, and more.

We'll begin by concentrating on the training procedure of DeepSeek-R1-Zero, which distinctively relied solely on reinforcement learning, rather of traditional monitored knowing. We'll then proceed to DeepSeek-R1, how it's reasoning works, and some timely engineering best practices for thinking models.

Hey everybody, Dan here, co-founder of PromptHub. Today, we're diving into DeepSeek's most current design release and comparing it with OpenAI's thinking models, particularly the A1 and A1 Mini designs. We'll explore their training procedure, reasoning capabilities, and some essential insights into prompt engineering for reasoning designs.

DeepSeek is a Chinese-based AI company committed to open-source advancement. Their recent release, the R1 reasoning design, is groundbreaking due to its open-source nature and ingenious training approaches. This consists of open access to the models, triggers, and research papers.

Released on January 20th, DeepSeek's R1 achieved impressive efficiency on various standards, rivaling OpenAI's A1 models. Notably, they also introduced a precursor design, R10, which serves as the structure for R1.

Training Process: R10 to R1

R10: This model was trained exclusively utilizing support learning without supervised fine-tuning, making it the first open-source design to accomplish high performance through this technique. Training included:

- Rewarding proper answers in deterministic jobs (e.g., math problems).
- Encouraging structured thinking outputs utilizing templates with "" and "" tags

Through thousands of iterations, R10 developed longer thinking chains, self-verification, and even reflective habits. For instance, during training, the model demonstrated "aha" moments and self-correction behaviors, which are rare in standard LLMs.

R1: Building on R10, R1 included several improvements:

- Curated datasets with long Chain of Thought examples.
- Incorporation of R10-generated reasoning chains.
- Human choice alignment for refined reactions.
- Distillation into smaller sized models (LLaMA 3.1 and 3.3 at different sizes).

Performance Benchmarks

DeepSeek's R1 model performs on par with OpenAI's A1 models throughout many thinking criteria:

Reasoning and Math Tasks: R1 rivals or exceeds A1 models in precision and depth of thinking.
Coding Tasks: A1 designs normally perform much better in LiveCode Bench and CodeForces tasks.
Simple QA: R1 typically outpaces A1 in structured QA jobs (e.g., 47% accuracy vs. 30%).

One significant finding is that longer reasoning chains generally enhance performance. This aligns with insights from Microsoft's Med-Prompt structure and OpenAI's observations on test-time compute and reasoning depth.

Challenges and Observations

Despite its strengths, R1 has some constraints:

- Mixing English and Chinese responses due to an absence of monitored fine-tuning.
- Less polished reactions compared to talk models like OpenAI's GPT.

These concerns were dealt with during R1's improvement process, consisting of supervised fine-tuning and human feedback.

Prompt Engineering Insights

A fascinating takeaway from DeepSeek's research study is how few-shot prompting abject R1's performance compared to zero-shot or succinct tailored prompts. This aligns with findings from the Med-Prompt paper and OpenAI's suggestions to limit context in thinking models. Overcomplicating the input can overwhelm the design and decrease accuracy.

DeepSeek's R1 is a substantial step forward for open-source reasoning designs, demonstrating abilities that match OpenAI's A1. It's an interesting time to experiment with these designs and their chat user interface, which is free to utilize.

If you have questions or wish to find out more, examine out the resources linked below. See you next time!

Training DeepSeek-R1-Zero: A support learning-only technique

DeepSeek-R1-Zero stands out from many other state-of-the-art models since it was trained using only reinforcement learning (RL), no monitored fine-tuning (SFT). This challenges the current traditional approach and opens brand-new opportunities to train reasoning designs with less human intervention and effort.

DeepSeek-R1-Zero is the first open-source model to validate that innovative reasoning capabilities can be established purely through RL.

Without pre-labeled datasets, the design finds out through trial and mistake, refining its behavior, specifications, and weights based exclusively on feedback from the services it produces.

DeepSeek-R1-Zero is the base design for DeepSeek-R1.

The RL process for DeepSeek-R1-Zero

The training process for DeepSeek-R1-Zero included presenting the design with numerous thinking tasks, ranging from mathematics issues to abstract reasoning obstacles. The design created outputs and was assessed based upon its performance.

DeepSeek-R1-Zero received feedback through a reward system that assisted assist its learning process:

Accuracy rewards: Evaluates whether the output is right. Used for when there are deterministic results (math issues).

Format rewards: Encouraged the model to structure its thinking within and tags.

Training prompt template

To train DeepSeek-R1-Zero to produce structured chain of idea series, the researchers used the following prompt training template, changing prompt with the thinking concern. You can access it in PromptHub here.

This template triggered the design to clearly outline its idea procedure within tags before delivering the final response in tags.

The power of RL in thinking

With this training process DeepSeek-R1-Zero began to produce sophisticated reasoning chains.

Through thousands of training steps, DeepSeek-R1-Zero evolved to fix significantly intricate issues. It learned to:

- Generate long reasoning chains that enabled much deeper and more structured analytical

- Perform self-verification to cross-check its own responses (more on this later).

- Correct its own mistakes, showcasing emerging self-reflective behaviors.

DeepSeek R1-Zero performance

While DeepSeek-R1-Zero is mostly a precursor to DeepSeek-R1, it still attained high performance on numerous standards. Let's dive into some of the experiments ran.

Accuracy enhancements during training

- Pass@1 accuracy started at 15.6% and by the end of the training it improved to 71.0%, comparable to OpenAI's o1-0912 design.

- The red strong line represents performance with majority voting (comparable to ensembling and self-consistency techniques), which increased precision even more to 86.7%, going beyond o1-0912.

Next we'll take a look at a table comparing DeepSeek-R1-Zero's efficiency across multiple thinking datasets against OpenAI's thinking designs.

AIME 2024: 71.0% Pass@1, a little listed below o1-0912 but above o1-mini. 86.7% cons@64, beating both o1 and o1-mini.

MATH-500: Achieved 95.9%, beating both o1-0912 and o1-mini.

GPQA Diamond: Outperformed o1-mini with a rating of 73.3%.

- Performed much even worse on coding tasks (CodeForces and LiveCode Bench).

Next we'll take a look at how the reaction length increased throughout the RL training procedure.

This graph reveals the length of responses from the model as the training procedure advances. Each "step" represents one cycle of the model's knowing procedure, where feedback is provided based on the output's performance, assessed using the timely design template discussed earlier.

For each concern (representing one action), 16 reactions were sampled, and the average accuracy was determined to ensure steady examination.

As training advances, the model produces longer thinking chains, enabling it to solve increasingly intricate reasoning tasks by leveraging more test-time calculate.

While longer chains do not always ensure better results, they usually correlate with improved performance-a trend also observed in the MEDPROMPT paper (read more about it here) and in the initial o1 paper from OpenAI.

Aha minute and self-verification

Among the coolest aspects of DeepSeek-R1-Zero's advancement (which also applies to the flagship R-1 design) is simply how good the design became at thinking. There were advanced thinking behaviors that were not clearly programmed but developed through its reinforcement learning procedure.

Over countless training actions, the design began to self-correct, review problematic logic, and confirm its own solutions-all within its chain of idea

An example of this noted in the paper, described as a the "Aha moment" is below in red text.

In this instance, the design literally stated, "That's an aha minute." Through DeepSeek's chat feature (their variation of ChatGPT) this type of reasoning typically emerges with phrases like "Wait a minute" or "Wait, however ... ,"

Limitations and obstacles in DeepSeek-R1-Zero

While DeepSeek-R1-Zero had the ability to carry out at a high level, there were some disadvantages with the design.

Language mixing and coherence issues: The model occasionally produced responses that combined languages (Chinese and English).

Reinforcement learning trade-offs: The lack of supervised fine-tuning (SFT) implied that the design lacked the refinement needed for totally polished, human-aligned outputs.

DeepSeek-R1 was established to resolve these problems!

What is DeepSeek R1

DeepSeek-R1 is an open-source thinking design from the Chinese AI lab DeepSeek. It builds on DeepSeek-R1-Zero, which was trained entirely with support learning. Unlike its predecessor, DeepSeek-R1 integrates monitored fine-tuning, making it more fine-tuned. Notably, it outperforms OpenAI's o1 model on a number of benchmarks-more on that later.

What are the primary distinctions in between DeepSeek-R1 and DeepSeek-R1-Zero?

DeepSeek-R1 constructs on the structure of DeepSeek-R1-Zero, which functions as the base model. The 2 vary in their training techniques and general performance.

1. Training technique

DeepSeek-R1-Zero: Trained totally with support learning (RL) and no supervised fine-tuning (SFT).

DeepSeek-R1: Uses a multi-stage training pipeline that consists of supervised fine-tuning (SFT) first, followed by the same support finding out process that DeepSeek-R1-Zero wet through. SFT assists improve coherence and readability.

2. Readability & Coherence

DeepSeek-R1-Zero: Struggled with language blending (English and Chinese) and readability issues. Its thinking was strong, but its outputs were less polished.

DeepSeek-R1: Addressed these concerns with cold-start fine-tuning, making reactions clearer and more structured.

3. Performance

DeepSeek-R1-Zero: Still a really strong reasoning design, in some cases beating OpenAI's o1, however fell the language mixing issues minimized functionality greatly.

DeepSeek-R1: Outperforms R1-Zero and OpenAI's o1 on most reasoning criteria, and the actions are much more polished.

Simply put, DeepSeek-R1-Zero was a proof of concept, while DeepSeek-R1 is the totally optimized variation.

How DeepSeek-R1 was trained

To tackle the readability and coherence concerns of R1-Zero, the scientists included a cold-start fine-tuning stage and a multi-stage training pipeline when constructing DeepSeek-R1:

Cold-Start Fine-Tuning:

- Researchers prepared a high-quality dataset of long chains of thought examples for initial supervised fine-tuning (SFT). This information was gathered using:- Few-shot prompting with comprehensive CoT examples.

- Post-processed outputs from DeepSeek-R1-Zero, improved by human annotators.

Reinforcement Learning:

DeepSeek-R1 went through the same RL procedure as DeepSeek-R1-Zero to refine its thinking abilities further.

Human Preference Alignment:

- A secondary RL phase improved the model's helpfulness and harmlessness, making sure better alignment with user needs.

Distillation to Smaller Models:

- DeepSeek-R1's reasoning capabilities were distilled into smaller, effective models like Qwen and Llama-3.1 -8 B, and Llama-3.3 -70 B-Instruct.

DeepSeek R-1 benchmark performance

The scientists checked DeepSeek R-1 throughout a variety of criteria and versus leading models: o1, GPT-4o, and Claude 3.5 Sonnet, o1-mini.

The criteria were broken down into a number of categories, shown below in the table: English, Code, Math, and Chinese.

Setup

The following criteria were used across all models:

Maximum generation length: 32,768 tokens.

Sampling setup:- Temperature: 0.6.

- Top-p worth: 0.95.

- DeepSeek R1 outperformed o1, Claude 3.5 Sonnet and other models in the majority of thinking standards.

o1 was the best-performing design in 4 out of the five coding-related standards.

- DeepSeek carried out well on creative and long-context job task, like AlpacaEval 2.0 and ArenaHard, outperforming all other models.

Prompt Engineering with thinking models

My preferred part of the article was the scientists' observation about DeepSeek-R1's level of sensitivity to prompts:
$https://akm-img-a-in.tosshub.com/indiatoday/images/story/202501/deepseek-ai-281910912-16x9_0.jpg?VersionId\u003dI7zgWN8dMRo5fxVA5bmLHYK3rFn09syO\u0026size\u003d690:388$

This is another datapoint that lines up with insights from our Prompt Engineering with Reasoning Models Guide, which recommendations Microsoft's research on their MedPrompt structure. In their study with OpenAI's o1-preview model, they discovered that overwhelming reasoning designs with few-shot context degraded performance-a sharp contrast to non-reasoning designs.

The crucial takeaway? Zero-shot prompting with clear and succinct guidelines appear to be best when using thinking models.
$https://www.aljazeera.com/wp-content/uploads/2025/01/2025-01-27T220904Z_708316342_RC2MICAKD27B_RTRMADP_3_DEEPSEEK-MARKETS-1738023042.jpg?resize\u003d770%2C513\u0026quality\u003d80$

pho[to]rum

#1 2025-02-01 12:56:04

DeepSeek R-1 Model Overview and how it Ranks against OpenAI's O1

Pied de page des forums