Vous n'êtes pas identifié.
DeepSeek is a Chinese AI company "devoted to making AGI a truth" and open-sourcing all its models. They began in 2023, but have been making waves over the past month or two, and especially this past week with the release of their two newest thinking designs: DeepSeek-R1-Zero and the more innovative DeepSeek-R1, likewise called DeepSeek Reasoner.
They have actually launched not only the designs but likewise the code and assessment prompts for public usage, together with an in-depth paper describing their approach.
Aside from producing 2 extremely performant models that are on par with OpenAI's o1 design, the paper has a lot of valuable details around support learning, chain of idea reasoning, timely engineering with reasoning models, and more.
We'll begin by focusing on the training procedure of DeepSeek-R1-Zero, which distinctively relied exclusively on reinforcement learning, instead of traditional supervised learning. We'll then carry on to DeepSeek-R1, how it's reasoning works, and some prompt engineering best practices for thinking designs.
Hey everybody, Dan here, co-founder of PromptHub. Today, we're diving into DeepSeek's latest design release and comparing it with OpenAI's thinking designs, particularly the A1 and A1 Mini models. We'll explore their training process, reasoning abilities, and some essential insights into prompt engineering for thinking designs.
DeepSeek is a Chinese-based AI company dedicated to open-source advancement. Their current release, the R1 reasoning design, is groundbreaking due to its open-source nature and ingenious training techniques. This includes open access to the models, triggers, and research study documents.
Released on January 20th, DeepSeek's R1 achieved remarkable efficiency on numerous benchmarks, rivaling OpenAI's A1 designs. Notably, they also introduced a precursor design, R10, which serves as the foundation for R1.
Training Process: R10 to R1
R10: This model was trained specifically utilizing reinforcement learning without supervised fine-tuning, making it the first open-source design to achieve high efficiency through this approach. Training included:
- Rewarding proper answers in deterministic tasks (e.g., math problems).
- Encouraging structured thinking outputs utilizing design templates with "" and "" tags
Through thousands of iterations, R10 established longer reasoning chains, self-verification, and even reflective behaviors. For example, throughout training, the design demonstrated "aha" minutes and self-correction behaviors, which are uncommon in standard LLMs.
R1: Building on R10, R1 included several improvements:
- Curated datasets with long Chain of Thought examples.
- Incorporation of R10-generated thinking chains.
- Human preference alignment for sleek reactions.
- Distillation into smaller models (LLaMA 3.1 and 3.3 at different sizes).
Performance Benchmarks
DeepSeek's R1 design performs on par with OpenAI's A1 designs across lots of reasoning standards:
Reasoning and Math Tasks: R1 rivals or outshines A1 designs in precision and depth of reasoning.
Coding Tasks: A1 models usually perform better in LiveCode Bench and CodeForces jobs.
Simple QA: R1 typically outmatches A1 in structured QA jobs (e.g., 47% accuracy vs. 30%).
One significant finding is that longer reasoning chains generally enhance performance. This lines up with insights from Microsoft's Med-Prompt structure and OpenAI's observations on test-time compute and reasoning depth.
Challenges and Observations
Despite its strengths, R1 has some constraints:
- Mixing English and Chinese reactions due to a lack of supervised fine-tuning.
- Less polished reactions compared to talk models like OpenAI's GPT.
These issues were addressed during R1's refinement process, consisting of monitored fine-tuning and human feedback.
Prompt Engineering Insights
An interesting takeaway from DeepSeek's research is how few-shot triggering abject R1's efficiency compared to zero-shot or concise tailored triggers. This lines up with findings from the Med-Prompt paper and OpenAI's recommendations to limit context in reasoning designs. Overcomplicating the input can overwhelm the design and lower precision.
DeepSeek's R1 is a considerable step forward for open-source reasoning models, showing capabilities that rival OpenAI's A1. It's an interesting time to experiment with these designs and their chat user interface, which is totally free to use.
If you have concerns or wish to find out more, take a look at the resources linked below. See you next time!
Training DeepSeek-R1-Zero: A support learning-only technique
DeepSeek-R1-Zero stands out from a lot of other cutting edge designs due to the fact that it was trained using only support knowing (RL), no monitored fine-tuning (SFT). This challenges the present standard approach and opens up brand-new opportunities to train thinking models with less human intervention and effort.
DeepSeek-R1-Zero is the very first open-source model to confirm that advanced thinking abilities can be developed simply through RL.
Without pre-labeled datasets, the design learns through trial and error, fine-tuning its behavior, specifications, and weights based entirely on feedback from the options it creates.
DeepSeek-R1-Zero is the base design for DeepSeek-R1.
The RL procedure for DeepSeek-R1-Zero
The training process for DeepSeek-R1-Zero involved presenting the model with numerous reasoning tasks, varying from mathematics issues to abstract reasoning difficulties. The model created outputs and was examined based upon its performance.
DeepSeek-R1-Zero got feedback through a reward system that assisted assist its learning process:
Accuracy rewards: Evaluates whether the output is right. Used for when there are deterministic results (mathematics problems).
Format benefits: Encouraged the model to structure its thinking within and tags.
Training prompt design template
To train DeepSeek-R1-Zero to produce structured chain of thought sequences, the researchers utilized the following prompt training template, replacing prompt with the reasoning concern. You can access it in PromptHub here.
This template triggered the design to clearly outline its idea procedure within tags before delivering the last answer in tags.
The power of RL in thinking
With this training procedure DeepSeek-R1-Zero began to produce advanced thinking chains.
Through countless training steps, DeepSeek-R1-Zero developed to solve progressively complex issues. It learned to:
- Generate long reasoning chains that allowed deeper and more structured problem-solving
- Perform self-verification to cross-check its own answers (more on this later).
- Correct its own mistakes, showcasing emerging self-reflective habits.
DeepSeek R1-Zero performance
While DeepSeek-R1-Zero is primarily a precursor to DeepSeek-R1, it still accomplished high performance on numerous standards. Let's dive into some of the experiments ran.
Accuracy improvements during training
- Pass@1 accuracy began at 15.6% and by the end of the training it enhanced to 71.0%, comparable to OpenAI's o1-0912 design.
- The red strong line represents performance with bulk ballot (comparable to ensembling and self-consistency techniques), which increased accuracy even more to 86.7%, exceeding o1-0912.
Next we'll look at a table comparing DeepSeek-R1-Zero's efficiency across numerous reasoning datasets versus OpenAI's reasoning designs.
AIME 2024: 71.0% Pass@1, slightly below o1-0912 however above o1-mini. 86.7% cons@64, beating both o1 and o1-mini.
MATH-500: Achieved 95.9%, beating both o1-0912 and o1-mini.
GPQA Diamond: Outperformed o1-mini with a rating of 73.3%.
- Performed much even worse on coding jobs (CodeForces and LiveCode Bench).
Next we'll take a look at how the reaction length increased throughout the RL training procedure.
This chart reveals the length of reactions from the model as the training process progresses. Each "step" represents one cycle of the design's knowing procedure, where feedback is supplied based on the output's efficiency, evaluated utilizing the timely template talked about earlier.
For each question (representing one step), 16 reactions were tested, and the average accuracy was calculated to ensure steady evaluation.
As training advances, the design creates longer thinking chains, permitting it to solve progressively complicated thinking jobs by leveraging more test-time calculate.
While longer chains don't constantly guarantee much better outcomes, they normally associate with enhanced performance-a trend also observed in the MEDPROMPT paper (find out more about it here) and in the original o1 paper from OpenAI.
Aha moment and self-verification
One of the coolest elements of DeepSeek-R1-Zero's development (which also uses to the flagship R-1 model) is just how great the design became at thinking. There were advanced reasoning behaviors that were not explicitly set however developed through its support finding out procedure.
Over thousands of training steps, the model started to self-correct, reassess flawed reasoning, and verify its own solutions-all within its chain of thought
An example of this kept in mind in the paper, referred to as a the "Aha minute" is below in red text.
In this instance, the design literally said, "That's an aha moment." Through DeepSeek's chat feature (their version of ChatGPT) this kind of thinking usually emerges with phrases like "Wait a minute" or "Wait, but ... ,"
Limitations and difficulties in DeepSeek-R1-Zero
While DeepSeek-R1-Zero was able to carry out at a high level, there were some disadvantages with the design.
Language blending and coherence problems: The design occasionally produced reactions that mixed languages (Chinese and English).
Reinforcement knowing trade-offs: The lack of monitored fine-tuning (SFT) meant that the model did not have the refinement needed for completely polished, human-aligned outputs.
DeepSeek-R1 was developed to address these problems!
What is DeepSeek R1
DeepSeek-R1 is an open-source reasoning model from the Chinese AI lab DeepSeek. It constructs on DeepSeek-R1-Zero, which was trained totally with support learning. Unlike its predecessor, DeepSeek-R1 integrates monitored fine-tuning, making it more fine-tuned. Notably, it outperforms OpenAI's o1 design on numerous benchmarks-more on that later.
What are the primary distinctions in between DeepSeek-R1 and DeepSeek-R1-Zero?
DeepSeek-R1 develops on the structure of DeepSeek-R1-Zero, which acts as the base model. The two differ in their training methods and general performance.
1. Training technique
DeepSeek-R1-Zero: Trained entirely with support learning (RL) and no supervised fine-tuning (SFT).
DeepSeek-R1: Uses a multi-stage training pipeline that includes supervised fine-tuning (SFT) first, followed by the very same reinforcement learning procedure that DeepSeek-R1-Zero damp through. SFT helps improve coherence and readability.
2. Readability & Coherence
DeepSeek-R1-Zero: Struggled with language mixing (English and Chinese) and readability issues. Its thinking was strong, but its outputs were less polished.
DeepSeek-R1: Addressed these problems with cold-start fine-tuning, making responses clearer and more structured.
3. Performance
DeepSeek-R1-Zero: Still a very strong thinking design, often beating OpenAI's o1, however fell the language blending concerns reduced usability considerably.
DeepSeek-R1: Outperforms R1-Zero and OpenAI's o1 on most thinking benchmarks, and the responses are a lot more polished.
In other words, DeepSeek-R1-Zero was a proof of principle, while DeepSeek-R1 is the completely enhanced version.
How DeepSeek-R1 was trained
To tackle the readability and coherence problems of R1-Zero, the scientists incorporated a cold-start fine-tuning stage and a multi-stage training pipeline when building DeepSeek-R1:
Cold-Start Fine-Tuning:
- Researchers prepared a high-quality dataset of long chains of thought examples for preliminary supervised fine-tuning (SFT). This data was collected using:- Few-shot prompting with in-depth CoT examples.
- Post-processed outputs from DeepSeek-R1-Zero, fine-tuned by human annotators.
Reinforcement Learning:
DeepSeek-R1 went through the same RL procedure as DeepSeek-R1-Zero to refine its thinking abilities even more.
Human Preference Alignment:
- A secondary RL stage improved the model's helpfulness and harmlessness, making sure better alignment with user needs.
Distillation to Smaller Models:
- DeepSeek-R1's thinking abilities were distilled into smaller, efficient designs like Qwen and Llama-3.1 -8 B, and Llama-3.3 -70 B-Instruct.
DeepSeek R-1 standard performance
The scientists checked DeepSeek R-1 throughout a range of benchmarks and versus leading models: o1, GPT-4o, and Claude 3.5 Sonnet, o1-mini.
The standards were broken down into a number of categories, revealed below in the table: English, Code, Math, and Chinese.
Setup
The following specifications were applied across all designs:
Maximum generation length: 32,768 tokens.
Sampling setup:- Temperature: 0.6.
- Top-p value: 0.95.
- DeepSeek R1 surpassed o1, Claude 3.5 Sonnet and other designs in the bulk of reasoning standards.
o1 was the best-performing model in 4 out of the 5 coding-related criteria.
- DeepSeek performed well on imaginative and long-context job task, like AlpacaEval 2.0 and ArenaHard, outperforming all other designs.
Prompt Engineering with thinking designs
My favorite part of the post was the researchers' observation about DeepSeek-R1's sensitivity to prompts:
This is another datapoint that aligns with insights from our Prompt Engineering with Reasoning Models Guide, which referrals Microsoft's research on their MedPrompt framework. In their study with OpenAI's o1-preview design, they found that overwhelming thinking designs with few-shot context broken down performance-a sharp contrast to non-reasoning designs.
The key takeaway? Zero-shot prompting with clear and succinct instructions seem to be best when using reasoning designs.
Hors ligne