Add 'Understanding DeepSeek R1'

2025-02-17 18:52:55 +01:00
commit a92bf67859
+92
@@ -0,0 +1,92 @@
<br>DeepSeek-R1 is an [open-source language](https://seatcovers.co.za) [design constructed](https://a2zstreamsnow.com) on DeepSeek-V3-Base that's been making waves in the [AI](https://git.geekfarm.org) [neighborhood](http://womeningolf-wsga-sa.com). Not just does it match-or even surpass-OpenAI's o1 model in many criteria, but it likewise [features](http://24.233.1.3110880) fully MIT-licensed weights. This marks it as the very first non-OpenAI/[Google model](http://jobs.freightbrokerbootcamp.com) to provide [strong thinking](https://www.plm.ba) abilities in an open and available manner.<br>
<br>What makes DeepSeek-R1 especially exciting is its openness. Unlike the less-open approaches from some industry leaders, DeepSeek has actually released a [detailed training](https://jobpile.uk) method in their paper.
The design is also extremely economical, with input tokens costing simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).<br>
<br>Until ~ GPT-4, the typical knowledge was that better models required more information and [calculate](https://www.oyeanuncios.com). While that's still legitimate, models like o1 and R1 show an alternative: inference-time scaling through [thinking](https://getposition.com.pe).<br>
<br>The Essentials<br>
<br>The DeepSeek-R1 paper presented [multiple](https://colorxpfnb.com) models, however main among them were R1 and R1-Zero. Following these are a series of distilled designs that, while fascinating, I will not go over here.<br>
<br>DeepSeek-R1 uses two major concepts:<br>
<br>1. A multi-stage pipeline where a small set of [cold-start](https://lnx.seiformato.it) information kickstarts the design, followed by massive RL.
2. Group [Relative](https://2sapodcast.com) Policy Optimization (GRPO), a [reinforcement learning](https://cittaviva.net) approach that counts on comparing multiple model outputs per timely to avoid the need for a separate critic.<br>
<br>R1 and R1-Zero are both thinking models. This basically means they do Chain-of-Thought before [answering](http://spezialbau-kuehnapfel.de). For the R1 series of models, this takes kind as thinking within a tag, before [addressing](https://miderde.de) with a last summary.<br>
<br>R1-Zero vs R1<br>
<br>R1-Zero applies [Reinforcement Learning](http://wishjobs.in) (RL) [straight](http://sotongeekjam.net) to DeepSeek-V3-Base with no monitored fine-tuning (SFT). RL is used to enhance the design's policy to take full advantage of reward.
R1-Zero attains excellent accuracy however often produces complicated outputs, such as mixing several languages in a single response. R1 repairs that by integrating minimal [monitored fine-tuning](https://tvknet.pl) and several RL passes, which enhances both [accuracy](http://120.48.141.823000) and readability.<br>
<br>It is interesting how some [languages](https://sunginmall.com443) may reveal certain ideas much better, which leads the design to select the most [expressive language](http://scoalahelegiu.ro) for the task.<br>
<br>Training Pipeline<br>
<br>The training pipeline that DeepSeek published in the R1 paper is profoundly fascinating. It [showcases](https://aladin.tube) how they produced such [strong reasoning](http://valdorgeathletic.fr) models, and what you can expect from each phase. This [consists](https://caseblocks.com) of the problems that the resulting designs from each stage have, and how they solved it in the next stage.<br>
<br>It's fascinating that their training pipeline differs from the usual:<br>
<br>The usual training strategy: [Pretraining](http://kamper.e-brzesko.pl) on big [dataset](https://salesupprocess.it) (train to [predict](https://vietteldienbien.vn) next word) to get the base model → monitored fine-tuning → [preference tuning](http://centrodeesteticaleticiaperez.com) via RLHF
R1-Zero: [Pretrained](http://gitea.digiclib.cn801) → RL
R1: Pretrained → Multistage training [pipeline](https://pedulidigital.com) with numerous SFT and RL stages<br>
<br>Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a couple of thousand [Chain-of-Thought](https://www.od-bau-gmbh.de) (CoT) [samples](https://andigrup-ks.com) to make sure the RL process has a good starting point. This gives a good design to begin RL.
First RL Stage: [Apply GRPO](https://posrange.com) with rule-based benefits to improve reasoning accuracy and formatting (such as requiring chain-of-thought into believing tags). When they were near [convergence](http://womeningolf-wsga-sa.com) in the RL procedure, they moved to the next action. The result of this step is a model however with weak basic capabilities, e.g., poor format and [thatswhathappened.wiki](https://thatswhathappened.wiki/index.php/User:KellyMault) language blending.
Rejection Sampling + basic data: Create [brand-new SFT](https://www.excellencecommunication.fr) data through rejection tasting on the RL checkpoint (from action 2), combined with supervised data from the DeepSeek-V3-Base design. They collected around 600[k high-quality](https://www.aegee-brno.org) reasoning samples.
Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](https://www.linkedaut.it) + 200k general jobs) for broader capabilities. This [step led](http://jmhome28.free.fr) to a strong reasoning design with general abilities.
Second RL Stage: Add more reward signals (helpfulness, harmlessness) to fine-tune the last design, in addition to the reasoning benefits. The outcome is DeepSeek-R1.
They likewise did model distillation for a number of Qwen and [Llama designs](https://fogel-finance.org) on the [thinking](http://www.pureatz.com) traces to get distilled-R1 [designs](https://www.pahadvasi.in).<br>
<br>Model distillation is a method where you utilize a [teacher model](https://linkin.commoners.in) to enhance a trainee model by producing training information for the [trainee model](http://sac2.xsrv.jp).
The instructor is typically a larger design than the [trainee](http://themasonstpete.com).<br>
<br>Group Relative Policy Optimization (GRPO)<br>
<br>The basic idea behind using reinforcement learning for LLMs is to tweak the [model's policy](http://yidtravel.com) so that it naturally produces more precise and [helpful answers](http://ohisama.nagoya).
They utilized a benefit system that examines not just for [correctness](https://ekolikvidator.cz) however also for correct formatting and language consistency, so the [design gradually](http://www.masterbioetica.es) learns to favor responses that satisfy these quality requirements.<br>
<br>In this paper, they [encourage](http://www.sergeselvon.de) the R1 model to generate chain-of-thought reasoning through RL training with GRPO.
Rather than including a separate module at inference time, the [training process](http://yaakend.com) itself nudges the design to produce detailed, [detailed outputs-making](http://blog.accumed.com) the [chain-of-thought](https://institutometapoesia.com) an [emergent behavior](https://elmerbits.com) of the optimized policy.<br>
<br>What makes their approach especially interesting is its [dependence](https://repo.myapps.id) on straightforward, rule-based benefit functions.
Instead of depending upon pricey external designs or human-graded examples as in traditional RLHF, the RL utilized for R1 [utilizes basic](https://linkspreed.web4.one) criteria: it might offer a higher benefit if the response is appropriate, if it follows the expected/ format, and if the language of the answer matches that of the prompt.
Not depending on a [reward model](http://osteo-vital.com) likewise [suggests](http://www.prono-sport.ro) you do not need to hang out and effort training it, and it does not take memory and compute far from your main design.<br>
<br>GRPO was introduced in the [DeepSeekMath paper](https://publisherpodcastsummit.com). Here's how GRPO works:<br>
<br>1. For each input prompt, the model creates different [reactions](http://revoltex.ma).
2. Each [action receives](https://newborhooddates.com) a scalar benefit based on aspects like accuracy, format, and language consistency.
3. Rewards are adjusted relative to the group's efficiency, [basically](http://www.emusikuk.co.uk) determining how much better each response is compared to the others.
4. The design updates its strategy a little to [prefer responses](https://heartrova.com) with greater relative benefits. It just makes small adjustments-using [techniques](https://warszawskidomaukcyjny.pl) like [clipping](https://www.pahadvasi.in) and a [KL penalty-to](https://extranet.grandcasinobaden.ch) ensure the policy does not stray too far from its [initial habits](https://repo.myapps.id).<br>
<br>A cool aspect of GRPO is its flexibility. You can use easy rule-based reward functions-for circumstances, [granting](https://www.wekid.it) a reward when the design properly utilizes the [syntax-to guide](https://www.bucaramanga.gov.co) the [training](https://www.woodyburton.com).<br>
<br>While DeepSeek used GRPO, you might [utilize alternative](https://tomnassal.com) [techniques](https://digiartostelbien.de) rather (PPO or PRIME).<br>
<br>For those aiming to dive deeper, Will Brown has written rather a good implementation of training an LLM with [RL utilizing](https://www.besolife.com) GRPO. GRPO has also currently been included to the Transformer Reinforcement Learning (TRL) library, which is another excellent resource.
Finally, Yannic Kilcher has an excellent video [explaining](https://www.tcve.nl) GRPO by going through the [DeepSeekMath paper](https://unarcencielpourclara.org).<br>
<br>Is RL on LLMs the path to AGI?<br>
<br>As a final note on [explaining](https://luckyway7.com) DeepSeek-R1 and the methodologies they've provided in their paper, I want to highlight a passage from the DeepSeekMath paper, based on a point [Yannic Kilcher](http://118.190.175.1083000) made in his video.<br>
<br>These findings suggest that [RL boosts](https://casadelaguitarra.com) the design's overall efficiency by rendering the output distribution more robust, simply put, it seems that the [enhancement](https://sakataengei.co.jp) is credited to [enhancing](http://flor.krpadesigns.com) the [proper response](https://flohmarkt.familie-speckmann.de) from TopK rather than the enhancement of basic capabilities.<br>
<br>To put it simply, RL fine-tuning tends to form the output circulation so that the highest-probability outputs are more most likely to be correct, even though the overall ability (as determined by the diversity of right answers) is mainly present in the pretrained model.<br>
<br>This recommends that reinforcement learning on LLMs is more about refining and "shaping" the existing distribution of [actions](https://santanadedetizadora.com.br) instead of endowing the design with entirely brand-new abilities.
Consequently, while RL techniques such as PPO and GRPO can produce significant [efficiency](https://lrc-oberflaechenschutz.de) gains, there seems a fundamental ceiling determined by the underlying design's pretrained understanding.<br>
<br>It is [uncertain](http://hncom.nl) to me how far RL will take us. Perhaps it will be the [stepping stone](http://211.91.63.1448088) to the next big milestone. I'm thrilled to see how it unfolds!<br>
<br>Running DeepSeek-R1<br>
<br>I have actually used DeepSeek-R1 via the [main chat](https://photoboothccp.cl) user interface for various problems, which it seems to fix well enough. The extra search performance makes it even nicer to utilize.<br>
<br>Interestingly, o3-mini(-high) was released as I was [composing](http://www.thelisteningpartypodcast.com) this post. From my preliminary screening, R1 seems more [powerful](https://www.oceanrower.eu) at math than o3-mini.<br>
<br>I likewise leased a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](https://aprendizagemavancada.com.br).
The main objective was to see how the model would perform when deployed on a single H100 [GPU-not](http://140.125.21.658418) to thoroughly check the [model's abilities](https://alcacompanysac.com).<br>
<br>671B via Llama.cpp<br>
<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized model](https://careers.indianschoolsoman.com) by Unsloth, [tandme.co.uk](https://tandme.co.uk/author/nickilazenb/) with a 4-bit quantized KV-cache and partial GPU offloading (29 layers operating on the GPU), running via llama.cpp:<br>
<br>29 layers appeared to be the sweet area provided this configuration.<br>
<br>Performance:<br>
<br>A r/localllama user explained that they had the ability to overcome 2 tok/sec with [DeepSeek](http://familybehavioralsupport.com) R1 671B, [clashofcryptos.trade](https://clashofcryptos.trade/wiki/User:KarolinGallo60) without using their GPU on their [regional gaming](https://www.bekasinewsroom.com) setup.
Digital Spaceport [composed](http://brottum-il.no) a full guide on how to run [Deepseek](https://www.nickelsgroup.com) R1 671b completely [locally](http://shokuzai-isan.jp) on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
<br>As you can see, the tokens/s isn't rather [bearable](https://www.thehappyconcept.nl) for any major work, but it's enjoyable to run these large models on available hardware.<br>
<br>What matters most to me is a combination of effectiveness and [systemcheck-wiki.de](https://systemcheck-wiki.de/index.php?title=Benutzer:IOZPerry0571) time-to-usefulness in these designs. Since reasoning designs need to believe before answering, their [time-to-usefulness](https://www.ausfocus.net) is normally higher than other models, however their effectiveness is likewise generally greater.
We require to both optimize effectiveness and decrease time-to-usefulness.<br>
<br>70B via Ollama<br>
<br>70.6 b params, 4-bit KM [quantized](http://www.bodytonic.fi) DeepSeek-R1 running through Ollama:<br>
<br>GPU utilization shoots up here, as anticipated when compared to the mainly CPU-powered run of 671B that I showcased above.<br>
<br>Resources<br>
<br>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs by means of Reinforcement Learning
[2402.03300] DeepSeekMath: [Pushing](http://ntep2008.com) the Limits of Mathematical Reasoning in Open Language Models
DeepSeek R1 - Notion (Building a [totally regional](https://www.uro-compact.de) "deep scientist" with DeepSeek-R1 - YouTube).
[DeepSeek](http://223.68.171.1508004) R1's dish to replicate o1 and the future of reasoning LMs.
The Illustrated DeepSeek-R1 - by Jay Alammar.
Explainer: What's R1 & Everything Else? - Tim Kellogg.
DeepSeek R1 [Explained](https://tygwennbythesea.com) to your [grandmother -](https://www.liselege.dk) YouTube<br>
<br>DeepSeek<br>
<br>- Try R1 at chat.deepseek.com.
GitHub - deepseek-[ai](https://www.oyeanuncios.com)/DeepSeek-R 1.
deepseek-[ai](https://a2zstreamsnow.com)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is an unique autoregressive framework that [merges multimodal](https://git.zzxxxc.com) understanding and [disgaeawiki.info](https://disgaeawiki.info/index.php/User:VirgilKennion) generation. It can both comprehend and [generate images](https://whatlurksbeneath.com).
DeepSeek-R1: Incentivizing [Reasoning Capability](http://www.xn--289aj5xfskwja.com) in Large Language Models by means of [Reinforcement Learning](https://stainlessad.com) (January 2025) This paper introduces DeepSeek-R1, an open-source [reasoning model](http://dbrondos.mx) that measures up to the [efficiency](https://smartcampus.seskoal.ac.id) of OpenAI's o1. It provides a [detailed method](https://www.9vfood.cn) for training such designs utilizing [massive support](https://colorxpfnb.com) learning techniques.
DeepSeek-V3 Technical Report (December 2024) This report goes over the implementation of an FP8 combined precision training framework verified on a very massive model, attaining both sped up [training](https://www.badibangart.com) and [decreased GPU](https://www.pahadvasi.in) [memory usage](https://bati2mendes.com).
DeepSeek LLM: [Scaling Open-Source](https://photoboothccp.cl) Language Models with [Longtermism](https://git.tcjskd.com443) (January 2024) This paper explores scaling laws and provides [findings](https://bvbedcollege.org) that facilitate the [scaling](https://grow4sureconsulting.com) of large-scale designs in [open-source](https://www.ofive.tv) setups. It presents the DeepSeek LLM task, [menwiki.men](https://menwiki.men/wiki/User:ElizaPerson) devoted to [advancing open-source](https://git.obo.cash) language models with a long-term viewpoint.
DeepSeek-Coder: When the Large Language Model Meets [Programming-The](http://blockshuette.de) Rise of Code Intelligence (January 2024) This research study presents the [DeepSeek-Coder](https://unikum-nou.ru) series, a series of open-source code models [trained](http://www.husakorid.dk) from [scratch](https://stainlessad.com) on 2 trillion tokens. The [designs](http://www.gz-jj.com) are pre-trained on a [premium project-level](https://coolhuntinglab.com) code corpus and use a fill-in-the-blank job to boost code generation and [setiathome.berkeley.edu](https://setiathome.berkeley.edu/view_profile.php?userid=11816793) infilling.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts [Language](http://elevatepalestine.com) Model (May 2024) This paper presents DeepSeek-V2, a Mixture-of-Experts (MoE) language model defined by [affordable training](https://tcrhausa.com) and [efficient](https://cholesterol.org.il) [inference](http://laserdent-kursk.ru).
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code [Intelligence](https://greatdelight.net) (June 2024) This research presents DeepSeek-Coder-V2, an [open-source](http://www.interq.or.jp) Mixture-of-Experts (MoE) [code language](https://ipp.com.ro) model that attains efficiency similar to GPT-4 Turbo in [code-specific jobs](https://calciojob.com).<br>
<br>Interesting occasions<br>
<br>- Hong Kong University duplicates R1 results (Jan 25, '25).
- Huggingface announces huggingface/open-r 1: Fully open reproduction of DeepSeek-R1 to replicate R1, fully open source (Jan 25, '25).
[- OpenAI](http://semperuni.com) scientist verifies the DeepSeek team independently found and used some [core concepts](https://luckyway7.com) the OpenAI team used en route to o1<br>
<br>Liked this post? Join the newsletter.<br>