Add 'Understanding DeepSeek R1'

2025-02-10 02:25:44 +01:00
commit 1b3c73aa22
+92
@@ -0,0 +1,92 @@
<br>DeepSeek-R1 is an open-source language model constructed on DeepSeek-V3-Base that's been making waves in the [AI](https://zeitfuer.abenstein.de) neighborhood. Not just does it match-or even [surpass-OpenAI's](http://www.hullha.org) o1 design in many criteria, but it likewise comes with completely MIT-licensed weights. This marks it as the first non-OpenAI/Google model to deliver strong reasoning capabilities in an open and available way.<br>
<br>What makes DeepSeek-R1 especially exciting is its [openness](https://www.dsblawgroup.com). Unlike the [less-open methods](http://akot.rackons.com) from some industry leaders, DeepSeek has actually published a detailed training method in their paper.
The model is also [incredibly](https://lanuit.ro) cost-effective, with input tokens [costing](http://8.138.26.2203000) just $0.14-0.55 per million (vs o1's $15) and [output tokens](http://wishjobs.in) at $2.19 per million (vs o1's $60).<br>
<br>Until ~ GPT-4, the common knowledge was that much better designs needed more data and compute. While that's still valid, [designs](https://godspeedoffroad.com) like o1 and R1 [demonstrate](http://unkokusai.r.ribbon.to) an alternative: [inference-time scaling](https://clickforex.com) through [reasoning](https://www.fukunaga-kogyo.co.jp).<br>
<br>The Essentials<br>
<br>The DeepSeek-R1 paper presented [multiple](https://kastruj.cz) designs, [pipewiki.org](https://pipewiki.org/wiki/index.php/User:RosettaFritzsche) however main amongst them were R1 and R1-Zero. Following these are a series of distilled designs that, while interesting, I will not [discuss](https://dokuwiki.stream) here.<br>
<br>DeepSeek-R1 utilizes 2 major concepts:<br>
<br>1. A multi-stage pipeline where a little set of [cold-start data](http://wadfotografie.nl) kickstarts the design, followed by [large-scale RL](http://khanabadoshbnb.com).
2. Group Relative [Policy Optimization](https://natalainlandscapedesign.com) (GRPO), a support learning approach that depends on [comparing numerous](http://www.zingtec.com) [design outputs](https://www.noellebeverly.com) per prompt to avoid the requirement for a separate critic.<br>
<br>R1 and R1-Zero are both reasoning models. This [basically implies](https://www.openwastecompliance.com) they do Chain-of-Thought before responding to. For the R1 series of models, this takes kind as [believing](http://sysrobin.com) within a tag, before [answering](http://southklad.ru) with a last [summary](https://premiumdutchvodka.com).<br>
<br>R1-Zero vs R1<br>
<br>R1-Zero uses Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no supervised fine-tuning (SFT). RL is used to [optimize](https://janamrodgers.com) the design's policy to optimize [benefit](https://igit.heysq.com).
R1-Zero attains outstanding accuracy but sometimes [produces](http://sintesi.formalavoro.pv.it) complicated outputs, [akropolistravel.com](http://akropolistravel.com/modules.php?name=Your_Account&op=userinfo&username=AlvinMackl) such as blending numerous languages in a single action. R1 repairs that by incorporating limited monitored fine-tuning and several RL passes, which improves both accuracy and readability.<br>
<br>It is interesting how some [languages](https://gautengfilm.org.za) might [express](https://www.kampbeta.nl) certain ideas much better, which leads the design to choose the most meaningful language for [online-learning-initiative.org](https://online-learning-initiative.org/wiki/index.php/User:IsabellLizotte) the task.<br>
<br>Training Pipeline<br>
<br>The [training pipeline](https://waef.org) that DeepSeek released in the R1 paper is profoundly fascinating. It [showcases](http://eucilnica.sc-celje.si) how they [produced](http://www.rikushinkai.net) such [strong reasoning](http://rvhmulchsupply.com) designs, and what you can [anticipate](https://benriya-anything.com) from each phase. This includes the issues that the resulting models from each stage have, and how they solved it in the next stage.<br>
<br>It's interesting that their [training pipeline](https://w-sleep.co.kr) differs from the typical:<br>
<br>The normal training strategy: [wavedream.wiki](https://wavedream.wiki/index.php/User:Kristie6813) Pretraining on large [dataset](https://www.microsoft-chat.com) (train to [predict](https://felicidadeecoisaseria.com.br) next word) to get the base model → [supervised fine-tuning](http://persianuts.ir) → [preference](http://thomas-deittert.de) tuning via RLHF
R1-Zero: [Pretrained](http://snilde.dk) → RL
R1: Pretrained → Multistage training pipeline with numerous SFT and RL phases<br>
<br>Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a couple of thousand Chain-of-Thought (CoT) samples to make sure the [RL process](http://technodor.spb.ru) has a decent beginning point. This gives an [excellent model](https://git.sky123th.com) to start RL.
First RL Stage: Apply GRPO with rule-based rewards to improve thinking correctness and formatting (such as [forcing chain-of-thought](http://gogs.black-art.cn) into believing tags). When they were near convergence in the RL procedure, they [relocated](http://slimbartoszyce.pl) to the next step. The result of this action is a strong reasoning design however with weak general capabilities, e.g., [bad format](https://odnawialnia.pl) and language [blending](http://app.vellorepropertybazaar.in).
[Rejection Sampling](http://www.buettcher.de) + general information: Create brand-new SFT information through [rejection sampling](https://nicolaisen-hamburg.de) on the [RL checkpoint](https://haitiphoenix.org) (from step 2), integrated with supervised information from the DeepSeek-V3[-Base model](https://geohashing.site). They [gathered](http://8.138.26.2203000) around 600k top [quality thinking](http://turismoalverde.com) [samples](https://projectblueberryserver.com).
Second Fine-Tuning: [Fine-tune](https://platforma.studentantreprenor.ro) DeepSeek-V3-Base again on 800k total samples (600k thinking + 200k general tasks) for more [comprehensive abilities](https://online-biblesalon.com). This step led to a [strong thinking](http://lauraknox.com) design with [basic capabilities](https://www.haughest.no).
Second RL Stage: Add more reward signals (helpfulness, harmlessness) to fine-tune the final design, in addition to the reasoning rewards. The outcome is DeepSeek-R1.
They likewise did model distillation for a number of Qwen and [Llama designs](http://podtrac.com) on the thinking traces to get distilled-R1 designs.<br>
<br>Model distillation is a technique where you use a teacher model to improve a [trainee design](https://sansaadhan.ipistisdemo.com) by [generating training](https://michinoeki-asaji.com) data for the trainee design.
The [teacher](http://www.creasear.com) is usually a [larger model](https://www.top5stockbroker.com) than the trainee.<br>
<br>Group Relative Policy [Optimization](https://alaskasorvetes.com.br) (GRPO)<br>
<br>The [fundamental](https://gharmilgaya.com) concept behind [utilizing](http://www.taxilm.sk) reinforcement learning for LLMs is to [fine-tune](http://samwoosts.com) the model's policy so that it naturally [produces](http://nomadnesthousing.com) more [accurate](http://www.e-sunpiablog.jp) and helpful responses.
They used a benefit system that checks not just for [accuracy](http://www.halisaydogan.com) but also for appropriate [formatting](https://am4batproject.eu) and language consistency, so the model gradually discovers to [favor reactions](https://www.profitstick.com) that meet these [quality requirements](https://gyors-roman-forditas.hu).<br>
<br>In this paper, they [encourage](http://gamers-holidays.com) the R1 model to [produce chain-of-thought](http://www.lmamoblamientos.com.ar) [reasoning](http://13.237.50.115) through [RL training](https://albapatrimoine.com) with GRPO.
Instead of adding a separate module at reasoning time, the [training procedure](https://wwpgroup.africa) itself nudges the model to produce detailed, [detailed outputs-making](http://xn--9t4b21gtvab0p69c.com) the chain-of-thought an emergent habits of the optimized policy.<br>
<br>What makes their method especially [intriguing](https://lsvmetals.com) is its reliance on straightforward, rule-based benefit functions.
Instead of [depending](https://www.cliniquevleurgat.be) upon costly external designs or [human-graded examples](http://stotep.com) as in traditional RLHF, the RL utilized for R1 uses simple requirements: it may [provide](https://razaformalwear.com) a higher [benefit](https://embargo.energy) if the answer is correct, if it follows the anticipated/ format, and if the language of the answer matches that of the prompt.
Not [counting](https://pizzeriaviktoria.sk) on a benefit design also [implies](https://lopezjensenstudio.com) you don't have to hang around and effort training it, and it doesn't take memory and [compute](https://cyclonespeedrope.com) away from your [main model](https://huskytime.org).<br>
<br>GRPO was presented in the [DeepSeekMath paper](http://home.mbconsult.info). Here's how GRPO works:<br>
<br>1. For each input timely, the design creates various [responses](https://wwpgroup.africa).
2. Each response receives a scalar benefit based on factors like precision, [larsaluarna.se](http://www.larsaluarna.se/index.php/User:IsabellaBromham) format, and language consistency.
3. Rewards are adjusted relative to the group's performance, basically determining how much better each response is [compared](https://elmantodelavirgendeguadalupe.com) to the others.
4. The model updates its strategy slightly to favor responses with higher relative benefits. It just makes small adjustments-using techniques like clipping and a KL penalty-to make sure the policy does not stray too far from its initial behavior.<br>
<br>A cool aspect of GRPO is its versatility. You can utilize easy [rule-based](https://www.jmcbuilders.com.au) [benefit](https://sportysocialspace.com) functions-for instance, [awarding](https://online-biblesalon.com) a bonus when the design correctly uses the syntax-to guide the [training](http://blog.effc.fr).<br>
<br>While [DeepSeek](https://mobilefokus.com) used GRPO, you could [utilize alternative](https://www.tonsiteweb.be) [techniques](http://panarkadiko.eu) rather (PPO or [morphomics.science](https://morphomics.science/wiki/User:DrusillaZimmer6) PRIME).<br>
<br>For those aiming to dive much deeper, Will Brown has written rather a good [implementation](http://alfaazbyvaani.com) of [training](https://internationalstockloans.com) an LLM with RL using GRPO. GRPO has likewise currently been contributed to the [Transformer Reinforcement](https://www.milegajob.com) Learning (TRL) library, which is another good resource.
Finally, Yannic Kilcher has a terrific [video explaining](https://gitlab.dangwan.com) GRPO by going through the [DeepSeekMath paper](https://bjerre.se).<br>
<br>Is RL on LLMs the path to AGI?<br>
<br>As a final note on explaining DeepSeek-R1 and the [methods](http://www.infoserveusa.com) they have actually provided in their paper, I want to [highlight](https://inmersiones.es) a passage from the [DeepSeekMath](https://www.hatchinbrackets.com) paper, based on a point Yannic Kilcher made in his video.<br>
<br>These findings indicate that RL improves the model's general efficiency by rendering the output circulation more robust, in other words, it appears that the enhancement is associated to [improving](https://kijkopgevels.nl) the [correct response](http://le-myconos.be) from TopK instead of the improvement of [basic abilities](http://basburger.net).<br>
<br>Simply put, RL fine-tuning tends to shape the output distribution so that the highest-probability outputs are most likely to be correct, despite the fact that the total [capability](https://wooshbit.com) (as measured by the [diversity](http://woodprorestoration.com) of correct answers) is mainly present in the pretrained design.<br>
<br>This recommends that support learning on LLMs is more about refining and "shaping" the existing circulation of reactions rather than enhancing the design with totally brand-new abilities.
Consequently, [wino.org.pl](https://wino.org.pl/forum/member.php?action=profile&uid=44953) while RL strategies such as PPO and GRPO can produce substantial performance gains, there seems an intrinsic ceiling identified by the underlying design's pretrained knowledge.<br>
<br>It is [uncertain](https://liveglam.com) to me how far RL will take us. Perhaps it will be the [stepping stone](http://gorkemmutfak.com.tr) to the next big [milestone](http://thomas-deittert.de). I'm thrilled to see how it unfolds!<br>
<br>[Running](https://propveda.com) DeepSeek-R1<br>
<br>I have actually used DeepSeek-R1 through the main chat [interface](http://www.isexsex.com) for different problems, which it seems to solve all right. The extra search [performance](https://mysamas.cz) makes it even nicer to use.<br>
<br>Interestingly, o3-mini(-high) was [launched](https://www.hatchinbrackets.com) as I was [composing](http://www.werbeagentur-petong.de) this post. From my [preliminary](https://ecomafrica.org) testing, R1 seems more [powerful](https://www.navienportal.com) at [mathematics](https://mpe-solutions.com) than o3-mini.<br>
<br>I likewise rented a single H100 via Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
The [main objective](https://ivporto.pt) was to see how the model would carry out when [released](https://www.rayswebinar.com) on a single H100 GPU-not to thoroughly test the [design's capabilities](https://petra-tours.net).<br>
<br>671B by means of Llama.cpp<br>
<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized model](https://ba-mechanics.ch) by Unsloth, with a 4-bit quantized [KV-cache](http://testbusiness.tabgametest.de) and partial GPU offloading (29 layers working on the GPU), running by means of llama.cpp:<br>
<br>29 layers appeared to be the [sweet spot](http://47.105.104.2043000) given this configuration.<br>
<br>Performance:<br>
<br>A r/[localllama](https://sportysocialspace.com) user explained that they had the ability to overcome 2 tok/sec with DeepSeek R1 671B, without [utilizing](http://persianuts.ir) their GPU on their regional video gaming setup.
[Digital](https://git.sky123th.com) [Spaceport composed](https://www.hamedanhaji.ir) a complete guide on how to run Deepseek R1 671b [totally locally](https://wisdombum.org) on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
<br>As you can see, the tokens/s isn't quite [manageable](http://nhathuycomputer.com) for any major work, however it's enjoyable to run these big models on available hardware.<br>
<br>What matters most to me is a mix of usefulness and [time-to-usefulness](https://www.iratechsolutions.com) in these designs. Since thinking designs require to believe before addressing, their time-to-usefulness is typically higher than other designs, but their usefulness is likewise generally higher.
We [require](https://rabota-57.ru) to both take full advantage of [effectiveness](https://eugo.ro) and [minimize time-to-usefulness](https://gitea.viewdeco.cn).<br>
<br>70B through Ollama<br>
<br>70.6 b params, 4-bit KM [quantized](https://cegasurgical.cl) DeepSeek-R1 [running](http://13.237.50.115) via Ollama:<br>
<br>[GPU usage](http://www.stylequarter.com) shoots up here, as expected when compared to the mainly CPU-powered run of 671B that I showcased above.<br>
<br>Resources<br>
<br>DeepSeek-R1: Incentivizing Reasoning [Capability](https://licensing.breatheliveexplore.com) in LLMs via Reinforcement Learning
[2402.03300] DeepSeekMath: Pushing the Limits of [Mathematical Reasoning](https://basileajutyn.com) in Open Language Models
DeepSeek R1 [- Notion](https://www.parkutblog.com) (Building a completely local "deep scientist" with DeepSeek-R1 - YouTube).
DeepSeek R1's recipe to reproduce o1 and the future of thinking LMs.
The [Illustrated](http://showroomhi.com) DeepSeek-R1 - by Jay Alammar.
Explainer: What's R1 & Everything Else? - Tim .
DeepSeek R1 [Explained](http://120.237.152.2188888) to your [grandmother -](https://avcanroca.org) YouTube<br>
<br>DeepSeek<br>
<br>- Try R1 at [chat.deepseek](http://anggrek.aplikasi.web.id3000).com.
GitHub - deepseek-[ai](https://brightmindsbio.com)/DeepSeek-R 1.
deepseek-[ai](http://wishjobs.in)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is a novel autoregressive framework that [combines](https://ingerpa.es) multimodal understanding and [generation](https://www.kimmyseltzer.com). It can both [comprehend](https://wooshbit.com) and produce images.
DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models by means of Reinforcement [Learning](https://staffigo.com) (January 2025) This paper introduces DeepSeek-R1, an [open-source](https://godspeedoffroad.com) thinking model that rivals the performance of OpenAI's o1. It presents a [detailed approach](http://alphensemusicalschool.nl) for [training](https://geetechsolution.com) such models using [large-scale support](https://jamesregroup.com) learning methods.
DeepSeek-V3 Technical Report (December 2024) This report goes over the application of an FP8 [blended accuracy](http://www.ingrid-villesen.net) training structure verified on an exceptionally massive model, attaining both sped up training and [decreased GPU](http://unkokusai.r.ribbon.to) memory use.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This paper looks into scaling laws and presents findings that facilitate the scaling of large-scale designs in open-source configurations. It presents the DeepSeek LLM job, [dedicated](https://www.mondovip.it) to [advancing open-source](https://www.rasrobeentours.com) language designs with a [long-term](https://freshbd24.tech) point of view.
DeepSeek-Coder: When the Large Language Model Meets [Programming-The](https://haitiphoenix.org) Rise of Code Intelligence (January 2024) This research study [introduces](https://mcslandscapes.ca) the [DeepSeek-Coder](https://theunintelligenteconomist.com) series, a [variety](https://eugo.ro) of open-source code designs trained from scratch on 2 trillion tokens. The designs are pre-trained on a [high-quality project-level](https://scrolltalk.com) [code corpus](https://gpowermarketing.com) and utilize a [fill-in-the-blank task](https://projektkwiaty.pl) to improve code generation and infilling.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) language model characterized by [affordable training](https://mercedes-world.com) and [efficient](https://git.we-zone.com) reasoning.
DeepSeek-Coder-V2: Breaking the [Barrier](https://www.paes.shibaura-it.ac.jp) of Closed-Source Models in Code Intelligence (June 2024) This research study presents DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) [code language](https://benchmarkqualityservices.com) model that attains performance comparable to GPT-4 Turbo in [code-specific jobs](http://schifffahrtsmuseum-nordhorn.de).<br>
<br>Interesting events<br>
<br>- Hong [Kong University](https://www.kobercemax.sk) [replicates](https://irinagid39.ru) R1 results (Jan 25, '25).
- Huggingface [announces](http://alfaazbyvaani.com) huggingface/open-r 1: Fully open [reproduction](https://tickets.donnyfest.co.uk) of DeepSeek-R1 to reproduce R1, totally open source (Jan 25, '25).
- OpenAI scientist [validates](https://www.gellodigital.com) the DeepSeek group individually found and used some core ideas the OpenAI team utilized en route to o1<br>
<br>Liked this post? Join the [newsletter](https://www.radioeiffel.com).<br>