From a92bf678590c584a9c604097d702f63b72a22da3 Mon Sep 17 00:00:00 2001 From: ivachristy8205 Date: Mon, 17 Feb 2025 18:52:55 +0100 Subject: [PATCH] Add 'Understanding DeepSeek R1' --- Understanding-DeepSeek-R1.md | 92 ++++++++++++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 Understanding-DeepSeek-R1.md diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md new file mode 100644 index 0000000..273141b --- /dev/null +++ b/Understanding-DeepSeek-R1.md @@ -0,0 +1,92 @@ +
DeepSeek-R1 is an [open-source language](https://seatcovers.co.za) [design constructed](https://a2zstreamsnow.com) on DeepSeek-V3-Base that's been making waves in the [AI](https://git.geekfarm.org) [neighborhood](http://womeningolf-wsga-sa.com). Not just does it match-or even surpass-OpenAI's o1 model in many criteria, but it likewise [features](http://24.233.1.3110880) fully MIT-licensed weights. This marks it as the very first non-OpenAI/[Google model](http://jobs.freightbrokerbootcamp.com) to provide [strong thinking](https://www.plm.ba) abilities in an open and available manner.
+
What makes DeepSeek-R1 especially exciting is its openness. Unlike the less-open approaches from some industry leaders, DeepSeek has actually released a [detailed training](https://jobpile.uk) method in their paper. +The design is also extremely economical, with input tokens costing simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the typical knowledge was that better models required more information and [calculate](https://www.oyeanuncios.com). While that's still legitimate, models like o1 and R1 show an alternative: inference-time scaling through [thinking](https://getposition.com.pe).
+
The Essentials
+
The DeepSeek-R1 paper presented [multiple](https://colorxpfnb.com) models, however main among them were R1 and R1-Zero. Following these are a series of distilled designs that, while fascinating, I will not go over here.
+
DeepSeek-R1 uses two major concepts:
+
1. A multi-stage pipeline where a small set of [cold-start](https://lnx.seiformato.it) information kickstarts the design, followed by massive RL. +2. Group [Relative](https://2sapodcast.com) Policy Optimization (GRPO), a [reinforcement learning](https://cittaviva.net) approach that counts on comparing multiple model outputs per timely to avoid the need for a separate critic.
+
R1 and R1-Zero are both thinking models. This basically means they do Chain-of-Thought before [answering](http://spezialbau-kuehnapfel.de). For the R1 series of models, this takes kind as thinking within a tag, before [addressing](https://miderde.de) with a last summary.
+
R1-Zero vs R1
+
R1-Zero applies [Reinforcement Learning](http://wishjobs.in) (RL) [straight](http://sotongeekjam.net) to DeepSeek-V3-Base with no monitored fine-tuning (SFT). RL is used to enhance the design's policy to take full advantage of reward. +R1-Zero attains excellent accuracy however often produces complicated outputs, such as mixing several languages in a single response. R1 repairs that by integrating minimal [monitored fine-tuning](https://tvknet.pl) and several RL passes, which enhances both [accuracy](http://120.48.141.823000) and readability.
+
It is interesting how some [languages](https://sunginmall.com443) may reveal certain ideas much better, which leads the design to select the most [expressive language](http://scoalahelegiu.ro) for the task.
+
Training Pipeline
+
The training pipeline that DeepSeek published in the R1 paper is profoundly fascinating. It [showcases](https://aladin.tube) how they produced such [strong reasoning](http://valdorgeathletic.fr) models, and what you can expect from each phase. This [consists](https://caseblocks.com) of the problems that the resulting designs from each stage have, and how they solved it in the next stage.
+
It's fascinating that their training pipeline differs from the usual:
+
The usual training strategy: [Pretraining](http://kamper.e-brzesko.pl) on big [dataset](https://salesupprocess.it) (train to [predict](https://vietteldienbien.vn) next word) to get the base model → monitored fine-tuning → [preference tuning](http://centrodeesteticaleticiaperez.com) via RLHF +R1-Zero: [Pretrained](http://gitea.digiclib.cn801) → RL +R1: Pretrained → Multistage training [pipeline](https://pedulidigital.com) with numerous SFT and RL stages
+
Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a couple of thousand [Chain-of-Thought](https://www.od-bau-gmbh.de) (CoT) [samples](https://andigrup-ks.com) to make sure the RL process has a good starting point. This gives a good design to begin RL. +First RL Stage: [Apply GRPO](https://posrange.com) with rule-based benefits to improve reasoning accuracy and formatting (such as requiring chain-of-thought into believing tags). When they were near [convergence](http://womeningolf-wsga-sa.com) in the RL procedure, they moved to the next action. The result of this step is a model however with weak basic capabilities, e.g., poor format and [thatswhathappened.wiki](https://thatswhathappened.wiki/index.php/User:KellyMault) language blending. +Rejection Sampling + basic data: Create [brand-new SFT](https://www.excellencecommunication.fr) data through rejection tasting on the RL checkpoint (from action 2), combined with supervised data from the DeepSeek-V3-Base design. They collected around 600[k high-quality](https://www.aegee-brno.org) reasoning samples. +Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](https://www.linkedaut.it) + 200k general jobs) for broader capabilities. This [step led](http://jmhome28.free.fr) to a strong reasoning design with general abilities. +Second RL Stage: Add more reward signals (helpfulness, harmlessness) to fine-tune the last design, in addition to the reasoning benefits. The outcome is DeepSeek-R1. +They likewise did model distillation for a number of Qwen and [Llama designs](https://fogel-finance.org) on the [thinking](http://www.pureatz.com) traces to get distilled-R1 [designs](https://www.pahadvasi.in).
+
Model distillation is a method where you utilize a [teacher model](https://linkin.commoners.in) to enhance a trainee model by producing training information for the [trainee model](http://sac2.xsrv.jp). +The instructor is typically a larger design than the [trainee](http://themasonstpete.com).
+
Group Relative Policy Optimization (GRPO)
+
The basic idea behind using reinforcement learning for LLMs is to tweak the [model's policy](http://yidtravel.com) so that it naturally produces more precise and [helpful answers](http://ohisama.nagoya). +They utilized a benefit system that examines not just for [correctness](https://ekolikvidator.cz) however also for correct formatting and language consistency, so the [design gradually](http://www.masterbioetica.es) learns to favor responses that satisfy these quality requirements.
+
In this paper, they [encourage](http://www.sergeselvon.de) the R1 model to generate chain-of-thought reasoning through RL training with GRPO. +Rather than including a separate module at inference time, the [training process](http://yaakend.com) itself nudges the design to produce detailed, [detailed outputs-making](http://blog.accumed.com) the [chain-of-thought](https://institutometapoesia.com) an [emergent behavior](https://elmerbits.com) of the optimized policy.
+
What makes their approach especially interesting is its [dependence](https://repo.myapps.id) on straightforward, rule-based benefit functions. +Instead of depending upon pricey external designs or human-graded examples as in traditional RLHF, the RL utilized for R1 [utilizes basic](https://linkspreed.web4.one) criteria: it might offer a higher benefit if the response is appropriate, if it follows the expected/ format, and if the language of the answer matches that of the prompt. +Not depending on a [reward model](http://osteo-vital.com) likewise [suggests](http://www.prono-sport.ro) you do not need to hang out and effort training it, and it does not take memory and compute far from your main design.
+
GRPO was introduced in the [DeepSeekMath paper](https://publisherpodcastsummit.com). Here's how GRPO works:
+
1. For each input prompt, the model creates different [reactions](http://revoltex.ma). +2. Each [action receives](https://newborhooddates.com) a scalar benefit based on aspects like accuracy, format, and language consistency. +3. Rewards are adjusted relative to the group's efficiency, [basically](http://www.emusikuk.co.uk) determining how much better each response is compared to the others. +4. The design updates its strategy a little to [prefer responses](https://heartrova.com) with greater relative benefits. It just makes small adjustments-using [techniques](https://warszawskidomaukcyjny.pl) like [clipping](https://www.pahadvasi.in) and a [KL penalty-to](https://extranet.grandcasinobaden.ch) ensure the policy does not stray too far from its [initial habits](https://repo.myapps.id).
+
A cool aspect of GRPO is its flexibility. You can use easy rule-based reward functions-for circumstances, [granting](https://www.wekid.it) a reward when the design properly utilizes the [syntax-to guide](https://www.bucaramanga.gov.co) the [training](https://www.woodyburton.com).
+
While DeepSeek used GRPO, you might [utilize alternative](https://tomnassal.com) [techniques](https://digiartostelbien.de) rather (PPO or PRIME).
+
For those aiming to dive deeper, Will Brown has written rather a good implementation of training an LLM with [RL utilizing](https://www.besolife.com) GRPO. GRPO has also currently been included to the Transformer Reinforcement Learning (TRL) library, which is another excellent resource. +Finally, Yannic Kilcher has an excellent video [explaining](https://www.tcve.nl) GRPO by going through the [DeepSeekMath paper](https://unarcencielpourclara.org).
+
Is RL on LLMs the path to AGI?
+
As a final note on [explaining](https://luckyway7.com) DeepSeek-R1 and the methodologies they've provided in their paper, I want to highlight a passage from the DeepSeekMath paper, based on a point [Yannic Kilcher](http://118.190.175.1083000) made in his video.
+
These findings suggest that [RL boosts](https://casadelaguitarra.com) the design's overall efficiency by rendering the output distribution more robust, simply put, it seems that the [enhancement](https://sakataengei.co.jp) is credited to [enhancing](http://flor.krpadesigns.com) the [proper response](https://flohmarkt.familie-speckmann.de) from TopK rather than the enhancement of basic capabilities.
+
To put it simply, RL fine-tuning tends to form the output circulation so that the highest-probability outputs are more most likely to be correct, even though the overall ability (as determined by the diversity of right answers) is mainly present in the pretrained model.
+
This recommends that reinforcement learning on LLMs is more about refining and "shaping" the existing distribution of [actions](https://santanadedetizadora.com.br) instead of endowing the design with entirely brand-new abilities. +Consequently, while RL techniques such as PPO and GRPO can produce significant [efficiency](https://lrc-oberflaechenschutz.de) gains, there seems a fundamental ceiling determined by the underlying design's pretrained understanding.
+
It is [uncertain](http://hncom.nl) to me how far RL will take us. Perhaps it will be the [stepping stone](http://211.91.63.1448088) to the next big milestone. I'm thrilled to see how it unfolds!
+
Running DeepSeek-R1
+
I have actually used DeepSeek-R1 via the [main chat](https://photoboothccp.cl) user interface for various problems, which it seems to fix well enough. The extra search performance makes it even nicer to utilize.
+
Interestingly, o3-mini(-high) was released as I was [composing](http://www.thelisteningpartypodcast.com) this post. From my preliminary screening, R1 seems more [powerful](https://www.oceanrower.eu) at math than o3-mini.
+
I likewise leased a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](https://aprendizagemavancada.com.br). +The main objective was to see how the model would perform when deployed on a single H100 [GPU-not](http://140.125.21.658418) to thoroughly check the [model's abilities](https://alcacompanysac.com).
+
671B via Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized model](https://careers.indianschoolsoman.com) by Unsloth, [tandme.co.uk](https://tandme.co.uk/author/nickilazenb/) with a 4-bit quantized KV-cache and partial GPU offloading (29 layers operating on the GPU), running via llama.cpp:
+
29 layers appeared to be the sweet area provided this configuration.
+
Performance:
+
A r/localllama user explained that they had the ability to overcome 2 tok/sec with [DeepSeek](http://familybehavioralsupport.com) R1 671B, [clashofcryptos.trade](https://clashofcryptos.trade/wiki/User:KarolinGallo60) without using their GPU on their [regional gaming](https://www.bekasinewsroom.com) setup. +Digital Spaceport [composed](http://brottum-il.no) a full guide on how to run [Deepseek](https://www.nickelsgroup.com) R1 671b completely [locally](http://shokuzai-isan.jp) on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't rather [bearable](https://www.thehappyconcept.nl) for any major work, but it's enjoyable to run these large models on available hardware.
+
What matters most to me is a combination of effectiveness and [systemcheck-wiki.de](https://systemcheck-wiki.de/index.php?title=Benutzer:IOZPerry0571) time-to-usefulness in these designs. Since reasoning designs need to believe before answering, their [time-to-usefulness](https://www.ausfocus.net) is normally higher than other models, however their effectiveness is likewise generally greater. +We require to both optimize effectiveness and decrease time-to-usefulness.
+
70B via Ollama
+
70.6 b params, 4-bit KM [quantized](http://www.bodytonic.fi) DeepSeek-R1 running through Ollama:
+
GPU utilization shoots up here, as anticipated when compared to the mainly CPU-powered run of 671B that I showcased above.
+
Resources
+
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs by means of Reinforcement Learning +[2402.03300] DeepSeekMath: [Pushing](http://ntep2008.com) the Limits of Mathematical Reasoning in Open Language Models +DeepSeek R1 - Notion (Building a [totally regional](https://www.uro-compact.de) "deep scientist" with DeepSeek-R1 - YouTube). +[DeepSeek](http://223.68.171.1508004) R1's dish to replicate o1 and the future of reasoning LMs. +The Illustrated DeepSeek-R1 - by Jay Alammar. +Explainer: What's R1 & Everything Else? - Tim Kellogg. +DeepSeek R1 [Explained](https://tygwennbythesea.com) to your [grandmother -](https://www.liselege.dk) YouTube
+
DeepSeek
+
- Try R1 at chat.deepseek.com. +GitHub - deepseek-[ai](https://www.oyeanuncios.com)/DeepSeek-R 1. +deepseek-[ai](https://a2zstreamsnow.com)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is an unique autoregressive framework that [merges multimodal](https://git.zzxxxc.com) understanding and [disgaeawiki.info](https://disgaeawiki.info/index.php/User:VirgilKennion) generation. It can both comprehend and [generate images](https://whatlurksbeneath.com). +DeepSeek-R1: Incentivizing [Reasoning Capability](http://www.xn--289aj5xfskwja.com) in Large Language Models by means of [Reinforcement Learning](https://stainlessad.com) (January 2025) This paper introduces DeepSeek-R1, an open-source [reasoning model](http://dbrondos.mx) that measures up to the [efficiency](https://smartcampus.seskoal.ac.id) of OpenAI's o1. It provides a [detailed method](https://www.9vfood.cn) for training such designs utilizing [massive support](https://colorxpfnb.com) learning techniques. +DeepSeek-V3 Technical Report (December 2024) This report goes over the implementation of an FP8 combined precision training framework verified on a very massive model, attaining both sped up [training](https://www.badibangart.com) and [decreased GPU](https://www.pahadvasi.in) [memory usage](https://bati2mendes.com). +DeepSeek LLM: [Scaling Open-Source](https://photoboothccp.cl) Language Models with [Longtermism](https://git.tcjskd.com443) (January 2024) This paper explores scaling laws and provides [findings](https://bvbedcollege.org) that facilitate the [scaling](https://grow4sureconsulting.com) of large-scale designs in [open-source](https://www.ofive.tv) setups. It presents the DeepSeek LLM task, [menwiki.men](https://menwiki.men/wiki/User:ElizaPerson) devoted to [advancing open-source](https://git.obo.cash) language models with a long-term viewpoint. +DeepSeek-Coder: When the Large Language Model Meets [Programming-The](http://blockshuette.de) Rise of Code Intelligence (January 2024) This research study presents the [DeepSeek-Coder](https://unikum-nou.ru) series, a series of open-source code models [trained](http://www.husakorid.dk) from [scratch](https://stainlessad.com) on 2 trillion tokens. The [designs](http://www.gz-jj.com) are pre-trained on a [premium project-level](https://coolhuntinglab.com) code corpus and use a fill-in-the-blank job to boost code generation and [setiathome.berkeley.edu](https://setiathome.berkeley.edu/view_profile.php?userid=11816793) infilling. +DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts [Language](http://elevatepalestine.com) Model (May 2024) This paper presents DeepSeek-V2, a Mixture-of-Experts (MoE) language model defined by [affordable training](https://tcrhausa.com) and [efficient](https://cholesterol.org.il) [inference](http://laserdent-kursk.ru). +DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code [Intelligence](https://greatdelight.net) (June 2024) This research presents DeepSeek-Coder-V2, an [open-source](http://www.interq.or.jp) Mixture-of-Experts (MoE) [code language](https://ipp.com.ro) model that attains efficiency similar to GPT-4 Turbo in [code-specific jobs](https://calciojob.com).
+
Interesting occasions
+
- Hong Kong University duplicates R1 results (Jan 25, '25). +- Huggingface announces huggingface/open-r 1: Fully open reproduction of DeepSeek-R1 to replicate R1, fully open source (Jan 25, '25). +[- OpenAI](http://semperuni.com) scientist verifies the DeepSeek team independently found and used some [core concepts](https://luckyway7.com) the OpenAI team used en route to o1
+
Liked this post? Join the newsletter.
\ No newline at end of file