cosmoplat

adriannaulrich/cosmoplat

Optimizing LLMs to be great at specific tests backfires on Meta, Stability.

-. -. -. -. -. -. -

When you purchase through links on our website, we may make an affiliate commission. Here's how it works.

Hugging Face has actually launched its 2nd LLM leaderboard to rank the finest language designs it has evaluated. The brand-new leaderboard looks for to be a more difficult uniform standard for testing open large language model (LLM) efficiency across a range of jobs. Alibaba's Qwen designs appear dominant in the leaderboard's inaugural rankings, taking 3 areas in the top 10.

Pumped to announce the brand brand-new open LLM leaderboard. We burned 300 H100 to re-run brand-new evaluations like MMLU-pro for all significant open LLMs!Some knowing:- Qwen 72B is the king and Chinese open models are controling total- Previous evaluations have become too easy for current ... June 26, 2024

Hugging Face's second leaderboard tests language designs across 4 jobs: knowledge testing, reasoning on very long contexts, complex mathematics abilities, and guideline following. Six criteria are utilized to evaluate these qualities, with tests consisting of solving 1,000-word murder secrets, explaining PhD-level questions in layman's terms, and many daunting of all: high-school mathematics equations. A complete breakdown of the standards used can be found on Hugging Face's blog site.

The frontrunner of the brand-new leaderboard is Qwen, Alibaba's LLM, which takes 1st, 3rd, and 10th place with its handful of versions. Also appearing are Llama3-70B, Meta's LLM, and a handful of smaller open-source jobs that handled to exceed the pack. Notably missing is any sign of ChatGPT