Hardware Upgrade Forum - View Single Post

s12a · 09-04-2024, 11:04

Interessante paper da ricercatore Meta:

https://arxiv.org/abs/2404.05405

Quote:

[Submitted on 8 Apr 2024]
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

Scaling laws describe the relationship between the size of language models and their capabilities. Unlike prior studies that evaluate a model's capability via loss or benchmarks, we estimate the number of knowledge bits a model stores. We focus on factual knowledge represented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia page. Through multiple controlled datasets, we establish that language models can and only can store 2 bits of knowledge per parameter, even when quantized to int8, and such knowledge can be flexibly extracted for downstream applications. Consequently, a 7B model can store 14B bits of knowledge, surpassing the English Wikipedia and textbooks combined based on our estimation.

More broadly, we present 12 results on how (1) training duration, (2) model architecture, (3) quantization, (4) sparsity constraints such as MoE, and (5) data signal-to-noise ratio affect a model's knowledge storage capacity.

Notable insights include:

The GPT-2 architecture, with rotary embedding, matches or even surpasses LLaMA/Mistral architectures in knowledge storage, particularly over shorter training durations. This arises because LLaMA/Mistral uses GatedMLP, which is less stable and harder to train.
Prepending training data with domain names (e.g., wikipedia.org) significantly increases a model's knowledge capacity. Language models can autonomously identify and prioritize domains rich in knowledge, optimizing their storage capacity.

Thread su twitter dall'autore: https://twitter.com/ZeyuanAllenZhu/s...13016592040248

Quote:

Result 1/2/3: LLMs can "consistently" achieve 2bit per parameter in storing knowledge after sufficient training; this predict a 7B model is capable of storing knowledge from all the English wiki + textbooks based on our estimation.

Quote:

Result 4/5: "all" LLMs can achieve such 2bit/param if sufficiently trained, even if all the MLP layers are removed. This is quite a universal law.
// What if models are insufficiently trained --- or equivalently, pretrain data consist of rare knowledge? See result 6/7 next.

Quote:

Result 6/7: If insufficiently trained, GPT2_rotary works 30% better than LLaMA/Mistral architectures in terms of storing knowledge. A closer look reveals that GatedMLP is the cause: it is less stable to train and thus not friendly for acquiring "rare knowledge" in pretrain data.

Quote:

Results 8/9: scaling laws for quantization and MoE.
// Quantization to int8 does not hurt knowledge capacity even for models at max capacity => 2bit of knowledge can be stored to int8
// MoEs with even 32 experts have great capacity => knowledge can be stored evenly on experts.

Quote:

Result 10/11/12: surprisingly, when pre-training good data (e.g., Wiki) together with "junks" (e.g., Common Crawl), LLM's capacity on good data may decrease by 20x times! A simple fix: add domain tokens to your data; LLMs can auto-detect domains rich in knowledge and prioritize.