Machine learning ed affini

s12a · 03-04-2024, 13:01

Dato che non c'è alcuna sezione a riguardo mi prendo la briga di aprire un thread nella sezione "Scienza e tecnica" riguardo nuovi paper di possibile interesse riguardo il machine learning ed argomenti correlati, in particolare LLM. Spero che altri utenti siano interessati a discutere e riportare pubblicazioni di interesse.

---

Inizio con questo paper di ieri da Anthropic:

https://www-cdn.anthropic.com/af5633...04_02_0936.pdf
Blog: https://www.anthropic.com/research/m...t-jailbreaking

Quote:

Many-shot Jailbreaking

We investigate a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior. This is newly feasible with the larger context windows recently deployed by Anthropic, OpenAI and Google DeepMind. We find that in diverse, realistic circumstances, the effectiveness of this attack follows a power law, up to hundreds of shots. We demonstrate the success of this attack on the most widely used state-of-the-art closed-weight models, and across various tasks. Our results suggest very long contexts present a rich new attack surface for LLMs.

In pratica qui viene discusso di come sia possibile eludere le "sicurezze" (

) degli LLM closed-source grazie alla loro dimensione (che favorisce l'in-context learning, ossia la capacità di riprodurre un "lavoro"/task dati alcuni esempi) e supporto a lunghi contesti (context size), ma personalmente ho sperimentato che simili procedure funzionano abbastanza bene anche per i modelli locali di alto livello come Mixtral-Instruct di MistralAI, che di default ha un blando livello di "sicurezza".

s12a · 03-04-2024, 14:55

Tuttavia, i "pericoli" evidenziati da Anthropic per il momento potrebbero non coinvolgere più di tanto i modelli open source a causa di certe limitazioni che questo paper evidenzia (per contro, con essi in genere ci vuole meno impegno).

https://arxiv.org/abs/2404.02060

Quote:

[Submitted on 2 Apr 2024]
Long-context LLMs Struggle with Long In-context Learning

Large Language Models (LLMs) have made significant strides in handling long sequences exceeding 32K tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their abilities in more nuanced, real-world scenarios. This study introduces a specialized benchmark (LIConBench) focusing on long in-context learning within the realm of extreme-label classification. We meticulously selected six datasets with a label range spanning 28 to 174 classes covering different input (few-shot demonstration) length from 2K to 50K. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct prediction. We evaluate 13 long-context LLMs on our benchmarks. We find that the long-context LLMs perform relatively well under the token length of 20K and the performance benefits from utilizing the long context window. However, after the context window exceeds 20K, most LLMs except GPT-4 will dip dramatically. This suggests a notable gap in current LLM capabilities for processing and understanding long, context-rich sequences. Further analysis revealed a tendency among models to favor predictions for labels presented towards the end at the sequence. Their ability to reason over multiple pieces in the long sequence is yet to be improved. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs. We believe LIConBench could serve as a more realistic evaluation for the future long context LLMs.

s12a · 04-04-2024, 13:59

Non un nuovo paper, ma blandamente correlato a quello di Anthropic dell'altro giorno.

https://arxiv.org/abs/2312.01552

Quote:

[Submitted on 4 Dec 2023]
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

The alignment tuning process of large language models (LLMs) typically involves instruction learning through supervised fine-tuning (SFT) and preference tuning via reinforcement learning from human feedback (RLHF). A recent study, LIMA (Zhou et al. 2023), shows that using merely 1K examples for SFT can achieve significant alignment performance as well, suggesting that the effect of alignment tuning might be "superficial." This raises questions about how exactly the alignment tuning transforms a base LLM.
We analyze the effect of alignment tuning by examining the token distribution shift between base LLMs and their aligned counterpart. Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions. Most distribution shifts occur with stylistic tokens. These direct evidence strongly supports the Superficial Alignment Hypothesis suggested by LIMA.

Based on these findings, we rethink the alignment of LLMs by posing the research question: how effectively can we align base LLMs without SFT or RLHF? To address this, we introduce a simple, tuning-free alignment method, URIAL. URIAL achieves effective alignment purely through in-context learning (ICL) with base LLMs, requiring as few as three constant stylistic examples and a system prompt. We conduct a fine-grained and interpretable evaluation on a diverse set of examples, named JUST-EVAL-INSTRUCT. Results demonstrate that base LLMs with URIAL can match or even surpass the performance of LLMs aligned with SFT or SFT+RLHF. We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting and ICL. Our findings on the superficial nature of alignment tuning and results with URIAL suggest that deeper analysis and theoretical understanding of alignment is crucial to future LLM research.

In pratica, è già noto che i modelli base possono essere "allineati" semplicemente fornendo qualche esempio simile a risposte "reali", ottenendo prestazioni competitive od in alcuni casi superiori a quelli dei modelli chat. Quindi, quello che Anthropic considera come "jailbreaking" in realtà nella pratica può essere semplicemente considerato come allineamento alle preferenze dell'utente via in-context learning (ICL). E, rispetto al finetuning vero e proprio, non richiede particolari risorse computazionali, quindi anche modelli di grande dimensione come Llama-2-70B o Mixtral 8x7B, possono facilmente diventare potenti chatbot senza particolari limitazioni.

s12a · 05-04-2024, 00:35

Una nuova tecnica da ricercatori Google Deepmind permetterebbe di risparmiare calcoli (e tempo) durante l'inferenza in maniera dinamica a seconda del token da predire.

https://arxiv.org/abs/2404.02258

Quote:

[Submitted on 2 Apr 2024]
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Transformer-based language models spread FLOPs uniformly across input sequences. In this work we demonstrate that transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence, optimising the allocation along the sequence for different layers across the model depth. Our method enforces a total compute budget by capping the number of tokens (k) that can participate in the self-attention and MLP computations at a given layer. The tokens to be processed are determined by the network using a top-k routing mechanism. Since k is defined a priori, this simple procedure uses a static computation graph with known tensor sizes, unlike other conditional computation techniques. Nevertheless, since the identities of the k tokens are fluid, this method can expend FLOPs non-uniformly across the time and model depth dimensions. Thus, compute expenditure is entirely predictable in sum total, but dynamic and context-sensitive at the token-level. Not only do models trained in this way learn to dynamically allocate compute, they do so efficiently. These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50\% faster to step during post-training sampling.

Twitter thread dove è spiegato in parole semplici: https://twitter.com/TheSeaMouse/stat...Tu5ad4lXOgAtZQ

s12a · 05-04-2024, 11:24

Non un paper, ma di interesse. Da Yann LeCun (Meta "Chief AI scientist"):

https://twitter.com/ylecun/status/1776151785624801336

Quote:

Video of the Ding-Shum Lecture I gave at Harvard's Center of Mathematical Sciences and Applications on 2024-03-28.

"Objective-Driven AI: Towards AI systems that can learn, remember, reason, and plan"

Abstract: How could machines learn as efficiently as humans and animals?
How could machines learn how the world works and acquire common sense?
How could machines learn to reason and plan?

Current AI architectures, such as Auto-Regressive Large Language Models fall short. I will propose a modular cognitive architecture that may constitute a path towards answering these questions. The centerpiece of the architecture is a predictive world model that allows the system to predict the consequences of its actions and to plan a sequence of actions that optimize a set of objectives. The objectives include guardrails that guarantee the system's controllability and safety. The world model employs a Hierarchical Joint Embedding Predictive Architecture (H-JEPA) trained with self-supervised learning. The JEPA learns abstract representations of the percepts that are simultaneously maximally informative and maximally predictable.

Paper: https://openreview.net/forum?id=BZ5a1r-kVsf
Slides: https://drive.google.com/file/d/1Ymx...qbpd9k_bo/view
Presentazione (video): https://www.youtube.com/watch?v=MiqLoAZFRSE

Come noto, l'intelligenza degli LLM autoregressivi (il cui output dipende strettamente dal precedente input) al momento è solo apparente, non sono in grado di realmente pensare o pianificare azioni guardando al futuro. LeCun ha in mente una architettura che dovrebbe risolvere la maggior parte di questi problemi, al momento almeno parti di essa con risultati positivi in limitate applicazioni.

s12a · 05-04-2024, 16:58

Non un paper né una presentazione scientifica, ma tutto fa brodo.

Qwen rilascia Qwen-32B. Usa GQA (Grouped Query Attention), dunque il consumo di VRAM è inferiore con contesti di lunga dimensione rispetto ad altri modelli della stessa famiglia. Più performante di MistralAI Mixtral 8x7B, a quanto pare:

https://qwenlm.github.io/blog/qwen1.5-32b/
https://huggingface.co/Qwen/Qwen1.5-32B-Chat-GGUF

Almeno in versione 72B, Qwen-Chat era fra i modelli migliori, almeno nei benchmark, ed il primo fra quelli open-weight (scaricabili):
https://huggingface.co/spaces/lmsys/...na-leaderboard

s12a · 09-04-2024, 10:04

Interessante paper da ricercatore Meta:

https://arxiv.org/abs/2404.05405

Quote:

[Submitted on 8 Apr 2024]
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

Scaling laws describe the relationship between the size of language models and their capabilities. Unlike prior studies that evaluate a model's capability via loss or benchmarks, we estimate the number of knowledge bits a model stores. We focus on factual knowledge represented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia page. Through multiple controlled datasets, we establish that language models can and only can store 2 bits of knowledge per parameter, even when quantized to int8, and such knowledge can be flexibly extracted for downstream applications. Consequently, a 7B model can store 14B bits of knowledge, surpassing the English Wikipedia and textbooks combined based on our estimation.

More broadly, we present 12 results on how (1) training duration, (2) model architecture, (3) quantization, (4) sparsity constraints such as MoE, and (5) data signal-to-noise ratio affect a model's knowledge storage capacity.

Notable insights include:

The GPT-2 architecture, with rotary embedding, matches or even surpasses LLaMA/Mistral architectures in knowledge storage, particularly over shorter training durations. This arises because LLaMA/Mistral uses GatedMLP, which is less stable and harder to train.
Prepending training data with domain names (e.g., wikipedia.org) significantly increases a model's knowledge capacity. Language models can autonomously identify and prioritize domains rich in knowledge, optimizing their storage capacity.

Thread su twitter dall'autore: https://twitter.com/ZeyuanAllenZhu/s...13016592040248

Quote:

Result 1/2/3: LLMs can "consistently" achieve 2bit per parameter in storing knowledge after sufficient training; this predict a 7B model is capable of storing knowledge from all the English wiki + textbooks based on our estimation.

Quote:

Result 4/5: "all" LLMs can achieve such 2bit/param if sufficiently trained, even if all the MLP layers are removed. This is quite a universal law.
// What if models are insufficiently trained --- or equivalently, pretrain data consist of rare knowledge? See result 6/7 next.

Quote:

Result 6/7: If insufficiently trained, GPT2_rotary works 30% better than LLaMA/Mistral architectures in terms of storing knowledge. A closer look reveals that GatedMLP is the cause: it is less stable to train and thus not friendly for acquiring "rare knowledge" in pretrain data.

Quote:

Results 8/9: scaling laws for quantization and MoE.
// Quantization to int8 does not hurt knowledge capacity even for models at max capacity => 2bit of knowledge can be stored to int8
// MoEs with even 32 experts have great capacity => knowledge can be stored evenly on experts.

Quote:

Result 10/11/12: surprisingly, when pre-training good data (e.g., Wiki) together with "junks" (e.g., Common Crawl), LLM's capacity on good data may decrease by 20x times! A simple fix: add domain tokens to your data; LLMs can auto-detect domains rich in knowledge and prioritize.

s12a · 09-04-2024, 18:47

LLM e psicologia. Un LLM allenato in maniera opportuna può aiutare l'utente a sviluppare capacità in ambiti sociali:

https://arxiv.org/abs/2404.04204

Quote:

[Submitted on 5 Apr 2024]
Social Skill Training with Large Language Models

People rely on social skills like conflict resolution to communicate effectively and to thrive in both work and personal life. However, practice environments for social skills are typically out of reach for most people. How can we make social skill training more available, accessible, and inviting? Drawing upon interdisciplinary research from communication and psychology, this perspective paper identifies social skill barriers to enter specialized fields. Then we present a solution that leverages large language models for social skill training via a generic framework. Our AI Partner, AI Mentor framework merges experiential learning with realistic practice and tailored feedback. This work ultimately calls for cross-disciplinary innovation to address the broader implications for workforce development and social equality.

s12a · 10-04-2024, 10:32

MistralAI rilascia Mixtral-8x22B ... se avete qualche 4090 che vi avanza...

(o

)
L'architettura dovrebbe essere simile a Mixtral 8x7B (modello MoE / Mixture of Experts, con 2 "esperti" per token).

Sembra che non sia Mistral-Large (offerto mediante API da Mistral e Microsoft), ma un modello completamente nuovo, e per il momento questo è il modello base/fondazionale, non la versione con Instruction tuning.

https://twitter.com/MistralAI/status...69263778291896
https://twitter.com/sophiamyang/stat...45947764297845

Essendo di grande dimensione (Il torrent è da 262GB per i pesi FP16), per usarlo nella pratica serviranno versioni quantizzate a precisione inferiore.

Qualche dettaglio tecnico:
https://twitter.com/danielhanchen/st...12653580771674

s12a · 10-04-2024, 11:29

Le architetture RNN (Recurrent Neural Network), le antenate dell'attuale Transformer, in qualche variante riescono a competere con o addirittura migliorare quest'ultima. RWKV è un esempio di particolare interesse, ed oggi è stato rilasciato un paper che descrive in dettaglio le migliorie apportate nelle ultime versioni 5 e 6.

RWKV è anche una serie di LLM open source, non solo open weight.

https://arxiv.org/abs/2404.05892

Quote:

[Submitted on 8 Apr 2024]
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the RWKV (RWKV-4) architecture. Our architectural design advancements include multi-headed matrix-valued states and a dynamic recurrence mechanism that improve expressivity while maintaining the inference efficiency characteristics of RNNs. We introduce a new multilingual corpus with 1.12 trillion tokens and a fast tokenizer based on greedy matching for enhanced multilinguality. We trained four Eagle models, ranging from 0.46 to 7.5 billion parameters, and two Finch models with 1.6 and 3.1 billion parameters and find that they achieve competitive performance across a wide variety of benchmarks. We release all our models on HuggingFace under the Apache 2.0 license. Models at: https://huggingface.co/RWKV Training code at: https://github.com/RWKV/RWKV-LM Inference code at: https://github.com/RWKV/ChatRWKV Time-parallel training code at: https://github.com/RWKV/RWKV-infctx-trainer

s12a · 11-04-2024, 11:29

https://arxiv.org/abs/2404.07143

Quote:

[Submitted on 10 Apr 2024]
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

This work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. A key component in our proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block. We demonstrate the effectiveness of our approach on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs. Our approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.

Un paper da ricercatori Google che descrive una variante dell'architettura Transformer in grado di estendere il contesto (la cui complessità computazionale con Trasformer normalmente scala con il quadrato della sua dimensione) a valori nell'ordine di milioni di token con "costi" fissi.

Con Google Gemini 1.5 Pro era già stato visto che contesti di grande dimensioni permettono nuovi usi e capacità per gli LLM, dunque artifici simili a quanto descritto sopra sono un passo avanti verso futuri modelli open-source (od almeno open-weight) più versatili.

s12a · 11-04-2024, 16:56

https://arxiv.org/pdf/2404.06654.pdf

Quote:

[Submitted on 9 Apr 2024]
RULER: What's the Real Context Size of Your Long-Context Language Models?

The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate ten long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only four models (GPT-4, Command-R, Yi-34B, and Mixtral) can maintain satisfactory performance at the length of 32K. Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity. We open source RULER to spur comprehensive evaluation of long-context LMs.

Un paper da NVidia che mette in discussione le capacità dichiarate di alcuni LLM di operare in maniera efficace con contesti di grande dimensione. Il test dell' "ago nel pagliaio", ossia trovare una stringa in mezzo a testo irrilevante, a quanto pare, non è sufficiente da solo a quantificare le prestazioni (comprensione, deduzione, ecc) effettive dei vari modelli con lunghi testi.

Questo era stato anche evidenziato in un altro paper linkato qui la settimana scorsa. https://www.hwupgrade.it/forum/showp...22&postcount=2

s12a · 12-04-2024, 13:30

RecurrentGemma è una versione non-transformer di Google Gemma, rilasciato qualche settimana fa (super-censurato-sicuro, ma è un altro discorso). Prestazioni comparabili alla versione transformer nonostante l'addestramento con un numero di token inferiore.

https://arxiv.org/abs/2404.07839

Quote:

[Submitted on 11 Apr 2024]
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

We introduce RecurrentGemma, an open language model which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide a pre-trained model with 2B non-embedding parameters, and an instruction tuned variant. Both models achieve comparable performance to Gemma-2B despite being trained on fewer tokens.

È possibile addestrare da zero modelli MoE (Mixture-of-exports) di piccola dimensione a costi relativamente abbordabili. JetMoE è un esempio completamente open source (pesi, dati, codice):

https://arxiv.org/abs/2404.07413

Quote:

[Submitted on 11 Apr 2024]
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Yikang Shen, Zhen Guo, Tianle Cai, Zengyi Qin

Large Language Models (LLMs) have achieved remarkable results, but their increasing resource demand has become a major obstacle to the development of powerful and accessible super-human intelligence. This report introduces JetMoE-8B, a new LLM trained with less than $0.1 million, using 1.25T tokens from carefully mixed open-source corpora and 30,000 H100 GPU hours. Despite its low cost, the JetMoE-8B demonstrates impressive performance, with JetMoE-8B outperforming the Llama2-7B model and JetMoE-8B-Chat surpassing the Llama2-13B-Chat model. These results suggest that LLM training can be much more cost-effective than generally thought. JetMoE-8B is based on an efficient Sparsely-gated Mixture-of-Experts (SMoE) architecture, composed of attention and feedforward experts. Both layers are sparsely activated, allowing JetMoE-8B to have 8B parameters while only activating 2B for each input token, reducing inference computation by about 70% compared to Llama2-7B. Moreover, JetMoE-8B is highly open and academia-friendly, using only public datasets and training code. All training parameters and data mixtures have been detailed in this report to facilitate future efforts in the development of open foundation models. This transparency aims to encourage collaboration and further advancements in the field of accessible and efficient LLMs. The model weights are publicly available at https://github.com/myshell-ai/JetMoE

Hanno un sito web qui: https://research.myshell.ai/jetmoe

s12a · 16-04-2024, 09:03

https://arxiv.org/abs/2404.08801

Quote:

[Submitted on 12 Apr 2024]
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space models exist, they empirically underperform Transformers in pretraining efficiency and downstream task accuracy. We introduce Megalodon, a neural architecture for efficient sequence modeling with unlimited context length. Megalodon inherits the architecture of Mega (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability and stability, including complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism and pre-norm with two-hop residual configuration. In a controlled head-to-head comparison with Llama2, Megalodon achieves better efficiency than Transformer in the scale of 7 billion parameters and 2 trillion training tokens. Megalodon reaches a training loss of 1.70, landing mid-way between Llama2-7B (1.75) and 13B (1.67). Code: https://github.com/XuezheMax/megalodon

Ricercatori associati con Meta propongono una architettura dalla complessità e richieste memoria lineari, per superare i limiti della Transformer nella gestione di contesti di grande dimensione. Chissà se sarà adottata dall'imminente Llama 3.

Interessante notare che hanno avuto abbastanza risorse da Meta per effettuare il training di un modello 7B da zero (!).

s12a · 18-04-2024, 14:07

Paper da ricercatori Google DeepMind.

https://arxiv.org/abs/2404.11018

Quote:

[Submitted on 17 Apr 2024]
Many-Shot In-Context Learning

Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases and can learn high-dimensional functions with numerical inputs. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.

Usando Google Gemini 1.5 Pro è stato possibile effettuare ICL (in-context learning) su un numero elevato di esempi, dato che supporta contesti di grande dimensione (nell'ordine di 1 milione di token). Non troppo sorprendentemente si osservano migliorie prestazionali con molti esempi, ma anche indicazioni della possibilità di aggirare l'influenza del modello base utilizzato.

Un simile principio è talvolta usato per il jailbreaking.

s12a · 18-04-2024, 17:30

Meta Llama 3 rilasciato, almeno in una versione iniziale da 8 e 70 miliardi di parametri. Paper quando finiscono di addestrare tutte le varianti, a quanto pare.

- https://llama.meta.com/llama3/
- https://ai.meta.com/blog/meta-llama-3/
- https://github.com/meta-llama/llama3
- https://github.com/meta-llama/llama3.../MODEL_CARD.md

s12a · Ieri, 09:56

https://arxiv.org/abs/2404.14219

Quote:

[Submitted on 22 Apr 2024]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench).

Microsoft rilascia un paper sull'imminente Phi-3, un LLM addestrato con quantità limitate di dati sintetici e web di alta qualità, I benchmark sono da capogiro.

05-04-2024, 16:58	#6
s12a Senior Member Iscritto dal: Jan 2008 Messaggi: 10904	Non un paper né una presentazione scientifica, ma tutto fa brodo. Qwen rilascia Qwen-32B. Usa GQA (Grouped Query Attention), dunque il consumo di VRAM è inferiore con contesti di lunga dimensione rispetto ad altri modelli della stessa famiglia. Più performante di MistralAI Mixtral 8x7B, a quanto pare: https://qwenlm.github.io/blog/qwen1.5-32b/ https://huggingface.co/Qwen/Qwen1.5-32B-Chat-GGUF Almeno in versione 72B, Qwen-Chat era fra i modelli migliori, almeno nei benchmark, ed il primo fra quelli open-weight (scaricabili): https://huggingface.co/spaces/lmsys/...na-leaderboard __________________ CPU Intel i7-12700K ~ Cooler Noctua NH-D15S ~ Motherboard MSI PRO Z690-A WIFI DDR4 ~ RAM Corsair Vengeance LPX 64 GB DDR4-3600 GPU MSI GeForce RTX 3090 GAMING X TRIO 24G ~ SSD SK hynix Platinum P41 2TB + Samsung 980 Pro 1TB PSU Corsair RM850x ~ Case Fractal Design Define C ~ Display Dell U2412M (A00) + NEC EA231WMi ~ OS ∞ Ultima modifica di s12a : 05-04-2024 alle 17:00.

10-04-2024, 10:32	#9
s12a Senior Member Iscritto dal: Jan 2008 Messaggi: 10904	MistralAI rilascia Mixtral-8x22B ... se avete qualche 4090 che vi avanza... (o ) L'architettura dovrebbe essere simile a Mixtral 8x7B (modello MoE / Mixture of Experts, con 2 "esperti" per token). Sembra che non sia Mistral-Large (offerto mediante API da Mistral e Microsoft), ma un modello completamente nuovo, e per il momento questo è il modello base/fondazionale, non la versione con Instruction tuning. https://twitter.com/MistralAI/status...69263778291896 https://twitter.com/sophiamyang/stat...45947764297845 Essendo di grande dimensione (Il torrent è da 262GB per i pesi FP16), per usarlo nella pratica serviranno versioni quantizzate a precisione inferiore. Qualche dettaglio tecnico: https://twitter.com/danielhanchen/st...12653580771674 __________________ CPU Intel i7-12700K ~ Cooler Noctua NH-D15S ~ Motherboard MSI PRO Z690-A WIFI DDR4 ~ RAM Corsair Vengeance LPX 64 GB DDR4-3600 GPU MSI GeForce RTX 3090 GAMING X TRIO 24G ~ SSD SK hynix Platinum P41 2TB + Samsung 980 Pro 1TB PSU Corsair RM850x ~ Case Fractal Design Define C ~ Display Dell U2412M (A00) + NEC EA231WMi ~ OS ∞ Ultima modifica di s12a : 10-04-2024 alle 10:45.

18-04-2024, 17:30	#16
s12a Senior Member Iscritto dal: Jan 2008 Messaggi: 10904	Meta Llama 3 rilasciato, almeno in una versione iniziale da 8 e 70 miliardi di parametri. Paper quando finiscono di addestrare tutte le varianti, a quanto pare. - https://llama.meta.com/llama3/ - https://ai.meta.com/blog/meta-llama-3/ - https://github.com/meta-llama/llama3 - https://github.com/meta-llama/llama3.../MODEL_CARD.md __________________ CPU Intel i7-12700K ~ Cooler Noctua NH-D15S ~ Motherboard MSI PRO Z690-A WIFI DDR4 ~ RAM Corsair Vengeance LPX 64 GB DDR4-3600 GPU MSI GeForce RTX 3090 GAMING X TRIO 24G ~ SSD SK hynix Platinum P41 2TB + Samsung 980 Pro 1TB PSU Corsair RM850x ~ Case Fractal Design Define C ~ Display Dell U2412M (A00) + NEC EA231WMi ~ OS ∞

Strumenti
Mostra una versione stampabile Invia questa pagina per email