View Single Post
Old 12-04-2024, 13:30   #13
s12a
Senior Member
 
L'Avatar di s12a
 
Iscritto dal: Jan 2008
Messaggi: 10922
RecurrentGemma è una versione non-transformer di Google Gemma, rilasciato qualche settimana fa (super-censurato-sicuro, ma è un altro discorso). Prestazioni comparabili alla versione transformer nonostante l'addestramento con un numero di token inferiore.

https://arxiv.org/abs/2404.07839

Quote:
[Submitted on 11 Apr 2024]
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

We introduce RecurrentGemma, an open language model which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide a pre-trained model with 2B non-embedding parameters, and an instruction tuned variant. Both models achieve comparable performance to Gemma-2B despite being trained on fewer tokens.

È possibile addestrare da zero modelli MoE (Mixture-of-exports) di piccola dimensione a costi relativamente abbordabili. JetMoE è un esempio completamente open source (pesi, dati, codice):

https://arxiv.org/abs/2404.07413

Quote:
[Submitted on 11 Apr 2024]
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Yikang Shen, Zhen Guo, Tianle Cai, Zengyi Qin

Large Language Models (LLMs) have achieved remarkable results, but their increasing resource demand has become a major obstacle to the development of powerful and accessible super-human intelligence. This report introduces JetMoE-8B, a new LLM trained with less than $0.1 million, using 1.25T tokens from carefully mixed open-source corpora and 30,000 H100 GPU hours. Despite its low cost, the JetMoE-8B demonstrates impressive performance, with JetMoE-8B outperforming the Llama2-7B model and JetMoE-8B-Chat surpassing the Llama2-13B-Chat model. These results suggest that LLM training can be much more cost-effective than generally thought. JetMoE-8B is based on an efficient Sparsely-gated Mixture-of-Experts (SMoE) architecture, composed of attention and feedforward experts. Both layers are sparsely activated, allowing JetMoE-8B to have 8B parameters while only activating 2B for each input token, reducing inference computation by about 70% compared to Llama2-7B. Moreover, JetMoE-8B is highly open and academia-friendly, using only public datasets and training code. All training parameters and data mixtures have been detailed in this report to facilitate future efforts in the development of open foundation models. This transparency aims to encourage collaboration and further advancements in the field of accessible and efficient LLMs. The model weights are publicly available at https://github.com/myshell-ai/JetMoE
Hanno un sito web qui: https://research.myshell.ai/jetmoe
__________________
CPU Intel i7-12700K ~ Cooler Noctua NH-D15S ~ Motherboard MSI PRO Z690-A WIFI DDR4 ~ RAM Corsair Vengeance LPX 64 GB DDR4-3600
GPU MSI GeForce RTX 3090 GAMING X TRIO 24G ~ SSD SK hynix Platinum P41 2TB + Samsung 980 Pro 1TB
PSU Corsair RM850x ~ Case Fractal Design Define C ~ Display Dell U2412M (A00) + NEC EA231WMi ~ OS
s12a è offline   Rispondi citando il messaggio o parte di esso