staff

Partner: P. Miłoś

Prace konferencyjne

Ludziejewski J.^♦, Pióro M., Krajewski J.^♦, Stefaniak M.^♦, Krutul M.^♦, Małaśnicki J.^♦, Cygan M.^♦, Sankowski P.^♦, Adamczewski K.^♦, Miłoś P.^♦, Jaszczur S.^♦, Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient, PMLR, 42nd International Conference on Machine Learning, 2025-07-13/07-19, Vancouver (CA), DOI: 10.48550/arXiv.2502.05172, pp.1-18, 2025

Pióro M., Ciebiera K.^♦, Król K.^♦, Ludziejewski J.^♦, Krutul M.^♦, Krajewski J.^♦, Antoniak S.^♦, Miłoś P.^♦, Cygan M.^♦, Jaszczur S.^♦, MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts, Next Generation of Sequence Modeling Architectures Workshop at International Conference on Machine Learning 2024, 2024-07-26/07-26, Wiedeń (AT), pp.1-4, 2024

Streszczenie:

State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable performance. Our model, MoE-Mamba, outperforms Mamba and matches the performance of Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in 2.35x fewer training steps while preserving the inference performance gains of Mamba against Transformer.

Afiliacje autorów:

Pióro M.	-	IPPT PAN
Ciebiera K.	-	other affiliation
Król K.	-	other affiliation
Ludziejewski J.	-	other affiliation
Krutul M.	-	other affiliation
Krajewski J.	-	other affiliation
Antoniak S.	-	other affiliation
Miłoś P.	-	other affiliation
Cygan M.	-	other affiliation
Jaszczur S.	-	other affiliation

Instytut Podstawowych Problemów Techniki

Polskiej Akademii Nauk

Prace konferencyjne