
Bagel: The creator of Tittok unveils a new multimodal and promising AI

After Baidu,, Deepseek And Alibabait’s another Chinese giant, Bytedance, to mingle with the AI race. If the name of this company does not tell you anything, its major creation will probably not be unrelated to you, since it is Tiktok. A few days ago, Bytedance unveiled Bagel, a model presented as a generalist, with 7 billion active parameters – 14 billion in total. He is able to ingest text, images or videos, then respond in one or the other format without having to change architecture. In the process, Bytedance placed the code, the weights as well as the Apache 2.0 license documentation, confirming a strategy clearly turned towards the opening. You can also test the whole via This demo interface.
Architecture: from the word to the grouped prediction
In the heart of Bagel is an architecture mixture-of-transformer-experts (word). Two distinct encoders respectively capture the indices in the pixels and the semantic dimension of the visuals. The model trains on billions of multimodal tokens interspersed and follows the paradigm “Next Group of Token Prediction”, which allows it to produce or supplement indifferently text, images or video sequences.
First convincing results
According to the company, the first figures are talking. On the Gaia index, Bagel obtains 82.42 and exceeds Qwen2.5-VL as well as Internvl-2.5. On Mrs. (2388), Mmbench (85.0) and MM-FEN (67.2), it distances the best open source models of the same size. In generation of text-Image, it reached 0.88 in the Geneval test, a score near Stable Diffusion 3, the current reference in the domain. As for the image editing side, the Gedit-Bench-envis indicator displays 7.36, and Intelligentbench rises to 44.0, confirming a fine visual manipulation from the first public version.
Operational functions
Bagel is not limited to transcribing images: he also knows how to generate 4K visuals from descriptions, predict future frames in a video or transform the style of a photograph. Its designers highlight the integrated “reasoning chain”: the model can explain its logical sequence on several dialogue laps, useful functionality for 3D navigation or the analysis of complex documents.
Efficient and economical
Thanks to the dynamic activation of only 7 billion parameters, the inference costs drop by around 40 % compared to a dense model of the same wingspan. An internal test cites an “cyberpunk” image generated in three seconds, with a gain of 15 % loyalty measured by SSIM, a measure of similarity between two digital images.
It is also specified that the whole runs on a single NVIDIA A100 graphics card, making local exploitation more affordable for independent laboratories and studios. If it is too early to know if this promise is fully held, it is obvious that this is a very important point. If Bagel is really as efficient on a single A100, he will not doubt a large adoption, both in creative studios and in academic laboratories. In any case, its putting online has caused 50,000 visits to Hugging Face and already 3,000 GitHub stars, in barely twenty-four hours.




