As Google unveiled a spate of artificial intelligence announcements at its Cloud Next conference, Mistral AI decided to step up and introduce its latest sparse mixture of experts (SMoE) model: Mixtral 8x22B.
However, unlike its competitors, Paris-based startup Riff.ai raised Europe’s largest-ever seed round in June 2023 and has quickly established itself as an AI powerhouse. Instead of using conventional channels like videos or blog posts for its release announcement, Riff took an unconventional route by posting a torrent link directly on X for download and testing of its model.
Meta has announced the general release of GPT-4 Turbo with vision and Gemini 1.5 Pro over the last several days, as well as hinting towards Llama 3 being made available next month.
Mistral’s huge torrent file with experts
Though details on Mistral 8x22B remain undisclosed, AI enthusiasts quickly pointed out its size – four files totalling 262GB via torrent – as something to keep an eye on locally.
“When I purchased my M1 Max Macbook I thought 32GB would be more than sufficient for what I do; after all I don’t work in art or design. Never would have predicted my interest in AI would suddenly make that space far from enough!” wrote a Reddit user in discussing its new model.
After Mistral posted Mixtral 8x22B for download via his X post, Hugging Face made available the model as a resource for training and deployment purposes. Furthermore, since this model is pretrained without moderation mechanisms in place. Together AI also made this model available so users could experience it for themselves.
Mistral’s sparse MoE approach offers users a selection of models suited to various categories of tasks in order to optimize both performance and cost.
“For each token processed by a router network, two groups (known as experts ) are selected as experts to process it additively and combine their output. This process increases model parameters while limiting cost and latency by only using part of their total set per token,” according to Mistral’s website.
Mistral had previously released Mixtral 8x7B using this approach; it contains 46.7B total parameters while using only 12.9B parameters per token – meaning processing input and producing output with equal speed and cost as using two experts at once. Reddit users suggest 130B total parameters are present with 38B being active parameters when producing tokens at once.
At this stage, Mixtral 8x22B’s performance across benchmarks remains unknown; users anticipate it will build upon that of its predecessor Mixtral which proved to be superior in many tests than both Meta’s Llama 2 70B (from which it was trained) and OpenAI GPT-3.5 across various GSM-8K and MMLU benchmarks, offering faster inference times than either.