The introductory blog post has a lot more information https://microsoft.ai/news/introducingmai-code-1-flash/ and the model card https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF The broader announcement of 7 MAI models seems to be where the 5B active in the title comes from https://microsoft.ai/news/building-a-hillclimbing-machine-la...
onlyrealcuzzo2026-06-03 04:05
ENGLISH (원문)
Gemma 4 26B-A4B scored exceptionally well with 20% less params, so this isn't unprecedented.
hootz2026-06-03 04:12
ENGLISH (원문)
I'd love to see a tokens per second metric. I always prioritize speed over raw intelligence for flash models.
capten2026-06-03 04:12
ENGLISH (원문)
It's so weird to me that the benchmarks remain so low, but the models are marketed as revolutionary. And if you say that low coding capabilities aren't a problem, say that to the token price hike and 'general use' model setup. Why not sell it as a math agent? Why do I have to set up 4 agents to check each others' work?
ajyoon2026-06-03 04:13
ENGLISH (원문)
Scroll wheel hijacked on this entire domain
tosh2026-06-03 04:17
ENGLISH (원문)
not open weight or at least I did not find anything indicating open weight
matchbok32026-06-03 04:18
ENGLISH (원문)
Yeah this website is horrendous to use. What were they thinking?
freediddy2026-06-03 04:20
ENGLISH (원문)
is 51% good enough to reliably use? There's no world in which I use an AI agent where it gets even 15% of the code wrong, that's as bad a Tesla FSD where you need to pay attention to the road while engaging FSD. What's the point? My attention is what I'm trying to relieve, not mostly correct functionality. The only thing that matters is whether you can one-shot code like Claude or Codex, I'm not interested in a small but mostly-okay-but-annoyingly-buggy-every-now-and-then AI.
bguberfain2026-06-03 04:21
ENGLISH (원문)
It is good to se big companies like Microsoft launching LLMs. They have large amount of compute power and good scientists to create useful models.
mattlondon2026-06-03 04:22
ENGLISH (원문)
Comparing against Claude 4.5? Aren't we up to 4.8? But disingenuous?
0vermorrow2026-06-03 04:23
ENGLISH (원문)
Latest Haiku (smallest Anthropic Model) is version 4.5, they haven't released a new version, hence the comparison to that.
VygmraMGVl2026-06-03 04:25
ENGLISH (원문)
Claude opus 4.6 scores 51.9% on the same benchmark. Microsoft's result is quite good.
klardotsh2026-06-03 04:25
ENGLISH (원문)
They're comparing to Haiku, not Opus. Haiku is currently at 4.5. Even if it were Opus, comparing to a version number makes for an interesting snapshot of time comparison: if you knew how a model performed at whatever time in was in vogue, you can say "well, it looks like Model X is about 6 months/1 year/etc. behind the frontier SOTA" - which is exactly the discussion that happens in the open-weight/local LLM space. (interesting, MAI-Code-1-Flash does not appear to be such an open-weight model, following the western trend of locking models up)
mentos2026-06-03 04:26
ENGLISH (원문)
Shouldn’t the next model focus not be on code but system design? Seems like the work from a good system design to code is practically solved. Now it’s a matter of the design of the system. Or is that represented in these evals?
댓글
16