홈으로 돌아가기
Hacker News

마이크로소프트의 MAI-Code-1-Flash, 활성 매개변수 50억 개만으로 SWE-Bench Pro에서 51%의 점수를 기록

Microsoft's MAI-Code-1-Flash Scores 51% SWE-Bench Pro with Just 5B Active Params

517 points 243 comments EvanZhouDev 2026-06-03 03:47

댓글

16
OsrsNeedsf2P 2026-06-03 03:55
ENGLISH (원문)
So it's trained on the SWE Bench Pro evalset
lemonish97 2026-06-03 03:56
ENGLISH (원문)
What is your evidence for this claim?
AntiRush 2026-06-03 03:57
ENGLISH (원문)
The introductory blog post has a lot more information https://microsoft.ai/news/introducingmai-code-1-flash/ and the model card https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF The broader announcement of 7 MAI models seems to be where the 5B active in the title comes from https://microsoft.ai/news/building-a-hillclimbing-machine-la...
onlyrealcuzzo 2026-06-03 04:05
ENGLISH (원문)
Gemma 4 26B-A4B scored exceptionally well with 20% less params, so this isn't unprecedented.
hootz 2026-06-03 04:12
ENGLISH (원문)
I'd love to see a tokens per second metric. I always prioritize speed over raw intelligence for flash models.
capten 2026-06-03 04:12
ENGLISH (원문)
It's so weird to me that the benchmarks remain so low, but the models are marketed as revolutionary. And if you say that low coding capabilities aren't a problem, say that to the token price hike and 'general use' model setup. Why not sell it as a math agent? Why do I have to set up 4 agents to check each others' work?
ajyoon 2026-06-03 04:13
ENGLISH (원문)
Scroll wheel hijacked on this entire domain
tosh 2026-06-03 04:17
ENGLISH (원문)
not open weight or at least I did not find anything indicating open weight
matchbok3 2026-06-03 04:18
ENGLISH (원문)
Yeah this website is horrendous to use. What were they thinking?
freediddy 2026-06-03 04:20
ENGLISH (원문)
is 51% good enough to reliably use? There's no world in which I use an AI agent where it gets even 15% of the code wrong, that's as bad a Tesla FSD where you need to pay attention to the road while engaging FSD. What's the point? My attention is what I'm trying to relieve, not mostly correct functionality. The only thing that matters is whether you can one-shot code like Claude or Codex, I'm not interested in a small but mostly-okay-but-annoyingly-buggy-every-now-and-then AI.
bguberfain 2026-06-03 04:21
ENGLISH (원문)
It is good to se big companies like Microsoft launching LLMs. They have large amount of compute power and good scientists to create useful models.
mattlondon 2026-06-03 04:22
ENGLISH (원문)
Comparing against Claude 4.5? Aren't we up to 4.8? But disingenuous?
0vermorrow 2026-06-03 04:23
ENGLISH (원문)
Latest Haiku (smallest Anthropic Model) is version 4.5, they haven't released a new version, hence the comparison to that.
VygmraMGVl 2026-06-03 04:25
ENGLISH (원문)
Claude opus 4.6 scores 51.9% on the same benchmark. Microsoft's result is quite good.
klardotsh 2026-06-03 04:25
ENGLISH (원문)
They're comparing to Haiku, not Opus. Haiku is currently at 4.5. Even if it were Opus, comparing to a version number makes for an interesting snapshot of time comparison: if you knew how a model performed at whatever time in was in vogue, you can say "well, it looks like Model X is about 6 months/1 year/etc. behind the frontier SOTA" - which is exactly the discussion that happens in the open-weight/local LLM space. (interesting, MAI-Code-1-Flash does not appear to be such an open-weight model, following the western trend of locking models up)
mentos 2026-06-03 04:26
ENGLISH (원문)
Shouldn’t the next model focus not be on code but system design? Seems like the work from a good system design to code is practically solved. Now it’s a matter of the design of the system. Or is that represented in these evals?

좋아요가 저장됐어요!

로그인하면 어디서나 확인하고
영구적으로 저장할 수 있어요.