Rendered at 00:06:16 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
SwellJoe 1 days ago [-]
I added this to a benchmark I've been doing of how well agents find security bugs, specifically security bugs originally found by Mythos. It performs poorly with only read/grep/ls tools, but in a follow-up test with a full shell and Python, it doubled its findings (still a poor showing, but it does at least indicate it is doing what it says on the tin: making tools to help it solve problems). It also did worse than Qwen AgentWorld, another recent post-train of Qwen 3.6 MoE intended for agentic use.
It would be really interesting to see how the Qwen 3.6 35B model compares to the 27B on your benchmark.
kordlessagain 20 hours ago [-]
Good to know. Thanks for the research!
Balinares 12 hours ago [-]
I'd have expected this to get more HN attention. Qwen 3.6 35B capability in a 9B model is a bonkers claim.
juliangoldsmith 6 hours ago [-]
It looks like they're comparing Orinth 9B to Qwen 3.5 35B, not Qwen 3.6. I guess it kind of makes sense since it's a finetune of 3.5, but I totally missed until I looked closely.
In my brief tests, Ornith 35B performed quite well. It won't replace DeepSeek V4 Flash for me, but if it was fast and cheap enough it might.
I don't remember being super impressed with Ornith 9B, but I could see it being on par with Qwen 3.5 35B.
chid 11 hours ago [-]
I thought so too when I read the headline but I expect it's basically Qwen3.5-9B
nzach 12 hours ago [-]
Instead of training the model to directly answer questions we trained the model to always write and execute the code that would solve the question ?
If that is the case, this isn't just a fancy way to perform prompt optimization?
https://swelljoe.com/post/will-it-mythos/
In my brief tests, Ornith 35B performed quite well. It won't replace DeepSeek V4 Flash for me, but if it was fast and cheap enough it might.
I don't remember being super impressed with Ornith 9B, but I could see it being on par with Qwen 3.5 35B.
If that is the case, this isn't just a fancy way to perform prompt optimization?