StepFun Step 3.7 Flash MoE Vision-Language Model Release
TECH

StepFun Step 3.7 Flash MoE Vision-Language Model Release

28+
Signals

Strategic Overview

  • 01.
    StepFun open-sourced Step 3.7 Flash on May 28-29, 2026 — a 198B-parameter sparse Mixture-of-Experts vision-language model (196B language backbone + 1.8B ViT) that activates roughly 11B parameters per token and supports a 256K context window with three selectable reasoning levels.
  • 02.
    Weights ship under Apache 2.0 in BF16, FP8, NVFP4, and GGUF formats on Hugging Face, and the model is deployable through vLLM, SGLang, TensorRT-LLM, llama.cpp, Ollama, LM Studio, and NVIDIA NIM containers — making it one of the most broadly deployable open-weight VLMs at this capability tier.
  • 03.
    The headline pitch is Advisor Mode: Step 3.7 Flash drives the full agent loop and escalates to a larger advisor model only at planning inflection points, reportedly reaching 97% of Claude Opus 4.6's SWE-Bench Verified score ($0.19 vs $1.76 per task).
  • 04.
    Native multimodality (images, video, GUI, documents) is new versus Step 3.5 Flash, which was text-only — positioning this release as a direct competitor to Gemini 2.5 Flash and Claude Sonnet 4.6 in the mid-tier agentic market.
  • 05.
    API list pricing is $0.20/M input tokens (cache miss), $0.04/M input (cache hit), and $1.15/M output, with the model also offered free for 30 days via the Nous Portal.

Advisor Mode: a 'distress signal' architecture that flips the cost curve

The cleverest piece of Step 3.7 Flash is not the model itself but how it spends compute. In Advisor Mode, the 11B-active executor drives the entire agent trajectory — tool calls, file reads, code edits — and only escalates to a larger advisor model at planning inflection points or after repeated failures [1]. The Communeify deep-dive describes it as a 'distress signal' pattern: the small model does the work, and only phones the expensive specialist when it gets stuck [2]. This is structurally different from the dominant pattern of routing every turn through a frontier model. By concentrating expensive reasoning at the few moments it actually changes the outcome, Step 3.7 Flash reportedly closes 97% of the gap to Claude Opus 4.6 on SWE-Bench Verified while spending $0.19 per task instead of $1.76 — a 9x cost ratio [1]. If the result holds outside StepFun's harness, the takeaway is that the next year of agent cost reduction may come less from cheaper frontier models and more from rationing when you call them at all.

The benchmark numbers Anthropic has to look at

Stripping out the marketing, the benchmark deltas are what give this release commercial bite. Step 3.7 Flash posts 56.26% on SWE-Bench Pro (up from 51.3% on Step 3.5 Flash), 59.55% on Terminal-Bench 2.1 (up from 53.37%), and 67.1 on ClawEval-1.1 versus 59.8 for the next competitor [3]. Cross-harness variance also tightened dramatically — Step 3.5 Flash swung between 43% and 73% depending on the agent scaffold, while Step 3.7 Flash stays in a 64.5%-71.5% band — meaning enterprises don't have to over-engineer the harness to get the headline number [3]. Pair that with $0.20 per million input tokens and $1.15 per million output, and the implied per-task cost for high-volume coding agents drops by an order of magnitude versus Opus-class APIs [4]. MarkTechPost reads this as a direct play for the same enterprise workloads Gemini 2.5 Flash and Claude Sonnet 4.6 are courting, with the open-weight wrinkle that customers can also self-host [3].

The local-runner story: why the open-weights community is rallying around it

Step 3.7 Flash is unusually friendly to the self-hosting community for a 198B-parameter model. Weights ship in NVFP4 and GGUF alongside BF16 and FP8, and the launch came with day-0 support across the open inference stack [5]. The framing matters: an 11B-active-parameter MoE in NVFP4 can plausibly run on a single 128GB workstation, which is exactly the niche that has been waiting for a credible open agentic VLM. The r/LocalLLaMA thread on launch day skewed toward practical 'will this run on my Strix Halo / Mac Studio / DGX Spark' threads rather than the usual benchmark-vs-DeepSeek debate, and one widely-quoted comment summed up the community's verdict — that the model's chain of thought is hard to read but the final answer rivals models several times its size. NVIDIA's own deployment guide reinforces the trajectory: DGX Station with 748GB coherent memory is pitched specifically for running the full 256K context locally, and DGX Spark is the recommended smaller setup [5]. This is what 'enterprise-ready local AI' looks like in 2026.

The skeptics' read: Goodhart, thinking-token overhead, and harness shopping

Reception is not uncritical. The most pointed pushback came from r/LocalLLaMA, where a post celebrating Step 3.7 Flash passing a popular reasoning riddle was met with a wave of Goodhart's-Law replies: the community treats any model that arrives with a near-Opus benchmark and a 9x cost claim as a candidate for training-data contamination. The more concrete complaint came from a separate r/unsloth quants thread reporting real-world Strix Halo (128GB) results where Step 3.7 Flash spent minutes on a problem a 35B Qwen variant solved in seconds on the same hardware. The model 'wins on quality' but it does so by burning thinking tokens — which means the published per-task dollar cost depends heavily on how aggressively you set the reasoning level (low/medium/high) and how cleanly Advisor Mode escalates [2]. The honest read is that Step 3.7 Flash's Advisor-mode numbers are likely real on the harnesses StepFun tested, but practitioners should expect to tune reasoning level and harness before trusting the dollar figure on their own workload.

Pre-IPO momentum and the China open-weight flywheel

The timing is not a coincidence. StepFun closed a B+ round exceeding RMB 5 billion (~$717M) in January 2026, led by Tencent, Qiming Venture Partners, and Shanghai State-owned Capital, and is reportedly preparing a Hong Kong IPO [6]. Open-sourcing a frontier-adjacent agentic VLM under Apache 2.0 four months later, on the same week that NVIDIA publishes a co-marketing deployment guide [5], is the kind of move that builds the developer mindshare an IPO prospectus needs. Zoom out and the picture is the same one DeepSeek and Moonshot have been painting all year: the Chinese frontier labs are not just shipping models, they are shipping permissively-licensed weights with day-0 vLLM, llama.cpp, and NIM support, in formats (NVFP4, GGUF) that make self-hosting cheap. Each release tightens that flywheel and makes 'just use the closed Western API' a harder default for a cost-sensitive enterprise CTO [3].

Historical Context

2023-04
Shanghai Jieyue Xingchen Intelligent Technology founded by former Microsoft researchers Jiang Daxin, Zhu Yibo, and Jiao Binxing.
2024-07
StepFun unveils Step-2 (trillion-parameter LLM), Step-1.5V (multimodal), and Step-1X (image generation) — establishing its multimodal track record.
2026-01-26
Closed B+ funding round exceeding RMB 5 billion (~$717M) led by Tencent, Qiming Venture Partners, and Shanghai State-owned Capital — the war chest behind 2026's release cadence.
2026-02
Step-3.5-Flash released — same 196B/11B MoE shape under Apache 2.0, but text-only. The architectural ancestor of Step 3.7 Flash.
2026-05-28
Step 3.7 Flash released — adds native vision (1.8B ViT), Advisor Mode, three reasoning tiers, and a 256K context window under Apache 2.0.

Power Map

Key Players
Subject

StepFun Step 3.7 Flash MoE Vision-Language Model Release

ST

StepFun (Shanghai Jieyue Xingchen Intelligent Technology)

Shanghai-based AI lab founded April 2023 by ex-Microsoft researchers Jiang Daxin, Zhu Yibo, and Jiao Binxing; developed and open-sourced Step 3.7 Flash. Reportedly preparing a Hong Kong IPO.

NO

Nous Research / Hermes Agent

Distribution partner — Step 3.7 Flash is offered free for 30 days via Nous Portal and is wired into mainstream agent harnesses including Hermes Agent.

NV

NVIDIA

Hardware and deployment partner — provides NIM containers, TensorRT-LLM, and NeMo-AutoModel support; DGX Station's 748GB coherent memory is positioned as the reference hardware for running the full 256K context.

TE

Tencent, Qiming Venture Partners, Shanghai State-owned Capital

Investors behind StepFun's B+ round in January 2026 exceeding RMB 5B (~$717M), funding the aggressive 2026 release cadence.

HU

Hugging Face

Primary weight distribution platform — hosts BF16, FP8, NVFP4, and GGUF quantizations under Apache 2.0.

AN

Anthropic (Claude Opus 4.6)

Benchmark reference and implicit commercial target — StepFun positions Advisor Mode as reaching 97% of Opus 4.6 coding quality at ~1/9 the per-task cost.

Fact Check

7 cited
  1. [1] Step 3.7 Flash
  2. [2] Step 3.7 Flash MoE Vision-Language Model Agentic Workflow Analysis
  3. [3] StepFun Releases Step 3.7 Flash: A 198B MoE Vision-Language Model for Coding Agents and Search Workflows
  4. [4] stepfun-ai/Step-3.7-Flash
  5. [5] Run Step 3.7 Flash on NVIDIA GPUs with Enterprise-Ready Multimodal AI
  6. [6] Shanghai AI unicorn StepFun raises over $718 million in B round
  7. [7] Step 3.7 Flash throughput report

Source Articles

Top 3

THE SIGNAL.

Analysts

"Frames Step 3.7 Flash as a direct mid-tier agentic competitor to Gemini 2.5 Flash and Claude Sonnet 4.6, with pricing and deployment options engineered to pull enterprise workloads off Western providers."

MarkTechPost analysis
AI industry publication

"Treats Advisor Mode as the key efficiency innovation — a small, fast executor with a 'distress signal' escalation to a larger advisor — and frames it as the new playbook for cost-efficient agentic AI. Communeify writes: 'Only when it gets stuck—for example, encountering a critical bottleneck requiring complex planning or repeated failures—does it send a distress signal to the larger Advisor model upstairs.'"

Communeify deep-dive review
Independent technical blog

"Positions Step 3.7 Flash as production-ready for enterprises on NVIDIA hardware via NIM, TensorRT-LLM, vLLM, and SGLang; highlights DGX Station as ideal for full 256K context. NVIDIA writes: 'With 748 GB of coherent memory, DGX Station is ideal for running Step 3.7 Flash with increased headroom for the full 256k context length, and faster local developer iteration.'"

NVIDIA Developer team
Deployment partner
The Crowd

"⚡️ Step 3.7 Flash is here: The new frontier is agent efficiency. #1 ClawEval-1.1 (67.1), #1 SimpleVQA Search (79.2), #2 SWE-PRO (56.3), 95.3 on V* Python. Open weights under Apache 2.0. Built for agentic, coding, search, and multimodal workflows — balancing speed, cost, and"

@@StepFun_ai1495

"🎉 Congrats to @StepFun_ai on releasing Step-3.7-Flash, with day-0 support in vLLM. - 198B sparse MoE vision-language model, ~11B active params per token, native image + text input - 256K context window for long docs, multi-file repos, and dense visual interfaces - FP8 and NVFP4"

@@vllm_project354

"Step 3.7 Flash was another one I was really looking for! Big jump compared to 3.5, multi modal and even better than DeepSeek V4 Flash in some benchmarks! 🔥 This could become my go to model on 128GB! Download in progress right now!"

@@ivanfioravanti201

"StepFun 3.7 Flash"

@u/Everlier383
Broadcast
Step 3.7 Flash - 198B Open Source Model That Does Everything; Does it Really?

Step 3.7 Flash - 198B Open Source Model That Does Everything; Does it Really?

Step 3.7 Flash: The Open-Source AI for AI Agents

Step 3.7 Flash: The Open-Source AI for AI Agents

I rarely praise a model like this! Step-3.7-Flash's frontend generation capability is insanely strong

I rarely praise a model like this! Step-3.7-Flash's frontend generation capability is insanely strong