Why SRAM in-memory compute matters: the inference memory wall
The technical center of this story is that Fractile is not building a faster GPU — it is attacking a different bottleneck. In a conventional GPU stack, transformer inference repeatedly shuttles model weights between high-bandwidth memory (HBM) and the compute units for every token generated. Once models are large and batches are small, that data movement, not raw FLOPs, dictates how many tokens per second a chip can produce and how much energy each token costs. Fractile's design stores the data needed for computations directly next to the transistors that perform the arithmetic, an in-memory compute (IMC) architecture built on SRAM and RISC-V control logic.
The payoff Fractile claims from collapsing that memory hierarchy is dramatic: roughly a hundred-fold increase in effective bandwidth and, by extension, the option to either serve users 100x faster or let the model 'think 100 times harder' at the same throughput. Public marketing numbers stack up to 100x speed, 10x cost reduction, and 20x better performance per watt versus Nvidia GPUs, with ComputerWeekly citing a 50x speed and 10% cost figure for trained-model inference. Whether the silicon actually hits those numbers in production is a separate question, but the architectural premise — that the inference workload is memory-bound, not compute-bound — is exactly the bet several non-GPU challengers are now making, and it is what makes Fractile interesting to a buyer like Anthropic rather than just to a sovereign-AI policy file.


