DevHunt

Forge generates optimized GPU kernels from any PyTorch or HuggingFace model. 32 parallel Coder+Judge agents compete to find the fastest CUDA/Triton implementation. Up to 5× faster than torch.compile(mode='max-autotune') with 97.6% correctness. Enter HuggingFace model ID, get optimized kernels for every layer. Powered by optimized NVIDIA Nemotron 3 Nano 30B at 250k tokens/sec. "Full refund if we don't beat torch.compile"

Screenshots

Forge CLI screenshot 1
Forge CLI screenshot 2
Forge CLI screenshot 3
Forge CLI screenshot 4
Watch Video
Added January 6, 2026View on Product Hunt

More in Developer Libraries

Feedback