Forge CLI

107 upvotes

Forge generates optimized GPU kernels from any PyTorch or HuggingFace model. 32 parallel Coder+Judge agents compete to find the fastest CUDA/Triton implementation. Up to 5× faster than torch.compile(mode='max-autotune') with 97.6% correctness. Enter HuggingFace model ID, get optimized kernels for every layer. Powered by optimized NVIDIA Nemotron 3 Nano 30B at 250k tokens/sec. "Full refund if we don't beat torch.compile"

Screenshots

Watch Video

Added January 6, 2026View on Product Hunt

More in Developer Libraries

SuperX

SuperX is an all-in-one growth toolkit for 𝕏. Get daily inspiration based on viral posts in your niche, trend-based research, and fast rewrites in your voice. Schedule posts at the best time, engage with the right accounts to get discovered, and track what works with built-in analytics.

822

Developer Libraries

Surgeflow

See the plan. Approve it. Watch it execute. SurgeFlow turns browser chaos into transparent automation across multiple tabs. Research, shopping, job applications—all with one command. Works in your browser—no need to download or install a new one.

715

Developer Libraries

Fluently Accent Guru

Are you a non-native English speaker? Check out our free English accent test. Fluently Accent Guru listens to your voice and guesses your accent in just 30 seconds. It’s fast, free, and surprisingly accurate. Try it to improve your English speaking skills!

676

Developer Libraries

Lovon AI Therapy

AI therapy you can actually talk to. Just speak naturally and get support anytime you need it.

637

Developer Libraries