Cheng Zeyi (成泽毅)

Founder & CEO, WaveSpeedAI

Education

2016-2020

Bachelor of Science in Computer Science and Technology, Wuhan University

GPA: 3.65/4.0

Majored in Computer Science with a keen interest in compilers and computer architecture. For my graduation project, I developed a smart JAVA debugger that can launch a JAVA program and, theoretically, with the aid of the JAVA Debug Protocol, adjust the program’s state using crash reports to reproduce crashes, assisting developers in identifying the cause.

Current Role

Founder & CEO of WaveSpeedAI — Building the Fastest Infrastructure for AI Image and Video Generation

Expertise

Performance Optimization for Deep Learning Inference (CUDA, CUTLASS, Triton, TorchInductor)

Performance Optimization for Computer Vision Algorithms (CUDA, OpenCV, OpenMP)

Python/C++ Hybrid Programming

GPU Video Processing

Work Experience

Founder & CEO, WaveSpeedAI

2025.3-Present

Co-founded WaveSpeedAI, a Singapore-based company building the core acceleration engine for the multimodal AI era.

WaveSpeedAI aggregates 700+ advanced AI models for image, video, and audio generation, making them faster, more efficient, and accessible. The platform delivers up to 6x faster inference and reduces compute costs by up to 67% compared to traditional cloud solutions.

Key achievements:

Inference Engineer

2024.8-2025.3

Published Comfy-WaveSpeed, a state-of-the-art inference acceleration solution for ComfyUI achieving 2x speedup on popular image and video generation models including FLUX, LTXV, and HunyuanVideo. Cross-platform support for NVIDIA GPUs, Apple MPS, AMD ROCm on Linux, Windows, and macOS.

Published ParaAttention, providing efficient Context Parallelism for multi-GPU DiT inference acceleration, along with First Block Cache, a novel caching technique that accelerates DiT inference with minimal quality loss. Adopted by leading AI inference platforms as a core component of their acceleration pipelines.

Software Engineer, SiliconFlow Inc

2023.12-2024.8

Open-sourced stable-fast, an inference optimization framework for HuggingFace Diffusers with 1000+ stars on GitHub.

Developed and maintained AIGC inference optimizations for SiliconFlow, including OneDiff, achieving best-in-class performance.

Built xelerate, an acceleration framework leveraging NVIDIA CUDA hardware features and PyTorch 2.0 APIs. Achieved peak performance on A100, H100, and RTX 4090 GPUs with INT8/FP8 quantization support. Key benchmarks:

Key features of xelerate:

Technical Expert, Alibaba Group

2022-2023.10

Led inference performance optimization for Quark Intelligent Scanner, a cloud-based document scanning service processing 10+ million images daily with fewer than 200 GPUs.

Built GPU-accelerated image preprocessing pipelines and optimized TorchScript Engine with operator fusion, graph rewriting, and memory optimization.

Developed ICE acceleration framework for Quark’s inference services:

Accelerated Transformer OCR, Swin Transformer, NAFNET, Swin2SR, and RealESRGAN models.

Led AIGC initiatives with Stable Diffusion, enabling on-the-fly fine-tuning with user-uploaded images. Achieved 60 it/s inference on NVIDIA A100 for Stable Diffusion v1.5 (512x512), delivering images in under a second with full ControlNet and LoRA compatibility.

Senior Software Engineer, Alibaba Group

2021-2022

Developed OCR and document format restoration services for Quark browser. Designed XML document protocol and implemented graph-based algorithm for EXCEL table structure restoration.

Built cross-platform ML model deployment framework supporting cloud, desktop, and mobile with unified APIs. Framework powers over half of Quark’s client-side ML projects.

Software Engineer, Alibaba Group

2020-2021

Rewrote a GPU-based video encoding/decoding framework from scratch. Redesigned APIs and data structures based on NVIDIA CUDA documentation and FFmpeg patterns, with rigorous performance and compatibility testing. The system uses NVCODEC for acceleration with fallback for compatibility, processing over 20 million short videos daily.

Deployed compact ML models via MNN and implemented CV algorithms in native C++ for Quark browser.

Backend Software Development Intern, ByteDance

2019

Developed payment systems.

Languages

Chinese (Native or Bilingual Proficiency)

English (Professional Proficiency)

IELTS English Language Test: 7.0

GRE Exam: Verbal Reasoning 154, Quantitative Reasoning 169, Analytical Writing 4.0

Projects

Comfy-WaveSpeed (open source): A SOTA Inference Acceleration Solution for ComfyUI

Technologies: Dynamic Caching, PyTorch, ComfyUI

500+ stars on GitHub. View Project

WaveSpeedAI: Multimodal AI Acceleration Platform

Technologies: CUDA, PyTorch, Distributed Systems, Cloud Infrastructure

Co-founded and built WaveSpeedAI, a global platform providing unified API access to 700+ AI models with industry-leading inference speeds. The platform powers real-time image generation and video generation with up to 6x faster inference.

View Platform

ParaAttention (open source): Efficient Context Parallelism for DiT Inference

Technologies: Attention Mechanism, PyTorch, Distributed Computing

100+ stars on GitHub. View Project

piflux (closed source): Accelerating FLUX Inference with Multiple GPUs.

Technologies: CUDA, PyTorch, PyTorch Distributed, Diffusion Transformer

Multi-GPU FLUX inference framework with fine-grained sequence-level parallelism and attention KV cache strategies. Integrates seamlessly with xelerate. Achieves 1.7s per 1024x1024 image (28 steps) on 2x H100 GPUs with near-original quality.

xelerate (closed source): Best PyTorch Inference Optimization Framework

Technologies: C++, CUDA, PyTorch, OpenAI Triton, TorchDynamo, TorchInductor

High-performance inference optimization framework matching TensorRT 10 performance with superior PyTorch 2.0 compatibility. Achieves peak performance on A100, H100, and RTX 4090 GPUs with INT8/FP8 quantization. Powers FLUX and CogVideoX inference.

stable-fast (open source): A Lightweight Inference Performance Optimization Framework for Stable Diffusion

Technologies: C++, CUDA, PyTorch, OpenAI Triton

1000+ stars on GitHub. View Project

ICE Deep Learning Computational Acceleration Framework

Technologies: C++, CUDA, PyTorch, OpenAI Triton

Internal acceleration framework with operator extensions supporting both forward and backward propagation:

Compatible with TorchDynamo, TorchScript, FuncTorch, and mainstream frameworks (Transformers, Diffusers).

NVJpeg Image Encoding Extension

Technologies: C++, CUDA, PyTorch

PyTorch-compatible GPU image encoding extension. Encodes 1000+ images per second on RTX 3090Ti with support for various sampling formats.

OCR-Based EXCEL Table Structure Restoration Algorithm

Technologies: Python, NumPy

Restores complex table structures from discrete line detection results. Powers Quark File Scanner and Quark Table Restoration services.

Fixed-Size Memory Allocation Library Based on Multiway Trees

Technologies: C++, Linux

High-performance memory allocator using multiway tree data structures with __builtin_ctzll. Outperforms TCMalloc in certain scenarios with minimal fragmentation. Integrated into production stream processing framework.

Performance Optimization of G’MIC (CImg) Image Processing Library

Technologies: C++, Linux, OpenMP

Achieved 4-10x performance improvement across all image processing algorithms in G’MIC, a popular GNU image processing framework, using OpenMP acceleration and template programming.