Cheng Zeyi (成泽毅)

Software Engineer

Education

2016-2020

Bachelor of Science in Computer Science and Technology, Wuhan University

GPA: 3.65/4.0

Majored in Computer Science with a keen interest in compilers and computer architecture. For my graduation project, I developed a smart JAVA debugger that can launch a JAVA program and, theoretically, with the aid of the JAVA Debug Protocol, adjust the program’s state using crash reports to reproduce crashes, assisting developers in identifying the cause.

Current Role

Software Engineer, SiliconFlow Inc

Expertise

Performance Optimization for Deep Learning Inference (CUDA, CUTLASS, Triton, PyTorch)

Performance Optimization for Computer Vision Algorithms (CUDA, OpenCV, OpenMP)

Python/C++ Hybrid Programming

GPU Video Processing

Work Experience

Software Engineer, SiliconFlow Inc

2023.12-Present

I open-sourced the stable-fast (https://github.com/chengzeyi/stable-fast) inference performance optimization framework on GitHub. This project is effective for optimizing HuggingFace Diffusers and has got over 1000+ stars. It is currently under active development and support for other implementations of stable diffusion is coming soon.

I also develop and maintain AIGC related inference performance optimization for SiliconFlow, including OneDiff and other projects, which achieve the best performance over all other implementations.

Now I am focusing on the development of a new acceleration framework called xelerate that utilizes the latest NVIDIA CUDA hardware features and PyTorch 2.0 APIs. This framework can achieve peak performance on NVIDIA A100, H100 and RTX 4090 GPUs and support a wide range of optimization techniques, including INT8 and FP8 quantization, and is expected to work seamlessly with some new models like FLUX, CogVideoX and SD3. On a single NVIDIA H100 GPU, it can reduce the inference time of FLUX.1-dev for generating a single 1024x1024 image with 28 steps to 2.7 seconds, while keeping nearly the same quality and having complete support for dynamic shape and dynamic LoRA switching. Even on a single NVIDIA RTX 4090 GPU, it can achieve a inference speed of 6.1s per 1024x1024 image with 28 steps for FLUX.1-dev.

Some key features of xelerate include:

Technical Expert, Alibaba Group

2022-2023.10

Primarily responsible for inference performance optimization and the development and maintenance of the Quark Intelligent Scanner project, aiming to tap into the expansive camera scanner application market. Our project employs complex deep learning models on cloud servers, distinguishing us from competitors deploying traditional computer vision algorithms locally on smartphones.

Our system mainly focuses on optimizing the TorchScript Engine, a promising deployment and optimization technique I identified two years ago. I’ve implemented GPU-accelerated image preprocessing, post-processing algorithms, and traditional CV image processing algorithms. After continuous enhancements emphasizing operator fusion, graph rewriting, and memory optimization, it has become stable and efficient.

I developed the ICE computational acceleration framework, which has significantly accelerated computations for Quark’s online inference services. The ICE framework integrates various techniques, such as:

Currently, the ICE acceleration framework has significantly accelerated models like Transformer OCR, Swin Transformer, NAFNET, Swin2SR, RealESRGAN, etc.

Recently, we aim to make a business breakthrough in the AIGC domain with the Stable Diffusion model. This project requires on-the-fly fine-tuning of the base model using personalized sample images uploaded by users, followed by extensive inference predictions. Presently, leveraging the ICE engine, I achieved an inference performance of 60 it/s on NVIDIA A100 for Stable Diffusion v1.5 (512x512 resolution), delivering images in half a second, and it’s compatible with ControlNet, LORA, and other…

Given our high expenditure on GPU-intensive computations, cost optimization is crucial. Our system processes over 10 million user-uploaded document images daily with fewer than 200 GPU cards, generating considerable profit. As our competitors continue to struggle with inefficient systems, their users are gradually transitioning to our product.

Senior Software Engineer, Alibaba Group

2021-2022

I was responsible for developing and maintaining our OCR and image document format restoration services. I designed an XML document format protocol for the Quark browser and, utilizing my understanding of graph algorithms from discrete mathematics, researched an algorithm to restore EXCEL table structures for our WORD/EXCEL structural restoration product. Though this system is still under active development, I transitioned roles to focus on other priorities.

Additionally, I developed a framework to integrate models into the Quark browser. This framework allows developers to write model invocation code once and deploy it across multiple platforms like cloud servers, desktops, and smartphones, each interfacing with distinct inference acceleration frameworks. This framework has facilitated over half of the client-side ML projects in the Quark browser.

Software Engineer, Alibaba Group

2020-2021

I inherited a poorly maintained GPU-based video encoding and decoding framework. Recognizing its limitations, I decided to completely rewrite the system. By studying NVIDIA CUDA programming documentation and prominent video codec frameworks like FFmpeg, I redesigned the API and data structures, rigorously testing for performance and format compatibility. This system efficiently utilizes NVCODEC for acceleration and switches to other implementations when compatibility is required. The project’s rewrite pr…

Additionally, I wrote native C++ code, invoking MNN to deploy compact ML models and implemented CV algorithms in the Quark browser.

Backend Software Development Intern, ByteDance

2019

Developed payment systems.

Languages

Chinese (Native or Bilingual Proficiency)

English (Professional Proficiency)

IELTS English Language Test: 7.0

GRE Exam: Verbal Reasoning 154, Quantitative Reasoning 169, Analytical Writing 4.0

Projects

piflux (closed source): Accelerating FLUX Inference with Multiple GPUs.

Technologies: CUDA, PyTorch, PyTorch Distributed, Diffusion Transformer

piflux is one of the fastest FLUX inference framework with multiple GPUs. It is not open-sourced yet. It is designed to work with xelerate seamlessly with very fine-grained sequence-level parallelism and attention KV cache strategies. On 2 NVIDIA H100 GPUs, it can reduce the inference time of Flux.1-dev for generating a single 1024x1024 image with 28 steps to 1.7 seconds, while keeping nearly the same quality.

xelerate (closed source): Best PyTorch Inference Optimization Framework

Technologies: C++, CUDA, PyTorch, OpenAI Triton, TorchDynamo, TorchInductor

xelerate is the the fastest inference performance optimization framework for deep learning models. It is not open-sourced yet. But it achieves the best performance over all other implementations. It is on par with NVIDIA TensorRT 10, but with more flexibility and compatibility with PyTorch 2.0 APIs, and can achieve peak performance on NVIDIA A100, H100 and RTX 4090 GPUs. It also supports a wide range of quantization techniques, including INT8 and FP8 quantization, and is expected to work seamlessly with some new models like FLUX and CogVideoX.

stable-fast (open source): A Lightweight Inference Performance Optimization Framework for Stable Diffusion

Technologies: C++, CUDA, PyTorch, OpenAI Triton

Open-sourced on GitHub with over 1000+ stars.

stable-fast (https://github.com/chengzeyi/stable-fast)

ICE Deep Learning Computational Acceleration Framework

Technologies: C++, CUDA, PyTorch, OpenAI Triton

I spearheaded its development and design. This acceleration framework contains basic operator extensions, all of which include backward propagation and reduce GPU memory requirements during training through a certain level of Recompute:

This acceleration framework supports multiple frontends (TorchDynamo, TorchScript, FuncTorch) and is highly compatible with mainstream algorithm frameworks (Huggingface Transformers, Diffusers, etc.).

NVJpeg Image Encoding Extension

Technologies: C++, CUDA, PyTorch

Fully compatible with PyTorch, this extension supports various sampling formats and boasts rapid speeds. On an RTX 3090Ti, it can encode over 1000 images per second.

OCR-Based EXCEL Table Structure Restoration Algorithm

Technologies: Python, NumPy

This algorithm can restore intricate table structures from discrete line detection results. It’s employed in multiple online services of the Quark browser, such as Quark File Scanner and Quark Table Restoration.

Fixed-Size Memory Allocation Library Based on Multiway Trees

Technologies: C++, Linux

Designed with __builtin_ctzll, this multiway tree data structure allows for quick memory allocation/release. In some scenarios, it’s faster than TCMalloc, with minimal fragmentation, and is integrated into our stream processing framework.

Performance Optimization of G’MIC (CImg) Image Processing Library

Technologies: C++, Linux, OpenMP

G’MIC is one of the most popular digital image processing frameworks among GNU users. Through OpenMP acceleration and template programming, I achieved a 4-10x performance boost in all its image processing algorithms.