About me

Hello, I'm Jie Sun. I am a fourth-year Ph.D. student in the Department of Computer Science at Zhejiang University, supervised by Zeke Wang and Fei Wu. My areas of interest include machine learning systems, graph computing, and recommendation systems. I like building efficient and scalable machine learning systems (for GNN, DLRM, and LLM) that leverage heterogeneous hardware, such as NVMe SSDs and GPUs, to address large-scale challenges coming from the industry. Currently, I am collaborating with Alibaba on developing a large-scale recommendation system, which is expected to be released in the coming months.

Publications

Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training
Jie Sun, Li Su, Zuocheng Shi, Wenting Shen, Zeke Wang, Lei Wang, Jie Zhang, Wenyuan Yu, Yong Li, Jingren Zhou, Fei Wu
USENIX Annual Technical Conference (ATC), 2023
[Paper] [Code]
We build Legion with the co-design of GPU-topology-aware hierarchical graph partitioning with NVLink-enhanced multi-GPU unified cache to accelerate large-scale GNNs training. Legion minimizes CPU-GPU PCIe traffic, achieving throughput close to pure in-GPU systems even with billion-scale graphs.

Hyperion: Optimizing SSD Access is All You Need to Enable Cost-efficient Out-of-core GNN Training
Jie Sun, Mo Sun, Zheng Zhang, Zuocheng Shi, Jun Xie, Zihan Yang, Jie Zhang, Fei Wu, Zeke Wang
IEEE International Conference on Data Engineering (ICDE), 2025
[Paper] [Code]
We build Hyperion, a cost-efficient out-of-core GNN training system that can achieve in-memory-like throughput on terabyte-scale graphs with some cheap NVMe SSDs. We also propose a GPU-initiated asynchronous disk IO stack to saturate SSDs with only a few GPU cores. We believe the asynchronous disk IO stack can be further applied to other out-of-core applications like DLRM, LLM inference (KVCache in disk), and RAG systems.

Helios: Efficient Distributed Dynamic Graph Sampling for Online GNN Inference
Jie Sun^*, Zuocheng Shi^*, Li Su, Wenting Shen, Zeke Wang, Yong Li, Wenyuan Yu, Wei Lin, Fei Wu, Bingsheng He, Jingren Zhou. ^*: Contributed equally to this project.
Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 2025
[Paper] [Code]
We build Helios, a distributed dynamic graph sampling service for online GNN inference. Helios can achieve millisecond-level sampling latency on rapidly updated dynamic graphs and can linearly scale out. Helios is now part of Alibaba Graph-Learn, an industrial GNN framework. See dynamic sampling services for more details (https://graph-learn.readthedocs.io/en/latest/en/dgs/intro.html).

Moment: Co-optimizing Physical Communication Topology and Data Placement for Multi-GPU Out-of-core GNN Training.
Zuocheng Shi^*, Jie Sun^*, Ziyu Song, Mo Sun, Yang Xiao, Fei Wu, Bingsheng He, Zeke Wang. ^*: Contributed equally to this project.
International Conference for High Performance Computing, Networking, Storage, and Analysis (SC, Best Student Paper Finalist), 2025
To appear. We propose Moment, a physical communication topology and data placement co-optimizer to enable high-throughput and low-cost GNN training in a multi-GPU machine. Moment addresses communication contention and GPU load imbalance issues by modeling the physical topology as capacity-constrained directed graphs and formulating communication scheduling as a max-flow problem. It also introduces a data-distribution-aware knapsack algorithm for optimized data placement.

TorchGT: A Holistic System for Large-scale Graph Transformer Training
Meng Zhang^*, Jie Sun^*, Qinghao Hu, Peng Sun, Zeke Wang, Yonggang Wen, Tianwei Zhang. ^*: Contributed equally to this project.
International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2024
[Paper] Graph transformers achieve higher capabilities to capture global/long-range effects than GNNs. However, the quadratic computation cost of self-attention makes it hard to scale. We propose TorchGT with algorithm and system co-design to accelerate Graph Transformer training and scale up to over 1M sequence length.

SSiMD: Supporting Six Signed Multiplications in a DSP Block for Low-Precision CNN on FPGAs
Qi Liu, Mo Sun, Jie Sun, Liqiang Lu, Jieru Zhao, Zeke Wang
International Conference on Field-Programmable Technology (FPT), 2023

SparseACC: A Generalized Linear Model Accelerator for Sparse Datasets
Jie Zhang, Hongjing Huang, Jie Sun, Juan G ́omez Luna, Onur Mutlu, Zeke Wang
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2023

P4SGD: Programmable Switch Enhanced Model-Parallel Training on Generalized Linear Models on Distributed FPGAs
Hongjing Huang, Yingtao Li, Jie Sun, Xueying Zhu, Jie Zhang, Liang Luo, Jialin Li, Zeke Wang
IEEE Transactions Parallel and Distributed System (TPDS), 2023

Education

[Sep. 2021 - Present] Zhejiang University, Ph.D. student in Computer Science (CS)
[Sep. 2017 – Jun. 2021] Zhejiang University, B.S. in Electronic Engineering (EE)

Internship

[June 2024 -- Present] Research Intern, NUS Xtra Group, supervised by Bingsheng He
[Nov 2020 - June 2024] Research Intern, Alibaba Group

Awards

[June. 2025] SC 25 Best Student Paper Finalist
[Jan. 2024] Alibaba Outstanding Research Intern (by Tongyi Lab)
[Jan. 2023] Eurosys 23 Best Poster Award, for the early work of Helios
[June. 2023] Outstanding Graduate Student of Zhejiang University
[Jan. 2021] Alibaba-Zhejiang University Joint Institute of Frontier Technologies（AZFT）Annual Outstanding Research Intern