TORCHSPARSE: EFFICIENT POINT CLOUD INFERENCE ENGINE
3D Sparse Convolution; optimize Gather-Matmul-Scatter dataflow; Adaptive Matmul Grouping; Quantized and Vectorized Memory access
4
3
4
2023
Mlsys
THU&&SJTU
EXPLOITING HARDWARE UTILIZATION AND ADAPTIVE DATAFLOW FOREFFICIENT SPARSE CONVOLUTION IN 3D POINT CLOUDS
3D Sparse Convolution; optimize Gather-Matmul-Scatter and fetch-on-demand dataflow; Dynamic dataflow changing; coded-CSR mapping; Parallel Processing of different workloads without padding; Pointer
4
3
3
2023
MICRO
MIT
TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs
3D Sparse Convolution; optimize Implicit Gather-Matmul-Scatter; Cuda Sparse Kernel; Sparse Autotuner by detailed workload
Challenge: CXL and NVM offer higher speed & bandwidth than storage devices with byte-level access. Memory disaggregation using DRAM (high-speed/BW + small capacity) and NVM (low-speed/BW + large capacity), faces latency, bandwidth, and consistency challenges.
Year
Venue
Authors
Title
Tags
P
E
N
2025
ASPLOS
Purdue
EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation
RMAI: Rethinking Memory for AI (Inference), In-Kernel Remote Shared Memory as a Software Alternative to CXL
CXL-like transparent remote data placement and direct addressing on RDMA; remote shared PGAS-like memory arch for expert loading
3
2
3
2026
Eurosys
RUC
LightDSA: Enabling Efficient DSA Through Hardware-Aware Transparent Optimization
5 insights of data center-aware DSA optimizations; lightweight DSA API library for contiguous allocation and 64-byte alignment; optimized recycling algorithm for out-of-order completion behavior
Challenge: Previous cache-based memory systems are not aligned with RDMA, NVM and CXL's latency, bandwidth and access pattern.
Year
Venue
Authors
Title
Tags
P
E
N
2020
OSDI
MIT
AIFM: High-Performance, Application-Integrated Far Memory
object-level swapping with remoteable pointers and dereference scopes; pauseless memory evacuator using green thread co-scheduling
2
3
4
2023
SOSP
UCSD
Mira: A Program-Behavior-Guided Far Memory System
profiling-guided customizable cache section partitioning; remote pointer optimization with adaptive prefetching and eviction hints; automated computation-and-network-aware function offloading
3
3
3
2024
ASPLOS
Northwestern
Getting a Handle on Unmanaged Memory
compiler-automated translation and hosting of memory handles; thread-private stack-allocated pin sets for atomic-free tracking; extensible object-mobility runtime service interface
3
2
4
2025
ATC
THU
DSA-2LM: A CPU-Free Tiered Memory Architecture with Intel DSA
CPU-free page migration in tiered memory via data streaming accelerator; adaptable migration algorithm for mixed 4KB/2MB pages; direct in-kernel DSA integration bypassing DMA
3
3
4
2026
OSDI
UW-Madison
OBASE: Object-Based Address-Space Engineering to Improve Memory Tiering
dynamic address-space reorganization for mitigating hotness fragmentation; lightweight access tracking via guide pointer metadata; pauseless lock-free object migration
NeoMem: Hardware-Software Co-Design for CXL-Native Memory Tiering
device-side memory profiling unit; sketch-based hot page detector with error-bound estimation; dynamic hotness threshold adjustment based on statistics
2
2
3
2025
ASPLOS
Yale
PULSE: Accelerating Distributed Pointer-Traversals on Disaggregated Memory
Survey of Disaggregated Memory: Cross-layer Technique Insights for Next-Generation Datacenters
Cross-layer classification of DM techniques; hardware-level categories; architectural-level classifications; system and runtime-level groupings; application-level optimizations such as general-purpose and domain-specific approaches
Challenge: Current chip designs are often monolithic and inflexible; leading to high costs and limited performance optimization opportunities.
Solution: Use chiplets to enable more flexible and cost-effective system designs by allowing the integration of specialized dies manufactured using optimal processes; leading to improved performance and yield.
Challenge: Scaling single monolithic AI accelerators is limited by yield and reticle size. Multi-Chip-Module (MCM) architectures solve this but introduce NUMA effects and severe inter-chiplet communication bottlenecks, requiring merged pipelining and hardware-software co-design.
Year
Venue
Authors
Title
Tags
P
E
N
2019
MICRO
NVIDIA
Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture
M2M: A Fine-Grained Mapping Framework to Accelerate Multiple DNNs on a Multi-Chiplet Architecture
PCA (Principal Component Analysis) and hierarchical clustering for network partitioning; simulated annealing algorithm for communication-aware block mapping; fine-tuned QoS(Quality of Service)policy for NoP(Network-on-Package) links
4
3
3
2026
ASP-DAC
THU
Scope: A Scalable Merged Pipeline Framework for Multi-Chip-Module NN Accelerators
Solution: 3DIC technology enables higher integration density; shorter interconnects; and improved performance by stacking multiple active layers in a single device.
Open3DBench: Open-Source Benchmark for 3D-IC Backend Implementation and PPA Evaluation
open-source 3D-IC benchmark; modular 3D partitioning and placement; Open3D-DMP algorithm for cross-die co-placement; comprehensive PPA evaluation with thermal simulation
Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling (DRT)
Dynamic Reflexive Tiling (DRT) algorithm; dynamically adjust tile shapes at runtime based on sparsity of tensors; ssembling uniform micro tiles into non-uniform macro tiles
3
3
2
2023
MICRO
MIT && NVIDIA
Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer Capacity
HYTE: Flexible Tiling for Sparse Accelerators via Hybrid Static-Dynamic Approaches
hybrid static-dynamic framework;selecting a near-optimal initial tiling scheme;dynamic fine-tuning of tile shapes;coordinates efficient management of both data and metadata in on-chip/off-chip buffers
SPADA: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow
highly diverse sparsity patterns;Window-based Adaptive Dataflow;dynamically select the optimal window shape configuration based on the similarity of sparse patterns
3
2
3
2023
ASPLOS
Universidad de Murcia && Georgia Tech && NVIDIA
Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing
dynamically adaptable multi-dataflow SpMSpM accelerator;Merger-Reduction Network;configurable tree-based topology;a customized L1 memory hierarchy comprising a read-only FIFO;a low-power cache;and a PSRAM for partial sums
3
3
3
2025
MICRO
University of Maryland
Misam: Machine Learning Assisted Dataflow Selection in Accelerators for Sparse Matrix Multiplication