Skip to content

Parallel and Multi-Processor Architecture

Heterogeneous Architecture

Challenge: Classic Heterogeneous Architecture faces challenges in the data movement and memory access patterns; leading to performance bottlenecks.

Year Venue Authors Title Tags P E N
2017 TACO Intel HAShCache: Heterogeneity-Aware Shared DRAMCache for Integrated Heterogeneous Systems heterogeneity-aware DRAMCache scheduling PrIS; temporal bypass ByE; spatial occupancy control chaining
2018 ICS NC State ProfDP: A Lightweight Profiler to Guide Data Placement in Heterogeneous Memory Systems latency sensitivity; bandwidth sensitivity; moving factor based data placement
2023 HPCA THU Baryon: Efficient Hybrid Memory Management with Compression and Sub-Blocking stage area and selective commit for stable block; dual-format metadata scheme; cacheline-aligned compression and two-level replacements

Multiple Domain Specific Accelerator

Year Venue Authors Title Tags P E N
2024 HPCA UCSD && Univ. of Kansas Data Motion Acceleration: Chaining Cross-Domain Multi Accelerators Proposes Data Motion Acceleration (DMX);Data Restructuring Accelerator (DRX) 3 4 3
2025 ISCA HyperAccel Hybe: GPU-NPU Hybrid System for Efficient LLM Inference with Million-Token Context Window GPU-NPU hybrid system for LLM; prefill-decode stage separation; fine-grained KV cache transmission; stage-wise pipelining 4 3 2

CPU-GPU System

Year Venue Authors Title Tags P E N
2024 arXiv KTH Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper Grace Hopper system memory characterization; integrated CPU-GPU page table analysis; first-touch policy impact study; system page size impact study; access-counter page migration evaluation 2 4 3
2025 ATC THU HYPERECA: Distributed Heterogeneous In-Memory Embedding Database for Training Recommender Models embedding database in host memory and GPU memory; 2-Fold Parallel strategy; contention-free ring schedule 2 3 3

GPU System

Year Venue Authors Title Tags P E N
2022 Mlsys MIT TORCHSPARSE: EFFICIENT POINT CLOUD INFERENCE ENGINE 3D Sparse Convolution; optimize Gather-Matmul-Scatter dataflow; Adaptive Matmul Grouping; Quantized and Vectorized Memory access 4 3 4
2023 Mlsys THU&&SJTU EXPLOITING HARDWARE UTILIZATION AND ADAPTIVE DATAFLOW FOREFFICIENT SPARSE CONVOLUTION IN 3D POINT CLOUDS 3D Sparse Convolution; optimize Gather-Matmul-Scatter and fetch-on-demand dataflow; Dynamic dataflow changing; coded-CSR mapping; Parallel Processing of different workloads without padding; Pointer 4 3 3
2023 MICRO MIT TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs 3D Sparse Convolution; optimize Implicit Gather-Matmul-Scatter; Cuda Sparse Kernel; Sparse Autotuner by detailed workload 4 3 4

Disaggregated Memory

Challenge: CXL and NVM offer higher speed & bandwidth than storage devices with byte-level access. Memory disaggregation using DRAM (high-speed/BW + small capacity) and NVM (low-speed/BW + large capacity), faces latency, bandwidth, and consistency challenges.

Year Venue Authors Title Tags P E N
2025 ASPLOS Purdue EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation Ethernet PHY network stack; PHY in-network scheduler; PHY intra-frame preemption 4 4 4
2025 HOTOS MSR Storage Class Memory is Dead, All Hail Managed-Retention Memory: Rethinking Memory for the AI Era Managed-Retention Memory class; relaxed retention non-volatile memory; dynamically Configurable Memory 3 3 2
2025 arXiv Edinburgh RMAI: Rethinking Memory for AI (Inference), In-Kernel Remote Shared Memory as a Software Alternative to CXL CXL-like transparent remote data placement and direct addressing on RDMA; remote shared PGAS-like memory arch for expert loading 3 2 3
2026 Eurosys RUC LightDSA: Enabling Efficient DSA Through Hardware-Aware Transparent Optimization 5 insights of data center-aware DSA optimizations; lightweight DSA API library for contiguous allocation and 64-byte alignment; optimized recycling algorithm for out-of-order completion behavior 4 4 4

Memory Tiering

Challenge: Previous cache-based memory systems are not aligned with RDMA, NVM and CXL's latency, bandwidth and access pattern.

Year Venue Authors Title Tags P E N
2020 OSDI MIT AIFM: High-Performance, Application-Integrated Far Memory object-level swapping with remoteable pointers and dereference scopes; pauseless memory evacuator using green thread co-scheduling 2 3 4
2023 SOSP UCSD Mira: A Program-Behavior-Guided Far Memory System profiling-guided customizable cache section partitioning; remote pointer optimization with adaptive prefetching and eviction hints; automated computation-and-network-aware function offloading 3 3 3
2024 ASPLOS Northwestern Getting a Handle on Unmanaged Memory compiler-automated translation and hosting of memory handles; thread-private stack-allocated pin sets for atomic-free tracking; extensible object-mobility runtime service interface 3 2 4
2025 ATC THU DSA-2LM: A CPU-Free Tiered Memory Architecture with Intel DSA CPU-free page migration in tiered memory via data streaming accelerator; adaptable migration algorithm for mixed 4KB/2MB pages; direct in-kernel DSA integration bypassing DMA 3 3 4
2026 OSDI UW-Madison OBASE: Object-Based Address-Space Engineering to Improve Memory Tiering dynamic address-space reorganization for mitigating hotness fragmentation; lightweight access tracking via guide pointer metadata; pauseless lock-free object migration 4 4 3

CXL-based Disaggregated Memory

Year Venue Authors Title Tags P E N
2024 MICRO PKU NeoMem: Hardware-Software Co-Design for CXL-Native Memory Tiering device-side memory profiling unit; sketch-based hot page detector with error-bound estimation; dynamic hotness threshold adjustment based on statistics 2 2 3
2025 ASPLOS Yale PULSE: Accelerating Distributed Pointer-Traversals on Disaggregated Memory iterator-based programming model; disaggregated accelerator architecture; in-network routing for distributed traversal 3 4 3
2025 arXiv Micron Architectural and System Implications of CXL-enabled Tiered Memory CXL parallelism bottleneck analysis; MIKU dynamic request control; ToR-based service time estimation 4 4 3
2025 arXiv PKU Enabling Efficient Transaction Processing on CXL-Based Memory Sharing hybrid coherence primitive for transactional data; hardware-assisted loose coherence 3 2 2
2025 ASPLOS PKU CTXNL: A Software-Hardware Co-designed Solution for Efficient CXL-Based Transaction Processing decouple coherence from memory access; software synchronization primitives at transaction commit 4 2 3

Survey

Year Venue Authors Title Tags P E N
2025 arXiv SJTU Survey of Disaggregated Memory: Cross-layer Technique Insights for Next-Generation Datacenters Cross-layer classification of DM techniques; hardware-level categories; architectural-level classifications; system and runtime-level groupings; application-level optimizations such as general-purpose and domain-specific approaches

Chiplets

Challenge: Current chip designs are often monolithic and inflexible; leading to high costs and limited performance optimization opportunities.

Solution: Use chiplets to enable more flexible and cost-effective system designs by allowing the integration of specialized dies manufactured using optimal processes; leading to improved performance and yield.

Survey

Year Venue Authors Title Tags P E N
2020 Electronics NUDT Chiplet Heterogeneous Integration Technology—Status and Challenges heterogeneous integration technology; interconnect interfaces and protocols; packaging technology
2022 CCF THPC ICT Survey on chiplets: interface, interconnect and integration methodology development history; interfaces and protocols; packaging technology; EDA tool; standardization of chiplet technology
2024 IEEE CASS THU Chiplet Heterogeneous Integration Technology—Status and Challenges wafer-scale chip architecture; compiler tool chain; integration technology; wafer-scale system; fault tolerance

Multimodal AI chiplets

Year Venue Authors Title Tags P E N
2024 MICRO CA SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators Hierarchical Scheduling Framework;Time Windowing;Chiplet-level Scheduling 3 3 3

MCM Architecture & Scheduling

Challenge: Scaling single monolithic AI accelerators is limited by yield and reticle size. Multi-Chip-Module (MCM) architectures solve this but introduce NUMA effects and severe inter-chiplet communication bottlenecks, requiring merged pipelining and hardware-software co-design.

Year Venue Authors Title Tags P E N
2019 MICRO NVIDIA Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture MCM (Multi-Chip-Module) accelerator;hierarchical mesh interconnection; GRS (Ground-Reference Signaling) 4 5 4
2024 TVLSI SJTU M2M: A Fine-Grained Mapping Framework to Accelerate Multiple DNNs on a Multi-Chiplet Architecture PCA (Principal Component Analysis) and hierarchical clustering for network partitioning; simulated annealing algorithm for communication-aware block mapping; fine-tuned QoS(Quality of Service)policy for NoP(Network-on-Package) links 4 3 3
2026 ASP-DAC THU Scope: A Scalable Merged Pipeline Framework for Multi-Chip-Module NN Accelerators cluster-dimension DSE (Design Space Exploration); DP-based (Dynamic Programming) cluster allocation algorithm; WSP-to-ISP (Weight/Input-Shared Partitioning) transition search 4 3 3
2026 HPCA PKU COMET: Communication and Memory Co-Design for Fine-Grained AI Inference in MCM Accelerators DMA (Direct Memory Access)Coalescing;Hardware-Aware Genetic Algorithm;adaptive on-chip memory address mapping 4 3 3

Cost Analysis

Year Venue Authors Title Tags P E N
2021 ISCA THU NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators NN-Baton automatic tool; critical capcity critical position framework for communication overhead; chiplet granularity exploration 4 3 4
2025 arXiv ASU CATCH: a Cost Analysis Tool for Co-optimization of chiplet-based Heterogeneous systems heterogeneous chiplet system modeling; DSE on chiplets size,IO,connection

3D IC

Solution: 3DIC technology enables higher integration density; shorter interconnects; and improved performance by stacking multiple active layers in a single device.

General 3D IC

Year Venue Authors Title Tags P E N
2019 GLSVLSI Boston Univeristy An Overview of Thermal Challenges and Opportunities for Monolithic 3D ICs TSV-based 3D integration; Mono3D integration with nanoscale monolithic inter-tier vias; influence of lateral heat flow and inter-connection
2019 ECTC TSMC System on Integrated Chips (SoIC) for 3D Heterogeneous Integration system on integrated chips; SoIC package integration; reliability of SoIC bond,TSV and TDV
2020 DATE Georgia Tech Macro-3D: A Physical Design Methodology for Face-to-Face-Stacked Heterogeneous 3D ICs face-to-face stack; separate 2D floorplans generation; memory-on-logic projection
2022 IEEE Micro Cerebras Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning fine-grained dataflow scheduling; high-bandwidth, low-latency fabric design; weight streaming

Interconnection

Year Venue Authors Title Tags P E N
2025 HPCA Fudan EIGEN: Enabling Efficient 3DIC Interconnect with Heterogeneous Dual-Layer Network-on-Active-Interposer Dual-layer interconnect architecture, Reinforcement learning routing, Switch-programmable interconnect 3 2 3

Design Space Exploration

Year Venue Authors Title Tags P E N
2025 arXiv SJTU Cool-3D: An End-to-End Thermal-Aware Framework for Early-Phase Design Space Exploration of Microfluidic-Cooled 3DICs end-to-end thermal-aware framework; microfluidic cooling integration; Pre-RTL design space exploration; floorplan designer; microfluidic cooling strategy generator

Benchmarks

Year Venue Authors Title Tags P E N
2025 arXiv NJU Open3DBench: Open-Source Benchmark for 3D-IC Backend Implementation and PPA Evaluation open-source 3D-IC benchmark; modular 3D partitioning and placement; Open3D-DMP algorithm for cross-die co-placement; comprehensive PPA evaluation with thermal simulation

SpMM, SpGEMM, SDDMM hardware accelerator

Tiling hardware

Year Venue Authors Title Tags P E N
2023 ASPLOS UC && UIUC && NVIDIA Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling (DRT) Dynamic Reflexive Tiling (DRT) algorithm; dynamically adjust tile shapes at runtime based on sparsity of tensors; ssembling uniform micro tiles into non-uniform macro tiles 3 3 2
2023 MICRO MIT && NVIDIA Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer Capacity overbooking tiling strategy;Swiftiles statistical sampling method;low-overhead hardware mechanism Tailors 3 3 3
2025 HPCA THU HYTE: Flexible Tiling for Sparse Accelerators via Hybrid Static-Dynamic Approaches hybrid static-dynamic framework;selecting a near-optimal initial tiling scheme;dynamic fine-tuning of tile shapes;coordinates efficient management of both data and metadata in on-chip/off-chip buffers 3 3 3

Dataflow hardware

Year Venue Authors Title Tags P E N
2023 ASPLOS THU && DAMO && Northwestern University SPADA: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow highly diverse sparsity patterns;Window-based Adaptive Dataflow;dynamically select the optimal window shape configuration based on the similarity of sparse patterns 3 2 3
2023 ASPLOS Universidad de Murcia && Georgia Tech && NVIDIA Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing dynamically adaptable multi-dataflow SpMSpM accelerator;Merger-Reduction Network;configurable tree-based topology;a customized L1 memory hierarchy comprising a read-only FIFO;a low-power cache;and a PSRAM for partial sums 3 3 3
2025 MICRO University of Maryland Misam: Machine Learning Assisted Dataflow Selection in Accelerators for Sparse Matrix Multiplication FPGA;a machine learning-based framework for dynamic dataflow selection;intelligent hardware reconfiguration;decision tree;intelligent reconfiguration engine 3 2 4