Parallel and Multi-Processor Architecture¶

Heterogeneous Architecture¶

Challenge: Classic Heterogeneous Architecture faces challenges in the data movement and memory access patterns; leading to performance bottlenecks.

Year	Venue	Authors	Title	Tags
2017	TACO	Intel	HAShCache: Heterogeneity-Aware Shared DRAMCache for Integrated Heterogeneous Systems	heterogeneity-aware DRAMCache scheduling PrIS; temporal bypass ByE; spatial occupancy control chaining
2018	ICS	NC State	ProfDP: A Lightweight Profiler to Guide Data Placement in Heterogeneous Memory Systems	latency sensitivity; bandwidth sensitivity; moving factor based data placement
2023	HPCA	THU	Baryon: Efficient Hybrid Memory Management with Compression and Sub-Blocking	stage area and selective commit for stable block; dual-format metadata scheme; cacheline-aligned compression and two-level replacements

Multiple Domain Specific Accelerator¶

Year	Venue	Authors	Title	Tags	P	E	N
2024	HPCA	UCSD && Univ. of Kansas	Data Motion Acceleration: Chaining Cross-Domain Multi Accelerators	Proposes Data Motion Acceleration (DMX);Data Restructuring Accelerator (DRX)	3	4	3
2025	ISCA	HyperAccel	Hybe: GPU-NPU Hybrid System for Efficient LLM Inference with Million-Token Context Window	GPU-NPU hybrid system for LLM; prefill-decode stage separation; fine-grained KV cache transmission; stage-wise pipelining	4	3	2

CPU-GPU System¶

Year	Venue	Authors	Title	Tags	P	E	N
2024	arXiv	KTH	Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper	Grace Hopper system memory characterization; integrated CPU-GPU page table analysis; first-touch policy impact study; system page size impact study; access-counter page migration evaluation	2	4	3
2025	ATC	THU	HYPERECA: Distributed Heterogeneous In-Memory Embedding Database for Training Recommender Models	embedding database in host memory and GPU memory; 2-Fold Parallel strategy; contention-free ring schedule	2	3	3

GPU System¶

Year	Venue	Authors	Title	Tags	P	E	N
2022	Mlsys	MIT	TORCHSPARSE: EFFICIENT POINT CLOUD INFERENCE ENGINE	3D Sparse Convolution; optimize Gather-Matmul-Scatter dataflow; Adaptive Matmul Grouping; Quantized and Vectorized Memory access	4	3	4
2023	Mlsys	THU&&SJTU	EXPLOITING HARDWARE UTILIZATION AND ADAPTIVE DATAFLOW FOREFFICIENT SPARSE CONVOLUTION IN 3D POINT CLOUDS	3D Sparse Convolution; optimize Gather-Matmul-Scatter and fetch-on-demand dataflow; Dynamic dataflow changing; coded-CSR mapping; Parallel Processing of different workloads without padding; Pointer	4	3	3
2023	MICRO	MIT	TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs	3D Sparse Convolution; optimize Implicit Gather-Matmul-Scatter; Cuda Sparse Kernel; Sparse Autotuner by detailed workload	4	3	4
2025	MICRO	UPC	Dissecting and Modeling the Architecture of Modern GPU Cores	CGGTY (Compiler Guided Greedy Then Youngest) issue scheduling; software-managed dependencies via stall and dependence counters	4	4	2

Disaggregated Memory¶

Challenge: CXL and NVM offer higher speed & bandwidth than storage devices with byte-level access. Memory disaggregation using DRAM (high-speed/BW + small capacity) and NVM (low-speed/BW + large capacity), faces latency, bandwidth, and consistency challenges.

Year	Venue	Authors	Title	Tags	P	E	N
2025	ASPLOS	Purdue	EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation	Ethernet PHY network stack; PHY in-network scheduler; PHY intra-frame preemption	4	4	4
2025	HOTOS	MSR	Storage Class Memory is Dead, All Hail Managed-Retention Memory: Rethinking Memory for the AI Era	Managed-Retention Memory class; relaxed retention non-volatile memory; dynamically Configurable Memory	3	3	2
2025	arXiv	Edinburgh	RMAI: Rethinking Memory for AI (Inference), In-Kernel Remote Shared Memory as a Software Alternative to CXL	CXL-like transparent remote data placement and direct addressing on RDMA; remote shared PGAS-like memory arch for expert loading	3	2	3
2026	Eurosys	RUC	LightDSA: Enabling Efficient DSA Through Hardware-Aware Transparent Optimization	5 insights of data center-aware DSA optimizations; lightweight DSA API library for contiguous allocation and 64-byte alignment; optimized recycling algorithm for out-of-order completion behavior	4	4	4

Memory Tiering¶

Challenge: Previous cache-based memory systems are not aligned with RDMA, NVM and CXL's latency, bandwidth and access pattern.

Year	Venue	Authors	Title	Tags	P	E	N
2020	OSDI	MIT	AIFM: High-Performance, Application-Integrated Far Memory	object-level swapping with remoteable pointers and dereference scopes; pauseless memory evacuator using green thread co-scheduling	2	3	4
2023	SOSP	UCSD	Mira: A Program-Behavior-Guided Far Memory System	profiling-guided customizable cache section partitioning; remote pointer optimization with adaptive prefetching and eviction hints; automated computation-and-network-aware function offloading	3	3	3
2024	ASPLOS	Northwestern	Getting a Handle on Unmanaged Memory	compiler-automated translation and hosting of memory handles; thread-private stack-allocated pin sets for atomic-free tracking; extensible object-mobility runtime service interface	3	2	4
2025	ATC	THU	DSA-2LM: A CPU-Free Tiered Memory Architecture with Intel DSA	CPU-free page migration in tiered memory via data streaming accelerator; adaptable migration algorithm for mixed 4KB/2MB pages; direct in-kernel DSA integration bypassing DMA	3	3	4
2026	OSDI	UW-Madison	OBASE: Object-Based Address-Space Engineering to Improve Memory Tiering	dynamic address-space reorganization for mitigating hotness fragmentation; lightweight access tracking via guide pointer metadata; pauseless lock-free object migration	4	4	3

CXL-based Disaggregated Memory¶

Year	Venue	Authors	Title	Tags	P	E	N
2024	MICRO	PKU	NeoMem: Hardware-Software Co-Design for CXL-Native Memory Tiering	device-side memory profiling unit; sketch-based hot page detector with error-bound estimation; dynamic hotness threshold adjustment based on statistics	2	2	3
2025	ASPLOS	Yale	PULSE: Accelerating Distributed Pointer-Traversals on Disaggregated Memory	iterator-based programming model; disaggregated accelerator architecture; in-network routing for distributed traversal	3	4	3
2025	arXiv	Micron	Architectural and System Implications of CXL-enabled Tiered Memory	CXL parallelism bottleneck analysis; MIKU dynamic request control; ToR-based service time estimation	4	4	3
2025	arXiv	PKU	Enabling Efficient Transaction Processing on CXL-Based Memory Sharing	hybrid coherence primitive for transactional data; hardware-assisted loose coherence	3	2	2
2025	ASPLOS	PKU	CTXNL: A Software-Hardware Co-designed Solution for Efficient CXL-Based Transaction Processing	decouple coherence from memory access; software synchronization primitives at transaction commit	4	2	3

Survey¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	SJTU	Survey of Disaggregated Memory: Cross-layer Technique Insights for Next-Generation Datacenters	Cross-layer classification of DM techniques; hardware-level categories; architectural-level classifications; system and runtime-level groupings; application-level optimizations such as general-purpose and domain-specific approaches

Chiplets¶

Challenge: Current chip designs are often monolithic and inflexible; leading to high costs and limited performance optimization opportunities.

Solution: Use chiplets to enable more flexible and cost-effective system designs by allowing the integration of specialized dies manufactured using optimal processes; leading to improved performance and yield.

Survey¶

Year	Venue	Authors	Title	Tags
2020	Electronics	NUDT	Chiplet Heterogeneous Integration Technology—Status and Challenges	heterogeneous integration technology; interconnect interfaces and protocols; packaging technology
2022	CCF THPC	ICT	Survey on chiplets: interface, interconnect and integration methodology	development history; interfaces and protocols; packaging technology; EDA tool; standardization of chiplet technology
2024	IEEE CASS	THU	Chiplet Heterogeneous Integration Technology—Status and Challenges	wafer-scale chip architecture; compiler tool chain; integration technology; wafer-scale system; fault tolerance

Multimodal AI chiplets¶

Year	Venue	Authors	Title	Tags	P	E	N
2024	MICRO	CA	SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators	Hierarchical Scheduling Framework;Time Windowing;Chiplet-level Scheduling	3	3	3

MCM Architecture & Scheduling¶

Challenge: Scaling single monolithic AI accelerators is limited by yield and reticle size. Multi-Chip-Module (MCM) architectures solve this but introduce NUMA effects and severe inter-chiplet communication bottlenecks, requiring merged pipelining and hardware-software co-design.

Year	Venue	Authors	Title	Tags	P	E	N
2019	MICRO	NVIDIA	Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture	MCM (Multi-Chip-Module) accelerator;hierarchical mesh interconnection; GRS (Ground-Reference Signaling)	4	5	4
2024	TVLSI	SJTU	M2M: A Fine-Grained Mapping Framework to Accelerate Multiple DNNs on a Multi-Chiplet Architecture	PCA (Principal Component Analysis) and hierarchical clustering for network partitioning; simulated annealing algorithm for communication-aware block mapping; fine-tuned QoS(Quality of Service)policy for NoP(Network-on-Package) links	4	3	3
2026	ASP-DAC	THU	Scope: A Scalable Merged Pipeline Framework for Multi-Chip-Module NN Accelerators	cluster-dimension DSE (Design Space Exploration); DP-based (Dynamic Programming) cluster allocation algorithm; WSP-to-ISP (Weight/Input-Shared Partitioning) transition search	4	3	3
2026	HPCA	PKU	COMET: Communication and Memory Co-Design for Fine-Grained AI Inference in MCM Accelerators	DMA (Direct Memory Access)Coalescing;Hardware-Aware Genetic Algorithm;adaptive on-chip memory address mapping	4	3	3

Cost Analysis¶

Year	Venue	Authors	Title	Tags	P	E	N
2021	ISCA	THU	NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators	NN-Baton automatic tool; critical capcity critical position framework for communication overhead; chiplet granularity exploration	4	3	4
2025	arXiv	ASU	CATCH: a Cost Analysis Tool for Co-optimization of chiplet-based Heterogeneous systems	heterogeneous chiplet system modeling; DSE on chiplets size,IO,connection

3D IC¶

Solution: 3DIC technology enables higher integration density; shorter interconnects; and improved performance by stacking multiple active layers in a single device.

General 3D IC¶

Year	Venue	Authors	Title	Tags
2019	GLSVLSI	Boston Univeristy	An Overview of Thermal Challenges and Opportunities for Monolithic 3D ICs	TSV-based 3D integration; Mono3D integration with nanoscale monolithic inter-tier vias; influence of lateral heat flow and inter-connection
2019	ECTC	TSMC	System on Integrated Chips (SoIC) for 3D Heterogeneous Integration	system on integrated chips; SoIC package integration; reliability of SoIC bond,TSV and TDV
2020	DATE	Georgia Tech	Macro-3D: A Physical Design Methodology for Face-to-Face-Stacked Heterogeneous 3D ICs	face-to-face stack; separate 2D floorplans generation; memory-on-logic projection
2022	IEEE Micro	Cerebras	Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning	fine-grained dataflow scheduling; high-bandwidth, low-latency fabric design; weight streaming

Interconnection¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	HPCA	Fudan	EIGEN: Enabling Efficient 3DIC Interconnect with Heterogeneous Dual-Layer Network-on-Active-Interposer	Dual-layer interconnect architecture, Reinforcement learning routing, Switch-programmable interconnect	3	2	3

Design Space Exploration¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	SJTU	Cool-3D: An End-to-End Thermal-Aware Framework for Early-Phase Design Space Exploration of Microfluidic-Cooled 3DICs	end-to-end thermal-aware framework; microfluidic cooling integration; Pre-RTL design space exploration; floorplan designer; microfluidic cooling strategy generator

Benchmarks¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	NJU	Open3DBench: Open-Source Benchmark for 3D-IC Backend Implementation and PPA Evaluation	open-source 3D-IC benchmark; modular 3D partitioning and placement; Open3D-DMP algorithm for cross-die co-placement; comprehensive PPA evaluation with thermal simulation

SpMM, SpGEMM, SDDMM hardware accelerator¶

Tiling hardware¶

Year	Venue	Authors	Title	Tags	P	E	N
2023	ASPLOS	UC && UIUC && NVIDIA	Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling (DRT)	Dynamic Reflexive Tiling (DRT) algorithm; dynamically adjust tile shapes at runtime based on sparsity of tensors; ssembling uniform micro tiles into non-uniform macro tiles	3	3	2
2023	MICRO	MIT && NVIDIA	Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer Capacity	overbooking tiling strategy;Swiftiles statistical sampling method;low-overhead hardware mechanism Tailors	3	3	3
2025	HPCA	THU	HYTE: Flexible Tiling for Sparse Accelerators via Hybrid Static-Dynamic Approaches	hybrid static-dynamic framework;selecting a near-optimal initial tiling scheme;dynamic fine-tuning of tile shapes;coordinates efficient management of both data and metadata in on-chip/off-chip buffers	3	3	3

Dataflow hardware¶

Year	Venue	Authors	Title	Tags	P	E	N
2023	ASPLOS	THU && DAMO && Northwestern University	SPADA: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow	highly diverse sparsity patterns;Window-based Adaptive Dataflow;dynamically select the optimal window shape configuration based on the similarity of sparse patterns	3	2	3
2023	ASPLOS	Universidad de Murcia && Georgia Tech && NVIDIA	Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing	dynamically adaptable multi-dataflow SpMSpM accelerator;Merger-Reduction Network;configurable tree-based topology;a customized L1 memory hierarchy comprising a read-only FIFO;a low-power cache;and a PSRAM for partial sums	3	3	3
2025	MICRO	University of Maryland	Misam: Machine Learning Assisted Dataflow Selection in Accelerators for Sparse Matrix Multiplication	FPGA;a machine learning-based framework for dynamic dataflow selection;intelligent hardware reconfiguration;decision tree;intelligent reconfiguration engine	3	2	4