Domain-Specific Accelerators¶

Graph Accelerators¶

Challenge: Massive memory requirement, Non-ordered memory access

Pipelined & Event-Driven Graph Accelerators¶

Solution: Optimize data flow and execution models using pipelining, event-driven mechanisms, or data shuffling to handle irregular graph workloads.

Year	Venue	Authors	Title	Tags	P	E	N
2016	MICRO	Princeton	Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics	vertex-programming based pipeline; on-chip scratchpad optimization; source/destination-oriented parallel streams	4	4	4
2019	FPL	NUS	On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-Based FPGAs	on-the-fly parallel data shuffling; OpenCL-based data dispatcher; runtime data dependency resolution; decoder-filter shuffling architecture	3	4	2
2020	MICRO	UCR	GraphPulse: An Event-Driven Hardware Accelerator for Asynchronous Graph Processing	asynchronous event-driven model; in-place event coalescing; delta-based accumulative processing	3	3	4

Memory-Optimized Graph Accelerators¶

Solution: Address the memory bottleneck (bandwidth and latency) through specialized memory hierarchies, HBM optimizations, or caching mechanisms.

Year	Venue	Authors	Title	Tags	P	E	N
2019	FPGA	HUST	Improving Performance of Graph Processing on FPGA-DRAM Platform by Two-level Vertex Caching	two-level vertex caching (L1 BRAM/L2 UltraRAM); Hilbert-order window sliding for locality; dual-pipeline computation-communication overlapping	4	4	3
2021	ICCAD	Cornell	GraphLily: Accelerating Graph Linear Algebra on HBM-Equipped FPGAs	GraphBLAS; FPGA overlay; HBM-optimized; SpMV/SpMSpV accelerator; CPSR sparse format; software-hardware co-design	4	4	3
2024	TRETS	HUST	ScalaBFS2: A High-performance BFS Accelerator on an HBM-enhanced FPGA Chip	HBM-enhanced BFS accelerator; independent HBM Reader; hybrid-mode PE; multi-layer crossbar	3	4	3

Dynamic Graph Accelerators¶

Challenge: Edge update, Graph store data structure design

Year	Venue	Authors	Title	Tags	P	E	N
2021	FPGA	HUST	GraSU: A Fast Graph Update Library for FPGA-based Dynamic Graph Processing	differential data management based on spatial similarity; Incremental Value Measurer (IVM); Value-Aware Memory Manager (VMM)	4	4	3
2021	MICRO	UCR	JetStream: Graph Analytics on Streaming Data with Event-Driven Hardware Accelerator	first streaming graph accelerator; asynchronous incremental algorithms; asynchronous edge deletion handling (VAP; DAP)	3	3	4
2022	ISCA	HUST	TDGraph: A Topology-Driven Accelerator for High-Performance Streaming Graph Processing	topology-driven incremental execution; streaming graph processing; regularized state propagation; vertex states coalescing	4	4	3
2023	MICRO	UCR	MEGA Evolving Graph Accelerator	first evolving graph accelerator; Batch-Oriented Execution (BOE); deletion-free based on CommonGraph; Batch Pipelining	3	3	4
2024	TRETS	UoV	Dynamic-ACTS - A Dynamic Graph Analytics Accelerator For HBM-Enabled FPGAs	a novel edge packing format (ACTPACK); hashed edge updates; low-overhead online partitioning	4	4	3

Hypergraph Accelerators¶

Solution: Realize the shared parts in hyperedges

Year	Venue	Authors	Title	Tags	P	E	N
2022	MICRO	HUST	A Data-Centric Accelerator for High-Performance Hypergraph Processing	Data-Centric; Load-Trigger-Reduce (LTR); Adaptive Data Loading	4	4	3
2025	HPCA	HUST	MeHyper: Accelerating Hypergraph Neural Networks by Exploring Implicit Dataflows	Microedge; Microedge-Centric Dataflow; RePAG Execution Model	4	3	4

Graph Mining Accelerators¶

Challenge: Complex graph algorithms, Irregular access patterns

Year	Venue	Authors	Title	Tags	P	E	N
2021	ISCA	MIT	FlexMiner: A Pattern-Aware Accelerator for Graph Pattern Mining	pattern-aware GPM accelerator; software/hardware co-design; pattern-specific execution plan; connectivity map (c-map)	4	4	3

DNN Accelerators¶

GEMM¶

Year	Venue	Authors	Title	Tags	P	E	N
2020	HPCA	Georgia Tech	SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training	flexible dot product engine; forwarding adder network	4	2	3
2025	TCAS-I	Edin.	DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration	Permutated Weight;FIFO-less Architecture	4	4	3

Layer Fusion Accelerators¶

Solution: Use layer fusion to combine multiple layers of a neural network into a single layer. This can help reduce the number of computations and memory accesses required during inference; leading to faster execution times and lower power consumption.

Year	Venue	Authors	Title	Tags
2016	MICRO	SBU	Fused-Layer CNN Accelerators	fuse the processing of multiple CNN layers by modifying the order in which the input data are brought on chip
2025	TC	KU Leuven	Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators	fine-grain mapping paradigm; mapping of layer-fused DNNs on heterogeneous dataflow accelerator architectures; memory- and communication-aware latency analysis; constraint optimization
2024	SOCC	IIT Hyderabad	Hardware-Aware Network Adaptation using Width and Depth Shrinking including Convolutional and Fully Connected Layer Merging	Width Shrinking: reduces the number of feature maps in CNN layers; Depth Shrinking: Merge of conv layer and fc layer
2024	ICSAI	MIT	LoopTree: Exploring the Fused-Layer Dataflow Accelerator Design Space	design space that supports set of tiling, recomputation, retention choices, and their combinations; model that validates design space

LLM Accelerators¶

Challenge: LLM accelerators face challenges in terms of memory bandwidth; power consumption; and the need for efficient data movement.

Accelerators facing Memory Wall¶

Challenge: LLMs require large amounts of memory bandwidth to store and access the model parameters and intermediate results which can lead to memory bottlenecks and reduced performance.

Year	Venue	Authors	Title	Tags	P	E	N
2024	ISCA	Furiosa AI	TCP: A Tensor Contraction Processor for AI Workloads	tensor contraction as a hardware primitive; circuit-switched fetch network for hierarchical data reuse; Einstein summation for tactic exploration	4	3	3
2024	DATE	NTU	ViTA: A Highly Efficient Dataflow and Architecture for Vision Transformers	highly efficient memory-centric dataflow; fused special function module for non-linear functions; A comprehensive DSE of ViTA Kernels and VMUs
2025	arXiv	SJTU	ROMA: A Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM	hybrid ROM-SRAM architecture for on-device LLM; B-ROM design for area-efficient ROM; fused cell integration of ROM and compute unit; QLoRA rank adaptation for task-specific tuning; on-chip storage optimization for quantized models
2025	ISCA	Duke	Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression	entropy-aware cache compression for LLMs; group-wise non-uniform quantization; shared k-means patterns; parallel Huffman hardware decoder	4	3	3

Compiler-Scheduled Cacheless Architecture¶

Challenge: Reactive hardware architectures (like GPUs) devote significant area and power to dynamic scheduling and caching, which leads to unpredictable tail latencies and limits the compute utilization in large-scale distributed systems.

Year	Venue	Authors	Title	Tags	P	E	N
2020	ISCA	Groq	Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads	Space-time determined;Functionally Sliced and Data stream Microarchitecture	4	5	4
2022	ISCA	Groq	A software-defined tensor streaming multiprocessor for large-scale machine learning	Multi-TSP network;Removal of Hardware Flow Control;Deterministic Load Balancing	4	5	4

Algorithmic Accelerators¶

Solution: Algorithmic accelerators use specialized algorithms to optimize the performance of LLMs.

Year	Venue	Authors	Title	Tags	P	E	N
2020	HPCA	Seoul National	A3: Accelerating Attention Mechanisms in Neural Networks with Approximation	greedy candidate search for reducing search targets; post-scoring selection with dynamic thresholding for softmax; lookup-table-based exponent modules	4	3	2
2021	HPCA	MIT	SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning	cascade pruning; high-parallelism top-K engine; progressive quantization	3	4	4
2024	ASPLOS	CMU	SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification	tree-based speculative inference; topology-aware causal mask; multi-step speculative sampling	4	4	4
2025	MICRO	KAIST	HLX: A Unified Pipelined Architecture for Optimized Performance of Hybrid Transformer-Mamba Language Models	PipeFlash fine-grained pipelining for Attention; PipeSSD fused pipelined execution for Mamba-2; Unified Reconfigurable Streamlined Core (URSC);inter-operation dependency mitigation	4	3	3

Wafer-Scale Accelerators¶

Solution: Adopt holistic hardware-software co-design to integrate high-bandwidth Network-on-Wafer (NoW) architectures with fault-tolerant mapping strategies, thereby mitigating yield limitations and communication bottlenecks to maximize system throughput.

Year	Venue	Authors	Title	Tags	P	E	N
2025	ISCA	THU	PD Constraint-aware Physical/Logical Topology Co-Design for Network on Wafer	mesh-switch physical topology; dual-granularity logical topology	4	3	2
2025	ISCA	THU	WSC-LLM: Efficient LLM Service and Architecture Co-exploration for Wafer-scale Chips	optimal resource partition algorithm; optimal KV cache placement algorithm	4	3	2
2026	HPCA	THU	WATOS: Efficient LLM Training Strategies and Architecture Co-exploration for Wafer-scale Chip	globally coordinated memory-efficient recomputation; location-aware resource placement; 3-stage system-level robustness design	4	4	2

Quantized DNN Accelerators¶

Solution: Quantized DNN accelerators are designed to efficiently execute quantized neural networks, which use lower precision representations for weights and activations.

Year	Venue	Authors	Title	Tags	P	E	N
2018	ISCA	SNU	Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation	accelerator architecture for outlier-aware quantized models; outlier-aware low-precision computation; separate outlier MAC unit	4	3	2
2024	DAC	ASU	Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference	composite data type Logarithmic Posits (LP); automated post training LP Quantization (LPQ) Framework based on genetic algorithms; mixed-precision LP Accelerator (LPA)	3	3	2
2023	HPCA	UPC	Mix-GEMM: An efficient HW-SW Architecture for Mixed-Precision Quantized Deep Neural Networks Inference on Edge Devices	Complete mixed-precision flexibility; hardware accelerator & BLIS-based library with custom RISC-V ISA extensions	3	2	3

Bit-Sliced DNN Accelerators¶

Solution: Bit-sliced DNN accelerators break down data into smaller bit-slices, allowing for more efficient processing and reduced memory and calculation resources requirements.

Year	Venue	Authors	Title	Tags	P	E	N
2018	ISCA	Georgia Tech	Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network	accelerator for layer-aware quantized DNN; bit-flexible computation unit; block-structured instruction set architecture	4	3	3
2023	HPCA	KAIST	Sibia: Signed Bit-slice Architecture for Dense DNN Acceleration with Slice-level Sparsity Exploitation	signed bit-slice representation;flexible zero skipping processing element	3	3	4
2024	HPCA	KU Leuven	BitWave: Exploiting Column-Based Bit-Level Sparsity for Deep Learning Acceleration	Bit-column sparsity for both computation reduction and data compression; Single-shot Bit-Flip post-training	3	3	3
2025	HPCA	POSTECH	Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity	Asymmetrically-Quantized bit-Slice GEMM; Zero-Point Manipulation and Distribution-based Bit-Slicing to increase sparsity	3	3	4
2025	HPCA	Yonsei	Bit-slice Architecture for DNN Acceleration with Slice-level Sparsity Enhancement and Exploitation	both input AND weight sparsity at bit-slice level; 8-bit data processing with 4-bit multipliers; Scale regularization during training to enhance sparsity	3	2	2

Reconfigurable Accelerators¶

Solution: Reconfigurable accelerators not only break the trade-off of flexibility and performance, but also enable hardware to adapt to algorithm changes as quickly as software while maintaining high energy efficiency.

Year	Venue	Authors	Title	Tags	P	E	N
2018	ASPLOS	Georgia Tech	MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects	augmented reduction tree(ART) for link conflict; chubby distribution tree for bandiwdth optimization; ART based virtual neuron construction	4	3	2
2019	JETCAS	MIT	Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices	hierarchical mesh NoC for multiple transmission modes; sparse PE architecture	5	4	2
2023	ASPLOS	UM & Georgia Tech	Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing	merger-reduction network for area efficiency; compression format conversion without hardware module; dedicated L1 memory architecture for different access pattern	4	3	2
2023	MICRO	MIT	HighLight: Efficient and Flexible DNN Acceleration with Hierarchical Structured Sparsity	hierarchical structured sparsity (HSS) enables systematic flexibility on sparsity patterns; modularized hierarchical sparsity acceleration architecture	3	3	3
2025	DATE	PKU	PEARL: FPGA-Based Reinforcement Learning Acceleration with Pipelined Parallel Environments	RL environment accelerator; PCIe communication optimization; IP-based design	3	3	2
2025	MICRO	Maryland	Misam: Machine Learning Assisted Dataflow Selection in Accelerators for Sparse Matrix Multiplication	ML-assisted runtime dataflow selection; lightweight decision tree predictor; intelligent reconfiguration engine for FPGAs; cost-benefit analysis for hardware switching	3	4	3

Benchmarks¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	Cambridge	Benchmarking Ultra-Low-Power µNPUs	Comparative µNPU Benchmarking (µNPU: microcontroller-scale Neural Processing Unit); open-source model compilation framework; µNPU memory I/O bottleneck identification	4	4	2

Dataflow Architecture¶

Solution: Dataflow architecture allows the execution of instructions based on the availability of data rather than a predetermined sequence; leading to more efficient use of resources and better performance in parallel processing and real-time systems.

Year	Venue	Authors	Title	Tags	P	E	N
2019	ASPLOS	THU	Tangram: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators	buffer sharing dataflow(BSD); alternate layer loop ordering (ALLO) dataflow; heuristics spatial layer mapping algorithm
2024	MICRO	CMU	The TYR Dataflow Architecture: Improving Locality by Taming Parallelism	local tag spaces technique; space tag managing instruction set; CT based concurrent-block communication
2024	MICRO	UCR	Sparsepipe: Sparse Inter-operator Dataflow Architecture with Cross-Iteration Reuse	producer-consumer reuse; cross-iteration reuse; sub-tensor dependency; OEI dataflow; sparsepipe architecture
2025	arXiv	UCSB	FETTA: Flexible and Efficient Hardware Accelerator for Tensorized Neural Network Training	contraction sequence search engine; tensor contraction unit; distribution/reduction network	3	4	3

Data Mapping¶

Solution: Assign data to specific locations in memory or storage to optimize performance; reduce latency; and improve resource utilization.

Survey¶

Year	Venue	Authors	Title	Tags	P	E	N
2013	DAC	NUS	Mapping on Multi/Many-core Systems: Survey of Current and Emerging Trends	dense/run-time mapping; centralized/distributred management; hybrid mapping

Heuristic Algorithm¶

Year	Venue	Authors	Title	Tags
2021	HPCA	Georgia Tech	MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores	sub-accelerator selection; fine-grained job prioritization; MANGA crossover genetic operators
2023	ISCA	THU	MapZero: Mapping for Coarse-grained Reconfigurable Architectures with Reinforcement Learning and Monte-Carlo Tree Search	GAT based DFG and CGRA embedding; routing penalty based reinforcement learning; Monte-Carlo tree search space exploration
2023	VLSI	IIT Kharagpur	Application Mapping Onto Manycore Processor Architectures Using Active Search Framework	RNN based active search framework; IP-Core Numbering Scheme; active search with/without pretraining

Optimization Modeling¶

Year	Venue	Authors	Title	Tags
2020	FPGA	ETH Zurich	Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis	computation and I/O decomposition model for matrix multiplication; 1D array collapse mapping method; internal double buffering
2021	HPCA	Georgia Tech	Heterogeneous Dataflow Accelerators for Multi-DNN Workloads	heterogeneous dataflow accelerators (HDAs) for DNN; dataflow flexibility; high utilization across the sub-accelerators
2023	MICRO	Alibaba; CUHK	ArchExplorer: Microarchitecture Exploration Via Bottleneck Analysis	dynamic event-dependence graph(EDG); induced DEG based critical path construction; bottleneck-removal-driven DSE
2023	ISCA	THU	Inter-layer Scheduling Space Definition and Exploration for Tiled Accelerators	inter-layer encoding method; temperal cut; spatial cut; RA tree analysis

Fault Tolerant Mapping¶

Year	Venue	Authors	Title	Tags	P	E	N
2017	SC	NIT	High-performance and energy-efficient fault-tolerance core mapping in NoC	weighted communication energy; placing unmapped vertices region; application core graph; spare core placement algorithm
2019	IVLSI	UESTC	Optimized mapping algorithm to extend lifetime of both NoC and cores in many-core system	lifetime budget metric; LBC-LBL mapping algorithm; electro-migration fault model

Reliability Management¶

Year	Venue	Authors	Title	Tags
2020	DATE	Turku	Thermal-Cycling-aware Dynamic Reliability Management in -Core System-on-Chip	Coffin-Mason equation based reliability model; reliability-aware mapping/scheduling; dynamic power management
2024	arXiv	WUSTL	A Two-Level Thermal Cycling-Aware Task Mapping Technique for Reliability Management in Manycore Systems	temperature based bin packing; task-to-bin assignment; thermal cycling-aware based task-to-core mapping
2024	arXiv	WUSTL	A Reinforcement Learning-Based Task Mapping Method to Improve the Reliability of Clustered Manycores	mean time to failure; density-based spatial clustering of applications with noise algorithm

CGRA¶

Solution: Optimally map an application's data flow graph onto the hardware fabric by simultaneously solving the tightly-coupled problems of scheduling, placement, and routing under strict spatial and temporal resource constraints.

Year	Venue	Authors	Title	Tags	P	E	N
2017	ASAP	Torontom	CGRA-ME: A Unified Framework for CGRA Modelling and Exploration	XML-based CGRA description; LLVM-based simulated annealing mapper; modulo routing resource graph	3	2	2
2019	TCAD	Tsinghua	Data-Flow Graph Mapping Optimization for CGRA With Deep Reinforcement Learning	neighbor PE interchange defined action space; local pattern based reward function	3	3	2
2022	HPCA	NUS	LISA: Graph Neural Network based Portable Mapping on Spatial Accelerators	label abstraction for quality mapping; GNN based label-aware mapping; label-aware simulated annealing	4	3	3
2024	MICRO	NUS	ICED: An Integrated CGRA Framework Enabling DVFS-Aware Acceleration	tile based CGRA configuration; DVFS labeling; DVFS-Aware DFG Mapping	4	3	2

Task Scheduling¶

Year	Venue	Authors	Title	Tags	P	E	N
2023	ICCAD	PKU	Memory-aware Scheduling for Complex Wired Networks with Iterative Graph Optimization	topology-aware pruning algorithm; integer linear programming scheduling method; sub-graph fusion algorithm ; memory-aware graph partitioning
2023	MICRO	Duke	Si-Kintsugi: Towards Recovering Golden-Like Performance of Defective Many-Core Spatial Architectures for AI	graph alignment algorithm for dataflow graph and platform pe grap; producer-consumer pattern dataflow generation algorithm

Many-core Architecture¶

Challenge: Many-core architectures are designed to handle a large number of cores; but they face challenges in terms of power consumption; performance; and resource allocation.

Resource Management¶

Challenge: Cores share resources with each other, how to achieve high performance by coordinating access among cores to prevent conflicts and ensure data consistency is a problem.

Year	Venue	Authors	Title	Tags
2015	HPCA	Cornel	Increasing Multicore System Efficiency through Intelligent Bandwidth Shifting	online bandwidth shifting mechanism; prefetch usefulness (PU) level
2015	HPCA	IBM	XChange: A Market-based Approach to Scalable Dynamic Multi-resource Allocation in Multicore Architectures	CMP multiresource allocation mechanism XChange; market framework based modeling
2018	MICRO	SNU	RpStacks-MT: A High-throughput Design Evaluation Methodology for Multi-core Processors	graph-based multi-core performance model; distance-based memory system model; dynamic scheduling reconstruction method
2023	MICRO	Yonsei	McCore: A Holistic Management of High-Performance Heterogeneous Multicores	cluster partitioning via index hash function; partitions balancing method; hardware support for RL based scheduling

Hardware Design¶

Solution: Hardware implementation for many-core architecture to achieve massive parallelism.

Year	Venue	Authors	Title	Tags	P	E	N
2016	SCIS	THU&BNU&CAS	The Sunway TaihuLight supercomputer: system and applications	Sunway TaihuLight's composition; scientific computing applications on TaihuLight	3	4	2
2017	IPDPSW	SJTU&Tokyo Tech	Benchmarking SW26010 Many-core Processor	hand-coded assembly benchmark for SW26010; CPE pipeline&memory hierarchy&RLC mechanism benchmarking	3	4	2
2023	MICRO	THU	MAICC: A Lightweight Many-core Architecture with In-Cache Computing for Multi-DNN Parallel Inference	slice improved and hardware-implemented reduction CIM; ISA extension for CIM; CNN layer segmentation and mapping algorithm

Application Optimization¶

Year	Venue	Authors	Title	Tags
2023	SC	NUDT	Optimizing Direct Convolutions on ARM Multi-Cores	direct convolution algorithm NDirect; loop ordering algorithm; micro convolution kernal for computing & packeting
2023	SC	NUDT	Optimizing MPI Collectives on Shared Memory Multi-Cores	intra-node reduction algorithm for redundant data movements; fine grained non-temporal store based adaptive collectives
2024	PPoPP	NUDT	Towards Scalable Unstructured Mesh Computations on Shared Memory Many-Cores	task dependency tree(TDT); tree traversal based parallel algorithm for CPU/GPU

Architecture DSE¶

Challenge: It's crucial to find the optimal hardware configurations that meet performance; power; and area constraints for specific applications.

NOC DSE¶

Year	Venue	Authors	Title	Tags	P	E	N
2018	ICCAD	WSU	Hybrid On-Chip Communication Architectures for Heterogeneous Manycore Systems	many-to-few communication patterns; long range shortcut based wireless NoC ; 3D-TSV based heterogeneous NoC
2018	IEEE TC	WSU	On-Chip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems	wireless-enabled heterogeneous NoC; archived multi-objective simulated annealing for network connectivity

Mapping & Co-Exploration DSE¶

Challenge: Efficiently co-optimize DNN mapping and hardware architecture under complex constraints.

Year	Venue	Authors	Title	Tags	P	E	N
2020	ICCAD	UIUC	DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator	two-level (global and local) automatic DSE engine; dynamic design space exploration framework; high-dimensional design space support	4	4	4
2024	HPCA	THU	Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators	layer-centric encoding method; DP-based graph partition algorithm; SA based D2D link communication optimization
2024	ASPLOS	THU	Cocco: Hardware-Mapping Co-Exploration towards Memory Capacity-Communication Optimization	consumption-centric flow based subgraph execution scheme; main/side region based memory management
2024	ASPDAC	CUHK	SoC-Tuner: An Importance-guided Exploration Framework for DNN-targeting SoC Design	intercluster distance algorithm; importance-based pruning and initialization	3	2	2
2024	Arxiv	Georgia Tech	PIPEORGAN: Efficient Inter-operation Pipelining with Flexible Spatial Org	spatial organization strategy pipeorgan for inter-operator pipelining; augmented mesh for pipelining(AMP) topology	4	2	2

Microarchitecture & Cross-Architecture DSE¶

Challenge: Efficiently explore and optimize design spaces across microarchitectures and heterogeneous hardware.

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	THU & Macau	MLDSE: Scaling Design Space Exploration Infrastructure for Multi-Level Hardware	IR and builder based hardware modeling; cross-architecture DSE; spatial-level DSE	3	3	2
2025	arXiv	PKU	DiffuSE: Cross-Layer Design Space Exploration of DNN Accelerator via Diffusion-Driven Optimization	diffusion-based design generation; conditional sampling	3	4	3

Data Access Accelerators¶

Challenge: Indirect and sparse memory access patterns; low memory bandwidth utilization; core structural limitations

Year	Venue	Authors	Title	Tags	P	E	N
2025	ISCA	UMich	DX100: Programmable Data Access Accelerator for Indirection	indirect memory access accelerator; bulk memory access reordering; DRAM row-buffer hit rate optimization; programmable data access ISA	4	3	2

HBM Interconnects¶

Challenge: The built-in crossbar of HBM FPGAs suffers from contention and low bandwidth during many-to-many unicast access, and standard HLS lacks support for efficient burst buffering.

Year	Venue	Authors	Title	Tags	P	E	N
2021	FPGA	UCLA	HBM Connect: High-Performance HLS Interconnect for FPGA HBM	HBM Connect; HLS Virtual Buffer (HVB); Mux-Demux Switch; butterfly custom crossbar; many-to-many unicast; pseudo-channel optimization	4	4	2