Domain-Specific Accelerators¶

Sorting & Merging Accelerators¶

Challenge: High-throughput 2-way mergers on FPGAs suffer from long feedback critical paths, high resource utilization from redundant comparators, and tie-record issues in feedback-less designs.

Solution: Relax the sorted-input requirement of the bitonic partial merger by replacing its first stage with distributed MAX units, eliminating barrel shifters and redundant merger blocks while achieving lowest comparator count.

Year	Venue	Authors	Title	Tags	P	E	N
2022	TC	Imperial & dunnhumby	FLiMS: a Fast Lightweight 2-way Merger for Sorting	distributed MAX selector stage eliminating barrel shifter rotation; single 2w-to-w bitonic partial merger with minimum comparator count; skewness optimisation via dir register oscillation for balanced dequeue; FLiMSj whole-row dequeue variant via cR register buffer; SIMD AVX2 implementation for CPU merge sort	4	4	4

Homomorphic Encryption Accelerators

Challenge: Current HE accelerators are slow as they rely on complex NTT operations that ignore data sparsity and error tolerance.

Year	Venue	Authors	Title	Tags	P	E	N
2025	DATE	PKU	FLASH: An Efficient Hardware Accelerator Leveraging Approximate and Sparse FFT for Homomorphic Encryption	approximate FFT for modular reduction elimination; sparsity-aware FFT dataflow using skipping and merging methods; DSE for Pareto-optimal computation precision	4	2	2
2023	ISCA	Seoul National	SHARP: A Short-Word Hierarchical Accelerator for Robust and Practical Fully Homomorphic Encryption	36-bit word length optimization for FHE precision-efficiency trade-off; double-prime scaling unit for instruction fusion and short-word scaling	3	2	3

Graph Accelerators¶

Challenge: Massive memory requirement, Non-ordered memory access

Survey¶

Challenge: Lack of systematic categorization and review of diverse graph accelerator implementations spanning different architectures and programming paradigms.

Year	Venue	Authors	Title	Tags	P	E	N
2019	JCST	HUST	A Survey on Graph Processing Accelerators: Challenges and Opportunities	vertex-centric vs edge-centric iterative paradigms; graph layout reorganization and ordering; source/destination/grid graph partitioning; runtime scheduling execution models
2022	IEEE Micro	UCLA	Systematically Understanding Graph Accelerator Dimensions and the Value of Hardware Flexibility	Taskflow execution model unifying task and dataflow parallelism; graph algorithm variant taxonomy; multi-level spatial partitioning; asynchronous hardware task scheduling priority queue

Pipelined & Event-Driven Graph Accelerators¶

Solution: Optimize data flow and execution models using pipelining and event-driven mechanisms to handle irregular graph workloads across hardware architectures.

Year	Venue	Authors	Title	Tags	P	E	N
2016	MICRO	Princeton	Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics	vertex-programming based pipeline; on-chip scratchpad optimization; source/destination-oriented parallel streams	4	4	4
2016	ISCA	Bilkent	Energy Efficient Architecture for Graph Analytics Accelerators	configurable SystemC-based GAS architecture template with application-pluggable data structures; monotonic-rank RAW/WAR hazard detection; bit-vector/queue dual structure; lock-free active vertex scheduling	3	3	3
2020	MICRO	UCR	GraphPulse: An Event-Driven Hardware Accelerator for Asynchronous Graph Processing	asynchronous event-driven model; in-place event coalescing; delta-based accumulative processing	3	3	4
2024	FPGA	HKUST	GraFlex: Flexible Graph Processing on FPGAs through Customized Scalable Interconnection Network	scatter-gather BSP paradigm; customizable multi-stage butterfly interconnection network with virtual-channel flow control; HLS-level coalesced memory; throughput-matching design methodology	3	4	3

FPGA Stream-Partitioned Graph Accelerators¶

Solution: Partition graphs into intervals and shards, streaming edges through FPGA on-chip memory while caching vertex data to maximize bandwidth utilization and reduce preprocessing overhead.

Year	Venue	Authors	Title	Tags	P	E	N
2016	FPGA	THU & UCB	FPGP: Graph Processing Framework on FPGA A Case Study of Breadth-First Search	interval-shard based vertex-centric graph processing on FPGA; on-chip BFS ping-pong vertex caching; analytical performance model with N_pk crossover point	3	4	2
2017	FPGA	THU & MSR	ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture	two-level interval-shard partitioning across multi-FPGA boards; index-based O(m) preprocessing without edge sorting; destination-first replacement strategy minimizing off-chip traffic; PE-level edge shuffling for load balancing; bitmap-based block skipping for sparse iterations	4	4	3
2018	FCCM	THU & UCB	NewGraph: Balanced Large-scale Graph Processing on FPGAs with Low Preprocessing Overheads	URAM-based large partitions reducing count by 3 orders of magnitude vs ForeGraph; FIFO-based dynamic crossbar eliminating pre-sorting; balanced workload without static edge shuffling	3	4	2
2019	FPL	NUS	On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-Based FPGAs	on-the-fly parallel data shuffling; OpenCL-based data dispatcher; runtime data dependency resolution; decoder-filter shuffling architecture	3	4	2

Memory-Optimized Graph Accelerators¶

Solution: Address the memory bottleneck (bandwidth and latency) through specialized memory hierarchies, HBM optimizations, or caching mechanisms.

Year	Venue	Authors	Title	Tags	P	E	N
2014	IPDPSW	ISU	CyGraph: A Reconfigurable Architecture for Parallel Breadth-First Search	custom 64-bit CSR combining visited-flag/row-offset/neighbor-count in single memory word; token-ring kernel-to-kernel interface for distributed next-queue coordination across multi-FPGA	3	4	2
2019	FPGA	HUST	Improving Performance of Graph Processing on FPGA-DRAM Platform by Two-level Vertex Caching	two-level vertex caching (L1 BRAM/L2 UltraRAM); Hilbert-order window sliding for locality; dual-pipeline computation-communication overlapping	4	4	3
2021	ICCAD	Cornell	GraphLily: Accelerating Graph Linear Algebra on HBM-Equipped FPGAs	GraphBLAS; FPGA overlay; HBM-optimized; SpMV/SpMSpV accelerator; CPSR sparse format; software-hardware co-design	4	4	3
2024	TRETS	HUST	ScalaBFS2: A High-performance BFS Accelerator on an HBM-enhanced FPGA Chip	HBM-enhanced BFS accelerator; independent HBM Reader; hybrid-mode PE; multi-layer crossbar	3	4	3
2019	ASPDAC	THU & UCB	GraphSAR: A Sparsity-Aware Processing-in-Memory Architecture for Large-Scale Graph Processing on ReRAMs	sparsity-aware recursive block partitioning with density threshold 0.5; hybrid-centric block-list and edge-list processing model; lightweight graph clustering via vertex index remapping; single-bit ReRAM cell implementation for unweighted algorithms	3	3	3

Dynamic Graph Accelerators¶

Challenge: Edge update, Graph store data structure design

Year	Venue	Authors	Title	Tags	P	E	N
2021	FPGA	HUST	GraSU: A Fast Graph Update Library for FPGA-based Dynamic Graph Processing	differential data management based on spatial similarity; Incremental Value Measurer (IVM); Value-Aware Memory Manager (VMM)	4	4	3
2021	MICRO	UCR	JetStream: Graph Analytics on Streaming Data with Event-Driven Hardware Accelerator	first streaming graph accelerator; asynchronous incremental algorithms; asynchronous edge deletion handling (VAP; DAP)	3	3	4
2022	ISCA	HUST	TDGraph: A Topology-Driven Accelerator for High-Performance Streaming Graph Processing	topology-driven incremental execution; streaming graph processing; regularized state propagation; vertex states coalescing	4	4	3
2023	MICRO	UCR	MEGA Evolving Graph Accelerator	first evolving graph accelerator; Batch-Oriented Execution (BOE); deletion-free based on CommonGraph; Batch Pipelining	3	3	4
2024	TRETS	UoV	Dynamic-ACTS - A Dynamic Graph Analytics Accelerator For HBM-Enabled FPGAs	a novel edge packing format (ACTPACK); hashed edge updates; low-overhead online partitioning	4	4	3

Hypergraph Accelerators¶

Solution: Realize the shared parts in hyperedges

Year	Venue	Authors	Title	Tags	P	E	N
2022	MICRO	HUST	A Data-Centric Accelerator for High-Performance Hypergraph Processing	Data-Centric; Load-Trigger-Reduce (LTR); Adaptive Data Loading	4	4	3
2025	HPCA	HUST	MeHyper: Accelerating Hypergraph Neural Networks by Exploring Implicit Dataflows	Microedge; Microedge-Centric Dataflow; RePAG Execution Model	4	3	4

Graph Mining Accelerators¶

Challenge: Complex graph algorithms, Irregular access patterns

Year	Venue	Authors	Title	Tags	P	E	N
2021	ISCA	MIT	FlexMiner: A Pattern-Aware Accelerator for Graph Pattern Mining	pattern-aware GPM accelerator; software/hardware co-design; pattern-specific execution plan; connectivity map (c-map)	4	4	3
2026	SIGMOD	Heidelberg & SAP	GraphMatch: Subgraph Query Processing on FPGAs	AllCompare FPGA-native parallel set intersection replacing sequential LeapFrog; worst-case optimal join (WCOJ)-based matching source-filter-extender-sink pipeline; cached fetcher architecture for repeated input set reuse	3	4	3

Graph Random Walk Accelerators¶

Challenge: Graph random walks require step-by-step probabilistic traversal with strong inter-step data dependencies and highly irregular random memory access; existing FPGA accelerators suffer from pipeline bubbles caused by workload imbalance and static scheduling that cannot adapt to stochastic query termination.

Year	Venue	Authors	Title	Tags	P	E	N
2023	SIGMOD	NUS	LightRW: FPGA Accelerated Graph Dynamic Random Walks	parallelized Weighted Reservoir Sampling (WRS) for FPGA pipelined graph dynamic random walks (GDRW); degree-aware cache replacing low-degree with high-degree vertex neighbor addresses; dynamic burst engine scheduling hybrid long/short burst lengths at runtime	3	4	3
2026	HPCA	NUS	RidgeWalker: Perfectly Pipelined Graph Random Walks on FPGAs	Markov-based stateless task decomposition for out-of-order GRW execution; zero-bubble query scheduler via M/M/1[N] queuing model with formal bubble-free guarantee; butterfly-interconnect Task Router for per-hop data-aware pipeline routing	4	4	3

DNN Accelerators¶

GEMM¶

Year	Venue	Authors	Title	Tags	P	E	N
2020	HPCA	Georgia Tech	SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training	flexible dot product engine; forwarding adder network	4	2	3
2022	DAC	Gatech	Self Adaptive Reconfigurable Arrays (SARA): Learning Flexible GEMM Accelerator Configuration and Mapping-space using ML	dedicated hardware recommender core; pipelined bypass links; real-time hardware reconfiguration	3	3	2
2025	TCAS-I	Edin.	DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration	Permutated Weight;FIFO-less Architecture	4	4	3

Layer Fusion Accelerators¶

Solution: Use layer fusion to combine multiple layers of a neural network into a single layer. This can help reduce the number of computations and memory accesses required during inference; leading to faster execution times and lower power consumption.

Year	Venue	Authors	Title	Tags
2016	MICRO	SBU	Fused-Layer CNN Accelerators	fuse the processing of multiple CNN layers by modifying the order in which the input data are brought on chip
2025	TC	KU Leuven	Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators	fine-grain mapping paradigm; mapping of layer-fused DNNs on heterogeneous dataflow accelerator architectures; memory- and communication-aware latency analysis; constraint optimization
2024	SOCC	IIT Hyderabad	Hardware-Aware Network Adaptation using Width and Depth Shrinking including Convolutional and Fully Connected Layer Merging	Width Shrinking: reduces the number of feature maps in CNN layers; Depth Shrinking: Merge of conv layer and fc layer
2024	ICSAI	MIT	LoopTree: Exploring the Fused-Layer Dataflow Accelerator Design Space	design space that supports set of tiling, recomputation, retention choices, and their combinations; model that validates design space

LLM Accelerators¶

Challenge: LLM accelerators face challenges in terms of memory bandwidth; power consumption; and the need for efficient data movement.

FPGA-Based Transformer Inference Accelerators¶

Challenge: Deploying Transformer inference on FPGAs/ACAPs suffers from severe shape mismatch between diverse layer sizes and fixed hardware resources, creating an inherent latency-throughput tradeoff that neither purely sequential nor fully spatial accelerator strategies can resolve simultaneously.

Solution: Explore sequential-spatial hybrid accelerator architectures with automated layer-to-accelerator scheduling and inter-accelerator communication co-design to achieve a superior latency-throughput Pareto front.

Year	Venue	Authors	Title	Tags	P	E	N
2024	FPGA	Pitt & UMD	SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration	latency-throughput Pareto front via sequential-spatial hybrid design; force-partition strategy for inter-accelerator memory bank conflict elimination; fine-grained line-buffer pipeline for nonlinear kernels (LayerNorm/Softmax)	4	4	3
2024	FPGA	THU & SJTU	FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs	configurable sparse DSP chain for N:M and block sparsity; always-on-chip decode with mixed-precision dequantization unit; length adaptive compilation reducing instruction storage by 500 times	4	4	3

SSM/Mamba Accelerators¶

Challenge: Mamba's element-wise operations are incompatible with Tensor Core reduction trees; nonlinear functions (exp, SiLU) require large dedicated hardware units; and element-wise ops have limited data sharing making standard tiling inapplicable.

Solution: Reconfigurable PE array that can disable reduction trees for element-wise ops, decompose nonlinear functions into element-wise primitives, and apply intra/inter-operation buffer management.

Year	Venue	Authors	Title	Tags	P	E	N
2024	ICCAD	SJTU	MARCA: Mamba Accelerator with ReConfigurable Architecture	reduction alternative PE array architecture toggling reduction tree for linear vs element-wise ops; fast biased exponential algorithm decomposing exp into shift and element-wise ops; piecewise SiLU approximation reusing reconfigurable PEs; intra-operation and inter-operation buffer management strategy	4	3	4

Accelerators facing Memory Wall¶

Challenge: LLMs require large amounts of memory bandwidth to store and access the model parameters and intermediate results which can lead to memory bottlenecks and reduced performance.

Year	Venue	Authors	Title	Tags	P	E	N
2023	ASPLOS	Gatech	FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks	inter-operator tensor-tensor fusion; fine-grained execution tiling; map-space exploration	3	3	2
2024	ISCA	Furiosa AI	TCP: A Tensor Contraction Processor for AI Workloads	tensor contraction as a hardware primitive; circuit-switched fetch network for hierarchical data reuse; Einstein summation for tactic exploration	4	3	3
2024	DATE	NTU	ViTA: A Highly Efficient Dataflow and Architecture for Vision Transformers	highly efficient memory-centric dataflow; fused special function module for non-linear functions; A comprehensive DSE of ViTA Kernels and VMUs
2025	arXiv	SJTU	ROMA: A Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM	hybrid ROM-SRAM architecture for on-device LLM; B-ROM design for area-efficient ROM; fused cell integration of ROM and compute unit; QLoRA rank adaptation for task-specific tuning; on-chip storage optimization for quantized models
2025	ISCA	Duke	Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression	entropy-aware cache compression for LLMs; group-wise non-uniform quantization; shared k-means patterns; parallel Huffman hardware decoder	4	3	3

Compiler-Scheduled Cacheless Architecture¶

Challenge: Reactive hardware architectures (like GPUs) devote significant area and power to dynamic scheduling and caching, which leads to unpredictable tail latencies and limits the compute utilization in large-scale distributed systems.

Year	Venue	Authors	Title	Tags	P	E	N
2020	ISCA	Groq	Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads	Space-time determined;Functionally Sliced and Data stream Microarchitecture	4	5	4
2022	ISCA	Groq	A software-defined tensor streaming multiprocessor for large-scale machine learning	Multi-TSP network;Removal of Hardware Flow Control;Deterministic Load Balancing	4	5	4

Algorithmic Accelerators¶

Solution: Algorithmic accelerators use specialized algorithms to optimize the performance of LLMs.

Year	Venue	Authors	Title	Tags	P	E	N
2020	HPCA	Seoul National	A3: Accelerating Attention Mechanisms in Neural Networks with Approximation	greedy candidate search for reducing search targets; post-scoring selection with dynamic thresholding for softmax; lookup-table-based exponent modules	4	3	2
2021	HPCA	MIT	SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning	cascade pruning; high-parallelism top-K engine; progressive quantization	3	4	4
2024	ASPLOS	CMU	SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification	tree-based speculative inference; topology-aware causal mask; multi-step speculative sampling	4	4	4
2025	MICRO	KAIST	HLX: A Unified Pipelined Architecture for Optimized Performance of Hybrid Transformer-Mamba Language Models	PipeFlash fine-grained pipelining for Attention; PipeSSD fused pipelined execution for Mamba-2; Unified Reconfigurable Streamlined Core (URSC);inter-operation dependency mitigation	4	3	3

Wafer-Scale Accelerators¶

Solution: Adopt holistic hardware-software co-design to integrate high-bandwidth Network-on-Wafer (NoW) architectures with fault-tolerant mapping strategies, thereby mitigating yield limitations and communication bottlenecks to maximize system throughput.

Year	Venue	Authors	Title	Tags	P	E	N
2025	ISCA	THU	PD Constraint-aware Physical/Logical Topology Co-Design for Network on Wafer	mesh-switch physical topology; dual-granularity logical topology	4	3	2
2025	ISCA	THU	WSC-LLM: Efficient LLM Service and Architecture Co-exploration for Wafer-scale Chips	optimal resource partition algorithm; optimal KV cache placement algorithm	4	3	2
2026	HPCA	THU	WATOS: Efficient LLM Training Strategies and Architecture Co-exploration for Wafer-scale Chip	globally coordinated memory-efficient recomputation; location-aware resource placement; 3-stage system-level robustness design	4	4	2

Quantized DNN Accelerators¶

Solution: Quantized DNN accelerators are designed to efficiently execute quantized neural networks, which use lower precision representations for weights and activations.

General-Purpose and Edge Quantized DNN Accelerators¶

Challenge: General quantized-DNN accelerators must support mixed precision or nonuniform datatypes without losing the throughput and energy benefits of low-bit execution.

Year	Venue	Authors	Title	Tags	P	E	N
2018	ISCA	SNU	Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation	accelerator architecture for outlier-aware quantized models; outlier-aware low-precision computation; separate outlier MAC unit	4	3	2
2023	HPCA	UPC	Mix-GEMM: An efficient HW-SW Architecture for Mixed-Precision Quantized Deep Neural Networks Inference on Edge Devices	Complete mixed-precision flexibility; hardware accelerator & BLIS-based library with custom RISC-V ISA extensions	3	2	3
2024	DAC	ASU	Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference	composite data type Logarithmic Posits (LP); automated post training LP Quantization (LPQ) Framework based on genetic algorithms; mixed-precision LP Accelerator (LPA)	3	3	2

Quantized Transformer and Foundation-Model Accelerators¶

Challenge: Transformer-scale quantized accelerators need to compress weights and activations while preserving model accuracy and reducing expensive memory traffic.

Year	Venue	Authors	Title	Tags	P	E	N
2020	MICRO	UToronto	GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference	Gaussian-outlier weight splitting; per-layer centroid dictionary quantization; 3-bit index storage for BERT weights; GOBO low-bit accelerator	4	3	4
2022	ISCA	UToronto	Mokey: Enabling Narrow Fixed-Point Inference for Out-of-the-Box Floating-Point Transformer Models	Golden Dictionary post-training quantization; 4-bit index weights and activations with fixed-point centroids; exponential-fit centroid arithmetic replacing MACs with narrow additions; Mokey accelerator and memory compression assist	4	4	4
2025	ISCA	Georgia Tech & Intel	MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization	outlier-aware MX quantization with pruning-based bit redistribution; ReCoN NoC for high-precision outlier merging; multi-precision INT PE systolic accelerator for LLM/VLM inference	4	3	3

Bit-Sliced DNN Accelerators¶

Solution: Bit-sliced DNN accelerators break down data into smaller bit-slices, allowing for more efficient processing and reduced memory and calculation resources requirements.

Year	Venue	Authors	Title	Tags	P	E	N
2018	ISCA	Georgia Tech	Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network	accelerator for layer-aware quantized DNN; bit-flexible computation unit; block-structured instruction set architecture	4	3	3
2023	HPCA	KAIST	Sibia: Signed Bit-slice Architecture for Dense DNN Acceleration with Slice-level Sparsity Exploitation	signed bit-slice representation;flexible zero skipping processing element	3	3	4
2024	HPCA	KU Leuven	BitWave: Exploiting Column-Based Bit-Level Sparsity for Deep Learning Acceleration	Bit-column sparsity for both computation reduction and data compression; Single-shot Bit-Flip post-training	3	3	3
2025	HPCA	POSTECH	Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity	Asymmetrically-Quantized bit-Slice GEMM; Zero-Point Manipulation and Distribution-based Bit-Slicing to increase sparsity	3	3	4
2025	HPCA	Yonsei	Bit-slice Architecture for DNN Acceleration with Slice-level Sparsity Enhancement and Exploitation	both input AND weight sparsity at bit-slice level; 8-bit data processing with 4-bit multipliers; Scale regularization during training to enhance sparsity	3	2	2

Reconfigurable Dataflow & Interconnect Accelerators¶

Solution: Dynamically adaptive interconnections and multi-dataflow engines allowing adaptable mapping to match spatial architectures with DNN variants.

Year	Venue	Authors	Title	Tags	P	E	N
2018	ASPLOS	Georgia Tech	MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects	augmented reduction tree(ART) for link conflict; chubby distribution tree for bandwidth optimization; ART based virtual neuron construction	4	3	2
2019	JETCAS	MIT	Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices	hierarchical mesh NoC for multiple transmission modes; sparse PE architecture	5	4	2
2023	ASPLOS	UM & Georgia Tech	Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing	merger-reduction network for area efficiency; compression format conversion without hardware module; dedicated L1 memory architecture for different access pattern	4	3	2
2024	ISCA	Gatech	FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching	reorder in reduction; dataflow-layout co-switching; butterfly interconnect for reduction and reordering (BIRRD)	4	4	2
2025	MICRO	Maryland	Misam: Machine Learning Assisted Dataflow Selection in Accelerators for Sparse Matrix Multiplication	ML-assisted runtime dataflow selection; lightweight decision tree predictor; intelligent reconfiguration engine for FPGAs; cost-benefit analysis for hardware switching	3	4	3

Application & Sparsity-Specific Reconfigurable Accelerators¶

Solution: Accelerators employing structured sparsity and environment-specific pipelining, enabling them to reconfigure flexibly for structured sparsity masks or specific AI domains (like RL).

Structured Sparsity DNN Accelerators¶

Solution: Accelerators employing hierarchical structured sparsity masks for flexible DNN acceleration.

Year	Venue	Authors	Title	Tags	P	E	N
2023	MLSys	Gatech	SUSHI: SUbgraph Stationary Hardware-software Inference Co-design	subgraph-stationary dataflow; dedicated persistent buffer; cache-state-aware scheduling; moving-average subgraph prediction	4	4	2
2023	MICRO	MIT	HighLight: Efficient and Flexible DNN Acceleration with Hierarchical Structured Sparsity	hierarchical structured sparsity (HSS); modularized sparsity acceleration architecture; systematic flexibility on sparsity patterns	3	3	3

RL Environment Accelerators¶

Solution: FPGA-based accelerators targeting the parallel environment execution bottleneck in reinforcement learning training pipelines.

Year	Venue	Authors	Title	Tags	P	E	N
2025	DATE	PKU	PEARL: FPGA-Based Reinforcement Learning Acceleration with Pipelined Parallel Environments	pipelined parallel RL environment execution on FPGA; PCIe data compression and local-store optimization; modular parametric environment template for user customization	3	3	2

Secure DNN Accelerators¶

Solution: Integrating on-chip cryptographic engines and co-optimizing the hardware architecture, memory authentication schemes, and data scheduling to efficiently enable Trusted Execution Environments (TEEs) for secure DNN computation with minimal overhead.

Year	Venue	Authors	Title	Tags	P	E	N
2023	MICRO	MIT	SecureLoop: Design Space Exploration of Secure DNN Accelerators	cryptographic-engine-aware loopnest scheduling; analytical authentication block formulation; simulated annealing-based cross-layer tuning	3	3	2

Benchmarks¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	Cambridge	Benchmarking Ultra-Low-Power µNPUs	Comparative µNPU Benchmarking (µNPU: microcontroller-scale Neural Processing Unit); open-source model compilation framework; µNPU memory I/O bottleneck identification	4	4	2

Dataflow Architecture¶

Reconfigurable Dataflow Architectures¶

Challenge: Bridging the gap between the high efficiency of dataflow execution and the need for handling irregular control flows (e.g., MIMD threads) and diverse parallel patterns in general-purpose computing.

Year	Venue	Authors	Title	Tags	P	E	N
2017	ISCA	Stanford	Plasticine: A Reconfigurable Architecture for Parallel Patterns	Pattern Compute Unit (PCU); Pattern Memory Unit (PMU); parallel pattern-based ISA; hierarchical reconfigurable interconnect	5	3	5
2022	ISCA	Stanford	Aurochs: An Architecture for Dataflow Threads	dataflow threads for MIMD execution; resource elasticity and dynamic context switching; hardware extensions for irregular workloads	4	3	4
2023	ISCA	UIUC	MESA: Microarchitecture Extensions for Spatial Architecture Generation	MESA hardware block; runtime spatial dataflow graph; performance-counter feedback loop; data-driven instruction mapping algorithm	3	3	4
2024	MICRO	CMU	The TYR Dataflow Architecture: Improving Locality by Taming Parallelism	local tag spaces technique; space tag managing instruction set; CT based concurrent-block communication

Tensor-Centric Dataflow Architectures¶

Challenge: Overcoming the memory wall and utilization bottlenecks in Deep Learning workloads by optimizing inter-operator dataflow, exploiting sparsity, and managing tensor contraction sequences.

Year	Venue	Authors	Title	Tags	P	E	N
2019	ASPLOS	THU	Tangram: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators	buffer sharing dataflow(BSD); alternate layer loop ordering (ALLO) dataflow; heuristics spatial layer mapping algorithm
2024	MICRO	UCR	Sparsepipe: Sparse Inter-operator Dataflow Architecture with Cross-Iteration Reuse	producer-consumer reuse; cross-iteration reuse; sub-tensor dependency; OEI dataflow; sparsepipe architecture
2025	arXiv	UCSB	FETTA: Flexible and Efficient Hardware Accelerator for Tensorized Neural Network Training	contraction sequence search engine; tensor contraction unit; distribution/reduction network	3	4	3

Data Mapping¶

Solution: Assign data to specific locations in memory or storage to optimize performance; reduce latency; and improve resource utilization.

Survey¶

Year	Venue	Authors	Title	Tags	P	E	N
2013	DAC	NUS	Mapping on Multi/Many-core Systems: Survey of Current and Emerging Trends	dense/run-time mapping; centralized/distributred management; hybrid mapping

Heuristic Algorithm¶

Year	Venue	Authors	Title	Tags
2021	HPCA	Georgia Tech	MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores	sub-accelerator selection; fine-grained job prioritization; MANGA crossover genetic operators
2023	ISCA	THU	MapZero: Mapping for Coarse-grained Reconfigurable Architectures with Reinforcement Learning and Monte-Carlo Tree Search	GAT based DFG and CGRA embedding; routing penalty based reinforcement learning; Monte-Carlo tree search space exploration
2023	VLSI	IIT Kharagpur	Application Mapping Onto Manycore Processor Architectures Using Active Search Framework	RNN based active search framework; IP-Core Numbering Scheme; active search with/without pretraining

Optimization Modeling¶

Year	Venue	Authors	Title	Tags
2020	FPGA	ETH Zurich	Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis	computation and I/O decomposition model for matrix multiplication; 1D array collapse mapping method; internal double buffering
2021	HPCA	Georgia Tech	Heterogeneous Dataflow Accelerators for Multi-DNN Workloads	heterogeneous dataflow accelerators (HDAs) for DNN; dataflow flexibility; high utilization across the sub-accelerators
2023	MICRO	Alibaba; CUHK	ArchExplorer: Microarchitecture Exploration Via Bottleneck Analysis	dynamic event-dependence graph(EDG); induced DEG based critical path construction; bottleneck-removal-driven DSE
2023	ISCA	THU	Inter-layer Scheduling Space Definition and Exploration for Tiled Accelerators	inter-layer encoding method; temperal cut; spatial cut; RA tree analysis

Fault Tolerant Mapping¶

Year	Venue	Authors	Title	Tags	P	E	N
2017	SC	NIT	High-performance and energy-efficient fault-tolerance core mapping in NoC	weighted communication energy; placing unmapped vertices region; application core graph; spare core placement algorithm
2019	IVLSI	UESTC	Optimized mapping algorithm to extend lifetime of both NoC and cores in many-core system	lifetime budget metric; LBC-LBL mapping algorithm; electro-migration fault model

Communication Optimized Mapping¶

Challenge: Efficiently exploring the vast design space to balance limited on-chip buffer resources with scarce DRAM bandwidth, aiming to minimize off-chip communication latency through optimized layer fusion and scheduling strategies.

Year	Venue	Authors	Title	Tags	P	E	N
2024	ASPLOS	THU	Cocco: Hardware-Mapping Co-Exploration towards Memory Capacity-Communication Optimization	consumption-centric flow based subgraph execution scheme; main/side region based memory management
2025	DAC	THU	Buffer Prospector: Discovering and Exploiting Untapped Buffer Resources in Many-Core DNN Accelerators	data-compute ratio; buffer requirement calculator; layer-pipeline (LP) mapping optimization; greedy based buffer allocator	4	3	2

Reliability Management¶

Year	Venue	Authors	Title	Tags	P	E	N
2018	DATE	NUST	Variation-Aware Task Allocation and Scheduling for Improving Reliability of Real-Time MPSoCs	variation-aware task allocation; soft-error reliability maximization; cross entropy based task scheduling	4	3	2
2019	DAC	NTU	LifeGuard: A Reinforcement Learning-Based Task Mapping Strategy for Performance-Centric Aging Management	performance-centric aging management; frequency-based core binning; DRL based task mapping	3	3	2
2020	DATE	Turku	Thermal-Cycling-aware Dynamic Reliability Management in -Core System-on-Chip	Coffin-Mason equation based reliability model; reliability-aware mapping/scheduling; dynamic power management
2024	arXiv	WUSTL	A Two-Level Thermal Cycling-Aware Task Mapping Technique for Reliability Management in Manycore Systems	temperature based bin packing; task-to-bin assignment; thermal cycling-aware based task-to-core mapping
2024	arXiv	WUSTL	A Reinforcement Learning-Based Task Mapping Method to Improve the Reliability of Clustered Manycores	mean time to failure; density-based spatial clustering of applications with noise algorithm

CGRA¶

Solution: Optimally map an application's data flow graph onto the hardware fabric by simultaneously solving the tightly-coupled problems of scheduling, placement, and routing under strict spatial and temporal resource constraints.

Year	Venue	Authors	Title	Tags	P	E	N
2017	ASAP	Torontom	CGRA-ME: A Unified Framework for CGRA Modelling and Exploration	XML-based CGRA description; LLVM-based simulated annealing mapper; modulo routing resource graph	3	2	2
2019	TCAD	Tsinghua	Data-Flow Graph Mapping Optimization for CGRA With Deep Reinforcement Learning	neighbor PE interchange defined action space; local pattern based reward function	3	3	2
2022	HPCA	NUS	LISA: Graph Neural Network based Portable Mapping on Spatial Accelerators	label abstraction for quality mapping; GNN based label-aware mapping; label-aware simulated annealing	4	3	3
2024	MICRO	NUS	ICED: An Integrated CGRA Framework Enabling DVFS-Aware Acceleration	tile based CGRA configuration; DVFS labeling; DVFS-Aware DFG Mapping	4	3	2

Task Scheduling¶

Year	Venue	Authors	Title	Tags	P	E	N
2023	ICCAD	PKU	Memory-aware Scheduling for Complex Wired Networks with Iterative Graph Optimization	topology-aware pruning algorithm; integer linear programming scheduling method; sub-graph fusion algorithm ; memory-aware graph partitioning	4	3	2
2023	MICRO	Duke	Si-Kintsugi: Towards Recovering Golden-Like Performance of Defective Many-Core Spatial Architectures for AI	graph alignment algorithm for dataflow graph and platform pe grap; producer-consumer pattern dataflow generation algorithm

Many-core Architecture¶

Challenge: Many-core architectures are designed to handle a large number of cores; but they face challenges in terms of power consumption; performance; and resource allocation.

Resource Management¶

Challenge: Cores share resources with each other, how to achieve high performance by coordinating access among cores to prevent conflicts and ensure data consistency is a problem.

Year	Venue	Authors	Title	Tags
2015	HPCA	Cornel	Increasing Multicore System Efficiency through Intelligent Bandwidth Shifting	online bandwidth shifting mechanism; prefetch usefulness (PU) level
2015	HPCA	IBM	XChange: A Market-based Approach to Scalable Dynamic Multi-resource Allocation in Multicore Architectures	CMP multiresource allocation mechanism XChange; market framework based modeling
2018	MICRO	SNU	RpStacks-MT: A High-throughput Design Evaluation Methodology for Multi-core Processors	graph-based multi-core performance model; distance-based memory system model; dynamic scheduling reconstruction method
2023	MICRO	Yonsei	McCore: A Holistic Management of High-Performance Heterogeneous Multicores	cluster partitioning via index hash function; partitions balancing method; hardware support for RL based scheduling

Hardware Design¶

Solution: Hardware implementation for many-core architecture to achieve massive parallelism.

Year	Venue	Authors	Title	Tags	P	E	N
2016	SCIS	THU&BNU&CAS	The Sunway TaihuLight supercomputer: system and applications	Sunway TaihuLight's composition; scientific computing applications on TaihuLight	3	4	2
2017	IPDPSW	SJTU&Tokyo Tech	Benchmarking SW26010 Many-core Processor	hand-coded assembly benchmark for SW26010; CPE pipeline&memory hierarchy&RLC mechanism benchmarking	3	4	2
2020	MICRO	UCSD	Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks	dynamic architecture fission; spatial multi-tenant acceleration; omni-directional systolic arrays	4	3	2
2023	MICRO	THU	MAICC: A Lightweight Many-core Architecture with In-Cache Computing for Multi-DNN Parallel Inference	slice improved and hardware-implemented reduction CIM; ISA extension for CIM; CNN layer segmentation and mapping algorithm

Application Optimization¶

Year	Venue	Authors	Title	Tags
2023	SC	NUDT	Optimizing Direct Convolutions on ARM Multi-Cores	direct convolution algorithm NDirect; loop ordering algorithm; micro convolution kernal for computing & packeting
2023	SC	NUDT	Optimizing MPI Collectives on Shared Memory Multi-Cores	intra-node reduction algorithm for redundant data movements; fine grained non-temporal store based adaptive collectives
2024	PPoPP	NUDT	Towards Scalable Unstructured Mesh Computations on Shared Memory Many-Cores	task dependency tree(TDT); tree traversal based parallel algorithm for CPU/GPU

Architecture DSE¶

Challenge: It's crucial to find the optimal hardware configurations that meet performance; power; and area constraints for specific applications.

NOC DSE¶

Year	Venue	Authors	Title	Tags	P	E	N
2018	ICCAD	WSU	Hybrid On-Chip Communication Architectures for Heterogeneous Manycore Systems	many-to-few communication patterns; long range shortcut based wireless NoC ; 3D-TSV based heterogeneous NoC
2018	IEEE TC	WSU	On-Chip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems	wireless-enabled heterogeneous NoC; archived multi-objective simulated annealing for network connectivity

Mapping & Co-Exploration DSE¶

Challenge: Efficiently co-optimize DNN mapping and hardware architecture under complex constraints.

Year	Venue	Authors	Title	Tags	P	E	N
2020	ICCAD	UIUC	DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator	two-level (global and local) automatic DSE engine; dynamic design space exploration framework; high-dimensional design space support	4	4	4
2020	ICCAD	Gatech	GAMMA: Automating the HW Mapping of DNN Models on Accelerators via Genetic Algorithm	domain-specific genetic representation; growth and aging evolutionary operators; two-stage inter-layer optimization	4	3	2
2022	TACO	Gatech	Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators	dimension dependence graph (DDG); maestro data-centric notation; decoupled off-chip/on-chip mapping exploration	3	3	2
2024	HPCA	THU	Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators	layer-centric encoding method; DP-based graph partition algorithm; SA based D2D link communication optimization
2024	ASPDAC	CUHK	SoC-Tuner: An Importance-guided Exploration Framework for DNN-targeting SoC Design	intercluster distance algorithm; importance-based pruning and initialization	3	2	2
2024	Arxiv	Georgia Tech	PIPEORGAN: Efficient Inter-operation Pipelining with Flexible Spatial Org	spatial organization strategy pipeorgan for inter-operator pipelining; augmented mesh for pipelining(AMP) topology	4	2	2
2025	ASPDAC	THU	KAPLA: Scalable NN Accelerator Dataflow Design Space Structuring and Fast Exploring	tensor-centric dataflow directives; bottom-up cost descending; inter-layer pruning and decoupling	3	3	2

Microarchitecture & Cross-Architecture DSE¶

Challenge: Efficiently explore and optimize design spaces across microarchitectures and heterogeneous hardware.

Year	Venue	Authors	Title	Tags	P	E	N
2019	MICRO	Georgia Tech	Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows: A Data-Centric Approach	MAESTRO analytical model; data-centric directives; spatial and temporal reuse analysis; hardware cost estimation; NoC (Network-on-Chip) modeling	4	4	4
2023	DATE	Gatech	AIRCHITECT: Learning Custom Architecture Design and Mapping Space	recommendation neural network; constant time prediction; systolic-array-based accelerators	3	4	2
2025	arXiv	THU & Macau	MLDSE: Scaling Design Space Exploration Infrastructure for Multi-Level Hardware	IR and builder based hardware modeling; cross-architecture DSE; spatial-level DSE	3	3	2
2025	arXiv	PKU	DiffuSE: Cross-Layer Design Space Exploration of DNN Accelerator via Diffusion-Driven Optimization	diffusion-based design generation; conditional sampling	3	4	3
2026	arXiv	Berkeley	ArchAgent: Agentic AI-driven Computer Architecture Discovery	automated runtime-configurable parameter tuning for workload-specific optimization by LLM; automated detection of simulator escapes (ai-exploited logic loopholes simulators)	3	3	3

Data Access Accelerators¶

Challenge: Indirect and sparse memory access patterns; low memory bandwidth utilization; core structural limitations

Year	Venue	Authors	Title	Tags	P	E	N
2025	ISCA	UMich	DX100: Programmable Data Access Accelerator for Indirection	indirect memory access accelerator; bulk memory access reordering; DRAM row-buffer hit rate optimization; programmable data access ISA	4	3	2

HBM Interconnects¶

Challenge: The built-in crossbar of HBM FPGAs suffers from contention and low bandwidth during many-to-many unicast access, and standard HLS lacks support for efficient burst buffering.

Year	Venue	Authors	Title	Tags	P	E	N
2021	FPGA	UCLA	HBM Connect: High-Performance HLS Interconnect for FPGA HBM	HBM Connect; HLS Virtual Buffer (HVB); Mux-Demux Switch; butterfly custom crossbar; many-to-many unicast; pseudo-channel optimization	4	4	2

Reinforcement Learning Accelerators¶

Challenge: Resolving the memory bottlenecks caused by the irregular, low-arithmetic-intensity operations of experience replay while efficiently synchronizing high-throughput neural network training with sequential data collection.

Year	Venue	Authors	Title	Tags	P	E	N
2026	TVLSI	NUAA	E2CAP: An Energy-Efficient FPGA Accelerator for Deep Reinforcement Learning With Experience Compression and Configurable PE Array	fss+css two-stage experience compression strategy; intra-PE MADD/MAC mode switching to eliminate weight transposition; inter-PE configurable group-slice interconnection for computation imbalance reduction	4	4	2
2022	IEEE Access	Osnabrueck	A Survey of Domain-Specific Architectures for Reinforcement Learning	experience replay bottleneck; on-chip training lackness; normalized efficiency metrics(IPS/LUT)	3	2	2
2022	CF	USC	FPGA Acceleration of Deep Reinforcement Learning using On-Chip Replay Management	on-chip replay management module; k-ary sum tree data structure; hardware pipelining with conflict-free memory access	3	4	2