Domain-Specific Accelerators¶
Sorting & Merging Accelerators¶
Challenge: High-throughput 2-way mergers on FPGAs suffer from long feedback critical paths, high resource utilization from redundant comparators, and tie-record issues in feedback-less designs.
Solution: Relax the sorted-input requirement of the bitonic partial merger by replacing its first stage with distributed MAX units, eliminating barrel shifters and redundant merger blocks while achieving lowest comparator count.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2022 | TC | Imperial & dunnhumby | FLiMS: a Fast Lightweight 2-way Merger for Sorting | distributed MAX selector stage eliminating barrel shifter rotation; single 2w-to-w bitonic partial merger with minimum comparator count; skewness optimisation via dir register oscillation for balanced dequeue; FLiMSj whole-row dequeue variant via cR register buffer; SIMD AVX2 implementation for CPU merge sort | 4 | 4 | 4 |
Homomorphic Encryption Accelerators
Challenge: Current HE accelerators are slow as they rely on complex NTT operations that ignore data sparsity and error tolerance.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2025 | DATE | PKU | FLASH: An Efficient Hardware Accelerator Leveraging Approximate and Sparse FFT for Homomorphic Encryption | approximate FFT for modular reduction elimination; sparsity-aware FFT dataflow using skipping and merging methods; DSE for Pareto-optimal computation precision | 4 | 2 | 2 |
| 2023 | ISCA | Seoul National | SHARP: A Short-Word Hierarchical Accelerator for Robust and Practical Fully Homomorphic Encryption | 36-bit word length optimization for FHE precision-efficiency trade-off; double-prime scaling unit for instruction fusion and short-word scaling | 3 | 2 | 3 |
Graph Accelerators¶
Challenge: Massive memory requirement, Non-ordered memory access
Survey¶
Challenge: Lack of systematic categorization and review of diverse graph accelerator implementations spanning different architectures and programming paradigms.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2019 | JCST | HUST | A Survey on Graph Processing Accelerators: Challenges and Opportunities | vertex-centric vs edge-centric iterative paradigms; graph layout reorganization and ordering; source/destination/grid graph partitioning; runtime scheduling execution models | |||
| 2022 | IEEE Micro | UCLA | Systematically Understanding Graph Accelerator Dimensions and the Value of Hardware Flexibility | Taskflow execution model unifying task and dataflow parallelism; graph algorithm variant taxonomy; multi-level spatial partitioning; asynchronous hardware task scheduling priority queue |
Pipelined & Event-Driven Graph Accelerators¶
Solution: Optimize data flow and execution models using pipelining and event-driven mechanisms to handle irregular graph workloads across hardware architectures.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2016 | MICRO | Princeton | Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics | vertex-programming based pipeline; on-chip scratchpad optimization; source/destination-oriented parallel streams | 4 | 4 | 4 |
| 2016 | ISCA | Bilkent | Energy Efficient Architecture for Graph Analytics Accelerators | configurable SystemC-based GAS architecture template with application-pluggable data structures; monotonic-rank RAW/WAR hazard detection; bit-vector/queue dual structure; lock-free active vertex scheduling | 3 | 3 | 3 |
| 2020 | MICRO | UCR | GraphPulse: An Event-Driven Hardware Accelerator for Asynchronous Graph Processing | asynchronous event-driven model; in-place event coalescing; delta-based accumulative processing | 3 | 3 | 4 |
| 2024 | FPGA | HKUST | GraFlex: Flexible Graph Processing on FPGAs through Customized Scalable Interconnection Network | scatter-gather BSP paradigm; customizable multi-stage butterfly interconnection network with virtual-channel flow control; HLS-level coalesced memory; throughput-matching design methodology | 3 | 4 | 3 |
FPGA Stream-Partitioned Graph Accelerators¶
Solution: Partition graphs into intervals and shards, streaming edges through FPGA on-chip memory while caching vertex data to maximize bandwidth utilization and reduce preprocessing overhead.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2016 | FPGA | THU & UCB | FPGP: Graph Processing Framework on FPGA A Case Study of Breadth-First Search | interval-shard based vertex-centric graph processing on FPGA; on-chip BFS ping-pong vertex caching; analytical performance model with N_pk crossover point | 3 | 4 | 2 |
| 2017 | FPGA | THU & MSR | ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture | two-level interval-shard partitioning across multi-FPGA boards; index-based O(m) preprocessing without edge sorting; destination-first replacement strategy minimizing off-chip traffic; PE-level edge shuffling for load balancing; bitmap-based block skipping for sparse iterations | 4 | 4 | 3 |
| 2018 | FCCM | THU & UCB | NewGraph: Balanced Large-scale Graph Processing on FPGAs with Low Preprocessing Overheads | URAM-based large partitions reducing count by 3 orders of magnitude vs ForeGraph; FIFO-based dynamic crossbar eliminating pre-sorting; balanced workload without static edge shuffling | 3 | 4 | 2 |
| 2019 | FPL | NUS | On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-Based FPGAs | on-the-fly parallel data shuffling; OpenCL-based data dispatcher; runtime data dependency resolution; decoder-filter shuffling architecture | 3 | 4 | 2 |
Memory-Optimized Graph Accelerators¶
Solution: Address the memory bottleneck (bandwidth and latency) through specialized memory hierarchies, HBM optimizations, or caching mechanisms.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2014 | IPDPSW | ISU | CyGraph: A Reconfigurable Architecture for Parallel Breadth-First Search | custom 64-bit CSR combining visited-flag/row-offset/neighbor-count in single memory word; token-ring kernel-to-kernel interface for distributed next-queue coordination across multi-FPGA | 3 | 4 | 2 |
| 2019 | FPGA | HUST | Improving Performance of Graph Processing on FPGA-DRAM Platform by Two-level Vertex Caching | two-level vertex caching (L1 BRAM/L2 UltraRAM); Hilbert-order window sliding for locality; dual-pipeline computation-communication overlapping | 4 | 4 | 3 |
| 2021 | ICCAD | Cornell | GraphLily: Accelerating Graph Linear Algebra on HBM-Equipped FPGAs | GraphBLAS; FPGA overlay; HBM-optimized; SpMV/SpMSpV accelerator; CPSR sparse format; software-hardware co-design | 4 | 4 | 3 |
| 2024 | TRETS | HUST | ScalaBFS2: A High-performance BFS Accelerator on an HBM-enhanced FPGA Chip | HBM-enhanced BFS accelerator; independent HBM Reader; hybrid-mode PE; multi-layer crossbar | 3 | 4 | 3 |
| 2019 | ASPDAC | THU & UCB | GraphSAR: A Sparsity-Aware Processing-in-Memory Architecture for Large-Scale Graph Processing on ReRAMs | sparsity-aware recursive block partitioning with density threshold 0.5; hybrid-centric block-list and edge-list processing model; lightweight graph clustering via vertex index remapping; single-bit ReRAM cell implementation for unweighted algorithms | 3 | 3 | 3 |
Dynamic Graph Accelerators¶
Challenge: Edge update, Graph store data structure design
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2021 | FPGA | HUST | GraSU: A Fast Graph Update Library for FPGA-based Dynamic Graph Processing | differential data management based on spatial similarity; Incremental Value Measurer (IVM); Value-Aware Memory Manager (VMM) | 4 | 4 | 3 |
| 2021 | MICRO | UCR | JetStream: Graph Analytics on Streaming Data with Event-Driven Hardware Accelerator | first streaming graph accelerator; asynchronous incremental algorithms; asynchronous edge deletion handling (VAP; DAP) | 3 | 3 | 4 |
| 2022 | ISCA | HUST | TDGraph: A Topology-Driven Accelerator for High-Performance Streaming Graph Processing | topology-driven incremental execution; streaming graph processing; regularized state propagation; vertex states coalescing | 4 | 4 | 3 |
| 2023 | MICRO | UCR | MEGA Evolving Graph Accelerator | first evolving graph accelerator; Batch-Oriented Execution (BOE); deletion-free based on CommonGraph; Batch Pipelining | 3 | 3 | 4 |
| 2024 | TRETS | UoV | Dynamic-ACTS - A Dynamic Graph Analytics Accelerator For HBM-Enabled FPGAs | a novel edge packing format (ACTPACK); hashed edge updates; low-overhead online partitioning | 4 | 4 | 3 |
Hypergraph Accelerators¶
Solution: Realize the shared parts in hyperedges
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2022 | MICRO | HUST | A Data-Centric Accelerator for High-Performance Hypergraph Processing | Data-Centric; Load-Trigger-Reduce (LTR); Adaptive Data Loading | 4 | 4 | 3 |
| 2025 | HPCA | HUST | MeHyper: Accelerating Hypergraph Neural Networks by Exploring Implicit Dataflows | Microedge; Microedge-Centric Dataflow; RePAG Execution Model | 4 | 3 | 4 |
Graph Mining Accelerators¶
Challenge: Complex graph algorithms, Irregular access patterns
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2021 | ISCA | MIT | FlexMiner: A Pattern-Aware Accelerator for Graph Pattern Mining | pattern-aware GPM accelerator; software/hardware co-design; pattern-specific execution plan; connectivity map (c-map) | 4 | 4 | 3 |
DNN Accelerators¶
GEMM¶
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2020 | HPCA | Georgia Tech | SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training | flexible dot product engine; forwarding adder network | 4 | 2 | 3 |
| 2022 | DAC | Gatech | Self Adaptive Reconfigurable Arrays (SARA): Learning Flexible GEMM Accelerator Configuration and Mapping-space using ML | dedicated hardware recommender core; pipelined bypass links; real-time hardware reconfiguration | 3 | 3 | 2 |
| 2025 | TCAS-I | Edin. | DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration | Permutated Weight;FIFO-less Architecture | 4 | 4 | 3 |
Layer Fusion Accelerators¶
Solution: Use layer fusion to combine multiple layers of a neural network into a single layer. This can help reduce the number of computations and memory accesses required during inference; leading to faster execution times and lower power consumption.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2016 | MICRO | SBU | Fused-Layer CNN Accelerators | fuse the processing of multiple CNN layers by modifying the order in which the input data are brought on chip | |||
| 2025 | TC | KU Leuven | Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators | fine-grain mapping paradigm; mapping of layer-fused DNNs on heterogeneous dataflow accelerator architectures; memory- and communication-aware latency analysis; constraint optimization | |||
| 2024 | SOCC | IIT Hyderabad | Hardware-Aware Network Adaptation using Width and Depth Shrinking including Convolutional and Fully Connected Layer Merging | Width Shrinking: reduces the number of feature maps in CNN layers; Depth Shrinking: Merge of conv layer and fc layer | |||
| 2024 | ICSAI | MIT | LoopTree: Exploring the Fused-Layer Dataflow Accelerator Design Space | design space that supports set of tiling, recomputation, retention choices, and their combinations; model that validates design space |
LLM Accelerators¶
Challenge: LLM accelerators face challenges in terms of memory bandwidth; power consumption; and the need for efficient data movement.
FPGA-Based Transformer Inference Accelerators¶
Challenge: Deploying Transformer inference on FPGAs/ACAPs suffers from severe shape mismatch between diverse layer sizes and fixed hardware resources, creating an inherent latency-throughput tradeoff that neither purely sequential nor fully spatial accelerator strategies can resolve simultaneously.
Solution: Explore sequential-spatial hybrid accelerator architectures with automated layer-to-accelerator scheduling and inter-accelerator communication co-design to achieve a superior latency-throughput Pareto front.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2024 | FPGA | Pitt & UMD | SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration | latency-throughput Pareto front via sequential-spatial hybrid design; force-partition strategy for inter-accelerator memory bank conflict elimination; fine-grained line-buffer pipeline for nonlinear kernels (LayerNorm/Softmax) | 4 | 4 | 3 |
| 2024 | FPGA | THU & SJTU | FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs | configurable sparse DSP chain for N:M and block sparsity; always-on-chip decode with mixed-precision dequantization unit; length adaptive compilation reducing instruction storage by 500 times | 4 | 4 | 3 |
SSM/Mamba Accelerators¶
Challenge: Mamba's element-wise operations are incompatible with Tensor Core reduction trees; nonlinear functions (exp, SiLU) require large dedicated hardware units; and element-wise ops have limited data sharing making standard tiling inapplicable.
Solution: Reconfigurable PE array that can disable reduction trees for element-wise ops, decompose nonlinear functions into element-wise primitives, and apply intra/inter-operation buffer management.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2024 | ICCAD | SJTU | MARCA: Mamba Accelerator with ReConfigurable Architecture | reduction alternative PE array architecture toggling reduction tree for linear vs element-wise ops; fast biased exponential algorithm decomposing exp into shift and element-wise ops; piecewise SiLU approximation reusing reconfigurable PEs; intra-operation and inter-operation buffer management strategy | 4 | 3 | 4 |
Accelerators facing Memory Wall¶
Challenge: LLMs require large amounts of memory bandwidth to store and access the model parameters and intermediate results which can lead to memory bottlenecks and reduced performance.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2023 | ASPLOS | Gatech | FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks | inter-operator tensor-tensor fusion; fine-grained execution tiling; map-space exploration | 3 | 3 | 2 |
| 2024 | ISCA | Furiosa AI | TCP: A Tensor Contraction Processor for AI Workloads | tensor contraction as a hardware primitive; circuit-switched fetch network for hierarchical data reuse; Einstein summation for tactic exploration | 4 | 3 | 3 |
| 2024 | DATE | NTU | ViTA: A Highly Efficient Dataflow and Architecture for Vision Transformers | highly efficient memory-centric dataflow; fused special function module for non-linear functions; A comprehensive DSE of ViTA Kernels and VMUs | |||
| 2025 | arXiv | SJTU | ROMA: A Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM | hybrid ROM-SRAM architecture for on-device LLM; B-ROM design for area-efficient ROM; fused cell integration of ROM and compute unit; QLoRA rank adaptation for task-specific tuning; on-chip storage optimization for quantized models | |||
| 2025 | ISCA | Duke | Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression | entropy-aware cache compression for LLMs; group-wise non-uniform quantization; shared k-means patterns; parallel Huffman hardware decoder | 4 | 3 | 3 |
Compiler-Scheduled Cacheless Architecture¶
Challenge: Reactive hardware architectures (like GPUs) devote significant area and power to dynamic scheduling and caching, which leads to unpredictable tail latencies and limits the compute utilization in large-scale distributed systems.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2020 | ISCA | Groq | Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads | Space-time determined;Functionally Sliced and Data stream Microarchitecture | 4 | 5 | 4 |
| 2022 | ISCA | Groq | A software-defined tensor streaming multiprocessor for large-scale machine learning | Multi-TSP network;Removal of Hardware Flow Control;Deterministic Load Balancing | 4 | 5 | 4 |
Algorithmic Accelerators¶
Solution: Algorithmic accelerators use specialized algorithms to optimize the performance of LLMs.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2020 | HPCA | Seoul National | A3: Accelerating Attention Mechanisms in Neural Networks with Approximation | greedy candidate search for reducing search targets; post-scoring selection with dynamic thresholding for softmax; lookup-table-based exponent modules | 4 | 3 | 2 |
| 2021 | HPCA | MIT | SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning | cascade pruning; high-parallelism top-K engine; progressive quantization | 3 | 4 | 4 |
| 2024 | ASPLOS | CMU | SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification | tree-based speculative inference; topology-aware causal mask; multi-step speculative sampling | 4 | 4 | 4 |
| 2025 | MICRO | KAIST | HLX: A Unified Pipelined Architecture for Optimized Performance of Hybrid Transformer-Mamba Language Models | PipeFlash fine-grained pipelining for Attention; PipeSSD fused pipelined execution for Mamba-2; Unified Reconfigurable Streamlined Core (URSC);inter-operation dependency mitigation | 4 | 3 | 3 |
Wafer-Scale Accelerators¶
Solution: Adopt holistic hardware-software co-design to integrate high-bandwidth Network-on-Wafer (NoW) architectures with fault-tolerant mapping strategies, thereby mitigating yield limitations and communication bottlenecks to maximize system throughput.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2025 | ISCA | THU | PD Constraint-aware Physical/Logical Topology Co-Design for Network on Wafer | mesh-switch physical topology; dual-granularity logical topology | 4 | 3 | 2 |
| 2025 | ISCA | THU | WSC-LLM: Efficient LLM Service and Architecture Co-exploration for Wafer-scale Chips | optimal resource partition algorithm; optimal KV cache placement algorithm | 4 | 3 | 2 |
| 2026 | HPCA | THU | WATOS: Efficient LLM Training Strategies and Architecture Co-exploration for Wafer-scale Chip | globally coordinated memory-efficient recomputation; location-aware resource placement; 3-stage system-level robustness design | 4 | 4 | 2 |
Quantized DNN Accelerators¶
Solution: Quantized DNN accelerators are designed to efficiently execute quantized neural networks, which use lower precision representations for weights and activations.
General-Purpose and Edge Quantized DNN Accelerators¶
Challenge: General quantized-DNN accelerators must support mixed precision or nonuniform datatypes without losing the throughput and energy benefits of low-bit execution.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2018 | ISCA | SNU | Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation | accelerator architecture for outlier-aware quantized models; outlier-aware low-precision computation; separate outlier MAC unit | 4 | 3 | 2 |
| 2023 | HPCA | UPC | Mix-GEMM: An efficient HW-SW Architecture for Mixed-Precision Quantized Deep Neural Networks Inference on Edge Devices | Complete mixed-precision flexibility; hardware accelerator & BLIS-based library with custom RISC-V ISA extensions | 3 | 2 | 3 |
| 2024 | DAC | ASU | Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference | composite data type Logarithmic Posits (LP); automated post training LP Quantization (LPQ) Framework based on genetic algorithms; mixed-precision LP Accelerator (LPA) | 3 | 3 | 2 |
Quantized Transformer and Foundation-Model Accelerators¶
Challenge: Transformer-scale quantized accelerators need to compress weights and activations while preserving model accuracy and reducing expensive memory traffic.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2020 | MICRO | UToronto | GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference | Gaussian-outlier weight splitting; per-layer centroid dictionary quantization; 3-bit index storage for BERT weights; GOBO low-bit accelerator | 4 | 3 | 4 |
| 2022 | ISCA | UToronto | Mokey: Enabling Narrow Fixed-Point Inference for Out-of-the-Box Floating-Point Transformer Models | Golden Dictionary post-training quantization; 4-bit index weights and activations with fixed-point centroids; exponential-fit centroid arithmetic replacing MACs with narrow additions; Mokey accelerator and memory compression assist | 4 | 4 | 4 |
| 2025 | ISCA | Georgia Tech & Intel | MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization | outlier-aware MX quantization with pruning-based bit redistribution; ReCoN NoC for high-precision outlier merging; multi-precision INT PE systolic accelerator for LLM/VLM inference | 4 | 3 | 3 |
Bit-Sliced DNN Accelerators¶
Solution: Bit-sliced DNN accelerators break down data into smaller bit-slices, allowing for more efficient processing and reduced memory and calculation resources requirements.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2018 | ISCA | Georgia Tech | Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network | accelerator for layer-aware quantized DNN; bit-flexible computation unit; block-structured instruction set architecture | 4 | 3 | 3 |
| 2023 | HPCA | KAIST | Sibia: Signed Bit-slice Architecture for Dense DNN Acceleration with Slice-level Sparsity Exploitation | signed bit-slice representation;flexible zero skipping processing element | 3 | 3 | 4 |
| 2024 | HPCA | KU Leuven | BitWave: Exploiting Column-Based Bit-Level Sparsity for Deep Learning Acceleration | Bit-column sparsity for both computation reduction and data compression; Single-shot Bit-Flip post-training | 3 | 3 | 3 |
| 2025 | HPCA | POSTECH | Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity | Asymmetrically-Quantized bit-Slice GEMM; Zero-Point Manipulation and Distribution-based Bit-Slicing to increase sparsity | 3 | 3 | 4 |
| 2025 | HPCA | Yonsei | Bit-slice Architecture for DNN Acceleration with Slice-level Sparsity Enhancement and Exploitation | both input AND weight sparsity at bit-slice level; 8-bit data processing with 4-bit multipliers; Scale regularization during training to enhance sparsity | 3 | 2 | 2 |
Reconfigurable Dataflow & Interconnect Accelerators¶
Solution: Dynamically adaptive interconnections and multi-dataflow engines allowing adaptable mapping to match spatial architectures with DNN variants.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2018 | ASPLOS | Georgia Tech | MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects | augmented reduction tree(ART) for link conflict; chubby distribution tree for bandwidth optimization; ART based virtual neuron construction | 4 | 3 | 2 |
| 2019 | JETCAS | MIT | Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices | hierarchical mesh NoC for multiple transmission modes; sparse PE architecture | 5 | 4 | 2 |
| 2023 | ASPLOS | UM & Georgia Tech | Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing | merger-reduction network for area efficiency; compression format conversion without hardware module; dedicated L1 memory architecture for different access pattern | 4 | 3 | 2 |
| 2024 | ISCA | Gatech | FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching | reorder in reduction; dataflow-layout co-switching; butterfly interconnect for reduction and reordering (BIRRD) | 4 | 4 | 2 |
| 2025 | MICRO | Maryland | Misam: Machine Learning Assisted Dataflow Selection in Accelerators for Sparse Matrix Multiplication | ML-assisted runtime dataflow selection; lightweight decision tree predictor; intelligent reconfiguration engine for FPGAs; cost-benefit analysis for hardware switching | 3 | 4 | 3 |
Application & Sparsity-Specific Reconfigurable Accelerators¶
Solution: Accelerators employing structured sparsity and environment-specific pipelining, enabling them to reconfigure flexibly for structured sparsity masks or specific AI domains (like RL).
Structured Sparsity DNN Accelerators¶
Solution: Accelerators employing hierarchical structured sparsity masks for flexible DNN acceleration.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2023 | MLSys | Gatech | SUSHI: SUbgraph Stationary Hardware-software Inference Co-design | subgraph-stationary dataflow; dedicated persistent buffer; cache-state-aware scheduling; moving-average subgraph prediction | 4 | 4 | 2 |
| 2023 | MICRO | MIT | HighLight: Efficient and Flexible DNN Acceleration with Hierarchical Structured Sparsity | hierarchical structured sparsity (HSS); modularized sparsity acceleration architecture; systematic flexibility on sparsity patterns | 3 | 3 | 3 |
RL Environment Accelerators¶
Solution: FPGA-based accelerators targeting the parallel environment execution bottleneck in reinforcement learning training pipelines.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2025 | DATE | PKU | PEARL: FPGA-Based Reinforcement Learning Acceleration with Pipelined Parallel Environments | pipelined parallel RL environment execution on FPGA; PCIe data compression and local-store optimization; modular parametric environment template for user customization | 3 | 3 | 2 |
Secure DNN Accelerators¶
Solution: Integrating on-chip cryptographic engines and co-optimizing the hardware architecture, memory authentication schemes, and data scheduling to efficiently enable Trusted Execution Environments (TEEs) for secure DNN computation with minimal overhead.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2023 | MICRO | MIT | SecureLoop: Design Space Exploration of Secure DNN Accelerators | cryptographic-engine-aware loopnest scheduling; analytical authentication block formulation; simulated annealing-based cross-layer tuning | 3 | 3 | 2 |
Benchmarks¶
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2025 | arXiv | Cambridge | Benchmarking Ultra-Low-Power µNPUs | Comparative µNPU Benchmarking (µNPU: microcontroller-scale Neural Processing Unit); open-source model compilation framework; µNPU memory I/O bottleneck identification | 4 | 4 | 2 |
Dataflow Architecture¶
Reconfigurable Dataflow Architectures¶
Challenge: Bridging the gap between the high efficiency of dataflow execution and the need for handling irregular control flows (e.g., MIMD threads) and diverse parallel patterns in general-purpose computing.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2017 | ISCA | Stanford | Plasticine: A Reconfigurable Architecture for Parallel Patterns | Pattern Compute Unit (PCU); Pattern Memory Unit (PMU); parallel pattern-based ISA; hierarchical reconfigurable interconnect | 5 | 3 | 5 |
| 2022 | ISCA | Stanford | Aurochs: An Architecture for Dataflow Threads | dataflow threads for MIMD execution; resource elasticity and dynamic context switching; hardware extensions for irregular workloads | 4 | 3 | 4 |
| 2024 | MICRO | CMU | The TYR Dataflow Architecture: Improving Locality by Taming Parallelism | local tag spaces technique; space tag managing instruction set; CT based concurrent-block communication |
Tensor-Centric Dataflow Architectures¶
Challenge: Overcoming the memory wall and utilization bottlenecks in Deep Learning workloads by optimizing inter-operator dataflow, exploiting sparsity, and managing tensor contraction sequences.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2019 | ASPLOS | THU | Tangram: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators | buffer sharing dataflow(BSD); alternate layer loop ordering (ALLO) dataflow; heuristics spatial layer mapping algorithm | |||
| 2024 | MICRO | UCR | Sparsepipe: Sparse Inter-operator Dataflow Architecture with Cross-Iteration Reuse | producer-consumer reuse; cross-iteration reuse; sub-tensor dependency; OEI dataflow; sparsepipe architecture | |||
| 2025 | arXiv | UCSB | FETTA: Flexible and Efficient Hardware Accelerator for Tensorized Neural Network Training | contraction sequence search engine; tensor contraction unit; distribution/reduction network | 3 | 4 | 3 |
Data Mapping¶
Solution: Assign data to specific locations in memory or storage to optimize performance; reduce latency; and improve resource utilization.
Survey¶
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2013 | DAC | NUS | Mapping on Multi/Many-core Systems: Survey of Current and Emerging Trends | dense/run-time mapping; centralized/distributred management; hybrid mapping |
Heuristic Algorithm¶
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2021 | HPCA | Georgia Tech | MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores | sub-accelerator selection; fine-grained job prioritization; MANGA crossover genetic operators | |||
| 2023 | ISCA | THU | MapZero: Mapping for Coarse-grained Reconfigurable Architectures with Reinforcement Learning and Monte-Carlo Tree Search | GAT based DFG and CGRA embedding; routing penalty based reinforcement learning; Monte-Carlo tree search space exploration | |||
| 2023 | VLSI | IIT Kharagpur | Application Mapping Onto Manycore Processor Architectures Using Active Search Framework | RNN based active search framework; IP-Core Numbering Scheme; active search with/without pretraining |
Optimization Modeling¶
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2020 | FPGA | ETH Zurich | Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis | computation and I/O decomposition model for matrix multiplication; 1D array collapse mapping method; internal double buffering | |||
| 2021 | HPCA | Georgia Tech | Heterogeneous Dataflow Accelerators for Multi-DNN Workloads | heterogeneous dataflow accelerators (HDAs) for DNN; dataflow flexibility; high utilization across the sub-accelerators | |||
| 2023 | MICRO | Alibaba; CUHK | ArchExplorer: Microarchitecture Exploration Via Bottleneck Analysis | dynamic event-dependence graph(EDG); induced DEG based critical path construction; bottleneck-removal-driven DSE | |||
| 2023 | ISCA | THU | Inter-layer Scheduling Space Definition and Exploration for Tiled Accelerators | inter-layer encoding method; temperal cut; spatial cut; RA tree analysis |
Fault Tolerant Mapping¶
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2017 | SC | NIT | High-performance and energy-efficient fault-tolerance core mapping in NoC | weighted communication energy; placing unmapped vertices region; application core graph; spare core placement algorithm | |||
| 2019 | IVLSI | UESTC | Optimized mapping algorithm to extend lifetime of both NoC and cores in many-core system | lifetime budget metric; LBC-LBL mapping algorithm; electro-migration fault model |
Communication Optimized Mapping¶
Challenge: Efficiently exploring the vast design space to balance limited on-chip buffer resources with scarce DRAM bandwidth, aiming to minimize off-chip communication latency through optimized layer fusion and scheduling strategies.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2024 | ASPLOS | THU | Cocco: Hardware-Mapping Co-Exploration towards Memory Capacity-Communication Optimization | consumption-centric flow based subgraph execution scheme; main/side region based memory management | |||
| 2025 | DAC | THU | Buffer Prospector: Discovering and Exploiting Untapped Buffer Resources in Many-Core DNN Accelerators | data-compute ratio; buffer requirement calculator; layer-pipeline (LP) mapping optimization; greedy based buffer allocator | 4 | 3 | 2 |
Reliability Management¶
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2018 | DATE | NUST | Variation-Aware Task Allocation and Scheduling for Improving Reliability of Real-Time MPSoCs | variation-aware task allocation; soft-error reliability maximization; cross entropy based task scheduling | 4 | 3 | 2 |
| 2019 | DAC | NTU | LifeGuard: A Reinforcement Learning-Based Task Mapping Strategy for Performance-Centric Aging Management | performance-centric aging management; frequency-based core binning; DRL based task mapping | 3 | 3 | 2 |
| 2020 | DATE | Turku | Thermal-Cycling-aware Dynamic Reliability Management in -Core System-on-Chip | Coffin-Mason equation based reliability model; reliability-aware mapping/scheduling; dynamic power management | |||
| 2024 | arXiv | WUSTL | A Two-Level Thermal Cycling-Aware Task Mapping Technique for Reliability Management in Manycore Systems | temperature based bin packing; task-to-bin assignment; thermal cycling-aware based task-to-core mapping | |||
| 2024 | arXiv | WUSTL | A Reinforcement Learning-Based Task Mapping Method to Improve the Reliability of Clustered Manycores | mean time to failure; density-based spatial clustering of applications with noise algorithm |
CGRA¶
Solution: Optimally map an application's data flow graph onto the hardware fabric by simultaneously solving the tightly-coupled problems of scheduling, placement, and routing under strict spatial and temporal resource constraints.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2017 | ASAP | Torontom | CGRA-ME: A Unified Framework for CGRA Modelling and Exploration | XML-based CGRA description; LLVM-based simulated annealing mapper; modulo routing resource graph | 3 | 2 | 2 |
| 2019 | TCAD | Tsinghua | Data-Flow Graph Mapping Optimization for CGRA With Deep Reinforcement Learning | neighbor PE interchange defined action space; local pattern based reward function | 3 | 3 | 2 |
| 2022 | HPCA | NUS | LISA: Graph Neural Network based Portable Mapping on Spatial Accelerators | label abstraction for quality mapping; GNN based label-aware mapping; label-aware simulated annealing | 4 | 3 | 3 |
| 2024 | MICRO | NUS | ICED: An Integrated CGRA Framework Enabling DVFS-Aware Acceleration | tile based CGRA configuration; DVFS labeling; DVFS-Aware DFG Mapping | 4 | 3 | 2 |
Task Scheduling¶
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2023 | ICCAD | PKU | Memory-aware Scheduling for Complex Wired Networks with Iterative Graph Optimization | topology-aware pruning algorithm; integer linear programming scheduling method; sub-graph fusion algorithm ; memory-aware graph partitioning | 4 | 3 | 2 |
| 2023 | MICRO | Duke | Si-Kintsugi: Towards Recovering Golden-Like Performance of Defective Many-Core Spatial Architectures for AI | graph alignment algorithm for dataflow graph and platform pe grap; producer-consumer pattern dataflow generation algorithm |
Many-core Architecture¶
Challenge: Many-core architectures are designed to handle a large number of cores; but they face challenges in terms of power consumption; performance; and resource allocation.
Resource Management¶
Challenge: Cores share resources with each other, how to achieve high performance by coordinating access among cores to prevent conflicts and ensure data consistency is a problem.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2015 | HPCA | Cornel | Increasing Multicore System Efficiency through Intelligent Bandwidth Shifting | online bandwidth shifting mechanism; prefetch usefulness (PU) level | |||
| 2015 | HPCA | IBM | XChange: A Market-based Approach to Scalable Dynamic Multi-resource Allocation in Multicore Architectures | CMP multiresource allocation mechanism XChange; market framework based modeling | |||
| 2018 | MICRO | SNU | RpStacks-MT: A High-throughput Design Evaluation Methodology for Multi-core Processors | graph-based multi-core performance model; distance-based memory system model; dynamic scheduling reconstruction method | |||
| 2023 | MICRO | Yonsei | McCore: A Holistic Management of High-Performance Heterogeneous Multicores | cluster partitioning via index hash function; partitions balancing method; hardware support for RL based scheduling |
Hardware Design¶
Solution: Hardware implementation for many-core architecture to achieve massive parallelism.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2016 | SCIS | THU&BNU&CAS | The Sunway TaihuLight supercomputer: system and applications | Sunway TaihuLight's composition; scientific computing applications on TaihuLight | 3 | 4 | 2 |
| 2017 | IPDPSW | SJTU&Tokyo Tech | Benchmarking SW26010 Many-core Processor | hand-coded assembly benchmark for SW26010; CPE pipeline&memory hierarchy&RLC mechanism benchmarking | 3 | 4 | 2 |
| 2020 | MICRO | UCSD | Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks | dynamic architecture fission; spatial multi-tenant acceleration; omni-directional systolic arrays | 4 | 3 | 2 |
| 2023 | MICRO | THU | MAICC: A Lightweight Many-core Architecture with In-Cache Computing for Multi-DNN Parallel Inference | slice improved and hardware-implemented reduction CIM; ISA extension for CIM; CNN layer segmentation and mapping algorithm |
Application Optimization¶
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2023 | SC | NUDT | Optimizing Direct Convolutions on ARM Multi-Cores | direct convolution algorithm NDirect; loop ordering algorithm; micro convolution kernal for computing & packeting | |||
| 2023 | SC | NUDT | Optimizing MPI Collectives on Shared Memory Multi-Cores | intra-node reduction algorithm for redundant data movements; fine grained non-temporal store based adaptive collectives | |||
| 2024 | PPoPP | NUDT | Towards Scalable Unstructured Mesh Computations on Shared Memory Many-Cores | task dependency tree(TDT); tree traversal based parallel algorithm for CPU/GPU |
Architecture DSE¶
Challenge: It's crucial to find the optimal hardware configurations that meet performance; power; and area constraints for specific applications.
NOC DSE¶
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2018 | ICCAD | WSU | Hybrid On-Chip Communication Architectures for Heterogeneous Manycore Systems | many-to-few communication patterns; long range shortcut based wireless NoC ; 3D-TSV based heterogeneous NoC | |||
| 2018 | IEEE TC | WSU | On-Chip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems | wireless-enabled heterogeneous NoC; archived multi-objective simulated annealing for network connectivity |
Mapping & Co-Exploration DSE¶
Challenge: Efficiently co-optimize DNN mapping and hardware architecture under complex constraints.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2020 | ICCAD | UIUC | DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator | two-level (global and local) automatic DSE engine; dynamic design space exploration framework; high-dimensional design space support | 4 | 4 | 4 |
| 2020 | ICCAD | Gatech | GAMMA: Automating the HW Mapping of DNN Models on Accelerators via Genetic Algorithm | domain-specific genetic representation; growth and aging evolutionary operators; two-stage inter-layer optimization | 4 | 3 | 2 |
| 2022 | TACO | Gatech | Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators | dimension dependence graph (DDG); maestro data-centric notation; decoupled off-chip/on-chip mapping exploration | 3 | 3 | 2 |
| 2024 | HPCA | THU | Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators | layer-centric encoding method; DP-based graph partition algorithm; SA based D2D link communication optimization | |||
| 2024 | ASPDAC | CUHK | SoC-Tuner: An Importance-guided Exploration Framework for DNN-targeting SoC Design | intercluster distance algorithm; importance-based pruning and initialization | 3 | 2 | 2 |
| 2024 | Arxiv | Georgia Tech | PIPEORGAN: Efficient Inter-operation Pipelining with Flexible Spatial Org | spatial organization strategy pipeorgan for inter-operator pipelining; augmented mesh for pipelining(AMP) topology | 4 | 2 | 2 |
| 2025 | ASPDAC | THU | KAPLA: Scalable NN Accelerator Dataflow Design Space Structuring and Fast Exploring | tensor-centric dataflow directives; bottom-up cost descending; inter-layer pruning and decoupling | 3 | 3 | 2 |
Microarchitecture & Cross-Architecture DSE¶
Challenge: Efficiently explore and optimize design spaces across microarchitectures and heterogeneous hardware.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2019 | MICRO | Georgia Tech | Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows: A Data-Centric Approach | MAESTRO analytical model; data-centric directives; spatial and temporal reuse analysis; hardware cost estimation; NoC (Network-on-Chip) modeling | 4 | 4 | 4 |
| 2023 | DATE | Gatech | AIRCHITECT: Learning Custom Architecture Design and Mapping Space | recommendation neural network; constant time prediction; systolic-array-based accelerators | 3 | 4 | 2 |
| 2025 | arXiv | THU & Macau | MLDSE: Scaling Design Space Exploration Infrastructure for Multi-Level Hardware | IR and builder based hardware modeling; cross-architecture DSE; spatial-level DSE | 3 | 3 | 2 |
| 2025 | arXiv | PKU | DiffuSE: Cross-Layer Design Space Exploration of DNN Accelerator via Diffusion-Driven Optimization | diffusion-based design generation; conditional sampling | 3 | 4 | 3 |
| 2026 | arXiv | Berkeley | ArchAgent: Agentic AI-driven Computer Architecture Discovery | automated runtime-configurable parameter tuning for workload-specific optimization by LLM; automated detection of simulator escapes (ai-exploited logic loopholes simulators) | 3 | 3 | 3 |
Data Access Accelerators¶
Challenge: Indirect and sparse memory access patterns; low memory bandwidth utilization; core structural limitations
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2025 | ISCA | UMich | DX100: Programmable Data Access Accelerator for Indirection | indirect memory access accelerator; bulk memory access reordering; DRAM row-buffer hit rate optimization; programmable data access ISA | 4 | 3 | 2 |
HBM Interconnects¶
Challenge: The built-in crossbar of HBM FPGAs suffers from contention and low bandwidth during many-to-many unicast access, and standard HLS lacks support for efficient burst buffering.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2021 | FPGA | UCLA | HBM Connect: High-Performance HLS Interconnect for FPGA HBM | HBM Connect; HLS Virtual Buffer (HVB); Mux-Demux Switch; butterfly custom crossbar; many-to-many unicast; pseudo-channel optimization | 4 | 4 | 2 |
Reinforcement Learning Accelerators¶
Challenge: Resolving the memory bottlenecks caused by the irregular, low-arithmetic-intensity operations of experience replay while efficiently synchronizing high-throughput neural network training with sequential data collection.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2026 | TVLSI | NUAA | E2CAP: An Energy-Efficient FPGA Accelerator for Deep Reinforcement Learning With Experience Compression and Configurable PE Array | fss+css two-stage experience compression strategy; intra-PE MADD/MAC mode switching to eliminate weight transposition; inter-PE configurable group-slice interconnection for computation imbalance reduction | 4 | 4 | 2 |
| 2022 | IEEE Access | Osnabrueck | A Survey of Domain-Specific Architectures for Reinforcement Learning | experience replay bottleneck; on-chip training lackness; normalized efficiency metrics(IPS/LUT) | 3 | 2 | 2 |
| 2022 | CF | USC | FPGA Acceleration of Deep Reinforcement Learning using On-Chip Replay Management | on-chip replay management module; k-ary sum tree data structure; hardware pipelining with conflict-free memory access | 3 | 4 | 2 |