Skip to content

Domain-Specific Accelerators

Sorting & Merging Accelerators

Challenge: High-throughput 2-way mergers on FPGAs suffer from long feedback critical paths, high resource utilization from redundant comparators, and tie-record issues in feedback-less designs.

Solution: Relax the sorted-input requirement of the bitonic partial merger by replacing its first stage with distributed MAX units, eliminating barrel shifters and redundant merger blocks while achieving lowest comparator count.

Year Venue Authors Title Tags P E N
2022 TC Imperial & dunnhumby FLiMS: a Fast Lightweight 2-way Merger for Sorting distributed MAX selector stage eliminating barrel shifter rotation; single 2w-to-w bitonic partial merger with minimum comparator count; skewness optimisation via dir register oscillation for balanced dequeue; FLiMSj whole-row dequeue variant via cR register buffer; SIMD AVX2 implementation for CPU merge sort 4 4 4

Homomorphic Encryption Accelerators

Challenge: Current HE accelerators are slow as they rely on complex NTT operations that ignore data sparsity and error tolerance.

Year Venue Authors Title Tags P E N
2025 DATE PKU FLASH: An Efficient Hardware Accelerator Leveraging Approximate and Sparse FFT for Homomorphic Encryption approximate FFT for modular reduction elimination; sparsity-aware FFT dataflow using skipping and merging methods; DSE for Pareto-optimal computation precision 4 2 2
2023 ISCA Seoul National SHARP: A Short-Word Hierarchical Accelerator for Robust and Practical Fully Homomorphic Encryption 36-bit word length optimization for FHE precision-efficiency trade-off; double-prime scaling unit for instruction fusion and short-word scaling 3 2 3

Graph Accelerators

Challenge: Massive memory requirement, Non-ordered memory access

Survey

Challenge: Lack of systematic categorization and review of diverse graph accelerator implementations spanning different architectures and programming paradigms.

Year Venue Authors Title Tags P E N
2019 JCST HUST A Survey on Graph Processing Accelerators: Challenges and Opportunities vertex-centric vs edge-centric iterative paradigms; graph layout reorganization and ordering; source/destination/grid graph partitioning; runtime scheduling execution models
2022 IEEE Micro UCLA Systematically Understanding Graph Accelerator Dimensions and the Value of Hardware Flexibility Taskflow execution model unifying task and dataflow parallelism; graph algorithm variant taxonomy; multi-level spatial partitioning; asynchronous hardware task scheduling priority queue

Pipelined & Event-Driven Graph Accelerators

Solution: Optimize data flow and execution models using pipelining and event-driven mechanisms to handle irregular graph workloads across hardware architectures.

Year Venue Authors Title Tags P E N
2016 MICRO Princeton Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics vertex-programming based pipeline; on-chip scratchpad optimization; source/destination-oriented parallel streams 4 4 4
2016 ISCA Bilkent Energy Efficient Architecture for Graph Analytics Accelerators configurable SystemC-based GAS architecture template with application-pluggable data structures; monotonic-rank RAW/WAR hazard detection; bit-vector/queue dual structure; lock-free active vertex scheduling 3 3 3
2020 MICRO UCR GraphPulse: An Event-Driven Hardware Accelerator for Asynchronous Graph Processing asynchronous event-driven model; in-place event coalescing; delta-based accumulative processing 3 3 4
2024 FPGA HKUST GraFlex: Flexible Graph Processing on FPGAs through Customized Scalable Interconnection Network scatter-gather BSP paradigm; customizable multi-stage butterfly interconnection network with virtual-channel flow control; HLS-level coalesced memory; throughput-matching design methodology 3 4 3

FPGA Stream-Partitioned Graph Accelerators

Solution: Partition graphs into intervals and shards, streaming edges through FPGA on-chip memory while caching vertex data to maximize bandwidth utilization and reduce preprocessing overhead.

Year Venue Authors Title Tags P E N
2016 FPGA THU & UCB FPGP: Graph Processing Framework on FPGA A Case Study of Breadth-First Search interval-shard based vertex-centric graph processing on FPGA; on-chip BFS ping-pong vertex caching; analytical performance model with N_pk crossover point 3 4 2
2017 FPGA THU & MSR ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture two-level interval-shard partitioning across multi-FPGA boards; index-based O(m) preprocessing without edge sorting; destination-first replacement strategy minimizing off-chip traffic; PE-level edge shuffling for load balancing; bitmap-based block skipping for sparse iterations 4 4 3
2018 FCCM THU & UCB NewGraph: Balanced Large-scale Graph Processing on FPGAs with Low Preprocessing Overheads URAM-based large partitions reducing count by 3 orders of magnitude vs ForeGraph; FIFO-based dynamic crossbar eliminating pre-sorting; balanced workload without static edge shuffling 3 4 2
2019 FPL NUS On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-Based FPGAs on-the-fly parallel data shuffling; OpenCL-based data dispatcher; runtime data dependency resolution; decoder-filter shuffling architecture 3 4 2

Memory-Optimized Graph Accelerators

Solution: Address the memory bottleneck (bandwidth and latency) through specialized memory hierarchies, HBM optimizations, or caching mechanisms.

Year Venue Authors Title Tags P E N
2014 IPDPSW ISU CyGraph: A Reconfigurable Architecture for Parallel Breadth-First Search custom 64-bit CSR combining visited-flag/row-offset/neighbor-count in single memory word; token-ring kernel-to-kernel interface for distributed next-queue coordination across multi-FPGA 3 4 2
2019 FPGA HUST Improving Performance of Graph Processing on FPGA-DRAM Platform by Two-level Vertex Caching two-level vertex caching (L1 BRAM/L2 UltraRAM); Hilbert-order window sliding for locality; dual-pipeline computation-communication overlapping 4 4 3
2021 ICCAD Cornell GraphLily: Accelerating Graph Linear Algebra on HBM-Equipped FPGAs GraphBLAS; FPGA overlay; HBM-optimized; SpMV/SpMSpV accelerator; CPSR sparse format; software-hardware co-design 4 4 3
2024 TRETS HUST ScalaBFS2: A High-performance BFS Accelerator on an HBM-enhanced FPGA Chip HBM-enhanced BFS accelerator; independent HBM Reader; hybrid-mode PE; multi-layer crossbar 3 4 3
2019 ASPDAC THU & UCB GraphSAR: A Sparsity-Aware Processing-in-Memory Architecture for Large-Scale Graph Processing on ReRAMs sparsity-aware recursive block partitioning with density threshold 0.5; hybrid-centric block-list and edge-list processing model; lightweight graph clustering via vertex index remapping; single-bit ReRAM cell implementation for unweighted algorithms 3 3 3

Dynamic Graph Accelerators

Challenge: Edge update, Graph store data structure design

Year Venue Authors Title Tags P E N
2021 FPGA HUST GraSU: A Fast Graph Update Library for FPGA-based Dynamic Graph Processing differential data management based on spatial similarity; Incremental Value Measurer (IVM); Value-Aware Memory Manager (VMM) 4 4 3
2021 MICRO UCR JetStream: Graph Analytics on Streaming Data with Event-Driven Hardware Accelerator first streaming graph accelerator; asynchronous incremental algorithms; asynchronous edge deletion handling (VAP; DAP) 3 3 4
2022 ISCA HUST TDGraph: A Topology-Driven Accelerator for High-Performance Streaming Graph Processing topology-driven incremental execution; streaming graph processing; regularized state propagation; vertex states coalescing 4 4 3
2023 MICRO UCR MEGA Evolving Graph Accelerator first evolving graph accelerator; Batch-Oriented Execution (BOE); deletion-free based on CommonGraph; Batch Pipelining 3 3 4
2024 TRETS UoV Dynamic-ACTS - A Dynamic Graph Analytics Accelerator For HBM-Enabled FPGAs a novel edge packing format (ACTPACK); hashed edge updates; low-overhead online partitioning 4 4 3

Hypergraph Accelerators

Solution: Realize the shared parts in hyperedges

Year Venue Authors Title Tags P E N
2022 MICRO HUST A Data-Centric Accelerator for High-Performance Hypergraph Processing Data-Centric; Load-Trigger-Reduce (LTR); Adaptive Data Loading 4 4 3
2025 HPCA HUST MeHyper: Accelerating Hypergraph Neural Networks by Exploring Implicit Dataflows Microedge; Microedge-Centric Dataflow; RePAG Execution Model 4 3 4

Graph Mining Accelerators

Challenge: Complex graph algorithms, Irregular access patterns

Year Venue Authors Title Tags P E N
2021 ISCA MIT FlexMiner: A Pattern-Aware Accelerator for Graph Pattern Mining pattern-aware GPM accelerator; software/hardware co-design; pattern-specific execution plan; connectivity map (c-map) 4 4 3

DNN Accelerators

GEMM

Year Venue Authors Title Tags P E N
2020 HPCA Georgia Tech SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training flexible dot product engine; forwarding adder network 4 2 3
2022 DAC Gatech Self Adaptive Reconfigurable Arrays (SARA): Learning Flexible GEMM Accelerator Configuration and Mapping-space using ML dedicated hardware recommender core; pipelined bypass links; real-time hardware reconfiguration 3 3 2
2025 TCAS-I Edin. DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration Permutated Weight;FIFO-less Architecture 4 4 3

Layer Fusion Accelerators

Solution: Use layer fusion to combine multiple layers of a neural network into a single layer. This can help reduce the number of computations and memory accesses required during inference; leading to faster execution times and lower power consumption.

Year Venue Authors Title Tags P E N
2016 MICRO SBU Fused-Layer CNN Accelerators fuse the processing of multiple CNN layers by modifying the order in which the input data are brought on chip
2025 TC KU Leuven Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators fine-grain mapping paradigm; mapping of layer-fused DNNs on heterogeneous dataflow accelerator architectures; memory- and communication-aware latency analysis; constraint optimization
2024 SOCC IIT Hyderabad Hardware-Aware Network Adaptation using Width and Depth Shrinking including Convolutional and Fully Connected Layer Merging Width Shrinking: reduces the number of feature maps in CNN layers; Depth Shrinking: Merge of conv layer and fc layer
2024 ICSAI MIT LoopTree: Exploring the Fused-Layer Dataflow Accelerator Design Space design space that supports set of tiling, recomputation, retention choices, and their combinations; model that validates design space

LLM Accelerators

Challenge: LLM accelerators face challenges in terms of memory bandwidth; power consumption; and the need for efficient data movement.

FPGA-Based Transformer Inference Accelerators

Challenge: Deploying Transformer inference on FPGAs/ACAPs suffers from severe shape mismatch between diverse layer sizes and fixed hardware resources, creating an inherent latency-throughput tradeoff that neither purely sequential nor fully spatial accelerator strategies can resolve simultaneously.

Solution: Explore sequential-spatial hybrid accelerator architectures with automated layer-to-accelerator scheduling and inter-accelerator communication co-design to achieve a superior latency-throughput Pareto front.

Year Venue Authors Title Tags P E N
2024 FPGA Pitt & UMD SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration latency-throughput Pareto front via sequential-spatial hybrid design; force-partition strategy for inter-accelerator memory bank conflict elimination; fine-grained line-buffer pipeline for nonlinear kernels (LayerNorm/Softmax) 4 4 3
2024 FPGA THU & SJTU FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs configurable sparse DSP chain for N:M and block sparsity; always-on-chip decode with mixed-precision dequantization unit; length adaptive compilation reducing instruction storage by 500 times 4 4 3
SSM/Mamba Accelerators

Challenge: Mamba's element-wise operations are incompatible with Tensor Core reduction trees; nonlinear functions (exp, SiLU) require large dedicated hardware units; and element-wise ops have limited data sharing making standard tiling inapplicable.

Solution: Reconfigurable PE array that can disable reduction trees for element-wise ops, decompose nonlinear functions into element-wise primitives, and apply intra/inter-operation buffer management.

Year Venue Authors Title Tags P E N
2024 ICCAD SJTU MARCA: Mamba Accelerator with ReConfigurable Architecture reduction alternative PE array architecture toggling reduction tree for linear vs element-wise ops; fast biased exponential algorithm decomposing exp into shift and element-wise ops; piecewise SiLU approximation reusing reconfigurable PEs; intra-operation and inter-operation buffer management strategy 4 3 4
Accelerators facing Memory Wall

Challenge: LLMs require large amounts of memory bandwidth to store and access the model parameters and intermediate results which can lead to memory bottlenecks and reduced performance.

Year Venue Authors Title Tags P E N
2023 ASPLOS Gatech FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks inter-operator tensor-tensor fusion; fine-grained execution tiling; map-space exploration 3 3 2
2024 ISCA Furiosa AI TCP: A Tensor Contraction Processor for AI Workloads tensor contraction as a hardware primitive; circuit-switched fetch network for hierarchical data reuse; Einstein summation for tactic exploration 4 3 3
2024 DATE NTU ViTA: A Highly Efficient Dataflow and Architecture for Vision Transformers highly efficient memory-centric dataflow; fused special function module for non-linear functions; A comprehensive DSE of ViTA Kernels and VMUs
2025 arXiv SJTU ROMA: A Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM hybrid ROM-SRAM architecture for on-device LLM; B-ROM design for area-efficient ROM; fused cell integration of ROM and compute unit; QLoRA rank adaptation for task-specific tuning; on-chip storage optimization for quantized models
2025 ISCA Duke Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression entropy-aware cache compression for LLMs; group-wise non-uniform quantization; shared k-means patterns; parallel Huffman hardware decoder 4 3 3
Compiler-Scheduled Cacheless Architecture

Challenge: Reactive hardware architectures (like GPUs) devote significant area and power to dynamic scheduling and caching, which leads to unpredictable tail latencies and limits the compute utilization in large-scale distributed systems.

Year Venue Authors Title Tags P E N
2020 ISCA Groq Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads Space-time determined;Functionally Sliced and Data stream Microarchitecture 4 5 4
2022 ISCA Groq A software-defined tensor streaming multiprocessor for large-scale machine learning Multi-TSP network;Removal of Hardware Flow Control;Deterministic Load Balancing 4 5 4
Algorithmic Accelerators

Solution: Algorithmic accelerators use specialized algorithms to optimize the performance of LLMs.

Year Venue Authors Title Tags P E N
2020 HPCA Seoul National A3: Accelerating Attention Mechanisms in Neural Networks with Approximation greedy candidate search for reducing search targets; post-scoring selection with dynamic thresholding for softmax; lookup-table-based exponent modules 4 3 2
2021 HPCA MIT SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning cascade pruning; high-parallelism top-K engine; progressive quantization 3 4 4
2024 ASPLOS CMU SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification tree-based speculative inference; topology-aware causal mask; multi-step speculative sampling 4 4 4
2025 MICRO KAIST HLX: A Unified Pipelined Architecture for Optimized Performance of Hybrid Transformer-Mamba Language Models PipeFlash fine-grained pipelining for Attention; PipeSSD fused pipelined execution for Mamba-2; Unified Reconfigurable Streamlined Core (URSC);inter-operation dependency mitigation 4 3 3
Wafer-Scale Accelerators

Solution: Adopt holistic hardware-software co-design to integrate high-bandwidth Network-on-Wafer (NoW) architectures with fault-tolerant mapping strategies, thereby mitigating yield limitations and communication bottlenecks to maximize system throughput.

Year Venue Authors Title Tags P E N
2025 ISCA THU PD Constraint-aware Physical/Logical Topology Co-Design for Network on Wafer mesh-switch physical topology; dual-granularity logical topology 4 3 2
2025 ISCA THU WSC-LLM: Efficient LLM Service and Architecture Co-exploration for Wafer-scale Chips optimal resource partition algorithm; optimal KV cache placement algorithm 4 3 2
2026 HPCA THU WATOS: Efficient LLM Training Strategies and Architecture Co-exploration for Wafer-scale Chip globally coordinated memory-efficient recomputation; location-aware resource placement; 3-stage system-level robustness design 4 4 2

Quantized DNN Accelerators

Solution: Quantized DNN accelerators are designed to efficiently execute quantized neural networks, which use lower precision representations for weights and activations.

General-Purpose and Edge Quantized DNN Accelerators

Challenge: General quantized-DNN accelerators must support mixed precision or nonuniform datatypes without losing the throughput and energy benefits of low-bit execution.

Year Venue Authors Title Tags P E N
2018 ISCA SNU Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation accelerator architecture for outlier-aware quantized models; outlier-aware low-precision computation; separate outlier MAC unit 4 3 2
2023 HPCA UPC Mix-GEMM: An efficient HW-SW Architecture for Mixed-Precision Quantized Deep Neural Networks Inference on Edge Devices Complete mixed-precision flexibility; hardware accelerator & BLIS-based library with custom RISC-V ISA extensions 3 2 3
2024 DAC ASU Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference composite data type Logarithmic Posits (LP); automated post training LP Quantization (LPQ) Framework based on genetic algorithms; mixed-precision LP Accelerator (LPA) 3 3 2
Quantized Transformer and Foundation-Model Accelerators

Challenge: Transformer-scale quantized accelerators need to compress weights and activations while preserving model accuracy and reducing expensive memory traffic.

Year Venue Authors Title Tags P E N
2020 MICRO UToronto GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference Gaussian-outlier weight splitting; per-layer centroid dictionary quantization; 3-bit index storage for BERT weights; GOBO low-bit accelerator 4 3 4
2022 ISCA UToronto Mokey: Enabling Narrow Fixed-Point Inference for Out-of-the-Box Floating-Point Transformer Models Golden Dictionary post-training quantization; 4-bit index weights and activations with fixed-point centroids; exponential-fit centroid arithmetic replacing MACs with narrow additions; Mokey accelerator and memory compression assist 4 4 4
2025 ISCA Georgia Tech & Intel MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization outlier-aware MX quantization with pruning-based bit redistribution; ReCoN NoC for high-precision outlier merging; multi-precision INT PE systolic accelerator for LLM/VLM inference 4 3 3

Bit-Sliced DNN Accelerators

Solution: Bit-sliced DNN accelerators break down data into smaller bit-slices, allowing for more efficient processing and reduced memory and calculation resources requirements.

Year Venue Authors Title Tags P E N
2018 ISCA Georgia Tech Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network accelerator for layer-aware quantized DNN; bit-flexible computation unit; block-structured instruction set architecture 4 3 3
2023 HPCA KAIST Sibia: Signed Bit-slice Architecture for Dense DNN Acceleration with Slice-level Sparsity Exploitation signed bit-slice representation;flexible zero skipping processing element 3 3 4
2024 HPCA KU Leuven BitWave: Exploiting Column-Based Bit-Level Sparsity for Deep Learning Acceleration Bit-column sparsity for both computation reduction and data compression; Single-shot Bit-Flip post-training 3 3 3
2025 HPCA POSTECH Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity Asymmetrically-Quantized bit-Slice GEMM; Zero-Point Manipulation and Distribution-based Bit-Slicing to increase sparsity 3 3 4
2025 HPCA Yonsei Bit-slice Architecture for DNN Acceleration with Slice-level Sparsity Enhancement and Exploitation both input AND weight sparsity at bit-slice level; 8-bit data processing with 4-bit multipliers; Scale regularization during training to enhance sparsity 3 2 2

Reconfigurable Dataflow & Interconnect Accelerators

Solution: Dynamically adaptive interconnections and multi-dataflow engines allowing adaptable mapping to match spatial architectures with DNN variants.

Year Venue Authors Title Tags P E N
2018 ASPLOS Georgia Tech MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects augmented reduction tree(ART) for link conflict; chubby distribution tree for bandwidth optimization; ART based virtual neuron construction 4 3 2
2019 JETCAS MIT Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices hierarchical mesh NoC for multiple transmission modes; sparse PE architecture 5 4 2
2023 ASPLOS UM & Georgia Tech Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing merger-reduction network for area efficiency; compression format conversion without hardware module; dedicated L1 memory architecture for different access pattern 4 3 2
2024 ISCA Gatech FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching reorder in reduction; dataflow-layout co-switching; butterfly interconnect for reduction and reordering (BIRRD) 4 4 2
2025 MICRO Maryland Misam: Machine Learning Assisted Dataflow Selection in Accelerators for Sparse Matrix Multiplication ML-assisted runtime dataflow selection; lightweight decision tree predictor; intelligent reconfiguration engine for FPGAs; cost-benefit analysis for hardware switching 3 4 3

Application & Sparsity-Specific Reconfigurable Accelerators

Solution: Accelerators employing structured sparsity and environment-specific pipelining, enabling them to reconfigure flexibly for structured sparsity masks or specific AI domains (like RL).

Structured Sparsity DNN Accelerators

Solution: Accelerators employing hierarchical structured sparsity masks for flexible DNN acceleration.

Year Venue Authors Title Tags P E N
2023 MLSys Gatech SUSHI: SUbgraph Stationary Hardware-software Inference Co-design subgraph-stationary dataflow; dedicated persistent buffer; cache-state-aware scheduling; moving-average subgraph prediction 4 4 2
2023 MICRO MIT HighLight: Efficient and Flexible DNN Acceleration with Hierarchical Structured Sparsity hierarchical structured sparsity (HSS); modularized sparsity acceleration architecture; systematic flexibility on sparsity patterns 3 3 3
RL Environment Accelerators

Solution: FPGA-based accelerators targeting the parallel environment execution bottleneck in reinforcement learning training pipelines.

Year Venue Authors Title Tags P E N
2025 DATE PKU PEARL: FPGA-Based Reinforcement Learning Acceleration with Pipelined Parallel Environments pipelined parallel RL environment execution on FPGA; PCIe data compression and local-store optimization; modular parametric environment template for user customization 3 3 2

Secure DNN Accelerators

Solution: Integrating on-chip cryptographic engines and co-optimizing the hardware architecture, memory authentication schemes, and data scheduling to efficiently enable Trusted Execution Environments (TEEs) for secure DNN computation with minimal overhead.

Year Venue Authors Title Tags P E N
2023 MICRO MIT SecureLoop: Design Space Exploration of Secure DNN Accelerators cryptographic-engine-aware loopnest scheduling; analytical authentication block formulation; simulated annealing-based cross-layer tuning 3 3 2

Benchmarks

Year Venue Authors Title Tags P E N
2025 arXiv Cambridge Benchmarking Ultra-Low-Power µNPUs Comparative µNPU Benchmarking (µNPU: microcontroller-scale Neural Processing Unit); open-source model compilation framework; µNPU memory I/O bottleneck identification 4 4 2

Dataflow Architecture

Reconfigurable Dataflow Architectures

Challenge: Bridging the gap between the high efficiency of dataflow execution and the need for handling irregular control flows (e.g., MIMD threads) and diverse parallel patterns in general-purpose computing.

Year Venue Authors Title Tags P E N
2017 ISCA Stanford Plasticine: A Reconfigurable Architecture for Parallel Patterns Pattern Compute Unit (PCU); Pattern Memory Unit (PMU); parallel pattern-based ISA; hierarchical reconfigurable interconnect 5 3 5
2022 ISCA Stanford Aurochs: An Architecture for Dataflow Threads dataflow threads for MIMD execution; resource elasticity and dynamic context switching; hardware extensions for irregular workloads 4 3 4
2024 MICRO CMU The TYR Dataflow Architecture: Improving Locality by Taming Parallelism local tag spaces technique; space tag managing instruction set; CT based concurrent-block communication

Tensor-Centric Dataflow Architectures

Challenge: Overcoming the memory wall and utilization bottlenecks in Deep Learning workloads by optimizing inter-operator dataflow, exploiting sparsity, and managing tensor contraction sequences.

Year Venue Authors Title Tags P E N
2019 ASPLOS THU Tangram: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators buffer sharing dataflow(BSD); alternate layer loop ordering (ALLO) dataflow; heuristics spatial layer mapping algorithm
2024 MICRO UCR Sparsepipe: Sparse Inter-operator Dataflow Architecture with Cross-Iteration Reuse producer-consumer reuse; cross-iteration reuse; sub-tensor dependency; OEI dataflow; sparsepipe architecture
2025 arXiv UCSB FETTA: Flexible and Efficient Hardware Accelerator for Tensorized Neural Network Training contraction sequence search engine; tensor contraction unit; distribution/reduction network 3 4 3

Data Mapping

Solution: Assign data to specific locations in memory or storage to optimize performance; reduce latency; and improve resource utilization.

Survey
Year Venue Authors Title Tags P E N
2013 DAC NUS Mapping on Multi/Many-core Systems: Survey of Current and Emerging Trends dense/run-time mapping; centralized/distributred management; hybrid mapping
Heuristic Algorithm
Year Venue Authors Title Tags P E N
2021 HPCA Georgia Tech MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores sub-accelerator selection; fine-grained job prioritization; MANGA crossover genetic operators
2023 ISCA THU MapZero: Mapping for Coarse-grained Reconfigurable Architectures with Reinforcement Learning and Monte-Carlo Tree Search GAT based DFG and CGRA embedding; routing penalty based reinforcement learning; Monte-Carlo tree search space exploration
2023 VLSI IIT Kharagpur Application Mapping Onto Manycore Processor Architectures Using Active Search Framework RNN based active search framework; IP-Core Numbering Scheme; active search with/without pretraining
Optimization Modeling
Year Venue Authors Title Tags P E N
2020 FPGA ETH Zurich Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis computation and I/O decomposition model for matrix multiplication; 1D array collapse mapping method; internal double buffering
2021 HPCA Georgia Tech Heterogeneous Dataflow Accelerators for Multi-DNN Workloads heterogeneous dataflow accelerators (HDAs) for DNN; dataflow flexibility; high utilization across the sub-accelerators
2023 MICRO Alibaba; CUHK ArchExplorer: Microarchitecture Exploration Via Bottleneck Analysis dynamic event-dependence graph(EDG); induced DEG based critical path construction; bottleneck-removal-driven DSE
2023 ISCA THU Inter-layer Scheduling Space Definition and Exploration for Tiled Accelerators inter-layer encoding method; temperal cut; spatial cut; RA tree analysis
Fault Tolerant Mapping
Year Venue Authors Title Tags P E N
2017 SC NIT High-performance and energy-efficient fault-tolerance core mapping in NoC weighted communication energy; placing unmapped vertices region; application core graph; spare core placement algorithm
2019 IVLSI UESTC Optimized mapping algorithm to extend lifetime of both NoC and cores in many-core system lifetime budget metric; LBC-LBL mapping algorithm; electro-migration fault model
Communication Optimized Mapping

Challenge: Efficiently exploring the vast design space to balance limited on-chip buffer resources with scarce DRAM bandwidth, aiming to minimize off-chip communication latency through optimized layer fusion and scheduling strategies.

Year Venue Authors Title Tags P E N
2024 ASPLOS THU Cocco: Hardware-Mapping Co-Exploration towards Memory Capacity-Communication Optimization consumption-centric flow based subgraph execution scheme; main/side region based memory management
2025 DAC THU Buffer Prospector: Discovering and Exploiting Untapped Buffer Resources in Many-Core DNN Accelerators data-compute ratio; buffer requirement calculator; layer-pipeline (LP) mapping optimization; greedy based buffer allocator 4 3 2
Reliability Management
Year Venue Authors Title Tags P E N
2018 DATE NUST Variation-Aware Task Allocation and Scheduling for Improving Reliability of Real-Time MPSoCs variation-aware task allocation; soft-error reliability maximization; cross entropy based task scheduling 4 3 2
2019 DAC NTU LifeGuard: A Reinforcement Learning-Based Task Mapping Strategy for Performance-Centric Aging Management performance-centric aging management; frequency-based core binning; DRL based task mapping 3 3 2
2020 DATE Turku Thermal-Cycling-aware Dynamic Reliability Management in -Core System-on-Chip Coffin-Mason equation based reliability model; reliability-aware mapping/scheduling; dynamic power management
2024 arXiv WUSTL A Two-Level Thermal Cycling-Aware Task Mapping Technique for Reliability Management in Manycore Systems temperature based bin packing; task-to-bin assignment; thermal cycling-aware based task-to-core mapping
2024 arXiv WUSTL A Reinforcement Learning-Based Task Mapping Method to Improve the Reliability of Clustered Manycores mean time to failure; density-based spatial clustering of applications with noise algorithm
CGRA

Solution: Optimally map an application's data flow graph onto the hardware fabric by simultaneously solving the tightly-coupled problems of scheduling, placement, and routing under strict spatial and temporal resource constraints.

Year Venue Authors Title Tags P E N
2017 ASAP Torontom CGRA-ME: A Unified Framework for CGRA Modelling and Exploration XML-based CGRA description; LLVM-based simulated annealing mapper; modulo routing resource graph 3 2 2
2019 TCAD Tsinghua Data-Flow Graph Mapping Optimization for CGRA With Deep Reinforcement Learning neighbor PE interchange defined action space; local pattern based reward function 3 3 2
2022 HPCA NUS LISA: Graph Neural Network based Portable Mapping on Spatial Accelerators label abstraction for quality mapping; GNN based label-aware mapping; label-aware simulated annealing 4 3 3
2024 MICRO NUS ICED: An Integrated CGRA Framework Enabling DVFS-Aware Acceleration tile based CGRA configuration; DVFS labeling; DVFS-Aware DFG Mapping 4 3 2

Task Scheduling

Year Venue Authors Title Tags P E N
2023 ICCAD PKU Memory-aware Scheduling for Complex Wired Networks with Iterative Graph Optimization topology-aware pruning algorithm; integer linear programming scheduling method; sub-graph fusion algorithm ; memory-aware graph partitioning 4 3 2
2023 MICRO Duke Si-Kintsugi: Towards Recovering Golden-Like Performance of Defective Many-Core Spatial Architectures for AI graph alignment algorithm for dataflow graph and platform pe grap; producer-consumer pattern dataflow generation algorithm

Many-core Architecture

Challenge: Many-core architectures are designed to handle a large number of cores; but they face challenges in terms of power consumption; performance; and resource allocation.

Resource Management

Challenge: Cores share resources with each other, how to achieve high performance by coordinating access among cores to prevent conflicts and ensure data consistency is a problem.

Year Venue Authors Title Tags P E N
2015 HPCA Cornel Increasing Multicore System Efficiency through Intelligent Bandwidth Shifting online bandwidth shifting mechanism; prefetch usefulness (PU) level
2015 HPCA IBM XChange: A Market-based Approach to Scalable Dynamic Multi-resource Allocation in Multicore Architectures CMP multiresource allocation mechanism XChange; market framework based modeling
2018 MICRO SNU RpStacks-MT: A High-throughput Design Evaluation Methodology for Multi-core Processors graph-based multi-core performance model; distance-based memory system model; dynamic scheduling reconstruction method
2023 MICRO Yonsei McCore: A Holistic Management of High-Performance Heterogeneous Multicores cluster partitioning via index hash function; partitions balancing method; hardware support for RL based scheduling

Hardware Design

Solution: Hardware implementation for many-core architecture to achieve massive parallelism.

Year Venue Authors Title Tags P E N
2016 SCIS THU&BNU&CAS The Sunway TaihuLight supercomputer: system and applications Sunway TaihuLight's composition; scientific computing applications on TaihuLight 3 4 2
2017 IPDPSW SJTU&Tokyo Tech Benchmarking SW26010 Many-core Processor hand-coded assembly benchmark for SW26010; CPE pipeline&memory hierarchy&RLC mechanism benchmarking 3 4 2
2020 MICRO UCSD Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks dynamic architecture fission; spatial multi-tenant acceleration; omni-directional systolic arrays 4 3 2
2023 MICRO THU MAICC: A Lightweight Many-core Architecture with In-Cache Computing for Multi-DNN Parallel Inference slice improved and hardware-implemented reduction CIM; ISA extension for CIM; CNN layer segmentation and mapping algorithm

Application Optimization

Year Venue Authors Title Tags P E N
2023 SC NUDT Optimizing Direct Convolutions on ARM Multi-Cores direct convolution algorithm NDirect; loop ordering algorithm; micro convolution kernal for computing & packeting
2023 SC NUDT Optimizing MPI Collectives on Shared Memory Multi-Cores intra-node reduction algorithm for redundant data movements; fine grained non-temporal store based adaptive collectives
2024 PPoPP NUDT Towards Scalable Unstructured Mesh Computations on Shared Memory Many-Cores task dependency tree(TDT); tree traversal based parallel algorithm for CPU/GPU

Architecture DSE

Challenge: It's crucial to find the optimal hardware configurations that meet performance; power; and area constraints for specific applications.

NOC DSE

Year Venue Authors Title Tags P E N
2018 ICCAD WSU Hybrid On-Chip Communication Architectures for Heterogeneous Manycore Systems many-to-few communication patterns; long range shortcut based wireless NoC ; 3D-TSV based heterogeneous NoC
2018 IEEE TC WSU On-Chip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems wireless-enabled heterogeneous NoC; archived multi-objective simulated annealing for network connectivity

Mapping & Co-Exploration DSE

Challenge: Efficiently co-optimize DNN mapping and hardware architecture under complex constraints.

Year Venue Authors Title Tags P E N
2020 ICCAD UIUC DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator two-level (global and local) automatic DSE engine; dynamic design space exploration framework; high-dimensional design space support 4 4 4
2020 ICCAD Gatech GAMMA: Automating the HW Mapping of DNN Models on Accelerators via Genetic Algorithm domain-specific genetic representation; growth and aging evolutionary operators; two-stage inter-layer optimization 4 3 2
2022 TACO Gatech Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators dimension dependence graph (DDG); maestro data-centric notation; decoupled off-chip/on-chip mapping exploration 3 3 2
2024 HPCA THU Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators layer-centric encoding method; DP-based graph partition algorithm; SA based D2D link communication optimization
2024 ASPDAC CUHK SoC-Tuner: An Importance-guided Exploration Framework for DNN-targeting SoC Design intercluster distance algorithm; importance-based pruning and initialization 3 2 2
2024 Arxiv Georgia Tech PIPEORGAN: Efficient Inter-operation Pipelining with Flexible Spatial Org spatial organization strategy pipeorgan for inter-operator pipelining; augmented mesh for pipelining(AMP) topology 4 2 2
2025 ASPDAC THU KAPLA: Scalable NN Accelerator Dataflow Design Space Structuring and Fast Exploring tensor-centric dataflow directives; bottom-up cost descending; inter-layer pruning and decoupling 3 3 2

Microarchitecture & Cross-Architecture DSE

Challenge: Efficiently explore and optimize design spaces across microarchitectures and heterogeneous hardware.

Year Venue Authors Title Tags P E N
2019 MICRO Georgia Tech Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows: A Data-Centric Approach MAESTRO analytical model; data-centric directives; spatial and temporal reuse analysis; hardware cost estimation; NoC (Network-on-Chip) modeling 4 4 4
2023 DATE Gatech AIRCHITECT: Learning Custom Architecture Design and Mapping Space recommendation neural network; constant time prediction; systolic-array-based accelerators 3 4 2
2025 arXiv THU & Macau MLDSE: Scaling Design Space Exploration Infrastructure for Multi-Level Hardware IR and builder based hardware modeling; cross-architecture DSE; spatial-level DSE 3 3 2
2025 arXiv PKU DiffuSE: Cross-Layer Design Space Exploration of DNN Accelerator via Diffusion-Driven Optimization diffusion-based design generation; conditional sampling 3 4 3
2026 arXiv Berkeley ArchAgent: Agentic AI-driven Computer Architecture Discovery automated runtime-configurable parameter tuning for workload-specific optimization by LLM; automated detection of simulator escapes (ai-exploited logic loopholes simulators) 3 3 3

Data Access Accelerators

Challenge: Indirect and sparse memory access patterns; low memory bandwidth utilization; core structural limitations

Year Venue Authors Title Tags P E N
2025 ISCA UMich DX100: Programmable Data Access Accelerator for Indirection indirect memory access accelerator; bulk memory access reordering; DRAM row-buffer hit rate optimization; programmable data access ISA 4 3 2

HBM Interconnects

Challenge: The built-in crossbar of HBM FPGAs suffers from contention and low bandwidth during many-to-many unicast access, and standard HLS lacks support for efficient burst buffering.

Year Venue Authors Title Tags P E N
2021 FPGA UCLA HBM Connect: High-Performance HLS Interconnect for FPGA HBM HBM Connect; HLS Virtual Buffer (HVB); Mux-Demux Switch; butterfly custom crossbar; many-to-many unicast; pseudo-channel optimization 4 4 2

Reinforcement Learning Accelerators

Challenge: Resolving the memory bottlenecks caused by the irregular, low-arithmetic-intensity operations of experience replay while efficiently synchronizing high-throughput neural network training with sequential data collection.

Year Venue Authors Title Tags P E N
2026 TVLSI NUAA E2CAP: An Energy-Efficient FPGA Accelerator for Deep Reinforcement Learning With Experience Compression and Configurable PE Array fss+css two-stage experience compression strategy; intra-PE MADD/MAC mode switching to eliminate weight transposition; inter-PE configurable group-slice interconnection for computation imbalance reduction 4 4 2
2022 IEEE Access Osnabrueck A Survey of Domain-Specific Architectures for Reinforcement Learning experience replay bottleneck; on-chip training lackness; normalized efficiency metrics(IPS/LUT) 3 2 2
2022 CF USC FPGA Acceleration of Deep Reinforcement Learning using On-Chip Replay Management on-chip replay management module; k-ary sum tree data structure; hardware pipelining with conflict-free memory access 3 4 2