Skip to content

Domain-Specific Accelerators

Graph Accelerators

Challenge: Massive memory requirement, Non-ordered memory access

Pipelined & Event-Driven Graph Accelerators

Solution: Optimize data flow and execution models using pipelining, event-driven mechanisms, or data shuffling to handle irregular graph workloads.

Year Venue Authors Title Tags P E N
2016 MICRO Princeton Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics vertex-programming based pipeline; on-chip scratchpad optimization; source/destination-oriented parallel streams 4 4 4
2019 FPL NUS On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-Based FPGAs on-the-fly parallel data shuffling; OpenCL-based data dispatcher; runtime data dependency resolution; decoder-filter shuffling architecture 3 4 2
2020 MICRO UCR GraphPulse: An Event-Driven Hardware Accelerator for Asynchronous Graph Processing asynchronous event-driven model; in-place event coalescing; delta-based accumulative processing 3 3 4

Memory-Optimized Graph Accelerators

Solution: Address the memory bottleneck (bandwidth and latency) through specialized memory hierarchies, HBM optimizations, or caching mechanisms.

Year Venue Authors Title Tags P E N
2019 FPGA HUST Improving Performance of Graph Processing on FPGA-DRAM Platform by Two-level Vertex Caching two-level vertex caching (L1 BRAM/L2 UltraRAM); Hilbert-order window sliding for locality; dual-pipeline computation-communication overlapping 4 4 3
2021 ICCAD Cornell GraphLily: Accelerating Graph Linear Algebra on HBM-Equipped FPGAs GraphBLAS; FPGA overlay; HBM-optimized; SpMV/SpMSpV accelerator; CPSR sparse format; software-hardware co-design 4 4 3
2024 TRETS HUST ScalaBFS2: A High-performance BFS Accelerator on an HBM-enhanced FPGA Chip HBM-enhanced BFS accelerator; independent HBM Reader; hybrid-mode PE; multi-layer crossbar 3 4 3

Dynamic Graph Accelerators

Challenge: Edge update, Graph store data structure design

Year Venue Authors Title Tags P E N
2021 FPGA HUST GraSU: A Fast Graph Update Library for FPGA-based Dynamic Graph Processing differential data management based on spatial similarity; Incremental Value Measurer (IVM); Value-Aware Memory Manager (VMM) 4 4 3
2021 MICRO UCR JetStream: Graph Analytics on Streaming Data with Event-Driven Hardware Accelerator first streaming graph accelerator; asynchronous incremental algorithms; asynchronous edge deletion handling (VAP; DAP) 3 3 4
2022 ISCA HUST TDGraph: A Topology-Driven Accelerator for High-Performance Streaming Graph Processing topology-driven incremental execution; streaming graph processing; regularized state propagation; vertex states coalescing 4 4 3
2023 MICRO UCR MEGA Evolving Graph Accelerator first evolving graph accelerator; Batch-Oriented Execution (BOE); deletion-free based on CommonGraph; Batch Pipelining 3 3 4
2024 TRETS UoV Dynamic-ACTS - A Dynamic Graph Analytics Accelerator For HBM-Enabled FPGAs a novel edge packing format (ACTPACK); hashed edge updates; low-overhead online partitioning 4 4 3

Hypergraph Accelerators

Solution: Realize the shared parts in hyperedges

Year Venue Authors Title Tags P E N
2022 MICRO HUST A Data-Centric Accelerator for High-Performance Hypergraph Processing Data-Centric; Load-Trigger-Reduce (LTR); Adaptive Data Loading 4 4 3
2025 HPCA HUST MeHyper: Accelerating Hypergraph Neural Networks by Exploring Implicit Dataflows Microedge; Microedge-Centric Dataflow; RePAG Execution Model 4 3 4

Graph Mining Accelerators

Challenge: Complex graph algorithms, Irregular access patterns

Year Venue Authors Title Tags P E N
2021 ISCA MIT FlexMiner: A Pattern-Aware Accelerator for Graph Pattern Mining pattern-aware GPM accelerator; software/hardware co-design; pattern-specific execution plan; connectivity map (c-map) 4 4 3

DNN Accelerators

GEMM

Year Venue Authors Title Tags P E N
2020 HPCA Georgia Tech SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training flexible dot product engine; forwarding adder network 4 2 3
2025 TCAS-I Edin. DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration Permutated Weight;FIFO-less Architecture 4 4 3

Layer Fusion Accelerators

Solution: Use layer fusion to combine multiple layers of a neural network into a single layer. This can help reduce the number of computations and memory accesses required during inference; leading to faster execution times and lower power consumption.

Year Venue Authors Title Tags P E N
2016 MICRO SBU Fused-Layer CNN Accelerators fuse the processing of multiple CNN layers by modifying the order in which the input data are brought on chip
2025 TC KU Leuven Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators fine-grain mapping paradigm; mapping of layer-fused DNNs on heterogeneous dataflow accelerator architectures; memory- and communication-aware latency analysis; constraint optimization
2024 SOCC IIT Hyderabad Hardware-Aware Network Adaptation using Width and Depth Shrinking including Convolutional and Fully Connected Layer Merging Width Shrinking: reduces the number of feature maps in CNN layers; Depth Shrinking: Merge of conv layer and fc layer
2024 ICSAI MIT LoopTree: Exploring the Fused-Layer Dataflow Accelerator Design Space design space that supports set of tiling, recomputation, retention choices, and their combinations; model that validates design space

LLM Accelerators

Challenge: LLM accelerators face challenges in terms of memory bandwidth; power consumption; and the need for efficient data movement.

Accelerators facing Memory Wall

Challenge: LLMs require large amounts of memory bandwidth to store and access the model parameters and intermediate results which can lead to memory bottlenecks and reduced performance.

Year Venue Authors Title Tags P E N
2024 ISCA Furiosa AI TCP: A Tensor Contraction Processor for AI Workloads tensor contraction as a hardware primitive; circuit-switched fetch network for hierarchical data reuse; Einstein summation for tactic exploration 4 3 3
2024 DATE NTU ViTA: A Highly Efficient Dataflow and Architecture for Vision Transformers highly efficient memory-centric dataflow; fused special function module for non-linear functions; A comprehensive DSE of ViTA Kernels and VMUs
2025 arXiv SJTU ROMA: A Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM hybrid ROM-SRAM architecture for on-device LLM; B-ROM design for area-efficient ROM; fused cell integration of ROM and compute unit; QLoRA rank adaptation for task-specific tuning; on-chip storage optimization for quantized models
2025 ISCA Duke Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression entropy-aware cache compression for LLMs; group-wise non-uniform quantization; shared k-means patterns; parallel Huffman hardware decoder 4 3 3
Compiler-Scheduled Cacheless Architecture

Challenge: Reactive hardware architectures (like GPUs) devote significant area and power to dynamic scheduling and caching, which leads to unpredictable tail latencies and limits the compute utilization in large-scale distributed systems.

Year Venue Authors Title Tags P E N
2020 ISCA Groq Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads Space-time determined;Functionally Sliced and Data stream Microarchitecture 4 5 4
2022 ISCA Groq A software-defined tensor streaming multiprocessor for large-scale machine learning Multi-TSP network;Removal of Hardware Flow Control;Deterministic Load Balancing 4 5 4
Algorithmic Accelerators

Solution: Algorithmic accelerators use specialized algorithms to optimize the performance of LLMs.

Year Venue Authors Title Tags P E N
2020 HPCA Seoul National A3: Accelerating Attention Mechanisms in Neural Networks with Approximation greedy candidate search for reducing search targets; post-scoring selection with dynamic thresholding for softmax; lookup-table-based exponent modules 4 3 2
2021 HPCA MIT SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning cascade pruning; high-parallelism top-K engine; progressive quantization 3 4 4
2024 ASPLOS CMU SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification tree-based speculative inference; topology-aware causal mask; multi-step speculative sampling 4 4 4
2025 MICRO KAIST HLX: A Unified Pipelined Architecture for Optimized Performance of Hybrid Transformer-Mamba Language Models PipeFlash fine-grained pipelining for Attention; PipeSSD fused pipelined execution for Mamba-2; Unified Reconfigurable Streamlined Core (URSC);inter-operation dependency mitigation 4 3 3
Wafer-Scale Accelerators

Solution: Adopt holistic hardware-software co-design to integrate high-bandwidth Network-on-Wafer (NoW) architectures with fault-tolerant mapping strategies, thereby mitigating yield limitations and communication bottlenecks to maximize system throughput.

Year Venue Authors Title Tags P E N
2025 ISCA THU PD Constraint-aware Physical/Logical Topology Co-Design for Network on Wafer mesh-switch physical topology; dual-granularity logical topology 4 3 2
2025 ISCA THU WSC-LLM: Efficient LLM Service and Architecture Co-exploration for Wafer-scale Chips optimal resource partition algorithm; optimal KV cache placement algorithm 4 3 2
2026 HPCA THU WATOS: Efficient LLM Training Strategies and Architecture Co-exploration for Wafer-scale Chip globally coordinated memory-efficient recomputation; location-aware resource placement; 3-stage system-level robustness design 4 4 2

Quantized DNN Accelerators

Solution: Quantized DNN accelerators are designed to efficiently execute quantized neural networks, which use lower precision representations for weights and activations.

Year Venue Authors Title Tags P E N
2018 ISCA SNU Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation accelerator architecture for outlier-aware quantized models; outlier-aware low-precision computation; separate outlier MAC unit 4 3 2
2024 DAC ASU Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference composite data type Logarithmic Posits (LP); automated post training LP Quantization (LPQ) Framework based on genetic algorithms; mixed-precision LP Accelerator (LPA) 3 3 2
2023 HPCA UPC Mix-GEMM: An efficient HW-SW Architecture for Mixed-Precision Quantized Deep Neural Networks Inference on Edge Devices Complete mixed-precision flexibility; hardware accelerator & BLIS-based library with custom RISC-V ISA extensions 3 2 3

Bit-Sliced DNN Accelerators

Solution: Bit-sliced DNN accelerators break down data into smaller bit-slices, allowing for more efficient processing and reduced memory and calculation resources requirements.

Year Venue Authors Title Tags P E N
2018 ISCA Georgia Tech Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network accelerator for layer-aware quantized DNN; bit-flexible computation unit; block-structured instruction set architecture 4 3 3
2023 HPCA KAIST Sibia: Signed Bit-slice Architecture for Dense DNN Acceleration with Slice-level Sparsity Exploitation signed bit-slice representation;flexible zero skipping processing element 3 3 4
2024 HPCA KU Leuven BitWave: Exploiting Column-Based Bit-Level Sparsity for Deep Learning Acceleration Bit-column sparsity for both computation reduction and data compression; Single-shot Bit-Flip post-training 3 3 3
2025 HPCA POSTECH Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity Asymmetrically-Quantized bit-Slice GEMM; Zero-Point Manipulation and Distribution-based Bit-Slicing to increase sparsity 3 3 4
2025 HPCA Yonsei Bit-slice Architecture for DNN Acceleration with Slice-level Sparsity Enhancement and Exploitation both input AND weight sparsity at bit-slice level; 8-bit data processing with 4-bit multipliers; Scale regularization during training to enhance sparsity 3 2 2

Reconfigurable Accelerators

Solution: Reconfigurable accelerators not only break the trade-off of flexibility and performance, but also enable hardware to adapt to algorithm changes as quickly as software while maintaining high energy efficiency.

Year Venue Authors Title Tags P E N
2018 ASPLOS Georgia Tech MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects augmented reduction tree(ART) for link conflict; chubby distribution tree for bandiwdth optimization; ART based virtual neuron construction 4 3 2
2019 JETCAS MIT Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices hierarchical mesh NoC for multiple transmission modes; sparse PE architecture 5 4 2
2023 ASPLOS UM & Georgia Tech Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing merger-reduction network for area efficiency; compression format conversion without hardware module; dedicated L1 memory architecture for different access pattern 4 3 2
2023 MICRO MIT HighLight: Efficient and Flexible DNN Acceleration with Hierarchical Structured Sparsity hierarchical structured sparsity (HSS) enables systematic flexibility on sparsity patterns; modularized hierarchical sparsity acceleration architecture 3 3 3
2025 DATE PKU PEARL: FPGA-Based Reinforcement Learning Acceleration with Pipelined Parallel Environments RL environment accelerator; PCIe communication optimization; IP-based design 3 3 2
2025 MICRO Maryland Misam: Machine Learning Assisted Dataflow Selection in Accelerators for Sparse Matrix Multiplication ML-assisted runtime dataflow selection; lightweight decision tree predictor; intelligent reconfiguration engine for FPGAs; cost-benefit analysis for hardware switching 3 4 3

Benchmarks

Year Venue Authors Title Tags P E N
2025 arXiv Cambridge Benchmarking Ultra-Low-Power µNPUs Comparative µNPU Benchmarking (µNPU: microcontroller-scale Neural Processing Unit); open-source model compilation framework; µNPU memory I/O bottleneck identification 4 4 2

Dataflow Architecture

Solution: Dataflow architecture allows the execution of instructions based on the availability of data rather than a predetermined sequence; leading to more efficient use of resources and better performance in parallel processing and real-time systems.

Year Venue Authors Title Tags P E N
2019 ASPLOS THU Tangram: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators buffer sharing dataflow(BSD); alternate layer loop ordering (ALLO) dataflow; heuristics spatial layer mapping algorithm
2024 MICRO CMU The TYR Dataflow Architecture: Improving Locality by Taming Parallelism local tag spaces technique; space tag managing instruction set; CT based concurrent-block communication
2024 MICRO UCR Sparsepipe: Sparse Inter-operator Dataflow Architecture with Cross-Iteration Reuse producer-consumer reuse; cross-iteration reuse; sub-tensor dependency; OEI dataflow; sparsepipe architecture
2025 arXiv UCSB FETTA: Flexible and Efficient Hardware Accelerator for Tensorized Neural Network Training contraction sequence search engine; tensor contraction unit; distribution/reduction network 3 4 3

Data Mapping

Solution: Assign data to specific locations in memory or storage to optimize performance; reduce latency; and improve resource utilization.

Survey
Year Venue Authors Title Tags P E N
2013 DAC NUS Mapping on Multi/Many-core Systems: Survey of Current and Emerging Trends dense/run-time mapping; centralized/distributred management; hybrid mapping
Heuristic Algorithm
Year Venue Authors Title Tags P E N
2021 HPCA Georgia Tech MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores sub-accelerator selection; fine-grained job prioritization; MANGA crossover genetic operators
2023 ISCA THU MapZero: Mapping for Coarse-grained Reconfigurable Architectures with Reinforcement Learning and Monte-Carlo Tree Search GAT based DFG and CGRA embedding; routing penalty based reinforcement learning; Monte-Carlo tree search space exploration
2023 VLSI IIT Kharagpur Application Mapping Onto Manycore Processor Architectures Using Active Search Framework RNN based active search framework; IP-Core Numbering Scheme; active search with/without pretraining
Optimization Modeling
Year Venue Authors Title Tags P E N
2020 FPGA ETH Zurich Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis computation and I/O decomposition model for matrix multiplication; 1D array collapse mapping method; internal double buffering
2021 HPCA Georgia Tech Heterogeneous Dataflow Accelerators for Multi-DNN Workloads heterogeneous dataflow accelerators (HDAs) for DNN; dataflow flexibility; high utilization across the sub-accelerators
2023 MICRO Alibaba; CUHK ArchExplorer: Microarchitecture Exploration Via Bottleneck Analysis dynamic event-dependence graph(EDG); induced DEG based critical path construction; bottleneck-removal-driven DSE
2023 ISCA THU Inter-layer Scheduling Space Definition and Exploration for Tiled Accelerators inter-layer encoding method; temperal cut; spatial cut; RA tree analysis
Fault Tolerant Mapping
Year Venue Authors Title Tags P E N
2017 SC NIT High-performance and energy-efficient fault-tolerance core mapping in NoC weighted communication energy; placing unmapped vertices region; application core graph; spare core placement algorithm
2019 IVLSI UESTC Optimized mapping algorithm to extend lifetime of both NoC and cores in many-core system lifetime budget metric; LBC-LBL mapping algorithm; electro-migration fault model
Reliability Management
Year Venue Authors Title Tags P E N
2020 DATE Turku Thermal-Cycling-aware Dynamic Reliability Management in -Core System-on-Chip Coffin-Mason equation based reliability model; reliability-aware mapping/scheduling; dynamic power management
2024 arXiv WUSTL A Two-Level Thermal Cycling-Aware Task Mapping Technique for Reliability Management in Manycore Systems temperature based bin packing; task-to-bin assignment; thermal cycling-aware based task-to-core mapping
2024 arXiv WUSTL A Reinforcement Learning-Based Task Mapping Method to Improve the Reliability of Clustered Manycores mean time to failure; density-based spatial clustering of applications with noise algorithm
CGRA

Solution: Optimally map an application's data flow graph onto the hardware fabric by simultaneously solving the tightly-coupled problems of scheduling, placement, and routing under strict spatial and temporal resource constraints.

Year Venue Authors Title Tags P E N
2017 ASAP Torontom CGRA-ME: A Unified Framework for CGRA Modelling and Exploration XML-based CGRA description; LLVM-based simulated annealing mapper; modulo routing resource graph 3 2 2
2019 TCAD Tsinghua Data-Flow Graph Mapping Optimization for CGRA With Deep Reinforcement Learning neighbor PE interchange defined action space; local pattern based reward function 3 3 2
2022 HPCA NUS LISA: Graph Neural Network based Portable Mapping on Spatial Accelerators label abstraction for quality mapping; GNN based label-aware mapping; label-aware simulated annealing 4 3 3
2024 MICRO NUS ICED: An Integrated CGRA Framework Enabling DVFS-Aware Acceleration tile based CGRA configuration; DVFS labeling; DVFS-Aware DFG Mapping 4 3 2

Task Scheduling

Year Venue Authors Title Tags P E N
2023 ICCAD PKU Memory-aware Scheduling for Complex Wired Networks with Iterative Graph Optimization topology-aware pruning algorithm; integer linear programming scheduling method; sub-graph fusion algorithm ; memory-aware graph partitioning
2023 MICRO Duke Si-Kintsugi: Towards Recovering Golden-Like Performance of Defective Many-Core Spatial Architectures for AI graph alignment algorithm for dataflow graph and platform pe grap; producer-consumer pattern dataflow generation algorithm

Many-core Architecture

Challenge: Many-core architectures are designed to handle a large number of cores; but they face challenges in terms of power consumption; performance; and resource allocation.

Resource Management

Challenge: Cores share resources with each other, how to achieve high performance by coordinating access among cores to prevent conflicts and ensure data consistency is a problem.

Year Venue Authors Title Tags P E N
2015 HPCA Cornel Increasing Multicore System Efficiency through Intelligent Bandwidth Shifting online bandwidth shifting mechanism; prefetch usefulness (PU) level
2015 HPCA IBM XChange: A Market-based Approach to Scalable Dynamic Multi-resource Allocation in Multicore Architectures CMP multiresource allocation mechanism XChange; market framework based modeling
2018 MICRO SNU RpStacks-MT: A High-throughput Design Evaluation Methodology for Multi-core Processors graph-based multi-core performance model; distance-based memory system model; dynamic scheduling reconstruction method
2023 MICRO Yonsei McCore: A Holistic Management of High-Performance Heterogeneous Multicores cluster partitioning via index hash function; partitions balancing method; hardware support for RL based scheduling

Hardware Design

Solution: Hardware implementation for many-core architecture to achieve massive parallelism.

Year Venue Authors Title Tags P E N
2016 SCIS THU&BNU&CAS The Sunway TaihuLight supercomputer: system and applications Sunway TaihuLight's composition; scientific computing applications on TaihuLight 3 4 2
2017 IPDPSW SJTU&Tokyo Tech Benchmarking SW26010 Many-core Processor hand-coded assembly benchmark for SW26010; CPE pipeline&memory hierarchy&RLC mechanism benchmarking 3 4 2
2023 MICRO THU MAICC: A Lightweight Many-core Architecture with In-Cache Computing for Multi-DNN Parallel Inference slice improved and hardware-implemented reduction CIM; ISA extension for CIM; CNN layer segmentation and mapping algorithm

Application Optimization

Year Venue Authors Title Tags P E N
2023 SC NUDT Optimizing Direct Convolutions on ARM Multi-Cores direct convolution algorithm NDirect; loop ordering algorithm; micro convolution kernal for computing & packeting
2023 SC NUDT Optimizing MPI Collectives on Shared Memory Multi-Cores intra-node reduction algorithm for redundant data movements; fine grained non-temporal store based adaptive collectives
2024 PPoPP NUDT Towards Scalable Unstructured Mesh Computations on Shared Memory Many-Cores task dependency tree(TDT); tree traversal based parallel algorithm for CPU/GPU

Architecture DSE

Challenge: It's crucial to find the optimal hardware configurations that meet performance; power; and area constraints for specific applications.

NOC DSE

Year Venue Authors Title Tags P E N
2018 ICCAD WSU Hybrid On-Chip Communication Architectures for Heterogeneous Manycore Systems many-to-few communication patterns; long range shortcut based wireless NoC ; 3D-TSV based heterogeneous NoC
2018 IEEE TC WSU On-Chip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems wireless-enabled heterogeneous NoC; archived multi-objective simulated annealing for network connectivity

Mapping & Co-Exploration DSE

Challenge: Efficiently co-optimize DNN mapping and hardware architecture under complex constraints.

Year Venue Authors Title Tags P E N
2020 ICCAD UIUC DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator two-level (global and local) automatic DSE engine; dynamic design space exploration framework; high-dimensional design space support 4 4 4
2024 HPCA THU Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators layer-centric encoding method; DP-based graph partition algorithm; SA based D2D link communication optimization
2024 ASPLOS THU Cocco: Hardware-Mapping Co-Exploration towards Memory Capacity-Communication Optimization consumption-centric flow based subgraph execution scheme; main/side region based memory management
2024 ASPDAC CUHK SoC-Tuner: An Importance-guided Exploration Framework for DNN-targeting SoC Design intercluster distance algorithm; importance-based pruning and initialization 3 2 2
2024 Arxiv Georgia Tech PIPEORGAN: Efficient Inter-operation Pipelining with Flexible Spatial Org spatial organization strategy pipeorgan for inter-operator pipelining; augmented mesh for pipelining(AMP) topology 4 2 2

Microarchitecture & Cross-Architecture DSE

Challenge: Efficiently explore and optimize design spaces across microarchitectures and heterogeneous hardware.

Year Venue Authors Title Tags P E N
2025 arXiv THU & Macau MLDSE: Scaling Design Space Exploration Infrastructure for Multi-Level Hardware IR and builder based hardware modeling; cross-architecture DSE; spatial-level DSE 3 3 2
2025 arXiv PKU DiffuSE: Cross-Layer Design Space Exploration of DNN Accelerator via Diffusion-Driven Optimization diffusion-based design generation; conditional sampling 3 4 3

Data Access Accelerators

Challenge: Indirect and sparse memory access patterns; low memory bandwidth utilization; core structural limitations

Year Venue Authors Title Tags P E N
2025 ISCA UMich DX100: Programmable Data Access Accelerator for Indirection indirect memory access accelerator; bulk memory access reordering; DRAM row-buffer hit rate optimization; programmable data access ISA 4 3 2

HBM Interconnects

Challenge: The built-in crossbar of HBM FPGAs suffers from contention and low bandwidth during many-to-many unicast access, and standard HLS lacks support for efficient burst buffering.

Year Venue Authors Title Tags P E N
2021 FPGA UCLA HBM Connect: High-Performance HLS Interconnect for FPGA HBM HBM Connect; HLS Virtual Buffer (HVB); Mux-Demux Switch; butterfly custom crossbar; many-to-many unicast; pseudo-channel optimization 4 4 2