Solution: Use layer fusion to combine multiple layers of a neural network into a single layer. This can help reduce the number of computations and memory accesses required during inference; leading to faster execution times and lower power consumption.
Year
Venue
Authors
Title
Tags
P
E
N
2016
MICRO
SBU
Fused-Layer CNN Accelerators
fuse the processing of multiple CNN layers by modifying the order in which the input data are brought on chip
2025
TC
KU Leuven
Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators
fine-grain mapping paradigm; mapping of layer-fused DNNs on heterogeneous dataflow accelerator architectures; memory- and communication-aware latency analysis; constraint optimization
2024
SOCC
IIT Hyderabad
Hardware-Aware Network Adaptation using Width and Depth Shrinking including Convolutional and Fully Connected Layer Merging
Width Shrinking: reduces the number of feature maps in CNN layers; Depth Shrinking: Merge of conv layer and fc layer
2024
ICSAI
MIT
LoopTree: Exploring the Fused-Layer Dataflow Accelerator Design Space
design space that supports set of tiling, recomputation, retention choices, and their combinations; model that validates design space
Challenge: LLMs require large amounts of memory bandwidth to store and access the model parameters and intermediate results which can lead to memory bottlenecks and reduced performance.
Year
Venue
Authors
Title
Tags
P
E
N
2024
ISCA
Furiosa AI
TCP: A Tensor Contraction Processor for AI Workloads
tensor contraction as a hardware primitive; circuit-switched fetch network for hierarchical data reuse; Einstein summation for tactic exploration
4
3
3
2024
DATE
NTU
ViTA: A Highly Efficient Dataflow and Architecture for Vision Transformers
highly efficient memory-centric dataflow; fused special function module for non-linear functions; A comprehensive DSE of ViTA Kernels and VMUs
2025
arXiv
SJTU
ROMA: A Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM
hybrid ROM-SRAM architecture for on-device LLM; B-ROM design for area-efficient ROM; fused cell integration of ROM and compute unit; QLoRA rank adaptation for task-specific tuning; on-chip storage optimization for quantized models
2025
ISCA
Duke
Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression
Challenge: Reactive hardware architectures (like GPUs) devote significant area and power to dynamic scheduling and caching, which leads to unpredictable tail latencies and limits the compute utilization in large-scale distributed systems.
Year
Venue
Authors
Title
Tags
P
E
N
2020
ISCA
Groq
Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads
Space-time determined;Functionally Sliced and Data stream Microarchitecture
4
5
4
2022
ISCA
Groq
A software-defined tensor streaming multiprocessor for large-scale machine learning
Multi-TSP network;Removal of Hardware Flow Control;Deterministic Load Balancing
Solution: Adopt holistic hardware-software co-design to integrate high-bandwidth Network-on-Wafer (NoW) architectures with fault-tolerant mapping strategies, thereby mitigating yield limitations and communication bottlenecks to maximize system throughput.
Year
Venue
Authors
Title
Tags
P
E
N
2025
ISCA
THU
PD Constraint-aware Physical/Logical Topology Co-Design for Network on Wafer
Solution: Quantized DNN accelerators are designed to efficiently execute quantized neural networks, which use lower precision representations for weights and activations.
Year
Venue
Authors
Title
Tags
P
E
N
2018
ISCA
SNU
Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation
accelerator architecture for outlier-aware quantized models; outlier-aware low-precision computation; separate outlier MAC unit
4
3
2
2024
DAC
ASU
Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference
composite data type Logarithmic Posits (LP); automated post training LP Quantization (LPQ) Framework based on genetic algorithms; mixed-precision LP Accelerator (LPA)
3
3
2
2023
HPCA
UPC
Mix-GEMM: An efficient HW-SW Architecture for Mixed-Precision Quantized Deep Neural Networks Inference on Edge Devices
Complete mixed-precision flexibility; hardware accelerator & BLIS-based library with custom RISC-V ISA extensions
Solution: Bit-sliced DNN accelerators break down data into smaller bit-slices, allowing for more efficient processing and reduced memory and calculation resources requirements.
Year
Venue
Authors
Title
Tags
P
E
N
2018
ISCA
Georgia Tech
Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network
accelerator for layer-aware quantized DNN; bit-flexible computation unit; block-structured instruction set architecture
4
3
3
2023
HPCA
KAIST
Sibia: Signed Bit-slice Architecture for Dense DNN Acceleration with Slice-level Sparsity Exploitation
signed bit-slice representation;flexible zero skipping processing element
3
3
4
2024
HPCA
KU Leuven
BitWave: Exploiting Column-Based Bit-Level Sparsity for Deep Learning Acceleration
Bit-column sparsity for both computation reduction and data compression; Single-shot Bit-Flip post-training
3
3
3
2025
HPCA
POSTECH
Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity
Asymmetrically-Quantized bit-Slice GEMM; Zero-Point Manipulation and Distribution-based Bit-Slicing to increase sparsity
3
3
4
2025
HPCA
Yonsei
Bit-slice Architecture for DNN Acceleration with Slice-level Sparsity Enhancement and Exploitation
both input AND weight sparsity at bit-slice level; 8-bit data processing with 4-bit multipliers; Scale regularization during training to enhance sparsity
Solution: Reconfigurable accelerators not only break the trade-off of flexibility and performance, but also enable hardware to adapt to algorithm changes as quickly as software while maintaining high energy efficiency.
Year
Venue
Authors
Title
Tags
P
E
N
2018
ASPLOS
Georgia Tech
MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects
augmented reduction tree(ART) for link conflict; chubby distribution tree for bandiwdth optimization; ART based virtual neuron construction
4
3
2
2019
JETCAS
MIT
Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices
hierarchical mesh NoC for multiple transmission modes; sparse PE architecture
5
4
2
2023
ASPLOS
UM & Georgia Tech
Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing
merger-reduction network for area efficiency; compression format conversion without hardware module; dedicated L1 memory architecture for different access pattern
4
3
2
2023
MICRO
MIT
HighLight: Efficient and Flexible DNN Acceleration with Hierarchical Structured Sparsity
Solution: Dataflow architecture allows the execution of instructions based on the availability of data rather than a predetermined sequence; leading to more efficient use of resources and better performance in parallel processing and real-time systems.
Year
Venue
Authors
Title
Tags
P
E
N
2019
ASPLOS
THU
Tangram: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators
Solution: Optimally map an application's data flow graph onto the hardware fabric by simultaneously solving the tightly-coupled problems of scheduling, placement, and routing under strict spatial and temporal resource constraints.
Year
Venue
Authors
Title
Tags
P
E
N
2017
ASAP
Torontom
CGRA-ME: A Unified Framework for CGRA Modelling and Exploration
Challenge: Many-core architectures are designed to handle a large number of cores; but they face challenges in terms of power consumption; performance; and resource allocation.
Challenge: Cores share resources with each other, how to achieve high performance by coordinating access among cores to prevent conflicts and ensure data consistency is a problem.
Year
Venue
Authors
Title
Tags
P
E
N
2015
HPCA
Cornel
Increasing Multicore System Efficiency through Intelligent Bandwidth Shifting
Challenge: The built-in crossbar of HBM FPGAs suffers from contention and low bandwidth during many-to-many unicast access, and standard HLS lacks support for efficient burst buffering.
Year
Venue
Authors
Title
Tags
P
E
N
2021
FPGA
UCLA
HBM Connect: High-Performance HLS Interconnect for FPGA HBM