Challenge: PIM core can be placed at bank level, off-chip buffer level and even inside the SSD controller, and the communication between different levels of PIM/NDP is challenging.
Year
Venue
Authors
Title
Tags
P
E
N
2026
arXiv
ICT
PAM: Processing Across Memory Hierarchy for Efficient KV-centric LLM Serving System
HBM-PIM/DDR-PIM/SSD-PIM heterogeneous 3-tier request scheduling; PAMattention via online softmax for distributed token-wise parallelism; importance-aware greedy KV scheduling for load balancing
4
4
3
2026
HPCA
ETHZ
Conduit: Programmer-Transparent Near-Data Processing Using Multiple Compute-Capable Resources in Solid State Drives
loop auto-vectorization to align with SSD page layout; instruction-granularity offloading via holistic cost function
Solution: Intergrate the compute unit into the SSD controller to process the capacity-sensitive applications.
Year
Venue
Authors
Title
Tags
P
E
N
2024
HPCA
UCLA
BeaconGNN: Large-Scale GNN Acceleration with Out-of-Order Streaming In-Storage Computing
DirectGraph format for out-of-order sampling; die-level processing units; channel-level command router
4
2
3
2025
ISCA
ETHZ
REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing
In-Storage processing
2
4
3
2025
ISCA
UCSD
In-Storage Acceleration of Retrieval Augmented Generation as a Service
metamorphic in-storage accelerator; Metadata Navigation Unit for dynamic data access
4
3
2
2025
arxiv
ETHZ
MARS: Processing-In-Memory Acceleration of Raw Signal Genome Analysis Inside the Storage Subsystem
PIM module inside the SSD controller; early signal quantization; read filtering
3
3
2
2025
ICCAD
SNU
LLM-on-the-Palm: Mobile LLM Inference with PIM-Enhanced NAND Flash Memory
a single MAC unit per plane; selective layer-wise mapping strategy offloading FC layers to PIM and attention to NPU; pipelined MAC and input broadcast via extended commands
Turbocharge ANNS on Real Processing-in-Memory by Enabling Fine-Grained Per-PIM-Core Scheduling
per-PU scheduling; persistent PIM kernel; per-PU dispatching with selective replication
3
4
4
2025
HPCA
UC Davis
NOVA: A Novel Vertex Management Architecture for Scalable Graph Processing
message-driven processors capable of executing algorithms; a direct-mapped cache with a write-back policy; support both asynchronous and bulk synchronous parallel execution models
GradPIM: A Practical Processing-in-DRAM Architecture for Gradient Descent
fixed-function PIM architecture for DNN gradient descent; non-invasive PIM operations using reserved DDR commands
3
3
2
2024
ASPLOS
PKU
PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization
algorithm for DNN to look-up-table conversion; auto-tuner for optimizing LUT-NN mapping on DRAM-PIMs
3
4
3
2025
arXiv
National Tech Univ. of Athens
PIMfused: Near-Bank DRAM-PIM with Fused-layer Dataflow for CNN Data Transfer Optimization
hybrid dataflow combining fused-layer and layer-by-layer strategies
3
2
2
2026
HPCA
Seoul National
LoCaLUT: Harnessing Capacity-Computation Tradeoffs for LUT-Based Inference in DRAM-PIM
operation-packed LUT canonicalization via multiset indexing; reordering LUT for weight permutation remapping; LUT slice streaming for DRAM-buffer hierarchy
4
4
3
2026
HPCA
Seoul National
RoMe: Row Granularity Access Memory System for Large Language Models
row-granularity access interface for LLM streaming; virtual bank to eliminate bank groups and pseudo channels; logic-die command generator for C/A pin reduction and simpler MC
Challenge: Graph processing is fundamentally limited by memory bandwidth and requires frequent random accesses, which are not efficiently supported by non-interleaved, bank-level PIM architectures.
Year
Venue
Authors
Title
Tags
P
E
N
2022
PACT
PKU
GNNear: Accelerating Full-Batch Training of Graph Neural Networks with Near-Memory Processing
splitting reduce operations to NDP units; narrow-shard strategy for data reuse; hybrid graph partition strategy for load balancing
4
3
3
2025
HPCA
ZJU
GOPIM: GCN-Oriented Pipeline Optimization for PIM Accelerators
ML-based replica resource allocation for pipeline streamlining; interleaved mapping with adaptive selective vertex updating
2
3
3
2025
MICRO
Seoul National
FALA: Locality-Aware PIM-Host Cooperation for Graph Processing with Fine-Grained Column Access
8-byte-level granularity for fine-grained vertex access with HBM2-PIM; multiple non-contiguous column accesses within single activated DRAM row
2
2
2
2026
ASPLOS
Uppsala
CoGraf: Fully Accelerating Graph Applications with Fine-Grained PIM
tuple-based LLC for coalescing at flexible granularity; multi-column Fine-Grained PIM instructions to utilize row-level parallelism; predicated bank-parallel instructions for conditional apply-phase operations
Challenge: MoE models often have higher Op/Byte ratios, making bank-level PIM easily compute-bound and limiting speedup.
Year
Venue
Authors
Title
Tags
P
E
N
2024
DAC
Seoul National
MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models
activation movement strategy to replace costly parameter movement; dynamic GPU-MoNDE load balancing for hot/cold experts
4
4
2
2025
MICRO
Samsung
Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching
replace the GPU HBM memory die with HBM-PIM die; expert and attention co-processing for dynamic workload splitting within MoE/attn layers
4
4
4
2025
ICCAD
PKU
HD-MoE: Hybrid and Dynamic Parallelism for Mixture-of-Expert LLMs with 3D Near-Memory Processing
LP-based hybrid TP and EP mapping; bayesian optimization for topology-aware link balancing; online dynamic expert placement with predictive pre-broadcast
REPA: Reconfigurable PIM for the Joint Acceleration of KV Cache Offloading and Processing
reconfigurable ReRAM-PIM for KV cache offload and in-situ processing; bulk-wise memory setting instructions for wordline parallelism; locality-aware data mapping and transfer overlapping
2
2
2
2026
ASPLOS
RPI
STARC: Selective Token Access with Remapping and Clustering for Efficient LLM Decoding on PIM Systems
semantic KV clustering for PIM row-level alignment; hardware-friendly cosine K-means via PIM primitives; incremental append-only remapping for KV cache sparsity
Challenge: Host pages need to enable interleaving to improve concurrent throughput, while PIM pages need to disable it to maintain better locality, creating a conflict.
Year
Venue
Authors
Title
Tags
P
E
N
2023
DAC
Georgia Tech
vPIM: Efficient Virtual Address Translation for Scalable Processing-in-Memory Architectures
network-contention-aware hashing to minimize cross-stack page table walks; pre-translation using repurposed PIM cores to move page table walks off the critical path
4
4
3
2024
ISCA
SJTU
UM-PIM: DRAM-based PIM with Uniform & Shared Memory Space
Uniform shared CPU-PIM memory; dual-track memory management; zero-copy data re-layout
Challenge: Existing compilers are not optimized for locality-aware PIM architectures and require specialized programming models to fully utilize PIM capabilities.
Year
Venue
Authors
Title
Tags
P
E
N
2025
ISCA
POSTECH
ATIM: Autotuning Tensor Programs for Processing-in-DRAM
autotuning framework for DRAM PIM; search-based optimizing tensor compiler; balanced evolutionary search algorithm
3
3
4
2025
ISCA
ETHZ
OptiPIM: Optimizing Processing-in-Memory Acceleration Using Integer Linear Programming
layout-aware nested loop representation; Integer Linear Programming formulation for PIM mapping; analytical cost modeling for data layout enforcement
3
3
4
2026
HPCA
Hanyang
PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System
token-centric PIM partitioning for high channel utilization; dynamic PIM command scheduling for out-of-order I/O-compute overlap; dynamic PIM access for dynamic virtual-to-physical translation
EasyDRAM: An FPGA-based Infrastructure for Fast and Accurate End-to-End Evaluation of Emerging DRAM Techniques
FPGA-based DRAM evaluation framework; C++ high-level language for description; time scaling for accurate modeling
3
4
3
2026
arXiv
PKU
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
ATLAS framework for hybrid bounding 3D DRAM NMP; hierarchical SPMD/MPMD programming model; grid-based transient thermal analyzer for temperature-constrained architecture exploration
Piccolo: Large-Scale Graph Processing with Fine-Grained In-Memory Scatter-Gather
In-DRAM fine-grained scatter-gather via data bus offsets; fine-grained cache architecture using fg-tags; Standard DDR command interpretation for FIM control; Combined graph tiling with fine-grained memory access
3
3
4
2024
arXiv
Seoul National
PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices
Virtual hypercube PIM model; PE-assisted data reordering; in-register and cross-domain data modulation
3
4
3
2025
ISCA
KAIST
PIMnet: A Domain-Specific Network for Efficient Collective Communication in Scalable PIM
COCOTree: A Computation-Capable Architecture for Collective Communication in Scalable PIM
hierarchical binary tree topology for inter-PE communication; in-network computation via computation-capable nodes; two-phase packet-based protocol with configuration-computation decoupling
DIMM-Link: Enabling Efficient Inter-DIMM Communication for Near-Memory Processing
high-speed hardware link bridges between DIMMs; direct intra-group P2P communication & broadcast; hybrid routing mechanism for inter-group communication
2025
HPCA
SJTU
AsyncDIMM: Achieving Asynchronous Execution in DIMM-Based Near-Memory Processing
Application-Transparent Near-Memory Processing Architecture with Memory Channel Network
integrates a processor on a buffered DIMM; application-transparent near-memory processing; leverages memory channels for high-bandwidth/low-latency inter-processor communication
ComPASS: A Compatible PIM Protocol Architecture and Scheduling Solution for Processor-PIM Collaboration
PIM-ACT new memory command for multi-bank PIM operations; PIM request generator to offload host processor; static and adaptive throughput balancers for PIM and non-PIM request scheduling
4
2
2
2025
ASPLOS
SJTU
PUSHtap: PIM-based In-Memory HTAP with Unified Data Storage Format
PIM-specific HTAP storage data format; semi-interleaved data layout for CPU and PIM concurrent data access
2
3
3
2026
DAC
ICT
TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading
tri-domain offloading architecture coordinating GPU/AMX-enabled CPU/DIMM-NDP; bottleneck-aware greedy expert scheduling; prediction-driven expert relayout and rebalancing via DIMM-Link
HEAT: NPU-NDP HEterogeneous Architecture for Transformer-Empowered Graph Neural Networks
topology-aware mixed-precision encoding for transformer; subgraph bundling and reordering for GNN memory efficiency; decoupled dataflow for NPU-NDP concurrent execution
2
3
3
2025
ICCAD
PKU
LP-Spec: Leveraging LPDDR PIM for Efficient LLM Mobile Speculative Inference with Architecture-Dataflow Co-Optimization
GEMM-enhanced hybrid LPDDR5 PIM; near-data memory controller for concurrent NPU-PIM execution & data reallocation; hardware-aware draft token pruning
3
4
2
2025
arXiv
Cornell
P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats
low-precision PIM compute unit with temporal data reuse; operator fusion for quantized dataflow to minimize dequantization overhead
Challenge: The original UMPEM API library is not well-suited for all workloads especially for those with cross-bank communication.
Year
Venue
Authors
Title
Tags
P
E
N
2023
arXiv
ETHZ
A Framework for High-throughput Sequence Alignment using Real Processing-in-Memory Systems
Alignment-in-Memory framework; hybrid WRAM-MRAM sketch data management for PIM
2
3
4
2025
arXiv
ETHZ
PIMDAL: Mitigating the Memory Bottleneck in Data Analytics using a Real Processing-in-Memory System
PIMDAL library on UPMEM PIM system for data analytics; scatter/gather-aware transfers for inter-PIM communication; Apache Arrow for host memory management
Challenge: No direct physical connectivity between the banks in the DIMM-based NDP architecture. Limited number of DDR channels causing poor scalability.
Solution: Introduce CXL-based interconnects to enable direct communication between memory banks; Use CXL memory pools and CXL switches to enable scalable NDP architecture.
Year
Venue
Authors
Title
Tags
P
E
N
2022
MICRO
UCSB
BEACON: Scalable Near-Data-Processing Accelerators for Genome Analysis near Memory Pool with the CXL Support
scalable hardware accelerator inside CXL switch or bank; lossless memory expansion for CXL memory pools
2024
HPCA
Samsung
An LPDDR-based CXL-PNM Platform for TCO-efficient Inference of Transformer-based Large Language Models
LPDDR5X-based CXL memory module; Processing-Near-Memory controller; software stack via direct access driver for transparent host-accelerator memory sharing
Challenge: There is no direct physical interconnection paths in DIMM-based, bank-level uniform NDP like UPMEM.
Solution: Put the logical, computational layer at the bottom of the die, and stack DRAM layers on top of it. Use TSVs to build thousands of physical paths between the logical and the DRAM layers.
3D-PATH: A Hierarchy LUT Processing-in-memory Accelerator with Thermal-aware Hybrid Bonding Integration
sparse-aware hierarchical slow-fast LUT design; multiplier-free floating-point operation by LUT; hotspot-aware hardware with self-throttling sense amplifier
4
2
2
2025
MICRO
Univ. of British Columbia
RayN: Ray Tracing Acceleration with Near-memory Computing
ray tracing units in 3D stacked DRAM logic layer; BLAS Breaking algorithm to partition BVH tree for load balancing; hybrid memory controller for concurrent GPU and near-memory access
4
3
2
2025
ICCAD
PKU
FENIX: Flexible and Efficient Hybrid HE/MPC Acceleration with Near-Memory Processing
fine-grained oblivious transfer partitioning to overlap HE and OT operations; batch-aware flexible encoding to reduce rotation overhead; near-bank NMP to offload memory-bound HE/OT primitives
Solution: HBM2-PIM is the first commercial HBM-PIM product, and 4 out of 8 DRAM layers are PIM-enabled layers, while the other 4 layers are standard DRAM layers.
Year
Venue
Authors
Title
Tags
P
E
N
2021
ISCA
Samsung
Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology Industrial Product
drop-in replacement for standard HBM2; bank-level parallelism using standard DRAM commands; address aligned mode to tolerate host-side command reordering
3
5
3
2022
Hot Chips
Samsung
Aquabolt-XL HBM2-PIM, LPDDR5-PIM With In-Memory Processing, and AXDIMM With Acceleration Buffer
HBM2-PIM with bank-level SIMD programmable computing units; Acceleration DIMM with acceleration buffers for rank-level parallelism
2
5
3
2023
MICRO
Yonsei
AESPA: Asynchronous Execution Scheme to Exploit Bank-Level Parallelism of Processing-in-Memory
Single-Instruction Long-Data execution model; asynchronous bank operation via long-data commands; column-major GEMV dataflow with shared accumulators
Challenge: Different PIM architectures have different characteristics and performance trade-offs; communicating between different PIM architectures is challenging.
Year
Venue
Authors
Title
Tags
P
E
N
2025
arXiv
NUS
LEAP: LLM Inference on Scalable PIM-NoC Architecture with Balanced Dataflow and Fine-Grained Parallelism
data dynamicity-aware task assignment to PIM or NoC; fine-grained model partitioning and heuristically optimized spatial mapping strategy
3
4
3
2025
arXiv
THU
CompAir: Synergizing Complementary PIMs and In-Transit NoC Computation for Efficient LLM Acceleration
heterogeneous DRAM-PIM and SRAM-PIM architecture with hybrid bonding; in-transit NoC computation with Curry ALU; hierarchical ISA for hybrid PIM systems
3
4
2
2025
arXiv
BUAA
HPIM: Heterogeneous Processing-In-Memory-based Accelerator for Large Language Models Inference
Multi-Objective Neural Architecture Search for In-Memory Computing
neural architecture search methodology; integration of Hyperopt, PyTorch and MNSIM
2024
arXiv
Intel
CiMNet: Towards Joint Optimization for DNN Architecture and Configuration for Compute-In-Memory Hardware
framework that jointly searches for optimal sub-networks and hardware configurations for CiM architectures; multi-objective evolutionary search method
4
2
4
2025
AICAS
UVA
Optimizing and Exploring System Performance in Compact Processing-in-Memory-based Chips
Pipeline Method for Compact PIM Designs; Dynamic Duplication Method (DDM); Maximum NN Size Estimation & Deployment in Compact PIM Design
2025
NeurIPS
IBM
Analog Foundation Models
synthetic-data distillation for analog LLM adaptation; iterative weight clipping for high-SNR conductance mapping; static DAC/ADC range learning; per-channel hardware-noise injection
Challenge: Need fast and accurate estimators for area, latency, energy, and system-level behavior across varied CiM hardware designs.
Year
Venue
Authors
Title
Tags
P
E
N
2018
TCAD
ASU
NeuroSim: A Circuit-Level Macro Model for Benchmarking Neuro-Inspired Architectures in Online Learning
estimate the circuit-level performance of neuro-inspired architectures; estimates the area, latency, dynamic energy, and leakage power; Support both SRAM and eNVM; tested on 2-layer MLP NN, MNIST
2020
TCAD
ZJU
Eva-CiM: A System-Level Performance and Energy Evaluation Framework for Computing-in-Memory Architectures
models for capturing memory access and dependency-aware ISA traces; models for quantifying interactions between the host CPU and the CiM module
2022
ICCAD
Purdue
Design Space and Memory Technology Co-Exploration for In-Memory Computing Based Machine Learning Accelerators
simulation framework to evaluate the systemlevel performance of IMC architecture; area-aware weight mapping strategy
4
3
2
2024
ISPASS
MIT
CiMLoop: A Flexible, Accurate, and Fast Compute-In-Memory Modeling Tool
flexible specification to describe CiM systems; accurate model/fast statistical model of data-value-dependent component energy
2025
ASPDAC
HKUST
MICSim: A Modular Simulator for Mixed-signal Compute-in-Memory based AI Accelerator
modulared Neurosim; data statistic-based average-mode instead of trace-based mode
Challenge: Compiler for CIM is not well studied. Existing compilers are either for specific architecture or not efficient.
Year
Venue
Authors
Title
Tags
P
E
N
2023
TACO
HUST
A Compilation Tool for Computation Offloading in ReRAM-based CIM Architectures
compilation tool to migrate legacy programs to CPU/CIM heterogeneous architectures; a model to quantify the performance gain
2023
DAC
CAS
PIMCOMP: A Universal Compilation Framework for Crossbar-based PIM DNN Accelerators
compiler based on Crossbar/IMA/Tile/Chip hierarchy; low latency and high throughput mode; genetic algorithm to optimize weight replication and core mapping; scheduling algorithms for complex DNN
2024
ASPLOS
CAS
CIM-MLC: A Multi-level Compilation Stack for Computing-In-Memory Accelerators
compilation stack for various CIM accelerators; multi-level DNN scheduling approach
2024
DATE
RWTH Aachen University
CLSA-CIM: A Cross-Layer Scheduling Approach for Computing-in-Memory Architectures
algorithm to decide which parts of NN are duplicated to reduce inference latency; cross layer scheduling on tiled CIM architectures
2024
TCAD
NJU
A Compilation Framework for SRAM Computing-in-Memory Systems With Optimized Weight Mapping and Error
input/output side parallelism (IOSP); partition-based MAQE(duplicating MSB storage)
A 273.48 TOPS/W and 1.58 Mb/mm2 Analog-Digital Hybrid CIM Processor with Transpose Ternary-eDRAM Bitcell
analog DRAM CIM for partial sum and digital adder
1
4
2
2025
arXiv
KAIST
RED: Energy Optimization Framework for eDRAM-based PIM with Reconfigurable Voltage Swing and Retention-aware Scheduling
RED framework for energy optimization; reconfigurable eDRAM design; retention-aware scheduling; trade-off analysis between RBL voltage swing, sense amplifier power, and retention time; refresh skipping and sense amplifier power gating
Challenge: Memory wall causing high latency of data transfer between CPU and memory; DIMM-based NDP causing high energy consumption; area overhead and low performance efficiency.
Solution: Generally modify the physical structure of SRAM to enable in-memory computing; rather than placing logic units into SRAM.
Challenge: Charge-domain SRAM CIM improves energy efficiency, but multi-bit analog operation is constrained by ADC precision limits and error amplification during reconstruction.
Solution: Rework charge-domain encoding, accumulation, or readout structures so analog SRAM CIM can preserve accuracy without giving up its efficiency advantage.
Year
Venue
Authors
Title
Tags
P
E
N
2026
HPCA
ICT
Cambricon-CIM: Enabling Energy-Efficient and Error-Resilient Analog CIM Acceleration via Reformation of Coding Bases
minimal non-binary coding bases for multi-bit slicing; runtime vector centering for dynamic-range compression; LUT-based base-length selection for charge-domain SRAM CIM
4
2
3
2024
ESSCIRC
THU
A 65nm 8b-Activation 8b-Weight SRAM-Based Charge-Domain Computing-in-Memory Macro Using A Fully-Parallel Analog Adder Network and A Single-ADC Interface
SRAM-based CD-CiM architecture; charge-domain analog adder tree; ReLU-optimized ADC
4
4
4
2018
JSSC
MIT
CONV-SRAM: An Energy-Efficient SRAM With In-Memory Dot-Product Computation for Low-Power Convolutional Neural Networks
SRAM-embedded convolution (dot-product) computation architecture for BNN; support multi-bit input-output
MemTorch: A Simulation Framework for Deep Memristive Cross-Bar Architectures
supports both GPUs and CPUs; integrates directly with PyTorch; simulate non-idealities of memristive devices within cross-bar, tested on VGG-16, CIFAR-10
2021
TCAD
Geogia Tech
DNN+NeuroSim V2.0: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators for On-Chip Training
non-ideal device properties of NVMS' effect for on-chip training
3
3
2
2025
DAC
BUAA
CIMFlow: An Integrated Framework for Systematic Design and Evaluation of Digital CIM Architectures
workflow for implementing and evaluating DNN workloads on digital CIM architectures; CIM-specific ISA design; compilation flow built on the MLIR infrastructure
Challenge: Transformer architecture is widely used in NLP and CV tasks. Existing SRAM CIM architectures are not suitable for transformer acceleration.
Year
Venue
Authors
Title
Tags
P
E
N
2025
DATE
PKU
Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs
architecture model and simulator for CIM-based TPUs; designed for LLM inference
4
2
4
2023
arXiv
Keio
An 818-TOPS/W CSNR-31dB SQNR-45dB 10-bit Capacitor-Reconfiguring Computing-in-Memory Macro with Software-Analog Co-Design for Transformers
Capacitor-Reconfiguring analog CIM architecture
1
4
3
2025
arXiv
Purdue
Hardware-Software Co-Design for Accelerating Transformer Inference Leveraging Compute-in-Memory
SRAM based softmax-friendly CIM architecture for transformer; finer-granularity pipelining strategy
4
3
2
2025
arXiv
PKU
Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs
Energy-efficient CIM core integration in TPUs (replace the original MXU); CIM-MXU with systolic data path; Array dimension scaling for CIM-MXU; Area-efficient CIM macro design; Mapping engine for generative model inference
2024
JSSC
THU
MulTCIM: Digital Computing-in-Memory-Based Multimodal Transformer Accelerator With Attention-Token-Bit Hybrid Sparsity
long reuse elimination scheduler (LRES) to dynamically reshape the attention matrix; runtime token pruner (RTP) to remove insignificant tokens; modal-adaptive CIM network (MACN) to dynamically divide CIM cores into Pipeline; effective-bits-balanced CIM (EBBCIM) macro architecture
Challenge: RRAM devices are non-volatile and have high density; suitable for CIM applications. However; RRAM devices have non-ideal effects that can cause significant performance degradation.
PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference
Programmable and general-purpose ReRAM based ML Accelerator; Supports an instruction set; Has potential for DNN training; Provides simulator that accepts model
2018
ICRC
Purdue & HP
Hardware-Software Co-Design for an Analog-Digital Accelerator for Machine Learning
compiler to translate model to ISA; ONNX interpreter to support models in common DL frame work; simulator to evaluate performance
2024
VLSI-SoC
RWTH Aachen University
Architecture-Compiler Co-design for ReRAM-Based Multi-core CIM Architectures
inference latency predictions and analysis of the crossbar utilization for CNN
2024
arXiv
CAS
A Fully Hardware Implemented Accelerator Design in ReRAM Analog Computing without ADCs
Based on Stochastic Binary Neural Networks; Winner-Take-All (WTA) strategy; Hardware implemented sigmoid and softmax
Challenge: RRAM crossbars struggle with signed weights and high-precision analog representation, so numeric formats must be reshaped to preserve accuracy without excessive peripheral cost.
Year
Venue
Authors
Title
Tags
P
E
N
2021
ISCA
Northeastern et al.
FORMS: Fine-grained Polarized ReRAM-based In-situ Computation for Mixed-signal DNN Accelerator
ADMM-based fragment polarization for same-sign ReRAM columns; fine-grained sub-array computation; input zero-skipping for small fragments
4
2
4
2024
Science
USC
Programming memristor arrays with arbitrarily high precision for analog computing
represent high-precision numbers using multiple relatively low-precision analog devices;using RRAM CIM to solve PDEs
Accurate and Energy-Efficient Bit-Slicing for RRAM-Based Neural Networks
unbalanced bit-slicing scheme for higher accuracy; holistic solution using 2's compliment
2024
MICRO
HUST
DRCTL: A Disorder-Resistant Computation Translation Layer Enhancing the Lifetime and Performance of Memristive CIM Architecture
address conversion method for dynamic scheduling; hierarchical wear-leveling (HWL) strategy for reliability improvement; data layout-aware selective remapping (LASR) to improve communication locality and reduce latency
2024
TC
SJTU
ERA-BS: Boosting the Efficiency of ReRAM-Based PIM Accelerator With Fine-Grained Bit-Level Sparsity
bit-level sparsity in both weights and activations; bit-flip scheme; dynamic activation sparsity exploitation scheme
Accurate Prediction of ReRAM Crossbar Performance Under I-V Nonlinearity and IR Drop
IRP-Net (IR Drop Prediction Network); iterative refinement
2
3
3
2023
Nature
TetraMem
Thousands of conductance levels in memristors integrated on CMOS
2048 conductance levels (11-bit); linear weight update protocol; bayesian hyperparameter optimization for inference
4
5
3
2024
AICAS
RWTH Aachen University
A Calibratable Model for Fast Energy Estimation of MVM Operations on RRAM Crossbars
system energy model for MVM on ReRAM crossbars; methodology to study the effect of the selection transistor and wire parasitics in 1T1R crossbar arrays
2024
arXiv
MIT
Modeling Analog-Digital-Converter Energy and Area for Compute-In-Memory Accelerator Design
architecture-level model that estimates ADC energy and area
4
3
3
2024
Nat. Commun.
KAUST
Hardware implementation of memristor-based artificial neural networks
ITT-RNA: Imperfection Tolerable Training for RRAM-Crossbar-Based Deep Neural-Network Accelerator
prevent the large-weight synapses from being mapped to the imperfect memristor cells; off-device training algorithm to alleviate the accumulation of errors across multiple layers; bit-wise mechanism to compensate the resistance variations
3
3
2
2023
arXiv
UND
U-SWIM: Universal Selective Write-Verify for Computing-in-Memory Neural Accelerators
only do write-verify for important weights; based on weight second derivatives as a guide
3
3
3
2023
Adv. Mater.
UMich
Bulk‐Switching Memristor‐Based Compute‐In‐Memory Module for Deep Neural Network Training
Bulk-ReRAM based digital-CIM hybrid architecture for training; CIM for forward, digital for backward
4
4
1
2024
APIN
SWU
Multi-optimization scheme for in-situ training of memristor neural network based on contrastive learning
optimizations to the deployment method, loss function and gradient calculation; compensation measures for non-ideal effects
2025
TNNLS
SNU
Efficient Hybrid Training Method for Neuromorphic Hardware Using Analog Nonvolatile Memory
Challenge: Convolutional layer is the most compute-intensive layer in CNNs. RRAM CIM architecture is quite suitable for convolutional layer operations but face challenges related to non-ideal effects and performance degradation.
fabrication of high-yield, high-performance and uniform memristor crossbar arrays; hybrid-training method; replication of multiple identical kernels for processing different inputs in parallel
2019
TED
PKU
Convolutional Neural Networks Based on RRAM Devices for Image Recognition and Online Learning Tasks
RRAM-based hardware implementation of CNN; expand kernel to the size of image
2025
TVLSI
NBU
A 578-TOPS/W RRAM-Based Binary Convolutional Neural Network Macro for Tiny AI Edge Devices
ReRAM XNOR cell; BCNN CIM macro with FPGA as the control core
Mapping of CNNs on multi-core RRAM-based CIM architectures
architecture optimized for communication; compiler algorithms for conv2D layer; cycle-accurate simulator
2023
TODAES
UCAS
Mathematical Framework for Optimizing Crossbar Allocation for ReRAM-based CNN Accelerators
formulate a crossbar allocation problem for ReRAM-based CNN accelerators; dynamic programming based solver; models the performance considering allocation problem
2025
IEEE Access
UTehran
SCiMA: A Systolic CiM-Based Accelerator With a New Weight Mapping for CNNs—A Virtual Framework Approach
Challenge: A single CIM substrate rarely supports complete applications efficiently; these works coordinate multiple compute domains or expose a broader programming interface so hybrid CIM can execute full workloads instead of isolated kernels.
Year
Venue
Authors
Title
Tags
P
E
N
2023
GLSVLSI
USC
Heterogeneous Integration of In-Memory Analog Computing Architectures with Tensor Processing Units
hybrid TPU-IMAC architecture; TPU for conv, CIM for fc
2023
NANOARCH
HUST
Heterogeneous Instruction Set Architecture for RRAM-enabled In-memory Computing
General ISA for RRAM CiM & digital heterogeneous architecture; a tile-processing unit-array three-level architecture
2025
ASPLOS
CAS
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
dynamic parallelism-aware task scheduling for llm decoding; online kernel characterization for heterogeneous architectures; hybrid PIM units for compute-bound and memory-bound kernels
2025
DAC
Chung-Ang Univ.
HH-PIM: Dynamic Optimization of Power and Performance with Heterogeneous-Hybrid PIM for Edge AI Devices
heterogeneous-hybrid PIM with HP/LP modules and MRAM/SRAM; dynamic data placement algorithm for energy optimization; dual PIM controller design
3
4
2
2026
ASPLOS
UIUC
DARTH-PUM: A Hybrid Processing-Using-Memory Architecture
hybrid analog-digital ReRAM PUM architecture; ACE-DCE coordinating hardware for full-kernel in-memory execution; instruction injection unit for shift-add amortization; vACore abstraction for variable-width operands
Challenge: Analog or mixed-signal CIM loses efficiency when noise, device variation, or limited precision force conservative design; these works add digital or SRAM assistance to recover accuracy while preserving most of the CIM speed/energy benefits.
Year
Venue
Authors
Title
Tags
P
E
N
2024
Science
NTHU
Fusion of memristor and digital compute-in-memory processing for energy-efficient edge computing
Fusion of ReRAM and SRAM CiM; ReRAM SLC & MLC Hybrid; Current quantization; Weight shifting with compensation
2024
IPDPS
Georgia Tech
Harmonica: Hybrid Accelerator to Overcome Imperfections of Mixed-signal DNN Accelerators
select and transfer imperfectionsensitive weights to digital accelerator; hybrid quantization(weights on analog part is more quantized)
2024
ASP-DAC
Keio
OSA-HCIM: On-The-Fly Saliency-Aware Hybrid SRAM CIM with Dynamic Precision Configuration
On-the-fly Saliency-Aware precision configuration scheme; Hybrid CIM Array for DCIM and ACIM using split-port SRAM
2025
arXiv
AaltoU
Acore-CIM: build accurate and reliable mixed-signal CIM cores with RISC-V controlled self-calibration
reliability-focused MAC cell; proof-of-concept SoC composed of a CIM core and a RISC-V control processor; automated Built-In Self-Calibration (BISC) routine
3
3
4
2026
DATE
ESI & IBM
HILAL: Hessian-Informed Layer Allocation for Heterogeneous Analog–Digital Inference
Hessian-informed analog/digital layer allocation; k-means robust/sensitive layer clustering; layer-wise analog fine-tuning with noise injection
Challenge: Combining multiple memory/computing media in one accelerator introduces restore, density, and dataflow mismatches; these papers redesign the macro or mapping/dataflow stack so the hybrid substrate is actually usable.
Year
Venue
Authors
Title
Tags
P
E
N
2023
ICCAD
SJTU
TL-nvSRAM-CIM: Ultra-High-Density Three-Level ReRAM-Assisted Computing-in-nvSRAM with DC-Power Free Restore and Ternary MAC Operations
DCpower-free weight-restore from ReRAM; ternary SRAM-CIM mechanism with differential computing scheme
2025
TCAD
HKUST
Configurable Dataflow and Adaptive Mapping Optimization for Hybrid ReRAM and SRAM Compute-in-Memory Accelerator
Hybrid Macro Unit (HMU); adaptive mapping optimization; configurable dataflow control
4
3
3
2025
Nature
TSMC
A mixed-precision memristor and SRAM compute-in-memory AI processor
layer based INT-FP hybrid architure; kernel-based mix-CIM (SRAM/ReRAM/digital hybrid architecture)
5
5
2
Hybrid CIM: Transformer and Attention Acceleration¶
Challenge: Transformer workloads mix dense analog-friendly matrix operations with irregular, sparse, or precision-sensitive attention steps, motivating hybrid CIM architectures that split roles across analog and digital domains.
Year
Venue
Authors
Title
Tags
P
E
N
2023
arXiv
HP
RACE-IT: A Reconfigurable Analog CAM-Crossbar Engine for In-Memory Transformer Acceleration
Compute Analog Content Addressable Memory (Compute-ACAM) structure; accelerator based on crossbars and Compute-ACAMs; encoding-based optimization
3
3
4
2024
VLSI
FDU
HARDSEA: Hybrid Analog-ReRAM Clustering and Digital-SRAM In-Memory Computing Accelerator for Dynamic Sparse Self-Attention in Transformer
product-quantization-based sparse self-attention algorithm; ADC-free ReRAM-CIM macro; ReRAM-CIM for front-end attention sparsification, SRAM-CIM for back-end sparse attention
4
3
3
2024
DAC
SJTU
HEIRS: Hybrid Three-Dimension RRAM- and SRAM-CIM Architecture for Multi-task Transformer Acceleration
Hybrid Distributive Accumulation; Local Recovery Unit (LRU) in 3D
3
3
3
2024
ESSERC
UCSD
An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing
analog CIM for low-score tokens, digital processor for high
Challenge: LLM deployment on hybrid CIM needs software-hardware partitioning that keeps large static weights dense and energy-efficient while protecting task-specific or precision-sensitive paths.
Year
Venue
Authors
Title
Tags
P
E
N
2025
arXiv
South Carolina
PIM-LLM: A High-Throughput Hybrid PIM Architecture for 1-bit LLMs
hybrid PIM-Digital architecture; analog PIM for low-precision MatMul; digital systolic array for high-precision matMul
4
3
1
2026
TODAES
HKU
HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture
pretrained weights on RRAM-CIM and LoRA branches on SRAM-CIM; noise-aware LoRA regularization for RRAM non-ideality; hybrid CIM energy reduction for LLM inference
Challenge: Limited by the precision & area & power trade-off of the ADC; certain CIM devices like RRAM are not suitable for high-precision computation (e.g. FP32). Quantization is needed to reduce the precision of the data.
BitS-Net: Bit-Sparse Deep Neural Network for Energy-Efficient RRAM-Based Compute-In-Memory
bit-sparsity quantization; bias-shifted MVM; hardware-aware loss function
3
2
3
2023
AICAS
TU Delft
Mapping-aware Biased Training for Accurate Memristor-based Neural Networks
favorability constraint analysis to find important weight values; mapping-aware biased training to restrict weight values to low variance RRAM states
3
4
2
2024
TCAD
BUAA
CIMQ: A Hardware-Efficient Quantization Framework for Computing-In-Memory-Based Neural Network Accelerators
bit-level sparsity induced activation quantization; quantizing partial sums to decrease required resolution of ADCs; arraywise quantization granularity
2024
TCAD
BUAA
CIM²PQ: An Arraywise and Hardware-Friendly Mixed Precision Quantization Method for Analog Computing-In-Memory
mixed precision quantization method based on evolutionary algorithm; arraywise quantization granularity; evaluation method to obtain the performance of strategy on the CIM
2024
ICCAD
TU Delft
Hardware-Aware Quantization for Accurate Memristor-Based Neural Networks
analysis of fixed-point quantization impact on conductance variation; weight quantization tuning technique; approach to reduce the residual error
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
integer-only inference arithmetic; quantizes both weights and activations as 8-bit integers, bias 32-bit; provides both quantized inference framework and training frame work
2023
ICCD
SJTU
PSQ: An Automatic Search Framework for Data-Free Quantization on PIM-based Architecture
post-training quantization framework without retraining; hardware-aware block reassembly
2025
arXiv
UHK
Binary Weight Multi-Bit Activation Quantization for Compute-in-Memory CNN Accelerators
a quantization framework that considers CIM's mixed-signal constraints; closed-form layer-specific weight binarization method; differentiable function for uniform multi-bit quantization
Challenge: Speculative prefetch requests can cause undesirable effects on the system (e.g., increased memory bandwidth consumption, cache pollution, memory access interference).
Year
Venue
Authors
Title
Tags
P
E
N
2021
MICRO
ETHZ
Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning
formulating prefetching as a reinforcement learning problem; holistic learning from multiple program features and system feedback; customizable prefetching objective via configuration registers
3
3
2
2025
MICRO
NUDT
Elevating Temporal Prefetching Through Instruction Correlation
critical instruction detection based on miss contribution; coverage-based classification for metadata utility; adaptive metadata cache partitioning via controller