Programming Languages and Software Engineering¶
Language design and semantics¶
Solution: user-friendly, resource-efficient, and secure programming languages
Compiler construction and optimization¶
Solution: improving performance, reducing resource usage, and ensuring correctness
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2024 | MICRO | Georgia Tech | Unleashing CPU Potential for Executing GPU Programs through Compiler/Runtime Optimizations | anti-coalescing transformation; block size invariant analysis; tail block adaptive synchronization; GPU-block dynamic tiling | 2 | 4 | 3 |
Program Optimization and Rewriting Frameworks¶
Challenge: Traditional sequential compiler optimizations suffer from the phase-ordering problem, while existing equational reasoning tools are often too slow or rigid for domain-specific, non-syntactic analyses.
Solution: Utilize equality saturation with efficient data structures (e.g., e-graphs) and extensible analysis mechanisms to explore the space of equivalent programs without strict ordering.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2021 | POPL | UW | egg: Fast and Extensible Equality Saturation | equality saturation algorithm;e-graphs (Equality Graphs);deferred rebuilding technique | 3 | 4 | 4 |
| 2026 | ASPLOS | PKU | Finding Reusable Instructions via E-Graph Anti-Unification | e-graph anti-unification algorithm for custom instruction identification;pattern vectorization;hardware-aware cost model | 4 | 4 | 4 |
Deep Learning Compilers¶
Solution: Graph transformations, Kernel fusion, Tensor optimization for compute and memory
Foundational DSLs & Compilers¶
Challenge: Bridging high-level algorithmic expression with low-level hardware performance for array/tensor computations.
Solution: Decouple algorithm specification from its execution schedule to enable performance portability and automated optimization.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2013 | PLDI | MIT | Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines | algorithm-schedule decoupling; image processing DSL; explicit scheduling language | 4 | 4 | 4 |
| 2018 | OSDI | UW | TVM: An Automated End-to-End Optimizing Compiler for Deep Learning | operator fusion; graph-level DL compiler; automatic code generation; tensor expression simplification | 4 | 4 | 4 |
| 2023 | ASPLOS | CMU | TensorIR: An Abstraction for Automatic Tensorized Program Optimization | use block abstraction for computation isolation and loop transformations; tensor intrinsic matching algorithm; evolutionary search-based automatic scheduling | 4 | 4 | 4 |
Tensor & Pipeline Optimization¶
Challenge: Manually writing high-performance kernels for diverse tensor operations is difficult and not portable.
Solution: Use MLIR-based abstractions, novel programming models, and automatic tuning to generate efficient code for tensor programs and pipelines.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2023 | OSDI | Microsoft Research | WELDER: Scheduling Deep Learning Memory Access via Tile-graph | Tile-graph abstraction for fine-grained memory management; inter-layer independence for optimization space decoupling; tile traffic-based cost model | 3 | 4 | 4 |
| 2025 | PPoPP | Thu | FlashTensor: Optimizing Tensor Programs by Leveraging Fine-grained Tensor Property | dataflow centered code recognition and optimization; two-stage heuristic algorithm to optimize tensor computation; kernel fusion | 4 | 4 | 2 |
| 2026 | ICLR | PKU | TILELANG: Bridge Programmability and Performance in Modern Neural Kernels | tvm-based compiler; compared to triton is more flexible | 4 | 4 | 3 |
| 2025 | OSDI | PKU | PipeThreader: Software-Defined Pipelining for Efficient DNN Execution | Pipeline programming abstraction and orchestration mechanism for heterogeneous computing units; tile size and pipeline stage number tradeoff | 4 | 4 | 4 |
| 2025 | arXiv | CMU | Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs | mega-kernelization; SM level parallel optimization; SM level representation | 3 | 3 | 3 |
Graph-Level Transformation and Optimization¶
Challenge: Large DNN topologies and massive tensor sizes easily exceed hardware memory capacities and execution efficiency limits, making isolated operator-level optimizations insufficient.
Solution: Apply coordinated graph-level transformations (such as operator fission and fusion) and holistic topological scheduling to systematically optimize the entire computation graph for peak memory and execution latency.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2021 | MLSys | MIT | IOS: Inter-Operator Scheduler for CNN Acceleration | inter-operator parallelism; dynamic programming based scheduler; concurrent execution; operator fusion | 4 | 4 | 2 |
| 2023 | DAC | PKU | Memory and Computation Coordinated Mapping of DNNs onto Complex Heterogeneous SoC | dataflow grouping; location-aware accelerator mapping; hybrid scheduling algorithm | 4 | 3 | 2 |
| 2024 | DAC | THU | GSPO: A Graph Substitution and Parallelization Joint Optimization Framework for DNN Inference | flow-based graph partition; joint optimization computational graph(JOCG); joint cost model; backtracking search algorithm | 4 | 4 | 2 |
| 2024 | ASPLOS | PKU | MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN | Fission Transformation (F-Trans) for graph splitting; Dimension Graph (D-Graph) representation; incremental scheduling algorithm for fast evaluation | 4 | 4 | 4 |
TinyML & Edge Compilers¶
Solution: address extreme memory and compute constraints on microcontrollers through specialized compilation strategies like fine-grained memory planning and redundancy-free tensor splitting.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2024 | HPCA | NYCU | TinyTS: Memory-Efficient TinyML Model Compiler Framework on Microcontrollers | dependency-free tensor splitting model; virtual feature map (VFP) for zero-copy concatenation; fine-grained life-cycle aware memory planner | 4 | 4 | 3 |
| 2025 | HPCA | NYCU | EDA: Energy-Efficient Inter-Layer Model Compilation for Edge DNN Inference Acceleration | Inter-Layer Operator Scheduling;SRAM-Constrained Tiling | 4 | 3 | 3 |
Compiler for Accelerators¶
Challenge: The semantic gap and astronomically large scheduling space of spatial accelerators make traditional compilers and manual tuning ineffective.
Solution: High-level programming model, automatic code generation, performance optimization for specialized hardware
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2021 | ISCA | UCB | CoSA: Scheduling by Constrained Optimization for Spatial Accelerators | mixed-integer programming (MIP) for scheduling; prime-factor allocation; constant binary matrices for algorithm-hardware constraints | 5 | 3 | 4 |
| 2022 | PLDI | MIT | Exocompilation for Productive Programming of Hardware Accelerators | Exocompilation; externalized accelerator specification; user-defined instructions; rewrite-based scheduling; effect analysis for safety | 4 | 4 | 3 |
| 2024 | HPCA | Stanford | Revet: A Language and Compiler for Dataflow Threads | dataflow threads execution model for vRDA; structured-link tensor format (SLTF) for control flow encoding; compiler lowering from imperative control flow to streaming dataflow | 4 | 3 | 4 |
| 2026 | ASPLOS | Stanford | Streaming Tensor Programs: A Streaming Abstraction for Dynamic Parallelism | asynchronous dataflow streaming abstraction;symbolic shape notation for dynamic tiling;dynamic hardware configuration time-multiplexing for MoE | 4 | 3 | 4 |
Embedded DSL Compiler Frameworks¶
Challenge: Developing high-performance parallel DSLs for heterogeneous hardware requires significant repetitive effort in building IRs, optimizers, and code generators from scratch.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2011 | IEEE Micro | Stanford & EPFL | Implementing Domain-Specific Languages for Heterogeneous Parallel Computing | Delite compiler framework; language virtualization; multi-view IR (Generic/Parallel/Domain-Specific); lightweight modular staging (LMS); heterogeneous code generation | 4 | 4 | 3 |
| 2011 | PPoPP | Stanford | A Domain-Specific Approach To Heterogeneous Parallelism | Delite runtime; deferred execution model; dynamic task graph; Delite op archetypes; GPU memory manager; run-ahead model | 4 | 4 | 4 |
Sparse Tensor Compilers¶
Challenge: Compared to dense tensors, sparse tensors have more complex data structure and computation patterns.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2023 | ASPLOS | UW | SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning | use composable formats for the expression for sparse matrix; divide the compute to different stages and reuse current optimizations;TensorIR-based sparse compiler | 4 | 3 | 3 |
| 2024 | OOPSLA | Cornell | UniSparse: An Intermediate Language for General Sparse Format Customization | language-based holistic format abstraction; decoupled logical (data structure) vs. physical (memory layout) representation; index map & orthogonal primitives (mutation/layout/query) for customization | 4 | 4 | 4 |
Graph Mining Compilers¶
Challenge: generic runtime algorithms, automatically compile high-level specifications into efficient code.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2019 | SOSP | CSM | AutoMine: Harmonizing High-Level Abstraction and High Performance for Graph Mining | automatic algorithm generation; set-based embedding representation; schedule generation via graph tournament | 4 | 4 | 4 |
Domain-Specific Languages¶
Solution: formal semantics definition, tool generation automation, cross-domain generalization
Machine Learning DSLs¶
Challenge: Using general-purpose languages for ML requires explicit, complex parallelization for heterogeneous hardware, limiting productivity and performance portability.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2011 | ICML | Stanford | OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning | OptiML; implicitly parallel DSL; best-effort computing; relaxed dependencies; domain-specific intermediate representation | 4 | 4 | 4 |
Graph DSLs¶
Challenge: balancing expressiveness, usability, and performance in graph DSL design
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2012 | ASPLOS | Stanford | Green-Marl: A DSL for Easy and Efficient Graph Analysis | domain-specific breadth-first/depth-first ordered traversal primitives; deferred data-parallel assignment for bulk synchronous consistency; architecture-independent loop fusion and reduction bounds relaxation | 4 | 4 | 4 |
| 2018 | OOPSLA | MIT | GraphIt: A High-Performance Graph DSL | algorithm-schedule decoupling for graph; Graph Iteration Space (GIS); scheduling language for traversal strategies; compiler-guided autotuning | 4 | 4 | 4 |
Sparse Tensor Algebra Compilers¶
Solution: multi-format iteration efficiency, format combination optimization, architecture-agnostic code generation
Format Abstraction and Conversion¶
Challenge: Sparse tensor algebra compilers need a compact way to describe many tensor formats and generate format-specific iteration code without hand-writing every format combination.
Solution: Use tensor index notation, merge-based sparse iteration, and format-level abstractions to generate code for many sparse and dense layouts.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2017 | OOPSLA | MIT | The Tensor Algebra Compiler | TACO compiler; iteration graph; merge lattice; compound tensor algebra codegen | 4 | 4 | 5 |
| 2018 | OOPSLA | MIT | Format Abstraction for Sparse Tensor Algebra Compilers | coordinate hierarchies; level formats abstraction; property-based merge lattice optimizations; level iterator conversion | 4 | 4 | 3 |
| 2020 | PLDI | MIT | Automatic Generation of Efficient Sparse Tensor Format Conversion Routines | coordinate remapping notation; attribute query language; tensor assembly abstract interface; three-phase conversion decomposition | 4 | 4 | 4 |
Sparse Tensor Scheduling and Assembly¶
Scheduling and Interoperability¶
Challenge: Sparse tensor programs require schedule choices and library interoperation that depend on sparse formats, nonzero structure, and external optimized kernels.
Solution: Expose sparse iteration transformations, asymptotic schedule selection, and verified external-function binding as compiler-level mechanisms.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2020 | OOPSLA | Reservoir Labs && Stanford | A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra | sparse iteration space transformations; derived iteration spaces; nonzero-space tiling; sparse scheduling API | 4 | 4 | 4 |
| 2022 | PLDI | MIT | Autoscheduling for Sparse Tensor Algebra with an Asymptotic Cost Model | automatic asymptotic scheduler; asymptotic cost model; Pareto frontier schedule pruning; TACO schedule search | 4 | 4 | 4 |
| 2023 | PLDI | Stanford | Mosaic: An Interoperable Compiler for Tensor Algebra | verified external-function binding; Mosaic function interface; automatic binding search; heterogeneous tensor algebra codegen | 4 | 4 | 4 |
Dynamic Assembly and Workspaces¶
Challenge: Sparse tensor computations often need dynamic updates or scattered writes into sparse results whose formats do not support efficient random insertion.
Solution: Generate update-friendly dynamic formats and intermediate sparse workspaces that adapt scatter-heavy computation to sparse result assembly.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2022 | OOPSLA | MIT | Compilation of Dynamic Sparse Tensor Algebra | node schema language; assembly abstract interface; map function generation; iterator optimization; dynamic tensor format composition | 4 | 4 | 3 |
| 2024 | PLDI | Stanford | Compilation of Modular and General Sparse Workspaces | sparse workspace insertion; workspace insertion algorithm template; sparse scattering detection; modular workspace implementations | 4 | 4 | 4 |
Transpilers¶
Solution: automatic, correct, and performant source-to-source code translation across different hardware ecosystems
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2025 | OSDI | CAS | QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach | neural-symbolic synthesis; LLM-assisted transcompilation; SMT-based code repair; hierarchical auto-tuning | 3 | 4 | 2 |
Hardware Description Languages¶
Solution: expressive hardware specification, efficient simulation and synthesis, robust verification methodologies
HDL Language Design¶
Challenge: balancing expressiveness, usability, and synthesis efficiency in HDL design
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2024 | FPGA | PKU | Cement: Streamlining FPGA Hardware Design with Cycle-Deterministic eHDL and Synthesis | incorporates an event layer and the ctrl sub-language; event-based extension; cycle-level timing analysis and control synthesis techniques | 4 | 4 | 4 |
Streaming Computation Models¶
Solution: high-throughput data processing, real-time analytics, efficient resource utilization for continuous data
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2020 | ASPLOS | Stanford | Fleet: A Framework for Massively Parallel Streaming on FPGAs | user write serial code for parallel; multi-stream parallelism; ready-valid signaling | 3 | 4 | 3 |
| 2020 | PLDI | Stanford | Type-Directed Scheduling of Streaming Accelerators | SSeq/TSeq space-time types; static throughput matching via types; invalid-bubble encoding in type system; type-directed scheduling | 4 | 4 | 3 |
HLS Code Generation and Automation¶
Solution: bridging high-level languages to hardware, design space exploration, QoR improvement automation
Predictable HLS Programming Models¶
Challenge: Legacy software languages repurposed for HLS rely on complex, unsystematic heuristics, leading to unpredictable area-performance trade-offs.
Solution: Use formal type systems and language constraints (like time-sensitive affine types) to guarantee predictable and optimal hardware generation.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2020 | PLDI | Cornell | Predictable Accelerator Design with Time-Sensitive Affine Types | Dahlia language; time-sensitive affine type system representing consumable hardware resources; logical time steps encoded in types; memory views for decoupling iteration from memory banking | 3 | 3 | 3 |
General HLS Optimizations and Techniques¶
Solution: Develop general techniques to enhance HLS QoR by optimizing key aspects like timing, resource management, and code structure.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2020 | DAC | University of California | Analysis and Optimization of the Implicit Broadcasts in FPGA HLS to Improve Maximum Frequency | systematic classification of HLS implicit broadcasts (data/control/pipeline); broadcast-aware scheduling with calibrated delay models; skid-buffer-based pipeline flow control to eliminate stall signal broadcasting | 4 | 4 | 4 |
| 2022 | ASPLOS | UCLA | HeteroGen: transpiling C to heterogeneous HLS code with automated test generation and program repair | automated test generation; dependence-guided search space pruning; early candidate rejection using coding styles | 3 | 4 | 3 |
| 2022 | FPGA | Cornell | HeteroFlow: An Accelerator Programming Model with Decoupled Data Placement for Software-Defined FPGAs | Decoupled data placement; Unified data placement primitive; Multi-level memory hierarchy optimization | 4 | 4 | 4 |
| 2024 | DATE | UIUC | Subgraph Extraction-Based Feedback-Guided Iterative Scheduling for HLS | ISDC iterative SDC scheduling; subgraph extraction-based low-level feedback; fanout and window-based subgraph extraction mechanism | 4 | 4 | 4 |
| 2025 | FPGA | University of Glasgow | Dynamic Loop Fusion in High-Level Synthesis | Dynamic loop fusion; HLS; Irregular memory access; Address monotonicity; Decoupled Access/Execute (DAE); Program-order schedule; Data Unit (DU) | 4 | 4 | 4 |
High-Level Language to HLS Abstractions¶
Solution: Raise the abstraction level by enabling HLS code generation from high-level languages like Python, simplifying hardware design for non-experts.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2019 | FPGA | UCLA | HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing | algorithm-schedule decoupling; Python DSL; tensor-based computation; quantitative data types; design space exploration | 4 | 4 | 4 |
| 2021 | TC | UIUC | PyLog: An Algorithm-Centric Python-Based FPGA Programming and Synthesis Flow | Python-based HLS flow; algorithm-centric operators (map/dot); automatic hardware type inference; automatic HLS pragma insertion | 4 | 4 | 3 |
| 2024 | MICRO | HUST | A Scalable Efficient and Robust Dynamic Memory Management Library for HLS-based FPGAs | DMM as graph analytics; request-guided graph traversal; data-centric concurrent traversal; shortcut-assisted fast traversal | 4 | 4 | 3 |
MLIR-based HLS Compiler Frameworks¶
Solution: Leverage the MLIR infrastructure to build modular, extensible, and reusable HLS frameworks for better analysis, transformation, and code generation.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2022 | ICCAD | PKU | HECTOR: A Multi-level Intermediate Representation for Hardware Synthesis Methodologies | high-level topological representation; low-level hierarchical elastic component; time graph transformation | 4 | 4 | 3 |
| 2022 | HPCA | UIUC | ScaleHLS: A New Scalable High-Level Synthesis Framework on Multi-Level Intermediate Representation | multi-level IR for HLS; HLS-dedicated analysis/transform library; MLIR-based HLS framework | 4 | 4 | 3 |
| 2023 | ASPLOS | IISc | HIR: An MLIR-based Intermediate Representation for Hardware Accelerator Description | MLIR-based hardware IR; datapath + schedule model; explicit scheduling via time variables; automatic FSM synthesis | 4 | 4 | 4 |
| 2024 | HPCA | SJTU | An Optimizing Framework on MLIR for Efficient FPGA-based Accelerator Generation | polyhedral-based dependence analysis for loop transformations; bottleneck-oriented design space exploration; Dependence/Polyhedral/Affine IR hierarchy for FPGA HLS | 4 | 4 | 4 |
| 2024 | PLDI | Cornell | Allo: A Programming Model for Composable Accelerator Design | composable programming model; decoupled hardware customizations; bottom-up type-safe composition; hierarchical dataflow graph; memory layout composition via type inference | 4 | 4 | 3 |
HLS Verification and Debugging¶
Solution: Develop automated tools for bug detection, formal verification, and cross-level debugging to ensure the correctness of HLS designs.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2024 | FPGA | Cornell | Formal Verification of Source-to-Source Transformations for HLS | hybrid verification via concrete interpretation of control-flow and symbolic analysis of dataflow; Computation Directed Acyclic Graph (CDAG) as a syntax-agnostic semantic representation; formal equivalence proof for Statically Interpretable Control-Flow (SICF) programs | 4 | 4 | 4 |
| 2024 | MICRO | PKU | Hestia: An Efficient Cross-level Debugger for High-level Synthesis | allowing inspection at multiple granularities; establishes the correspondence at different levels; a multi-level interpreter for three levels | 4 | 4 | 4 |
| 2024 | LAD | UIUC | An Iteratively-refined Dataset for High-Level Synthesis Functional Verification through LLM-Aided Bug Injection | Chrysalis dataset with bug injection; ICL+RAG+CoT bug injection methodology; iteratively-refined HLS verification dataset | 4 | 4 | 4 |
Dataflow-centric HLS Acceleration¶
Solution: Exploit task-level parallelism by automatically transforming and scheduling designs as dataflow graphs to maximize pipeline throughput and resource utilization.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2024 | ASPLOS | UIUC | HIDA: A Hierarchical Dataflow Compiler for High-Level Synthesis | hierarchical dataflow IR (HIDA-IR); multi-level dataflow optimizer (HIDA-OPT); pattern-driven task fusion | 3 | 5 | 4 |
| 2025 | FPGA | UCLA | Stream-HLS: Towards Automatic Dataflow Acceleration | automatic dataflow HLS; global scheduling for streaming; MINLP for HLS optimization | 4 | 4 | 4 |
| 2025 | MICRO | UIUC | StreamTensor: Make Tensors Stream in Dataflow Accelerators for LLMs | itensor iterative tensor type system;automatic stream-based kernel fusion;LP-based FIFO sizing;unified dataflow component generation | 4 | 4 | 3 |
HLS for Systolic Arrays and AI Engines¶
Solution: Provide automated compilation flows and programming models to efficiently map algorithms, especially for AI workloads, onto specialized compute fabrics like systolic arrays and AI Engines.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2017 | DAC | PKU & Falcon & UCLA | Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs | 2D systolic array architecture; analytical model for performance and resource; two-phase design space exploration; end-to-end C-to-FPGA automation flow | 3 | 4 | 3 |
| 2019 | ISCAS | PKU & UCLA | Frequency Improvement of Systolic Array-Based CNNs on FPGAs | front-end accumulation chain segmentation; back-end topology-aware floorplanning constraints; frequency optimization for systolic arrays | 3 | 4 | 3 |
| 2025 | FPGA | Brown University | ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines | MLIR-based AIE compilation; Unified AIE+PL IR; Tile-based parallelism; ADF dialect; Automated AIE placement | 4 | 5 | 4 |
HLS for Advanced Memory/Packaging¶
Solution: Develop HLS methodologies and co-design techniques to effectively utilize advanced hardware features like High-Bandwidth Memory (HBM), multi-die packaging, and direct storage access.
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2022 | FPGA | Cornell | High-Performance Sparse Linear Algebra on HBM-Equipped FPGAs Using HLS: A Case Study on SpMV | HLS methodology; HBM FPGA; SpMV accelerator; split-kernel design; microarchitecture in HLS; load-store forwarding; pipelined arbiter | 4 | 4 | 4 |
| 2023 | FPGA | UoP | DONGLE: Direct FPGA-Orchestrated NVMe Storage for HLS | HLS direct NVMe access; FPGA-orchestrated storage; Unified HLS storage interface; Single-source HLS for storage; DONGLE architecture | 4 | 4 | 4 |
| 2023 | FPGA | HKUST | FADO: Floorplan-Aware Directive Optimization for High-Level Synthesis Designs on Multi-Die FPGAs | Floorplan-aware HLS; Multi-die FPGA optimization; Directive-floorplan co-optimization; Incremental floorplanning for HLS; MMBP for HLS DSE | 3 | 4 | 4 |
| 2024 | ASPLOS | UCLA | TAPA-CS: Enabling Scalable Accelerator Design on Distributed HBM-FPGAs | two-layer ILP-based inter/intra-FPGA floorplanning coupled with interconnect pipelining during HLS; topology-aware communication cost model; latency-insensitive cross-FPGA partitioning; automatic multi-FPGA design partitioning with cut-set pipeline balancing | 3 | 4 | 3 |
Program analysis¶
Solution: statically or dynamically analyzing programs to understand their behavior, detect errors, and optimize performance
Domain-specific program analysis¶
Solution: leveraging domain knowledge for precise analysis, specialized bug detection, targeted optimization insights
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2024 | PPoPP | Information Engineering University | A Holistic Approach to Automatic Mixed-Precision Code Generation and Tuning for Affine Programs | holistic code generation and tuning; polyhedral model for mixed-precision; model-driven autotuning | 4 | 5 | 4 |
| 2024 | PPoPP | University of Delaware | Recurrence Analysis for Automatic Parallelization of Subscripted Subscripts | recurrence analysis for parallelization; subscripted subscript analysis; intermittent monotonicity detection | 3 | 4 | 3 |
HLS program analysis¶
Solution: verifying functional correctness of HLS, analyzing performance bottlenecks, ensuring interface compatibility
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2025 | FPGA | UoE | Latency Insensitivity Testing for Dataflow HLS Designs | Automated Latency Insensitivity Testing; Parallel Hardware-Accelerated Testing Platform; Test space reduction; Stalling Units (SU) | 4 | 4 | 4 |