Algorithmic-Hardware Co-Design for Efficient Embedded Intelligence

Angioli, Marco

The rapid growth of Artificial Intelligence (AI) has enabled unprecedented capabilities, but it is also driving a computational and energy efficiency crisis that affects the entire compute stack. This crisis is most acute at the edge, where over half of all new AI workloads are expected to be deployed before 2030. Here, embedded systems must sense, decide, and react to their environment under extreme power, memory, and real-time constraints—requirements that are fundamentally incompatible with the escalating resource demands of modern AI models. This dissertation addresses this crisis in the context of online decision-making at the edge by adopting contextual bandit algorithms as the central abstraction and pursuing algorithmic-hardware co-design to enable efficient, adaptive intelligence on resource-constrained devices. The thesis progresses through three parts: from arithmetic and algorithmic optimization of linear models, to a neuro-inspired paradigm shift via hyperdimensional computing, and finally, to the design of dedicated hardware accelerators. Part I addresses the dominant bottlenecks in linear bandit algorithms for embedded processors. Starting from the observation that integer division alone can account for up to 70% of execution time on these platforms, we introduce a novel data-dependent hardware division unit that reduces average latency by up to 20.65x. By integrating this unit into a vector accelerator and reformulating learning updates to avoid explicit matrix inversion, we achieve up to a 58x speedup and a 50x energy reduction, while reducing the computational complexity from cubic to quadratic. Yet, even with these optimizations, the inherent limitations of dense linear algebra remain. Recognizing this, Part I} changes the nature of the computation itself by adopting hyperdimensional computing (HDC), a neuro-inspired paradigm based on high-dimensional distributed representations. We introduce HD-CB, the first framework that models and solves contextual bandit problems directly in high-dimensional space using simple, highly parallel element-wise operations, and progressively develop a family of six variants spanning increasingly advanced learning rules and compact representations. Extensive evaluation on synthetic datasets and real-world benchmarks demonstrates that HD-CB is a strong candidate for hardware acceleration and online decision-making at the edge: it achieves faster convergence, exhibits linear complexity with problem size, and outperforms the 32-bit floating-point linear baseline using just 4 bits per component. Part III bridges algorithms and silicon by designing a family of hardware accelerators for HDC. We first introduce a configurable coprocessor unit for binary HDC, integrated into a RISC-V core; we then extend hardware support to modular composite representations, demonstrating that this model achieves a unique balance among information capacity, classification accuracy, and hardware efficiency for embedded systems. Finally, we develop the first hardware-friendly accelerator for the Fourier Holographic Reduced Representations model and introduce AeneasHDC, an open-source framework that automates accelerator design-space exploration and reduces deployment time from weeks to minutes. Together, these contributions demonstrate that algorithmic-hardware co-design, grounded in neuro-inspired representations, can make autonomous online decision-making at the edge a practical reality.

Algorithmic-Hardware Co-Design for Efficient Embedded Intelligence / Angioli, M.. - (2026 Jun 04).