#2 Energy Efficient Software Hardware CoDesign for Machine Learning: From TinyML to Large Language Models

Score: 24.0 | Matched keywords: ai, large language models, llm, machine learning

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles For TinyML systems, even milliwatt-level inefficiencies can render deployments impractical [2], while large-scale LLM infrastructures require thousands of accelerators, leading to megawatt-scale power draw and substantial operational costs [1], [9]. Recent studies further indicate that inference now accounts for more than half of total LLM lifecycle emissions [4], underscoring the urgency of energy-efficient deployment strategies across the full spectrum of machine learning systems. Many co-design techniques lack generalization across model families and deployment scales, with approaches optimized for convolutional networks often failing to transfer effectively to attention-based architectures [11], [20].

The core proposal is Many co-design techniques lack generalization across model families and deployment scales, with approaches optimized for convolutional networks often failing to transfer effectively to attention-based architectures [11], [20]. Traditional approaches that optimize machine learning algorithms independently of their execution platforms are increasingly inadequate [10], [11]. Software–hardware co-design has therefore emerged as a key paradigm, jointly optimizing models, compilers, architectures, and runtime systems to improve energy efficiency, throughput, and adaptability [12], [13]. Existing efforts rarely provide unified frameworks that bridge optimization strategies across TinyML, [cs.AR] 24 Mar 2026 Eyeriss DNNBuilder Cambricon-S Analog-Dig.

In battery-powered and energy-harvesting IoT devices, inefficiencies directly translate into reduced lifetime or impractical maintenance costs, motivating adaptive architectures that dynamically adjust precision and computation depth based on available power [2]. Early accelerator designs such as Eyeriss demonstrated that careful dataflow and memory hierarchy optimization can significantly mitigate this bottleneck by maximizing data reuse and minimizing off-chip access [5], [25].

The central reported finding is In battery-powered and energy-harvesting IoT devices, inefficiencies directly translate into reduced lifetime or impractical maintenance costs, motivating adaptive architectures that dynamically adjust precision and computation depth based on available power [2].

The paper also makes it clear that Moreover, throughput-centric metrics obscure broader sustainability impacts; infrastructureaware analyses show that environmental cost extends beyond energy to include water usage and embodied carbon, and that FLOP-efficient models may still incur high operational emissions due to inefficient memory behavior [1], [4]. Techniques optimized for TinyML, such as zero-buffer dataflows, do not generalize to transformer attention with irregular access patterns [6], [8], while LLM-specific optimizations such as KV-cache management and speculative decoding are infeasible on memoryconstrained edge devices [1]. REPRESENTATIVE WORKS DEMONSTRATE FPGA AND CIM ADVANTAGES FOR POWER-CONSTRAINED DEPLOYMENTS, WITH EFFICIENCY GAINS OF 16-59× OVER BASELINE IMPLEMENTATIONS. Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: In battery-powered and energy-harvesting IoT devices, inefficiencies directly translate into reduced lifetime or impractical maintenance costs, motivating adaptive architectures that dynamically adjust precision and computation depth based on available power [2].
Important caution: Moreover, throughput-centric metrics obscure broader sustainability impacts; infrastructureaware analyses show that environmental cost extends beyond energy to include water usage and embodied carbon, and that FLOP-efficient models may still incur high operational emissions due to inefficient memory behavior [1], [4].

Problem definition

For TinyML systems, even milliwatt-level inefficiencies can render deployments impractical [2], while large-scale LLM infrastructures require thousands of accelerators, leading to megawatt-scale power draw and substantial operational costs [1], [9].
Recent studies further indicate that inference now accounts for more than half of total LLM lifecycle emissions [4], underscoring the urgency of energy-efficient deployment strategies across the full spectrum of machine learning systems.
Many co-design techniques lack generalization across model families and deployment scales, with approaches optimized for convolutional networks often failing to transfer effectively to attention-based architectures [11], [20].
Software–hardware co-design has therefore emerged as a key paradigm, jointly optimizing models, compilers, architectures, and runtime systems to improve energy efficiency, throughput, and adaptability [12], [13].

Core idea & method

Many co-design techniques lack generalization across model families and deployment scales, with approaches optimized for convolutional networks often failing to transfer effectively to attention-based architectures [11], [20].
Traditional approaches that optimize machine learning algorithms independently of their execution platforms are increasingly inadequate [10], [11].
Software–hardware co-design has therefore emerged as a key paradigm, jointly optimizing models, compilers, architectures, and runtime systems to improve energy efficiency, throughput, and adaptability [12], [13].
Existing efforts rarely provide unified frameworks that bridge optimization strategies across TinyML, [cs.AR] 24 Mar 2026 Eyeriss DNNBuilder Cambricon-S Analog-Dig.
Timeline of representative software–hardware co-design approaches for energy-efficient machine learning (2016–2025), categorized by architectural paradigm.
For TinyML systems, even milliwatt-level inefficiencies can render deployments impractical [2], while large-scale LLM infrastructures require thousands of accelerators, leading to megawatt-scale power draw and substantial operational costs [1], [9].

Actual findings

In battery-powered and energy-harvesting IoT devices, inefficiencies directly translate into reduced lifetime or impractical maintenance costs, motivating adaptive architectures that dynamically adjust precision and computation depth based on available power [2].

How the conclusion was reached

Step 1 — Proposed approach: Many co-design techniques lack generalization across model families and deployment scales, with approaches optimized for convolutional networks often failing to transfer effectively to attention-based architectures [11], [20].
Step 3 — Main reported evidence: In battery-powered and energy-harvesting IoT devices, inefficiencies directly translate into reduced lifetime or impractical maintenance costs, motivating adaptive architectures that dynamically adjust precision and computation depth based on available power [2].
Step 5 — Claim boundary / limitation: Moreover, throughput-centric metrics obscure broader sustainability impacts; infrastructureaware analyses show that environmental cost extends beyond energy to include water usage and embodied carbon, and that FLOP-efficient models may still incur high operational emissions due to inefficient memory behavior [1], [4].

Experimental setup & results

In battery-powered and energy-harvesting IoT devices, inefficiencies directly translate into reduced lifetime or impractical maintenance costs, motivating adaptive architectures that dynamically adjust precision and computation depth based on available power [2].
Early accelerator designs such as Eyeriss demonstrated that careful dataflow and memory hierarchy optimization can significantly mitigate this bottleneck by maximizing data reuse and minimizing off-chip access [5], [25].

Limitations & risks

Moreover, throughput-centric metrics obscure broader sustainability impacts; infrastructureaware analyses show that environmental cost extends beyond energy to include water usage and embodied carbon, and that FLOP-efficient models may still incur high operational emissions due to inefficient memory behavior [1], [4].
Techniques optimized for TinyML, such as zero-buffer dataflows, do not generalize to transformer attention with irregular access patterns [6], [8], while LLM-specific optimizations such as KV-cache management and speculative decoding are infeasible on memoryconstrained edge devices [1].
REPRESENTATIVE WORKS DEMONSTRATE FPGA AND CIM ADVANTAGES FOR POWER-CONSTRAINED DEPLOYMENTS, WITH EFFICIENCY GAINS OF 16-59× OVER BASELINE IMPLEMENTATIONS.

상세 요약 (KO)

전체 논문 읽은 느낌 요약

이 백서에서는 TinyML 시스템의 경우 밀리와트 수준의 비효율성으로 인해 배포가 비현실적일 수 있으며[2], 대규모 LLM 인프라에는 수천 개의 가속기가 필요하므로 메가와트 규모의 전력 소모와 상당한 운영 비용이 발생합니다[1], [9]. 최근 연구에 따르면 추론은 이제 총 LLM 수명 주기 배출량의 절반 이상을 차지하며[4] 기계 학습 시스템 전체에 걸쳐 에너지 효율적인 배포 전략의 시급성을 강조합니다. 많은 공동 설계 기술에는 모델 계열 및 배포 규모 전반에 걸친 일반화가 부족하며, 컨벌루션 네트워크에 최적화된 접근 방식은 주의 기반 아키텍처로 효과적으로 전환하지 못하는 경우가 많습니다[11], [20]. 핵심 제안은 다음과 같습니다. 많은 공동 설계 기술에는 모델 계열 및 배포 규모 전반에 걸친 일반화가 부족하며, 컨벌루션 네트워크에 최적화된 접근 방식은 주의 기반 아키텍처로 효과적으로 전환하지 못하는 경우가 많습니다[11], [20]. 실행 플랫폼과 독립적으로 기계 학습 알고리즘을 최적화하는 기존 접근 방식은 점점 더 부적절해지고 있습니다[10], [11]. 따라서 소프트웨어-하드웨어 공동 설계는 모델, 컴파일러, 아키텍처 및 런타임 시스템을 공동으로 최적화하여 에너지 효율성, 처리량 및 적응성을 향상시키는 핵심 패러다임으로 부상했습니다[12], [13]. 기존 노력에서는 TinyML 전반에 걸쳐 최적화 전략을 연결하는 통합 프레임워크를 거의 제공하지 않습니다. [cs.AR] 2026년 3월 24일 Eyeriss DNNBuilder Cambricon-S Analog-Dig. 배터리 구동 및 에너지 수확 IoT 장치에서 비효율성은 수명 단축 또는 비실용적인 유지 관리 비용으로 직접적으로 이어지며, 사용 가능한 전력에 따라 정밀도와 계산 깊이를 동적으로 조정하는 적응형 아키텍처를 활성화합니다[2]. Eyeriss와 같은 초기 가속기 설계는 신중한 데이터 흐름과 메모리 계층 구조 최적화가 데이터 재사용을 극대화하고 칩 외부 액세스를 최소화하여 이러한 병목 현상을 크게 완화할 수 있음을 보여주었습니다[5], [25]. 보고된 핵심 결과는 배터리 구동 및 에너지 수확 IoT 장치에서 비효율성은 수명 단축 또는 비실용적인 유지 관리 비용으로 직접적으로 해석되어 사용 가능한 전력을 기반으로 정밀도와 계산 깊이를 동적으로 조정하는 적응형 아키텍처에 동기를 부여한다는 것입니다[2]. 또한 이 백서에서는 처리량 중심의 지표가 더 광범위한 지속 가능성 영향을 모호하게 한다는 점을 분명히 밝혔습니다. 인프라 인식 분석에 따르면 환경 비용은 에너지를 넘어 물 사용량과 내재 탄소를 포함하며, FLOP 효율적인 모델은 비효율적인 메모리 동작으로 인해 여전히 높은 운영 배출량을 초래할 수 있음을 보여줍니다[1], [4]. 제로 버퍼 데이터 흐름과 같이 TinyML에 최적화된 기술은 불규칙한 액세스 패턴으로 변환기 주의를 일반화하지 못하는 반면[6], [8], KV 캐시 관리 및 추측 디코딩과 같은 LLM 관련 최적화는 메모리가 제한된 에지 장치에서 실행 불가능합니다[1]. 대표적인 연구에서는 기본 구현에 비해 16~59배의 효율성 향상을 통해 전력이 제한된 배포에 대한 FPGA 및 CIM의 이점을 보여줍니다. 전반적으로, 이 논문은 제안된 방법이 보고된 방법에 의해 직접적으로 뒷받침되는 경우에 가장 설득력이 있습니다. 그러나 청구 범위는 평가 설정 및 명시된 제한 사항을 고려하여 읽어야 합니다.

핵심 결론

주요 시사점: 배터리 구동 및 에너지 수확 IoT 장치에서 비효율성은 수명 단축이나 유지 관리 비용의 비실용성으로 직접적으로 이어지며, 사용 가능한 전력을 기반으로 정밀도와 계산 깊이를 동적으로 조정하는 적응형 아키텍처에 동기를 부여합니다[2].
중요한 주의 사항: 더욱이 처리량 중심 측정 기준은 더 광범위한 지속 가능성 영향을 모호하게 만듭니다. 인프라 인식 분석에 따르면 환경 비용은 에너지를 넘어 물 사용량과 내재 탄소를 포함하며, FLOP 효율적인 모델은 비효율적인 메모리 동작으로 인해 여전히 높은 운영 배출량을 초래할 수 있음을 보여줍니다[1], [4].

문제 정의

TinyML 시스템의 경우 밀리와트 수준의 비효율성도 배포를 비실용적으로 만들 수 있으며[2], 대규모 LLM 인프라에는 수천 개의 가속기가 필요하므로 메가와트 규모의 전력 소모와 상당한 운영 비용이 발생합니다[1], [9].
최근 연구에 따르면 추론은 이제 총 LLM 수명 주기 배출량의 절반 이상을 차지하며[4] 기계 학습 시스템 전체에 걸쳐 에너지 효율적인 배포 전략의 시급성을 강조합니다.
많은 공동 설계 기술에는 모델 계열 및 배포 규모 전반에 걸친 일반화가 부족하며, 컨벌루션 네트워크에 최적화된 접근 방식은 주의 기반 아키텍처로 효과적으로 전환하지 못하는 경우가 많습니다[11], [20].
따라서 소프트웨어-하드웨어 공동 설계는 모델, 컴파일러, 아키텍처 및 런타임 시스템을 공동으로 최적화하여 에너지 효율성, 처리량 및 적응성을 향상시키는 핵심 패러다임으로 부상했습니다[12], [13].

핵심 아이디어/방법

많은 공동 설계 기술에는 모델 계열 및 배포 규모 전반에 걸친 일반화가 부족하며, 컨벌루션 네트워크에 최적화된 접근 방식은 주의 기반 아키텍처로 효과적으로 전환하지 못하는 경우가 많습니다[11], [20].
실행 플랫폼과 독립적으로 기계 학습 알고리즘을 최적화하는 기존 접근 방식은 점점 더 부적절해지고 있습니다[10], [11].
따라서 소프트웨어-하드웨어 공동 설계는 모델, 컴파일러, 아키텍처 및 런타임 시스템을 공동으로 최적화하여 에너지 효율성, 처리량 및 적응성을 향상시키는 핵심 패러다임으로 부상했습니다[12], [13].
기존 노력에서는 TinyML 전반에 걸쳐 최적화 전략을 연결하는 통합 프레임워크를 거의 제공하지 않습니다. [cs.AR] 2026년 3월 24일 Eyeriss DNNBuilder Cambricon-S Analog-Dig.
에너지 효율적인 기계 학습을 위한 대표적인 소프트웨어-하드웨어 공동 설계 접근 방식의 타임라인(2016~2025), 아키텍처 패러다임별로 분류.
TinyML 시스템의 경우 밀리와트 수준의 비효율성도 배포를 비실용적으로 만들 수 있으며[2], 대규모 LLM 인프라에는 수천 개의 가속기가 필요하므로 메가와트 규모의 전력 소모와 상당한 운영 비용이 발생합니다[1], [9].

실제 결과

배터리 구동 및 에너지 수확 IoT 장치에서 비효율성은 수명 단축 또는 비실용적인 유지 관리 비용으로 직접적으로 이어지며, 사용 가능한 전력에 따라 정밀도와 계산 깊이를 동적으로 조정하는 적응형 아키텍처를 활성화합니다[2].

결론이 나온 과정

1단계 - 제안된 접근 방식: 많은 공동 설계 기술에는 모델 계열 및 배포 규모 전반에 걸친 일반화가 부족하며, 컨볼루셔널 네트워크에 최적화된 접근 방식은 주의 기반 아키텍처로 효과적으로 전환하지 못하는 경우가 많습니다[11], [20].
3단계 — 보고된 주요 증거: 배터리 구동 및 에너지 수확 IoT 장치에서 비효율성은 수명 단축 또는 비현실적인 유지 관리 비용으로 직접적으로 해석되어 사용 가능한 전력을 기반으로 정밀도와 계산 깊이를 동적으로 조정하는 적응형 아키텍처에 동기를 부여합니다[2].
5단계 — 주장 경계/제한: 더욱이 처리량 중심 지표는 더 광범위한 지속 가능성 영향을 모호하게 합니다. 인프라 인식 분석에 따르면 환경 비용은 에너지를 넘어 물 사용량과 내재 탄소를 포함하며, FLOP 효율적인 모델은 비효율적인 메모리 동작으로 인해 여전히 높은 운영 배출량을 초래할 수 있음을 보여줍니다[1], [4].

실험 설정/결과

배터리 구동 및 에너지 수확 IoT 장치에서 비효율성은 수명 단축 또는 비실용적인 유지 관리 비용으로 직접적으로 이어지며, 사용 가능한 전력에 따라 정밀도와 계산 깊이를 동적으로 조정하는 적응형 아키텍처를 활성화합니다[2].
Eyeriss와 같은 초기 가속기 설계는 신중한 데이터 흐름과 메모리 계층 구조 최적화가 데이터 재사용을 극대화하고 칩 외부 액세스를 최소화하여 이러한 병목 현상을 크게 완화할 수 있음을 보여주었습니다[5], [25].

한계/리스크

더욱이, 처리량 중심 지표는 더 광범위한 지속가능성 영향을 모호하게 만듭니다. 인프라 인식 분석에 따르면 환경 비용은 에너지를 넘어 물 사용량과 내재 탄소를 포함하며, FLOP 효율적인 모델은 비효율적인 메모리 동작으로 인해 여전히 높은 운영 배출량을 초래할 수 있음을 보여줍니다[1], [4].
제로 버퍼 데이터 흐름과 같이 TinyML에 최적화된 기술은 불규칙한 액세스 패턴으로 변환기 주의를 일반화하지 못하는 반면[6], [8], KV 캐시 관리 및 추측 디코딩과 같은 LLM 관련 최적화는 메모리가 제한된 에지 장치에서 실행 불가능합니다[1].
대표적인 연구에서는 기본 구현에 비해 16~59배의 효율성 향상을 통해 전력이 제한된 배포에 대한 FPGA 및 CIM의 이점을 보여줍니다.