Definition

Research Question

How does model quantization affect system resource efficiency and correctness when deploying DL systems?

Search String

(( "machine learning" OR "ML" OR "deep learning" OR "DL" OR "large language model" OR "LLM?" OR "neural network" OR "?NN" OR "fundational model" OR "agent" ) AND ( "quantization" OR "quantize" OR "quantized" ) AND ( "energy consumption" OR "energy efficien*" OR "sustain*" OR "carbon footprint" OR "carbon emission" ) AND NOT ( "FL" OR "federated learning" ) )

Inclusion Criteria

  • The study regards the application of model quantization to optimize a DL model.
  • The study regards the environmental sustainability and/or energy efficiency of applying model quantization.
  • The study analyzes the application of model quantization for model inference.
  • The study regards the application of model quantization at the software level.
  • The study controls the factors in each trial avoiding free variation among different runs.

Exclusion Criteria

  • The study combines model quantization with other optimization techniques.
  • The study does not report a non-quantized baseline.
  • The study is a secondary or tertiary study.
  • The study is not written in English.
  • The study is in the form of editorials, tutorials, books, extended abstracts, and so on.

Papers

  • Impact of Memory Voltage Scaling on Accuracy and Resilience of Deep Learning Based Edge Devices (fxp_4_8 evidence)
  • POQ: Is There a Pareto-Optimal Quantization Strategy for Deep Neural Networks? (w8a8 evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp16 - activations int8 evidence)
  • Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks (q0.4 evidence)
  • Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy (8bit evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (16-bit evidence)
  • Activation Density Based Mixed-Precision Quantization for Energy Efficient Neural Networks
  • Q_YOLOv5m: A Quantization-based Approach for Accelerating Object Detection on Embedded Platforms (w8a8 QAT evidence)
  • Energy Cost Modelling for Optimizing Large Language Model Inference on Hardware Accelerators (4-bit evidence)
  • POQ: Is There a Pareto-Optimal Quantization Strategy for Deep Neural Networks? (w2a2 evidence)
  • POQ: Is There a Pareto-Optimal Quantization Strategy for Deep Neural Networks? (full-int4 evidence)
  • Impact of ML Optimization Tactics on Greener Pre-Trained ML Models
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights int8 - activations int8 evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp32 - activations int8 evidence)
  • Implementing Deep Neural Networks on ARM-Based Microcontrollers: Application for Ventricular Fibrillation Detection (full model int8 evidence)
  • Impact of Memory Voltage Scaling on Accuracy and Resilience of Deep Learning Based Edge Devices (fxp_4_32 evidence)
  • Energy Cost Modelling for Optimizing Large Language Model Inference on Hardware Accelerators (1-bit evidence)
  • Verifiable and Energy Efficient Medical Image Analysis with Quantised Self-attentive Deep Neural Networks
  • Efficient Expiration Date Recognition in Food Packages for Mobile Applications (w8a8 evidence)
  • POQ: Is There a Pareto-Optimal Quantization Strategy for Deep Neural Networks? (w-int4, a-int8 evidence)
  • POQ: Is There a Pareto-Optimal Quantization Strategy for Deep Neural Networks? (w-int2, a-int4 evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (4-bit evidence)
  • Implementing Deep Neural Networks on ARM-Based Microcontrollers: Application for Ventricular Fibrillation Detection (w8a8 evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights int8 - activations fp32 evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (2-bit evidence)
  • Edge AI-Powered System Architecture for Aloe Vera Plant Disease Detection
  • Energy-Efficient Deep Learning for Cloud Detection Onboard Nanosatellite
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp32 - activations fp16 evidence)
  • Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy (4bit evidence)
  • Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks (q0.32 evidence)
  • Quantized Object Detection for Real-Time Inference on Embedded GPU Architectures (int8 evidence)
  • Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks (q0.16 evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (32-bit evidence)
  • Q_YOLOv5m: A Quantization-based Approach for Accelerating Object Detection on Embedded Platforms (w8a8 PTQ evidence)
  • Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks (int1 evidence)
  • POQ: Is There a Pareto-Optimal Quantization Strategy for Deep Neural Networks? (w4a4 evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights int8 - activations fp16 evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp16 - activations fp32 evidence)
  • Impact of Memory Voltage Scaling on Accuracy and Resilience of Deep Learning Based Edge Devices (fxp_8_16 evidence)
  • Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks (int6 evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp16 - activations fp16 evidence)
  • POQ: Is There a Pareto-Optimal Quantization Strategy for Deep Neural Networks? (w-int2, a-int8 evidence)
  • Efficient Expiration Date Recognition in Food Packages for Mobile Applications (w16a16 evidence)
  • POQ: Is There a Pareto-Optimal Quantization Strategy for Deep Neural Networks? (full-int8 evidence)
  • Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks (q0.8 evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (8-bit evidence)
  • Quantized Object Detection for Real-Time Inference on Embedded GPU Architectures (fp16 evidence)

Evidence

  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification
  • QUANOS: Adversarial Noise Sensitivity Driven Hybrid Quantization of Neural Networks (5bit evidence)
  • Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks (int1 evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (8-bit evidence)
  • Impact of ML Optimization Tactics on Greener Pre-Trained ML Models
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp16 - activations int8 evidence)
  • Verifiable and Energy Efficient Medical Image Analysis with Quantised Self-attentive Deep Neural Networks
  • Green My LLM: Studying the Key Factors Affecting the Energy Consumption of Code Assistants
  • Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks (q0.8 evidence)
  • UAV-deployed Deep Learning Network for Real-Time Multi-Class Damage Detection Using Model Quantization Techniques (INT8 PTQ evidence)
  • Activation Density Based Mixed-Precision Quantization for Energy Efficient Neural Networks
  • Optimizing Convolutional Neural Networks for IoT Devices: Performance and Energy Efficiency of Quantization Techniques (fp16 PTQ evidence)
  • Q_YOLOv5m: A Quantization-based Approach for Accelerating Object Detection on Embedded Platforms (w8a8 PTQ evidence)
  • Energy Cost Modelling for Optimizing Large Language Model Inference on Hardware Accelerators
  • QUANOS: Adversarial Noise Sensitivity Driven Hybrid Quantization of Neural Networks (8bit evidence)
  • POQ: Is There a Pareto-Optimal Quantization Strategy for Deep Neural Networks? (w-int4, a-int8 evidence)
  • Optimization Strategies for Enhancing Resource Efficiency in Transformers & Large Language Models
  • POQ: Is There a Pareto-Optimal Quantization Strategy for Deep Neural Networks? (full-int4 evidence)
  • Q_YOLOv5m: A Quantization-based Approach for Accelerating Object Detection on Embedded Platforms (w8a8 QAT evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights int8 - activations fp32 evidence)
  • Efficient Expiration Date Recognition in Food Packages for Mobile Applications (w8a8 evidence)
  • Efficient Expiration Date Recognition in Food Packages for Mobile Applications (w16a16 evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp32 - activations int8 evidence)
  • Impact of Memory Voltage Scaling on Accuracy and Resilience of Deep Learning Based Edge Devices (fxp_4_32 evidence)
  • UAV-deployed Deep Learning Network for Real-Time Multi-Class Damage Detection Using Model Quantization Techniques (INT8 partial QAT evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights int8 - activations fp16 evidence)
  • Optimizing Convolutional Neural Networks for IoT Devices: Performance and Energy Efficiency of Quantization Techniques (int8 QAT evidence)
  • Optimizing Convolutional Neural Networks for IoT Devices: Performance and Energy Efficiency of Quantization Techniques (int8 PTQ evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants
  • QUANOS: Adversarial Noise Sensitivity Driven Hybrid Quantization of Neural Networks (QUANOS evidence)
  • Energy Cost Modelling for Optimizing Large Language Model Inference on Hardware Accelerators (4-bit evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (4-bit evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (32-bit evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp16 - activations fp32 evidence)
  • Implementing Deep Neural Networks on ARM-Based Microcontrollers: Application for Ventricular Fibrillation Detection (full model int8 evidence)
  • Impact of Memory Voltage Scaling on Accuracy and Resilience of Deep Learning Based Edge Devices (fxp_4_8 evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp16 - activations fp16 evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp32 - activations fp16 evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights int8 - activations int8 evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (2-bit evidence)
  • UAV-deployed Deep Learning Network for Real-Time Multi-Class Damage Detection Using Model Quantization Techniques (half-precision training evidence)
  • Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy (8bit evidence)
  • Quantized Object Detection for Real-Time Inference on Embedded GPU Architectures (int8 evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (16-bit evidence)
  • Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks (q0.4 evidence)
  • Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy (4bit evidence)
  • Energy Efficiency of~Deep Learning Compression Techniques in~Wearable Human Activity Recognition
  • POQ: Is There a Pareto-Optimal Quantization Strategy for Deep Neural Networks? (full-int8 evidence)
  • Optimization Strategies for Enhancing Resource Efficiency in Transformers & Large Language Models (INT8 evidence)
  • Energy-Efficient Deep Learning for Cloud Detection Onboard Nanosatellite
  • Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks (q0.16 evidence)
  • Green My LLM: Studying the Key Factors Affecting the Energy Consumption of Code Assistants (BitsAndBytes FP4 evidence
  • UAV-deployed Deep Learning Network for Real-Time Multi-Class Damage Detection Using Model Quantization Techniques (INT8 QAT evidence)
  • Implementing Deep Neural Networks on ARM-Based Microcontrollers: Application for Ventricular Fibrillation Detection (w8a8 evidence)
  • Edge AI-Powered System Architecture for Aloe Vera Plant Disease Detection
  • Energy Cost Modelling for Optimizing Large Language Model Inference on Hardware Accelerators (1-bit evidence)
  • POQ: Is There a Pareto-Optimal Quantization Strategy for Deep Neural Networks? (w4a4 evidence)
  • Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy
  • A Methodological Framework for Optimizing the Energy Consumption of Deep Neural Networks: A Case Study of a Cyber Threat Detector
  • Impact of Memory Voltage Scaling on Accuracy and Resilience of Deep Learning Based Edge Devices (fxp_8_16 evidence)
  • POQ: Is There a Pareto-Optimal Quantization Strategy for Deep Neural Networks? (w-int2, a-int4 evidence)
  • Green My LLM: Studying the Key Factors Affecting the Energy Consumption of Code Assistants (BitsAndBytes NF4 evidence
  • Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks (q0.32 evidence)
  • Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks (int6 evidence)
  • POQ: Is There a Pareto-Optimal Quantization Strategy for Deep Neural Networks? (w2a2 evidence)
  • QUANOS: Adversarial Noise Sensitivity Driven Hybrid Quantization of Neural Networks (4bit evidence)
  • POQ: Is There a Pareto-Optimal Quantization Strategy for Deep Neural Networks? (w-int2, a-int8 evidence)
  • POQ: Is There a Pareto-Optimal Quantization Strategy for Deep Neural Networks? (w8a8 evidence)
  • Green My LLM: Studying the Key Factors Affecting the Energy Consumption of Code Assistants (EETQ INT8 evidence
  • Optimization Strategies for Enhancing Resource Efficiency in Transformers & Large Language Models (FP4 evidence)
  • Quantized Object Detection for Real-Time Inference on Embedded GPU Architectures (fp16 evidence)

Aggregated Evidence

Conclusion

Research question

Proposed theory: Model quantization causes positive effects in DL systems’ resource efficiency. Strongly positive effects are observed in storage size and GPU energy consumption. Inference power draw is weakly positively affected while {indiferent - weakly positive} effects are observed for GPU power draw and inference latency. Model quantization also causes weakly negative effects on accuracy.

Full aggregation