Definition

Research Question

How does model quantization affect system resource efficiency and correctness when deploying DL systems?

Search String

(( "machine learning" OR "ML" OR "deep learning" OR "DL" OR "large language model" OR "LLM?" OR "neural network" OR "?NN" OR "fundational model" OR "agent" ) AND ( "quantization" OR "quantize" OR "quantized" ) AND ( "energy consumption" OR "energy efficien*" OR "sustain*" OR "carbon footprint" OR "carbon emission" ) AND NOT ( "FL" OR "federated learning" ) ) AND PUBYEAR > 2019

Inclusion Criteria

  • The study regards the application of model quantization to optimize a DL model.
  • The study regards the environmental sustainability and/or energy efficiency of applying model quantization.
  • The study analyzes the application of model quantization for model inference.
  • The study regards the application of model quantization at the software level.
  • The study controls the factors in each trial avoiding free variation among different runs.

Exclusion Criteria

  • The study combines model quantization with other optimization techniques.
  • The study does not report a non-quantized baseline.
  • The study is a secondary or tertiary study.
  • The study is not written in English.
  • The study is in the form of editorials, tutorials, books, extended abstracts, and so on.

Papers

  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights int8 - activations fp32 evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (2-bit evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp16 - activations fp16 evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights int8 - activations int8 evidence)
  • Energy Cost Modelling for Optimizing Large Language Model Inference on Hardware Accelerators (1-bit evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp32 - activations fp16 evidence)
  • Energy Cost Modelling for Optimizing Large Language Model Inference on Hardware Accelerators (4-bit evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp16 - activations fp32 evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (32-bit evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights int8 - activations fp16 evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp16 - activations int8 evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (4-bit evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (16-bit evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp32 - activations int8 evidence)
  • Verifiable and Energy Efficient Medical Image Analysis with Quantised Self-attentive Deep Neural Networks
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (8-bit evidence)
  • Impact of ML Optimization Tactics on Greener Pre-Trained ML Models
  • Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy (4bit evidence)
  • Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy (8bit evidence)

Evidence

  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification
  • QUANOS: Adversarial Noise Sensitivity Driven Hybrid Quantization of Neural Networks (5bit evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (8-bit evidence)
  • Impact of ML Optimization Tactics on Greener Pre-Trained ML Models
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp16 - activations fp16 evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp32 - activations fp16 evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp16 - activations int8 evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (2-bit evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights int8 - activations int8 evidence)
  • Verifiable and Energy Efficient Medical Image Analysis with Quantised Self-attentive Deep Neural Networks
  • UAV-deployed Deep Learning Network for Real-Time Multi-Class Damage Detection Using Model Quantization Techniques (half-precision training evidence)
  • Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy (8bit evidence)
  • Green My LLM: Studying the Key Factors Affecting the Energy Consumption of Code Assistants
  • UAV-deployed Deep Learning Network for Real-Time Multi-Class Damage Detection Using Model Quantization Techniques (INT8 PTQ evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (16-bit evidence)
  • Optimizing Convolutional Neural Networks for IoT Devices: Performance and Energy Efficiency of Quantization Techniques (fp16 PTQ evidence)
  • Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy (4bit evidence)
  • Energy Efficiency of~Deep Learning Compression Techniques in~Wearable Human Activity Recognition
  • Energy Cost Modelling for Optimizing Large Language Model Inference on Hardware Accelerators
  • Optimization Strategies for Enhancing Resource Efficiency in Transformers & Large Language Models (INT8 evidence)
  • QUANOS: Adversarial Noise Sensitivity Driven Hybrid Quantization of Neural Networks (8bit evidence)
  • Optimization Strategies for Enhancing Resource Efficiency in Transformers & Large Language Models
  • Green My LLM: Studying the Key Factors Affecting the Energy Consumption of Code Assistants (BitsAndBytes FP4 evidence
  • UAV-deployed Deep Learning Network for Real-Time Multi-Class Damage Detection Using Model Quantization Techniques (INT8 QAT evidence)
  • Energy Cost Modelling for Optimizing Large Language Model Inference on Hardware Accelerators (1-bit evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights int8 - activations fp32 evidence)
  • Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy
  • A Methodological Framework for Optimizing the Energy Consumption of Deep Neural Networks: A Case Study of a Cyber Threat Detector
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp32 - activations int8 evidence)
  • UAV-deployed Deep Learning Network for Real-Time Multi-Class Damage Detection Using Model Quantization Techniques (INT8 partial QAT evidence)
  • Green My LLM: Studying the Key Factors Affecting the Energy Consumption of Code Assistants (BitsAndBytes NF4 evidence
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights int8 - activations fp16 evidence)
  • Optimizing Convolutional Neural Networks for IoT Devices: Performance and Energy Efficiency of Quantization Techniques (int8 QAT evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants
  • QUANOS: Adversarial Noise Sensitivity Driven Hybrid Quantization of Neural Networks (4bit evidence)
  • Optimizing Convolutional Neural Networks for IoT Devices: Performance and Energy Efficiency of Quantization Techniques (int8 PTQ evidence)
  • QUANOS: Adversarial Noise Sensitivity Driven Hybrid Quantization of Neural Networks (QUANOS evidence)
  • Energy Cost Modelling for Optimizing Large Language Model Inference on Hardware Accelerators (4-bit evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (4-bit evidence)
  • Green My LLM: Studying the Key Factors Affecting the Energy Consumption of Code Assistants (EETQ INT8 evidence
  • Optimization Strategies for Enhancing Resource Efficiency in Transformers & Large Language Models (FP4 evidence)
  • Energy-Efficient Respiratory Anomaly Detection in Premature Newborn Infants (32-bit evidence)
  • Experimental Energy Consumption Analysis of Neural Network Model Compression Methods on Microcontrollers with Applications in Bird Call Classification (weights fp16 - activations fp32 evidence)

Aggregated Evidence

Conclusion

Research question

Proposed theory: Model quantization causes positive effects in DL systems’ resource efficiency. Strongly positive effects are observed in storage size and GPU energy consumption. Inference power draw is weakly positively affected while {indiferent - weakly positive} effects are observed for GPU power draw and inference latency. Model quantization also causes weakly negative effects on accuracy.

Full aggregation

Model quantization from fp32 to int8