Research Question
How does model quantization affect system resource efficiency and correctness when deploying DL systems?
Search String
(( "machine learning" OR "ML" OR "deep learning" OR "DL" OR "large language model" OR "LLM?" OR "neural network" OR "?NN" OR "fundational model" OR "agent" ) AND ( "quantization" OR "quantize" OR "quantized" ) AND ( "energy consumption" OR "energy efficien*" OR "sustain*" OR "carbon footprint" OR "carbon emission" ) AND NOT ( "FL" OR "federated learning" ) ) AND PUBYEAR > 2019
Inclusion Criteria
- The study regards the application of model quantization to optimize a DL model.
- The study regards the environmental sustainability and/or energy efficiency of applying model quantization.
- The study analyzes the application of model quantization for model inference.
- The study regards the application of model quantization at the software level.
- The study controls the factors in each trial avoiding free variation among different runs.
Exclusion Criteria
- The study combines model quantization with other optimization techniques.
- The study does not report a non-quantized baseline.
- The study is a secondary or tertiary study.
- The study is not written in English.
- The study is in the form of editorials, tutorials, books, extended abstracts, and so on.
Research question
Proposed theory: Model quantization causes positive effects in DL systems’ resource efficiency. Strongly positive effects are observed in storage size and GPU energy consumption. Inference power draw is weakly positively affected while {indiferent - weakly positive} effects are observed for GPU power draw and inference latency. Model quantization also causes weakly negative effects on accuracy.
Full aggregation
Model quantization from fp32 to int8