Quantization is a model compression technique that reduces the precision of numeric representations, for example storing weights as 8 bit integers instead of 32 bit floats, to shrink model size and speed inference.
Proper quantization lowers memory use and cost on CPUs or edge devices while often keeping acceptable accuracy. It requires calibration and testing because aggressive quantization can degrade model quality. Quantization is a useful tool for deploying large models in constrained production environments but must be validated for each task and hardware target.