CoCalc -- gptq_quantization_in

GitHub Repository: keras-team/keras-io
Path: blob/master/guides/gptq_quantization_in_keras.py
⁸²⁶⁴ views
1
"""
2
Title: GPTQ Quantization in Keras
3
Author: [Jyotinder Singh](https://x.com/Jyotinder_Singh)
4
Date created: 2025/10/16
5
Last modified: 2025/10/16
6
Description: How to run weight-only GPTQ quantization for Keras & KerasHub models.
7
Accelerator: GPU
8
"""
9

10
"""
11
## What is GPTQ?
12

13
GPTQ ("Generative Pre-Training Quantization") is a post-training, weight-only
14
quantization method that uses a second-order approximation of the loss (via a
15
Hessian estimate) to minimize the error introduced when compressing weights to
16
lower precision, typically 4-bit integers.
17

18
Unlike standard post-training techniques, GPTQ keeps activations in
19
higher-precision and only quantizes the weights. This often preserves model
20
quality in low bit-width settings while still providing large storage and
21
memory savings.
22

23
Keras supports GPTQ quantization for KerasHub models via the
24
`keras.quantizers.GPTQConfig` class.
25
"""
26

27
"""
28
## Load a KerasHub model
29

30
This guide uses the `Gemma3CausalLM` model from KerasHub, a small (1B
31
parameter) causal language model.
32

33
"""
34
import keras
35
from keras_hub.models import Gemma3CausalLM
36
from datasets import load_dataset
37

38
prompt = "Keras is a"
39

40
model = Gemma3CausalLM.from_preset("gemma3_1b")
41

42
outputs = model.generate(prompt, max_length=30)
43
print(outputs)
44

45
"""
46
## Configure & run GPTQ quantization
47

48
You can configure GPTQ quantization via the `keras.quantizers.GPTQConfig` class.
49

50
The GPTQ configuration requires a calibration dataset and tokenizer, which it
51
uses to estimate the Hessian and quantization error. Here, we use a small slice
52
of the WikiText-2 dataset for calibration.
53

54
You can tune several parameters to trade off speed, memory, and accuracy. The
55
most important of these are `weight_bits` (the bit-width to quantize weights to)
56
and `group_size` (the number of weights to quantize together). The group size
57
controls the granularity of quantization: smaller groups typically yield better
58
accuracy but are slower to quantize and may use more memory. A good starting
59
point is `group_size=128` for 4-bit quantization (`weight_bits=4`).
60

61
In this example, we first prepare a tiny calibration set, and then run GPTQ on
62
the model using the `.quantize(...)` API.
63
"""
64

65
# Calibration slice (use a larger/representative set in practice)
66
texts = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:1%]")["text"]
67

68
calibration_dataset = [
69
    s + "." for text in texts for s in map(str.strip, text.split(".")) if s
70
]
71

72
gptq_config = keras.quantizers.GPTQConfig(
73
    dataset=calibration_dataset,
74
    tokenizer=model.preprocessor.tokenizer,
75
    weight_bits=4,
76
    group_size=128,
77
    num_samples=256,
78
    sequence_length=256,
79
    hessian_damping=0.01,
80
    symmetric=False,
81
    activation_order=False,
82
)
83

84
model.quantize("gptq", config=gptq_config)
85

86
outputs = model.generate(prompt, max_length=30)
87
print(outputs)
88

89
"""
90
## Model Export
91

92
The GPTQ quantized model can be saved to a preset and reloaded elsewhere, just
93
like any other KerasHub model.
94
"""
95

96
model.save_to_preset("gemma3_gptq_w4gs128_preset")
97
model_from_preset = Gemma3CausalLM.from_preset("gemma3_gptq_w4gs128_preset")
98
output = model_from_preset.generate(prompt, max_length=30)
99
print(output)
100

101
"""
102
## Performance & Benchmarking
103

104
Micro-benchmarks collected on a single NVIDIA 4070 Ti Super (16 GB).
105
Baselines are FP32.
106

107
Dataset: WikiText-2.
108

109

110
| Model (preset)                    | Perplexity Increase % (↓ better) | Disk Storage Reduction Δ % (↓ better) | VRAM Reduction Δ % (↓ better) | First-token Latency Δ % (↓ better) | Throughput Δ % (↑ better) |
111
| --------------------------------- | -------------------------------: | ------------------------------------: | ----------------------------: | ---------------------------------: | ------------------------: |
112
| GPT2 (gpt2_base_en_cnn_dailymail) |                             1.0% |                              -50.1% ↓ |                      -41.1% ↓ |                            +0.7% ↑ |                  +20.1% ↑ |
113
| OPT (opt_125m_en)                 |                            10.0% |                              -49.8% ↓ |                      -47.0% ↓ |                            +6.7% ↑ |                  -15.7% ↓ |
114
| Bloom (bloom_1.1b_multi)          |                             7.0% |                              -47.0% ↓ |                      -54.0% ↓ |                            +1.8% ↑ |                  -15.7% ↓ |
115
| Gemma3 (gemma3_1b)                |                             3.0% |                              -51.5% ↓ |                      -51.8% ↓ |                           +39.5% ↑ |                   +5.7% ↑ |
116

117

118
Detailed benchmarking numbers and scripts are available
119
[here](https://github.com/keras-team/keras/pull/21641).
120

121
### Analysis
122

123
There is notable reduction in disk space and VRAM usage across all models, with
124
disk space savings around 50% and VRAM savings ranging from 41% to 54%. The
125
reported disk savings understate the true weight compression because presets
126
also include non-weight assets.
127

128
Perplexity increases only marginally, indicating model quality is largely
129
preserved after quantization.
130
"""
131

132
"""
133
## Practical tips
134

135
* GPTQ is a post-training technique; training after quantization is not supported.
136
* Always use the model's own tokenizer for calibration.
137
* Use a representative calibration set; small slices are only for demos.
138
* Start with W4 group_size=128; tune per model/task.
139
"""
140

141
Product

Resources

Company