Path: blob/master/guides/gptq_quantization_in_keras.py
8264 views
"""1Title: GPTQ Quantization in Keras2Author: [Jyotinder Singh](https://x.com/Jyotinder_Singh)3Date created: 2025/10/164Last modified: 2025/10/165Description: How to run weight-only GPTQ quantization for Keras & KerasHub models.6Accelerator: GPU7"""89"""10## What is GPTQ?1112GPTQ ("Generative Pre-Training Quantization") is a post-training, weight-only13quantization method that uses a second-order approximation of the loss (via a14Hessian estimate) to minimize the error introduced when compressing weights to15lower precision, typically 4-bit integers.1617Unlike standard post-training techniques, GPTQ keeps activations in18higher-precision and only quantizes the weights. This often preserves model19quality in low bit-width settings while still providing large storage and20memory savings.2122Keras supports GPTQ quantization for KerasHub models via the23`keras.quantizers.GPTQConfig` class.24"""2526"""27## Load a KerasHub model2829This guide uses the `Gemma3CausalLM` model from KerasHub, a small (1B30parameter) causal language model.3132"""33import keras34from keras_hub.models import Gemma3CausalLM35from datasets import load_dataset3637prompt = "Keras is a"3839model = Gemma3CausalLM.from_preset("gemma3_1b")4041outputs = model.generate(prompt, max_length=30)42print(outputs)4344"""45## Configure & run GPTQ quantization4647You can configure GPTQ quantization via the `keras.quantizers.GPTQConfig` class.4849The GPTQ configuration requires a calibration dataset and tokenizer, which it50uses to estimate the Hessian and quantization error. Here, we use a small slice51of the WikiText-2 dataset for calibration.5253You can tune several parameters to trade off speed, memory, and accuracy. The54most important of these are `weight_bits` (the bit-width to quantize weights to)55and `group_size` (the number of weights to quantize together). The group size56controls the granularity of quantization: smaller groups typically yield better57accuracy but are slower to quantize and may use more memory. A good starting58point is `group_size=128` for 4-bit quantization (`weight_bits=4`).5960In this example, we first prepare a tiny calibration set, and then run GPTQ on61the model using the `.quantize(...)` API.62"""6364# Calibration slice (use a larger/representative set in practice)65texts = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:1%]")["text"]6667calibration_dataset = [68s + "." for text in texts for s in map(str.strip, text.split(".")) if s69]7071gptq_config = keras.quantizers.GPTQConfig(72dataset=calibration_dataset,73tokenizer=model.preprocessor.tokenizer,74weight_bits=4,75group_size=128,76num_samples=256,77sequence_length=256,78hessian_damping=0.01,79symmetric=False,80activation_order=False,81)8283model.quantize("gptq", config=gptq_config)8485outputs = model.generate(prompt, max_length=30)86print(outputs)8788"""89## Model Export9091The GPTQ quantized model can be saved to a preset and reloaded elsewhere, just92like any other KerasHub model.93"""9495model.save_to_preset("gemma3_gptq_w4gs128_preset")96model_from_preset = Gemma3CausalLM.from_preset("gemma3_gptq_w4gs128_preset")97output = model_from_preset.generate(prompt, max_length=30)98print(output)99100"""101## Performance & Benchmarking102103Micro-benchmarks collected on a single NVIDIA 4070 Ti Super (16 GB).104Baselines are FP32.105106Dataset: WikiText-2.107108109| Model (preset) | Perplexity Increase % (↓ better) | Disk Storage Reduction Δ % (↓ better) | VRAM Reduction Δ % (↓ better) | First-token Latency Δ % (↓ better) | Throughput Δ % (↑ better) |110| --------------------------------- | -------------------------------: | ------------------------------------: | ----------------------------: | ---------------------------------: | ------------------------: |111| GPT2 (gpt2_base_en_cnn_dailymail) | 1.0% | -50.1% ↓ | -41.1% ↓ | +0.7% ↑ | +20.1% ↑ |112| OPT (opt_125m_en) | 10.0% | -49.8% ↓ | -47.0% ↓ | +6.7% ↑ | -15.7% ↓ |113| Bloom (bloom_1.1b_multi) | 7.0% | -47.0% ↓ | -54.0% ↓ | +1.8% ↑ | -15.7% ↓ |114| Gemma3 (gemma3_1b) | 3.0% | -51.5% ↓ | -51.8% ↓ | +39.5% ↑ | +5.7% ↑ |115116117Detailed benchmarking numbers and scripts are available118[here](https://github.com/keras-team/keras/pull/21641).119120### Analysis121122There is notable reduction in disk space and VRAM usage across all models, with123disk space savings around 50% and VRAM savings ranging from 41% to 54%. The124reported disk savings understate the true weight compression because presets125also include non-weight assets.126127Perplexity increases only marginally, indicating model quality is largely128preserved after quantization.129"""130131"""132## Practical tips133134* GPTQ is a post-training technique; training after quantization is not supported.135* Always use the model's own tokenizer for calibration.136* Use a representative calibration set; small slices are only for demos.137* Start with W4 group_size=128; tune per model/task.138"""139140141