21 October 2024
In this tutorial we'll quantize a pre-trained model, the new IBM® Granite-3.0-8B-Instruct model, in a few different ways to show the size of the models and compare how they perform on a task.
Quantization of large language models (LLMs) is a model optimization technique that reduces memory space and latency by sacrificing some model accuracy. Large transformer-based models such as LLMs often require significant GPU resources to run. In turn, a quantized model can allow you to run machine learning inference on limited GPUs or even on a CPU. Frameworks such as TensorFlow Lite (tflite) can run quantized TensorFlow models on edge devices including phones or microcontrollers. In the era of larger and larger LLMs, quantization is an essential technique during the training, fine tuning and inference stages of modeling. Quantization is especially helpful for users who want to run models locally on limited hardware machines. Low-resource hardware that has a hardware machine learning accelerator can also run quantized models very efficiently.
Quantization is the process of mapping input values from a large set of continuous elements to a smaller set with a finite number of elements. Quantization methods have evolved rapidly and been an area of active research. For instance, a simple algorithm might be integer quantization, which is simply scaling 32-bit floating point (f32) numbers to 8-bit integers (int8). This technique is often called zero-point quantization. More sophisticated techniques use FP8, an 8-bit floating point with a dynamic range that can be set by the user. In this tutorial, we'll use k-means quantization to create very small models. That saves us from needing to do model calibration or the time-intensive step of creating an importance matrix that defines the importance of each activation in the neural network.
We'll focus on post training quantization (PTQ) which focuses on decreasing the precision (and thus resource demands) after the model is trained. Quantization-Aware Training (QAT) is a common quantization technique for mitigating model accuracy and perplexity degradation that arises from quantization but is a more advanced technique with more limited use cases. In particular, we'll use k-means quantization via llama.cpp, an open source library that quantizes PyTorch models.
When working with LLMs, model quantization allows us to convert high-precision floating-point numbers in the neural network layers to low-precision numbers that consume much less space. We'll be converting models to GPT-Generated Unified Format (GGUF) to run them efficiently in constrained resource scenarios. GGUF is a binary format optimized for quick loading and saving of models that makes it efficient for inference purposes. It achieves this efficiency by combining the model parameters (weights and biases) with more metadata for effective execution. Because it’s compatible with various programming languages such as Python and R and supports fine tuning so users can adapt LLMs to specialized applications, it has become a popular format.
In this tutorial, we’ll quantize the IBM® Granite-3.0-8B-Instruct model in a few different ways to show the size of the models and compare how they perform on a task. To view more Granite tutorials, check out the IBM Granite Community. This tutorial is also available on Github.
This can be done in either a terminal on OSX, Linux or in VS Code on Windows.
First, we need to create a virtual environment in Python that we can use to save our libraries.
python3 -m venv .Next, we'll want to install the open source HuggingFace Hub library so that we can use its API to download the Granite Model files.
Next, either save the following script to a file and run it, or simply start a Python3 session and run it there.
from huggingface_hub import snapshot_downloadNow we can copy the files into our local directory for easier access.
Next up, we need to install llama.cpp at a system level. The instructions here to build from source are rather complex but very well documented here.
Alternatively, on OSX you can use homebrew.
brew install llama.cppOn Mac and Linux, the Nix package manager can be used.
nix profile install nixpkgs#llama-cppYou can also look for prebuilt binaries at the Github Releases page.
Once we have llama.cpp installed, we can install the libraries to run the llama-cpp scripts in the virtual environment.
./bin/pip install 'llama-cpp-python[server]'We need to get the entire repository for llama.cpp in order to convert models to GGUF format.
git clone https://github.com/ggerganov/llama.cppNow we install libraries that we'll need in order to run the GGUF conversion:
bin/pip install -r llama.cpp/requirements.txtNow convert the model to gguf using a script inside the repository.
bin/python3 llama.cpp/convert_hf_to_gguf.py granite-3.0-8b-instructThis gives us a new GGUF file based on our original model files.
Now we're ready to quantize. Common quantization schemes that are supported by GGUF include:
For the smallest possible model, we can use 2 bit quantization.
/opt/homebrew/bin/llama-quantize \For a medium sized quanitized model, we can use 4 bit quantization.
/opt/homebrew/bin/llama-quantize \For a larger quanitized model on a machine with more resources, we can use 6 bit quantization.
/opt/homebrew/bin/llama-quantize \As a size comparison,Q2_K has a size of 3.17 GB,Q4_K_M is 5.06 GB, and finally,Q6_K takes up 6.87 GB.
Each of these steps may take up to 15 minutes but when they're done we have multiple versions of the model that we can use to compare.
We could run the model in llamacpp with the following command.
llama-server -m granite-3.0-8b-instruct/granite-8B-instruct-Q4_K_M.gguf --port 8080This allows you to open a webpage at localhost:8080 which hosts an interactive session which runs Granite Instruct as a helpful assistant.
We could also add the file to ollama. To do this, first we need to create a modelfile. Open a text editor and enter the following:
# ModelfileSave this file as GCI_8b_modelfile_Q4 and start ollama.
ollama create Granite_Instruct:8b_Q6 -f GCI_8b_modelfile_Q6Then load the model into ollama.
ollama create Granite_Instruct:8b_Q4 -f GCI_8b_modelfile_Q4Now, you're ready to run the quantized model.
ollama run Granite_Instruct:8b_Q4We repeat the process with two other model files, one for the 2 bit and one for the 6 bit quantization.
Save this as GCI_8b_modelfile_Q2 and load it into ollama with the following command.
ollama create Granite_Code_Instruct:8b_Q2 -f GCI_8b_modelfile_Q2One last modelfile for the Q6 version:
# ModelfileSave this asGCI_8b_modelfile_Q6 and load into ollama with the following command.
ollama create Granite_Instruct:8b_Q6 -f GCI_8b_modelfile_Q6Now, we're ready to compare our models.
Let's compare how the different versions do by asking the model to fix bugs in some Python code as a validation exercise. First, we'll load the 2 bit version to use that as a baseline.
ollama run Granite_Instruct:8b_Q2Once that's running, we can paste our prompt.
>>>Fix any errors in the following Python function and format it correctly:With the 2 bit quantization, the model doesn't perform quite as well as we might hope. It misses that the equation for the area of a circle and the perimeters are both wrong. It also doesn't explain any of the errors.
I have fixed the errors in your Python code and formatted it as follows:Now let's load the Q4.
ollama run Granite_Instruct:8b_Q4Copying and pasting the previous prompt, we get a much better response, although the 3rd item doesn't make sense.
There are a few errors in the provided Python code:
Finally, we can load the 6 bit quantized model:
ollama run Granite_Instruct:8b_Q6This answer is much better than either of the two previous versions. It catches all of the bugs in the code, gives us the correct formula for both area and perimeter, and correctly changes the last call to the print_circle_properties method.
The provided Python function contains two errors:We can see that the 2 bit quantized model does save space but also is less adept at picking out errors in our code as well as fixing them and explaining them. The 4 bit model corrects all of the code errors but doesn't fully explain its instructions. The 6 bit model corrects all of the errors and explains those errors correctly and in greater detail.
In this tutorial you learned about quantizing models and used several different quantization methods to create differently sized models using llama.cpp. Then you learned how to run them locally using both llama.cpp or in ollama. Finally, you tested three different versions of the same IBM® Granite-3.0-8B-Instruct model to see how each responded to a prompt to fix bugs in Python code.
Get started
Get started
Get started
Get started
IBM web domains
ibm.com, ibm.org, ibm-zcouncil.com, insights-on-business.com, jazz.net, mobilebusinessinsights.com, promontory.com, proveit.com, ptech.org, s81c.com, securityintelligence.com, skillsbuild.org, softlayer.com, storagecommunity.org, think-exchange.com, thoughtsoncloud.com, alphaevents.webcasts.com, ibm-cloud.github.io, ibmbigdatahub.com, bluemix.net, mybluemix.net, ibm.net, ibmcloud.com, galasa.dev, blueworkslive.com, swiss-quantum.ch, blueworkslive.com, cloudant.com, ibm.ie, ibm.fr, ibm.com.br, ibm.co, ibm.ca, community.watsonanalytics.com, datapower.com, skills.yourlearning.ibm.com, bluewolf.com, carbondesignsystem.com