IBM Granite 13b foundation model version 1 model card
Granite 13 Billion Model (granite.13b
) Details
IBM Generative AI Large Language Foundation Models are Enterprise-level English-language models trained on enterprise-relevant datasets from five domains – internet, academic, code, legal and finance – all of which have been curated for business
use by IBM. All data sets were scrutinized to exclude objectionable content and benchmarked against internal and external models to enable responsible deployment and address key issues including governance, risk assessment, privacy concerns
and bias mitigation. The Granite Base 13 Billion (granite.13b.base
) V1.0 model has been trained using over 1T tokens. This is the base model, from which other variants were fine-tuned to target downstream tasks. The Granite family
of models will support all 5 language tasks (Q&A, Generate, Extract, Summarize, and Classify).
Besides the base variant (granite.13b.base
), there are two other models that have been fine-tuned. The first variant (granite.13b.instruct
) is an instruction-tuned Supervised Fine-Tuning (SFT) model, which was further
tuned using a combination of the Flan Collection, 15k samples from Dolly, Anthropic's human preference data about helpfulness and harmlessness,
Instructv3, and internal synthetic datasets specifically designed for summarization and dialog tasks (~700K samples).
The second variant (granite.13b.chat
) is a Contrastive fine-tuning (CFT) variant, which uses a new objective
called unlikelihood training, which penalizes unlikely generations by assigning a lower probability value.
The table below lists current official variants released by IBM Research. The Massive Multitask Language Understanding (MMLU) benchmark is used to show the performance of each variant.
Variant | Description / Intended Use | Pre-training Data Seen | MMLU (5-shot) |
---|---|---|---|
granite.13b.instruct | This variant is a Supervised Fine-Tuned (SFT) version of the base model to improve its instruction-following. It was tuned using a mix of FLAN and a mixture of other datasets (Dolly, HHRLHF, and IBM internal datasets, etc.). This model is intended as a starting point to help bootstrap further downstream alignment or task-specific tuning. | 1000B Tokens | 42.05 |
granite.13b.chat | This variant is a further-aligned version of the granite.13b.instruct variant. It was aligned using Constrastive Fine Tuning (CFT) for general to improve its harmlessness and the quality of its generation responses. This model should be used when looking to prompt-engineer out of the box, particularly when longer responses are desired. It also may be helpful as a starting point for further downstream fine-tuning. | 1000B Tokens | 42.07 |
- Person or organization developing the model:
- Granite (13B) was developed by IBM Research.
- Model release date and version:
- Granite (13B) version 1.0 was released on 09/21/2023.
- Model type:
- Granite (13B) V1.0 is a large decoder-only transformer model.
- The following features were used in the design of the model:
- Decoder-only model
- Multi-Query Attention
- 50K GPT-NeoX tokenizer
- Flash Attention
- 8k context length
- Absolute (learnt) position embeddings
- Information about training algorithms, parameters, fairness constraints or other applied approaches, and features:
- Model was trained using 4x Tensor Parallel + 4x Pipeline Parallel + 32x ZeRO-1 + Sequence Parallel + Flash Attention using a fork of Megatron-LM.
- Cluster: CCC
- GPUs: 256x A100 80GB
- Interconnect: 200 gigabit Infiniband
- Dataset streamed over GPFS
- Paper or other resource for more information:
- License:
- Available only through IBM products and offerings. Contact IBM for licensing terms.
Intended Use
-
Primary intended uses:
- .chat / .instruct : The Granite series of models are a family of IBM-trained decoder-only models used for text generation, summarization, question and answer, classification, and extraction.
- base : The base model will be primarily used to fine-tune downstream language tasks.
-
Primary intended users:
- The primary users are IBM Enterprise clients looking to bolster their portfolios with Enterprise-level generative AI models.
-
Out-of-scope use cases:
- The granite.13b models are not designed, tested, or supported, for code use cases of any kind.
Factors
- Relevant factors: Models work with proper English text. All datasets have been cleansed of any type of tagging (e.g., HTML), and all media has been removed as well.
- Evaluation factors: Evaluation datasets have to be proper English, and are limited to text only.
Metrics
IBM has built a comprehensive test framework FM eval that is used throughout the model's life-cycle. This can be used both to evaluate IBM's own models and those already trained by third-parties allowing models to be measured against a variety of benchmarks. The evaluation framework runs on an Openshift cluster with GPU support and uses various AI model evaluation frameworks: Eleuther AI's Language Model Evaluation Harness lm-eval, Stanford's Holistic Evaluation of Language Models (HELM), Beyond the Imitation Game Benchmark (BIG-bench), as well as IBM-internal datasets.
Performance Metrics
The evaluation of the Granite 13 Billion (13B) variants can be found in the Granite Paper: https://www.ibm.com/downloads/cas/X9W4O6BM
Data, Limitations, and Recommendations
- Data selection for training:
- The Granite Base (13B) V1.0 model was trained using IBM's curated pre-training dataset. A breakdown of the sampling data used for training is shown in the table below.
Data Composition and Sampling
Dataset sampling for Granite Base (13B) V1.0
Dataset | Description |
---|---|
Common Crawl | Open repository of web crawl data. |
Webhose | Unstructured web content converted into machine-readable data feeds acquired by IBM. |
arXiv | Over 1.8 million scientific paper pre-prints posted to arXiv. |
wikimedia | Eight English Wikimedia projects (enwiki, enwikibooks, enwikinews, enwikiquote, enwikisource, enwikiversity, enwikivoyage, enwiktionary). containing extracted plain text from pages and articles. |
OpenWeb Text | Open-source version of OpenAI’s Web Text corpus containing web pages through 2019. |
Stack Exchange | Anonymized set of all user-contributed content on the Stack Exchange network, a popular collection of websites centered around user-contributed questions and answers. |
Hacker News | News on computer science and entrepreneurship, taken between 2007-2018. |
Project Gutenberg PG19 | A repository of free e-books with focus on older works for which U.S. copyright has expired. |
GitHub Clean | Code data from CodeParrot covering a variety of coding languages. |
Pubmed Central | Biomedical and life sciences papers. |
Free Law | Public-domain legal opinions from US federal and state courts. |
SEC Filings | 10-K/Q filings from the US Securities and Exchange Commission (SEC) for the years 1934-2022. |
USPTO | US patents granted from 1975 to May 2023, excluding design patents. |
DeepMind Mathematics | Mathematical question and answer pairs data. |
- Tokenizer used:
- GPT-NeoX 20B
- 1.03 T Tokens
- More...
Dataset sampling for Granite Instruct (13B) V1.0
The Granite Instruct model was initialized from Granite 13B Base and Supervised Fine-Tuned (SFT) with a mixture of with a mixture of datasets from different sources. The SFT data includes a subset of the Flan Collection,15K samples from Dolly, Anthropic’s human preference data about helpfulness and harmlessness, Instructv3, and internal synthetic datasets specifically designed for summarization and dialog tasks.
Dataset sampling for Granite Chat (13B) V1.0
The Granite Chat model was initialized from Granite 13B Instruct and fine-tuned with a mixture of instruction-tuning datasets. It was aligned using Constrastive Fine Tuning (CFT) for general to improve its harmlessness and the quality of its generation responses. The datasets for CFT are paired samples from Anthropic’s human preference data about helpfulness and harmlessness that have been filtered using the OpenAssist reward model, samples from Dolly, and samples from ProsocialDialog.