The granite-20b-code-base-sql-gen model card

The granite-20b-code-base-sql-gen foundation model is a 20 B parameter model fine-tuned from Granite-20B-Code-Base on IBM's SQLInstruct dataset, Spider 1.0 and the BIRD training set. This results in a schema-independent SQL generation model, that can produce SQL to answer questions over unseen databases.

Usage

This model generates SQL from a natural language question and a relevant schema, plus content linking information. It is used as the final stage in a Text-to-SQL system. The use cases are for data analysts and business analysts to gain insights from data with a conversational interface.

Generation

This is a simple example of how to use Granite-20B-Code-Base-SQL-Gen model.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # or "cpu"
model_path = "your_directory/granite-20b-code-base-sql-gen"
tokenizer = AutoTokenizer.from_pretrained('ibm-granite/granite-20b-code-base')
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()
# change input text as desired
prompt = 'Note:\\nthe youngest user refers to MIN(Age)\\n\\nWhat is the id of the youngest user?\\n\\nCREATE TABLE users (\\n id INTEGER PRIMARY KEY, -- the user id\\n displayname TEXT, -- the user\'s display name\\n age INTEGER, -- user\'s age\\n);\\n\\nusers.id\\nusers.displayname: \'   Raid-Mahbouba\', \'0 kelvin\', \'0012\'\\nusers.age: 13, 68, 73\\n\\nselected: users.id, users.age, None, users.displayname\\njoin: None, users.id\\ncondition: users.age, None, users.id\\norder: users.age, None\\ngroup: None, users.id, users.age\\n\\nComparisons: None\\n\\nNote:\\nthe youngest user refers to MIN(Age)\\n\\nWhat is the id of the youngest user?\\nGenerate SQL:'
# tokenize
input_tokens = tokenizer.batch_encode_plus([prompt], return_tensors="pt", max_length=3000, padding="longest", truncation=True)
# transfer tokenized inputs to the device
for i in input_tokens:
    input_tokens[i] = input_tokens[i].to(device)
# generate output tokens
output = model.generate(**input_tokens, max_new_tokens=300)
# decode output tokens into text
output = tokenizer.batch_decode(output)
# loop over the batch to print, in this example the batch size is 1
for i in output:
    print(i)

Training Data

Granite Code text-to-SQL models are trained on Spider 1.0 and BIRD.

Infrastructure

We train the Granite Code models using IBM's Cognitive Compute Cluster (CCC), with NVIDIA A100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over several GPUs.

Ethical Considerations and Limitations

Granite code text-to-SQL models are primarily finetuned using question/SQL pairs on SQLite databases. Thus, their performance may be limited with other SQL dialects unless post-processing for dialect transpiling is used. Moreover, developers should perform safety testing and target-specific tuning before deploying these models on critical applications. The model also inherits ethical considerations and limitations from its base model. For more information, please refer to Granite-20B-Code-Base model card.