What is InstructLab?

Authors

Staff Writer

IBM Think

Staff Editor

IBM Think

What is InstructLab?

InstructLab is a method for training AI models that is meant to significantly improve LLMs used in the development of gen AI applications.

Instruct Lab was developed by IBM Research and RedHat, it is an open-source project, meaning it relies on a global community of developers (known as the InstructLab community) to build and maintain it.

The InstructLab Project was created to address problems constraining the development of large language models (LLMs), most notably the cost and complexity of training and data collection and the difficulty of contributing skills and knowledge.

According to Forbes, InstructLab has increased LLM performance and resolved several scaling challenges of traditional LLM training, eliminating the need for enterprises to build and maintain multiple LLMs. This is largely possible because of an LLM training method known as Large-scale Alignment for chatBots, or LAB, developed by IBM.

What is Large-scale Alignment for chatBots (LAB)?

Today’s most powerful chatbots, like Siri, Alexa and ChatGPT, all depend on LLMs that are pre-trained, allowing them to learn tasks quickly during the AI alignment process. But getting artificial intelligence to that level can be expensive and time-consuming, and the models that emerge often lack the depth necessary to get through complex, nuanced, human-like situations. According to the IBM Institute of Business Value, executives expected the average cost of computing to climb almost 90% due primarily to the demands of building LLMs for generative artificial intelligence (gen AI) applications .

Large-scale Alignment for chatBots (LAB) is a method of generating data synthetically for specific tasks an organization needs a chatbot to accomplish. Unlike traditional training methods, it enables chatbots to quickly assimilate new information and learn new skills without overwriting things they’ve already learned.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

How does InstructLab work?

InstructLab’s approach to large language model (LLM) development and maintenance is different from other models in that it puts the process firmly in the hands of a worldwide community of developers, a process known as open-source AI. Just as open-source software enables developers to contribute to the development of code and features, open-source AI allows allow them to add new skills and capabilities and rapidly iterate on existing models.

Underpinned by the LAB method, InstructLab’s approach to building LLMs is different from others in three critical ways:

Taxonomy-driven data curation

Large-scale synthetic data generation

Large-scale alignment tuning

Taxonomy-driven data curation

In the training of an LLM, taxonomy is a hierarchical structure that categorizes the skills and knowledge areas critical to an LLM’s intended application. For example, the taxonomy for an LLM that is going to be applied to an autonomous vehicle would differ significantly from one that is being applied to medical research in the same way a race car driver would have to learn different skills than a doctor.

InstructLab’s data is structured in a way that makes it easy for the model’s existing skills and knowledge base to understand. The simplicity of InstructLab’s structure makes it straightforward for developers to identify gaps and fill in knowledge and skills where necessary. This taxonomy-driven data curation also allows for models to be specifically targeted for new use cases, like research or a specific Internet of Things (IoT) application, and given the appropriate skills.

Towards this end, InstructLab’s approach relies heavily on YAML (“YAML Ain’t No Markup Language,” or “Yet Another Markup Language”) a standardized format for representing data in a way that’s easy for both humans and machines to interpret. The YAML approach paves the way for the next key step in InstructLab’s process: large-scale synthetic data generation.

Large-scale synthetic data generation

Once the data for a specific model to train on has been curated, the model itself is ready to generate its own data based on the training data, a process known as synthetic data generation. What distinguishes InstructLab’s approach to this step in the training of an LLM is the scale on which it is done and the accuracy of the data it can generate. Relying on the Large-scale Alignment for chatBots (LAB) method once again, InstructLab’s approach adds an automated step, further refining the answers the LLM generates to ensure their accuracy.

The new data generated during this step, critical to the training of all LLMs, not just InstructLab’s, relies upon what’s known as a “teacher” model, a larger model that generates labels and data for a smaller, more efficient “student” model to learn from.

With the LAB method, InstructLab’s LLMs don’t actually use data stored by the teacher model but rather specific prompts that exponentially increase the dataset while simultaneously ensuring that examples generated by the “student” model remain inline with the LLMs' intended purpose.

According to IBM Research, this approach “Systematically generates synthetic data for the tasks you want your chatbot to accomplish, and for assimilating new knowledge and capabilities into the foundation model, without overwriting what the model has already learned.”

Large-scale alignment tuning

In the final step of the InstructLab/LAB process, the LLM is retrained on the synthetic data it’s been learning from, refining its skills and improving the accuracy of its answers. This last step is broken into two phases:

Knowledge tuning: Knowledge tuning is training that focuses on improving the LLM’s base knowledge of essential skills by introducing new facts that require short and long responses and assessing the accuracy of those responses.

Skill tuning: After knowledge training has been completed, the model undergoes skill tuning as part of its final preparation before being deployed. During skill tuning, the model trains on data samples related to specific skills required of it (depending on its intended purpose). For example, during this stage of training, a customer-service chatbot would likely train on thousands of hours of transcripts of customer inquiries.

Red Hat

Enable AI capabilities with Red Hat Openshift on IBM Cloud

Learn how you can use the Red Hat OpenShift on IBM Cloud to enable the flexible, scalable machine learning operations platform with tools to build, deploy, and manage AI-enabled applications.

Discover more

How InstructLab differs from traditional RAG

LLMs trained on more traditional methods typically use a process called retrieval-augmented generation (RAG) to supplement their knowledge with more focused, domain-specific training. RAG is a useful tool for organizations who need to add proprietary data to an existing base model for a specific purpose without giving up control over their proprietary data.

The InstructLab/LAB method can be used for the same purpose as a more traditional RAG process, but rather than add existing, specific knowledge, it focuses more on end-user contributions from its community to build relevant knowledge and skills. Organizations seeking to fine-tune LLMs for a specific purpose can use both RAG and InstructLab/LAB to achieve ideal outcomes.

InstructLab’s model-training infrastructure

As AI applications become more demanding, LLMs to support them are getting larger and more complex, and subsequently placing more rigorous demands on underlying AI infrastructure. InstructLab/LAB, like all other advanced model-training methods, depends on a GPU-intensive infrastructure capable of meeting the performance benchmarks necessary to constantly retrain AI models according to the contributions from its global open-source community at github.com/instructlab.

Fortunately, IBM is dedicated to providing all necessary data storage, management, workflows and practices for the success of LLM projects.

Why is InstructLab important?

Today, LLMs underpin the most exciting AI use cases, from generative AI chatbots and coding assistants to edge computing, Internet of Things (IoT) applications and more. They can either be proprietary models, like OpenAI and Claude, or models that instead rely on open-source principles when it comes to the pretraining data they use, like Mistral, Llama-2, and IBM’s Granite models.

InstructLab excels in its capacity to match and even exceed the performance of proprietary models using publicly available ones. IBM watsonx, an AI and data platform designed to help businesses scale and accelerate the impact of AI, relies on it extensively. For example, Merlinite-7B, a recent LAB-trained model, outperformed several proprietary models in key areas, according to an IBM Research paper.

The limitations of modifying existing LLMs

To meet the requirements of advanced generative AI applications, developers often rely on an existing LLM that they adapt to meet a specific business need. Take, for example, an insurance company seeking to build a gen AI application to help employees glean insights from proprietary customer data. Today, they would probably purchase an existing LLM built for chatbots and modify it according to their needs. But this approach has several important limitations:

Fine-tuning an existing LLM to understand the unique area of expertise your organization requires can be expensive and resource-intensive.

It’s difficult for the model to continuously improve once it’s been tuned to fit a certain set of needs, meaning it can’t iterate or evolve with the organization’s requirements.

Refining an LLM to fit a specific business purpose requires a large amount of human-generated data for the model to train on, which is both time-consuming and expensive to acquire.

InstructLab improves LLMs with less human input and fewer resources

The InstructLab method can train LLMs using fewer human-generated inputs and far less computing resources. The foundation of the training method of most modern LLMs, especially the ones that underpin powerful chatbots, is extensive pretraining on large datasets of unstructured text. While this training approach enables LLMs to acquire new skills relatively quickly in the alignment stage, it’s costly and requires extensive human input.

The LAB approach, developed by IBM Research, uses taxonomy-guided synthetic data generation to reduce cost and the need for human input. Coupled with InstructLab’s open-source, community-driven approach to development, this approach effectively democratizes the development of LLMs needed for generative AI applications.

InstructLab’s Command Line Interface (CLI), the sets of instructions developers use to manage it, is even built to run on widely used devices like personal laptops, and developers are encouraged to contribute new knowledge or skills via the AI community Hugging Face.

InstructLab use cases

InstructLab takes an open-source, community-based approach to fine-tuning LLMs for a wide range of use cases. Here are a few of the most common.

Healthcare

LLMs developed using the InstructLab approach can be trained to acquire new skills and knowledge for many applications in the healthcare industry, from scouring volumes of clinical data to help scientists make breakthroughs in medical research to assessing patient risk from medical history and more.

Banking

In banking, the InstructLab approach can build LLMs with an emphasis on trade analysis and model projection to help spot trends and forecast risk associated with trading strategies. It can also be used to train LLMs for gen AI applications in personal finance, such as saving for retirement, budgeting and more.

Customer service

LLMs trained using the InstructLab approach can power intelligent chatbots trained in specific areas of customer service, such as returning an item or requesting a specific product. Beyond that, the LAB method can help fine-tune LLMs to be virtual assistants with a complex set of skills, like scheduling appointments, booking travel, filing taxes and more.

Marketing

The InstructLab method helps fine-tune LLMs behind gen AI applications in marketing for a variety of purposes. They can learn to scour customer data for insights into behavior, product preference and even future product design. They can also acquire the necessary skills to offer tailored product advice, such as shoe or clothing size, color preference and more.

DevOps

The application of the InstructLab method for training LLMs to help improve the DevOps lifecycle can benefit developers in several important ways. LLMs trained using the InstructLab method can generate code and create scripts, automate infrastructure provisioning (Infrastructure as Code (IaC) applications) and shorten and improve routine problem-solving, troubleshooting, and even code analysis and review.

IT Economics: Cost savings from running Red Hat OpenShift on IBM Power servers

Discover the economic benefits of application modernization across hybrid cloud environments with Red Hat OpenShift on Power.

What is InstructLab?

Authors

What is InstructLab?

What is Large-scale Alignment for chatBots (LAB)?

The latest AI News + Insights

How does InstructLab work?

Taxonomy-driven data curation

Large-scale synthetic data generation

Large-scale alignment tuning

Enable AI capabilities with Red Hat Openshift on IBM Cloud

How InstructLab differs from traditional RAG

InstructLab’s model-training infrastructure

Why is InstructLab important?

The limitations of modifying existing LLMs

InstructLab improves LLMs with less human input and fewer resources

InstructLab use cases

Share

Resources

The latest AI News + Insights