Large language models (LLMs) are foundation models that use artificial intelligence (AI), deep learning and massive data sets, including websites, articles and books, to generate text, translate between languages and write many types of content. There are two types of these generative AI models: proprietary large language models and open source large language models.
Proprietary LLMs are owned by a company and can only be used by customers that purchase a license. The license may restrict how the LLM can be used. On the other hand, open source LLMs are free and available for anyone to access, use for any purpose, modify and distribute.
The term “open source” refers to the LLM code and underlying architecture being accessible to the public, meaning developers and researchers are free to use, improve or otherwise modify the model.
What are the benefits of open source LLMs?
Previously it seemed that the bigger an LLM was, the better, but now enterprises are realizing they can be prohibitively expensive in terms of research and innovation. In response, an open source model ecosystem began showing promise and challenging the LLM business model.
Transparency and flexibility
Enterprises that don’t have in-house machine learning talent can use open source LLMs, which provide transparency and flexibility, within their own infrastructure, whether in the cloud or on premises. That gives them full control over their data and means sensitive information stays within their network. All this reduces the risk of a data leak or unauthorized access.
An open source LLM offers transparency regarding how it works, its architecture and training data and methodologies, and how it’s used. Being able to inspect code and having visibility into algorithms allows an enterprise more trust, assists regarding audits and helps ensure ethical and legal compliance. Additionally, efficiently optimizing an open source LLM can reduce latency and increase performance.
They are generally much less expensive in the long term than proprietary LLMs because no licensing fees are involved. However, the cost of operating an LLM does include the cloud or on-premises infrastructure costs, and they typically involve a significant initial rollout cost.
Added features and community contributions
Pre-trained, open source LLMs allow fine-tuning. Enterprises can add features to the LLM that benefit their specific use, and the LLMs can also be trained on specific datasets. Making these changes or specifications on a proprietary LLM entails working with a vendor and costs time and money.
While proprietary LLMs mean an enterprise must rely on a single provider, an open source one lets the enterprise take advantage of community contributions, multiple service providers and possibly internal teams to handle updates, development, maintenance and support. Open source allows enterprises to experiment and use contributions from people with varying perspectives. That can result in solutions allowing enterprises to stay at the cutting edge of technology. It also gives businesses using open source LLMs more control over their technology and decisions regarding how they use it.
What types of projects can open source LLM models enable?
Organizations can use open source LLM models to create virtually any project useful to their employees or, when the open source license allows, that can be offered as commercial products. These include:
Open source LLM models allow you to create an app with language generation abilities, such as writing emails, blog posts or creative stories. An LLM like Falcon-40B, offered under an Apache 2.0 license, can respond to a prompt with high-quality text suggestions you can then refine and polish.
Open source LLMs trained on existing code and programming languages can assist developers in building applications and finding errors and security-related faults.
Open source LLMs let you create applications that offer personalized learning experiences, which can be customized and fine-tuned to particular learning styles.
An open source LLM tool that summarizes long articles, news stories, research reports and more can make it easy to extract key data.
These can understand and answer questions, offer suggestions and engage in natural language conversation.
Open source LLMs that train on multilingual datasets can provide accurate and fluent translations in many languages.
LLMs can analyze text to determine emotional or sentiment tone, which is valuable in brand reputation management and analysis of customer feedback.
Content filtering and moderation
LLMs can be valuable in identifying and filtering out inappropriate or harmful online content, which is a huge help in maintaining a safer online environment.
What kinds of organizations use open source LLMs?
A wide range of organization types use open source LLMs. For example, IBM and NASA developed an open source LLM trained on geospatial data to help scientists and their organizations fight climate change.
Publishers and journalists use open source LLMs internally to analyze, identify and summarize information without sharing proprietary data outside the newsroom.
Some healthcare organizations use open source LLMs for healthcare software, including diagnosis tools, treatment optimizations and tools handling patient information, public health and more.
The open source LLM FinGPT was developed specifically for the financial industry.
Some of the best open source, curated LLMs
The Open LLM Leaderboard aims to track, rank and evaluate open source LLMs and chatbots on different benchmarks.
One well-performing open source LLM with a license that allows agreements for commercial use is LLaMa 2 by Meta AI, which encompasses pre-trained and fine-tuned generative text models with 7 to 70 billion parameters and is available in the Watsonx.ai studio. It’s also available through the Hugging Face ecosystem and transformer library.
Vicuna and Alpaca were created on top of the LLaMa model and, like Google’s Bard and OpenAI’s ChatGPT, are fine-tuned to follow instructions. Vicuna, which outperforms Alpaca, matches GPT-4 performance.
Bloom by BigScience is a multilingual language model created by more than 1,000 AI researchers. It’s the first multilingual LLM trained in complete transparency.
The Falcon LLM from Technology Innovation Institute (TII) can be used with chatbots to generate creative text, solve complex problems and reduce and automate repetitive tasks. Both Falcon 6B and 40B are available as raw models for fine-tuning or as already instruction-tuned models that can be used as-is. Falcon uses only about 75% of GPT-3’s training compute budget and significantly outperforms it.
MPT-7B and MPT-30B are open source LLMs licensed for commercial use from MosaicML (recently acquired by Databricks). MPT-7B matches the performance of LlaMA. MPT-30B outperforms GPT-3. Both are trained on 1T tokens.
FLAN-T5, launched by Google AI, can handle more than 1,800 diverse tasks.
StarCoder from Hugging Face is an open source LLM coding assistant trained on permissive code from GitHub.
RedPajama-INCITE, licensed under Apache-2, is a 6.9B parameter pre-trained language model developed by Together and leaders from various institutions, including the University of Montreal and the Stanford Center for Research on Foundation Models.
Cerebras-GPT from Cerebras is a family of seven GPT models that range from 111 million to 13 billion parameters.
StableLM is an open source LLM from Stability AI, which made the AI image generator Stable Diffusion. It trained on a dataset containing 1.5 trillion tokens called “The Pile” and is fine-tuned with a combination of open source datasets from Alpaca, GPT4All (which offers a range of models based on GPT-J, MPT and LlaMa), Dolly, ShareGPT and HH.
Risks associated with large language models
Although LLM outputs sound fluent and authoritative, there can be risks that include offering information based on “hallucinations” as well as problems with bias, consent or security. Education on these risks is one answer to these issues of data and AI.
Hallucinations, or falsehoods, can result from the LLM being trained on incomplete, contradictory, or inaccurate data or from predicting the next accurate word based on context without understanding meaning.
Bias happens when the source of data is not diverse or representative.
Consent refers to whether the training data was gathered with accountability, meaning it follows AI governance processes that make it compliant with laws and regulations and offers ways for people to incorporate feedback.
Security problems can include leaking PII, cyber criminals using the LLM for malicious tasks such as phishing and spamming, and hackers changing original programming.
Open source large language models and IBM
AI models, particularly LLMs, will be one of the most transformative technologies of the next decade. As new AI regulations impose guidelines around the use of AI, it is critical to not just manage and govern AI models but, equally importantly, to govern the data put into the AI.
To help organizations address these needs and multiply the impact of AI, IBM offers watsonx, our enterprise-ready AI and data platform. Together, watsonx offers organizations the ability to:
Train, tune and deploy AI across your business with watsonx.ai
Scale AI workloads, for all your data, anywhere with watsonx.data
Enable responsible, transparent and explainable data and AI workflows with watsonx.governance
Beyond conversational search, watsonx Assistant continues to collaborate with IBM Research and watsonx to develop customized watsonx LLMs that specialize in classification, reasoning, information extraction, summarization and other conversational use cases. Watsonx Assistant has already achieved major advancements in its ability to understand customers with less effort using large language models.