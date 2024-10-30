Imagine a company hires a new employee. Their resume is excellent and they complete all of their tasks quickly and efficiently. Their work is technically getting done—but is it getting done well? Is it high quality, accurate and reliable?

As with any new hire, managers take time to review their work to make sure that it meets company standards and performs appropriately. As artificial intelligence (AI) plays a larger role in business output and decisions, companies need to do the same for LLMs.

Large language models (LLMs) are foundation models that are trained on immense amounts of data and used for tasks that are related to understanding and generating text. For example, this type of AI system is especially useful for work such as content creation, summarization and sentiment analysis.

LLMs revolutionized the field of natural language processing (NLP) and brought generative AI into the public eye in new ways. OpenAI’s Chat GPT-3 and GPT-4, along with Meta’s Llama, are the best-known examples, but a wide range of LLMs is used in various domains. LLMs power AI tools such as chatbots, virtual assistants, language translation tools and code generation systems.

As LLM applications are adopted more broadly, especially for use in high-stakes industries such as healthcare and finance, testing their output is increasingly important. That’s where LLM evaluation comes in.