March 22, 2024 in Large Language Models

Three Essential Elements of the LLMOps Tech Stack

And 10 steps to ensure LLM app quality

Anupam Datta

SHARE: PRINT ARTICLE:

https://doi.org/10.1287/LYTX.2024.02.01

Key Takeaways

Large language models (LLMs) are an exciting new frontier for delivering services but also require a new kind of tech stack that supports app development, deployment and maintenance, with a focus on quality and accuracy.
The three essential elements of the LLMOps tech stack are (1) observability, (2) compute and (3) storage.
There are 10 typical workflow steps that need to occur, in order, to ensure LLM app quality in the real world.

Large language models (LLMs) are proliferating and have the potential to change the world. The technology is being applied in many areas of our lives, from healthcare to customer service.

LLMs are an exciting new frontier for delivering services, but they also require a new kind of tech stack that supports app development, deployment and maintenance, with a focus on quality and accuracy. For the past year, I’ve been working directly with developers, helping them to build out their infrastructure for LLMs. We’ve learned that the three essential elements of the LLMOps tech stack are (1) observability, (2) compute and (3) storage (see Figure 1). All are required to ensure successful function of LLM apps.

Observability consists of testing, debugging and monitoring.
Compute consists of training, experimentation, model serving, fine-tuning, prompt engineering, and Foundation Model application programming interfaces (FM APIs).
Storage consists of model repository, feature store, vector database, and data lake/warehouse.

LLMOps

Spanning these three layers are 10 typical workflow steps that need to occur, in order. I’ve seen firsthand how these steps combine to ensure LLM app quality in the real world:

Foundation model training: This is composed of generative pretraining and supervised learning. During generative pretraining, an LLM is trained on vast amounts of data to predict the next word after a sequence of text. This allows generative language models to be proficient at producing human-like text. The training process is accelerated using advanced hardware and software stacks that enable massive parallelization from vendors such as Intel and NVIDIA. The next step is supervised learning, in which language models are trained on specific examples of human-provided prompts and responses to guide their behaviors.
Data preparation: There are several flavors of data preparation tasks with associated tools. LLM creators, such as OpenAI, Google, Meta, etc., prepare data for generative pretraining – often very large volumes of unlabeled data. Recently, we are also seeing more carefully curated, smaller data sets used to train LLMs that are orders of magnitude smaller than state-of-the-art models, but competitive in certain tasks. These data sets could, for example, include a set of prompts and responses to fine-tune a chatbot to a domain-specific use case such as customer service for e-commerce. LLM applications can also leverage retrieval-augmented generation (RAG), which involves augmenting LLMs with a knowledge base of documents that serve as a source of truth and can be queried. One example of this paradigm is Morgan Stanley’s use of OpenAI models to create a chatbot for their financial advisors.
Vector database index construction: RAGs, such as the Morgan Stanley wealth management chatbot, require the knowledge base of documents to be separated, converted into embeddings and stored in a vector database, which is indexed to support querying. Vector databases, such as Pinecone and Weaviate, are thus seeing rapid adoption.
Model fine-tuning: LLMs need to be fine-tuned, often on private data held by enterprises and small businesses. These data sets could, for example, include a set of prompts and responses to fine-tune a chatbot to a domain-specific use case such as for customer service for e-commerce. LLM providers such as OpenAI and Google are increasingly making fine-tuning APIs available for their models. Fine-tuning APIs are also available for open-source LLMs hosted on services such as AWS Amazon SageMaker JumpStart.
App creation: LLMs are often connected to other tools, such as vector databases, search indices or other APIs. RAGs and agents, mentioned previously in this list, are two popular classes of LLM applications. Building by chaining has emerged as a popular paradigm, with tools like Haystack, LlamaIndex and LangChain seeing widespread developer adoption.
Prompt engineering, tracking, collaboration: When developers are creating prompts tailored for a specific use case, the process often involves experimentation; the developer creates a prompt, observes the results and then iterates on the prompts to improve the effectiveness of the app. Tools such as TruLens and W&B Prompts help developers with this process by tracking prompts, responses and intermediate results of apps and enabling this information to be shared across developer teams.
Evaluation and debugging: Systematic evaluation and debugging of LLMs and LLM apps based on RAGs and agents are absolutely essential before they are moved into production. A first step is often to use human evaluations and benchmark data sets to evaluate LLMs. Although useful, these methods do not scale. Recent work has shown the power of programmatic evaluation methods to evaluate LLMs and LLM apps including OpenAI Evals and TruLens. Evaluations help ensure that LLM apps are honest, helpful and harmless.
Model deployment and inference: LLMs are available via APIs and deployed by the major cloud providers, including AWS, Google Cloud Platform and Azure, as well as hosted by LLM providers such as OpenAI and Anthropic. Platform companies, such as Databricks, MosaicML and Snowflake, also offer model deployment services.
App hosting: App hosting services, such as Vercel, are increasingly being used to deploy LLM apps faster.
Monitoring: Deployed LLMs and LLM apps need to be monitored for quality metrics (in line with the evaluation metrics described in this list) as well as for cost, latency and more. Vendors include TruEra and HoneyHive.

The LLM space is rapidly evolving, and developers are looking to scale. Establishing a strong tech stack and repeatable workflow provides the structure needed to both accelerate LLM app deployment and ensure quality.

Anupam Datta

Anupam Datta is co-founder, president and chief scientist at TruEra.

Keywords: