prehandbook

Road To Prompt Reliability Engineering

Introduction

This article introduces the concept of Prompt Reliability Engineering as the final step on the prompt-ops maturity ladder. We will walk up this ladder and talk through the concepts and best practices on each of the steps. If you just deployed your first genai app and you’'’re starting to get overwhelmed scrolling through the application logs or if prompt engineering seems like taking one step forward and two steps back then this article is for you.

Prompt-ops maturity levels

1.Yoloing prompts

This is where every project starts and many end. If you update a piece of text in VScode, hit save ,and then go on to refresh a web-page to try the changes in a chat app, then you’'’re likely in this stage. You’'’re hopefully using github or some type of source code versioning to save the changes to your prompts. Bonus points if you’'’ve set-up some type of ci/cd workflow that also deploys changes commited to your production app.

2.Logging & tracing

Once the llm app becomes more sophisticated and the developer starts chaining multiple llm invocations observability becomes an issue. Debugging poor app behaviour becomes laborous, especially with large prompts. Dedicated tooling for genai app observability solves this. Tools that offer visualizations of chained invocations and traces can be particularly helpful for debugging complex agentic behaviour.

3.EVALs 🤝 User feedback

Prompt engineering following your gut feel will hit its limits even with best observability tooling available. LLM models will inevitably drift, introducing new system instructions to address new failure modes can lead to degraded performance on existing failure modes, or you might simply want to start using a different model provider all together. Your genai application will start having degraded performance in unexpected ways.

Drawing from traditional swe experience the solution is obvious: a testing framework. This is where EVALs[1] come in. EVALs are testing your LLM models against a fixed set of labeled input prompts. Labeled in the sence that we know what output we’'’re expecting for each one of these input prompts.

The key insight here comes after yet again a comparison to the world of traditional engineering. A start-up operating on this step of the ladder will be treating their EVALs like the source code: they’'’re hard to acquire [2]. After all you will need to have users to bump into all the failure modes of the llm app hiding in remote nooks and crannies of the problem space. The prompts for a start-up operating on this step of the ladder will rather be more akin to compiled code: replacable and maybe even shareable.

GenAI evaluation is a rich field, with different modes, criteria, and metrics.

4.Prompt Reliability Engineering

Prompt Reliability Engineering (PRE) is the practice of applying Site Reliability Engineering (SRE) principles to AI-powered applications. It provides a framework for measuring and maintaining the quality and reliability of your LLM applications. SRE has been a revolutionary force in how we think about building and maintaining traditional software systems. It provides a data-driven approach to balancing reliability with the need to innovate and release new features. The core idea is to define what reliability means for your application, measure it, and then use that data to make informed decisions. Let’'’s explore how we can apply these battle-hardened SRE principles to the new frontier of GenAI applications.

The core concepts of SRE can be mapped to PRE:

Tooling

Several tools can help you implement PRE. The presentation mentions the Vertex AI suite of tools, including the GenAI Evaluation Service, Agent Evaluation Service, and Prompt Optimizer. BigQuery can be used as a backend for storing and analyzing evaluation data. The Agent Development Kit (ADK) is also mentioned as a relevant tool.

Appendix

[1] Here we’'’re referring to EVALs in the context of building applications on top of models that have been trained or fine-tuned by others. We ofcourse also need evaluation frameworks when training and fine-tuning models (optimizing for perplexity, loss functions etc.) We’'’re not talking about these types of evaluations here. In other words we don’'’t have access to all the individual values of the final llm layer, we’'’re only working with wathever tokens the API sends us.

[2] Synthetic EVALs generation is a counter argument to this and will surely be a space that an industrious founder will want to explore.