← Back

Five Step Program to Get your Prompts into Shape

Published on

Foreword

To further AI understanding and adoption in the insurance industry, Lazarus is producing a series of articles titled Artificial Intelligence - Insights for Insurance, or “AI-II.” In an earlier Insight (AI-II Insight # 2: “Coming Soon to a Screen Near You….A Prompt Engineer”) direction was provided on how leaders should think about the discipline of prompting. The guidance here most applies to the enterprise-level prompting discussion referenced in Insight #2. This Insight goes deeper and draws heavily and directly on the daily experiences of Lazarus prompting expert, Kelly Daniel.

Introduction

This Insight presents state-of-the-art prompting experience while condensing key findings into a five-step program. Some aspects of this program will be consistent with other technology implementations while other aspects will be unique.

Explainability Explained

Let’s start with some principles.

First Principle: “Don’t get blinded by the light.” Like all technology implementations, prompts must be tested and proven. The blinding speed of responses from generic language models can be misleading, causing people to think no effort is needed for enterprise-scale prompting. Getting prompts to generate rapid responses is trivial. Accurate enterprise prompts, prompts at scale, is a very different situation. Knowing the responses are accurate and the limitations with enterprise-scale prompts takes effort. Make the effort.

Second Principle: “Don’t jump into the deep end of the pool.” Start slowly and build on initial success. For example, if your corpus is a large set of documents, start with very few and select use cases with clear answers. Don’t start with edge cases. A key corollary of this principle is not everything should be prompted. If dealing with an edge case that is eating up a lot of time trying to get the prompts right, look at the use case. If this is an edge case that can be cost-effectively and consistently executed with existing approaches then an AI solution, and therefore prompting, may not be needed. Do not inflict random acts of prompting that won’t add value.

Third Principle: “Undertake developing enterprise-level prompting with a spirit of exploration.” In the prompting process, there may be more to learn from wrong answers than right answers. Prompting is as much art (language) as it is science (engineering), and this spirit of exploration will ensure you capture meaningful data and insights to improve the prompting. Within this spirit of exploration, the right mindset is needed from the beginning. Think about prompting akin to having a very intelligent, well-meaning intern who needs explicit direction as the correct mindset to begin the exploration.

Don’t get blinded by the light.

Five Step Process to Get your Prompts into Shape

Step 1: Test on one document

When starting with documents, Lazarus’ best practice is to pick a source or single document. Ensure this document is straightforward but representative of the larger pool of documents. Starting with one document, learning, and progressing forward is far more effective than jumping straight to batches of documents. Going slower actually saves time and money. Up-front thinking organizes and focuses efforts.

Selecting the right place to start will get easier with experience and as time goes on, the prompters can judge the difficulty of the use case by simply looking at the file and running a few ad hoc prompts.

Once there is success with the first document and understanding of challenges and difficulties, the scope can be expanded. Success here means correct answers to prompts at an acceptable percentage. If humans interpret a document correctly 35% of the time, expecting your prompting to deliver 100% correct interpretation is most often unrealistic and will keep an organization at Step 1 forever.

However, no organization should give up on Step 1 in an hour, as prompting does take effort and thoughtfulness. But if Step 1 takes over a full day, it may be time to reconsider the first use case.

Step 2: Expand testing slightly (3 to 5 documents)

Step 1 is not likely to take very long if the principles outlined have been followed. Lazarus’ experience shows that Step 2 can take much longer. A possibility is that Principle 2 has been violated (“Don’t start with edge cases”), perhaps inadvertently. This may have caused “overfitting” and narrow “success” that doesn’t work broadly.

Running your prompt on a handful of different documents will help you make adjustments early and enable generalizing prompts to work on the whole document set.

Step 3: Move to small batches

After successfully completing Step 2, the scaling up begins with running small batches.

This step is about how to increase accuracy across the whole set. Small changes, even just adjusting the order of words or rules, can have a big impact. A prompt that fixes the data extraction for one document may return inaccurate results for three others. That is not a welcome change, but due to the exploration aspect of this process, the prompter will have times of unwelcome responses.

This is why in Step 3 it’s now important to start looking at the aggregate prompt performance statistics and worry less about any one document. The exception, of course, is if that document represents a central and uncompromisable use case.

This is also a good time to make sure the data structure is being captured. Lazarus has found a simple spreadsheet organized by file name and prompt field is the most effective method. Capturing correct answers for each field here will also allow for some automated scoring of prompt results. This will result in time savings as the batches increase in size.

When stuck trying to improve a prompt, sometimes it pays to focus on a wrong answer or two and ask the model “Why?”

Step 4: Interrogate your inaccurate results

When stuck trying to improve a prompt, sometimes it pays to focus on a wrong answer or two and ask the model “Why?”

As silly as it seems, prompting a model “Is the phrase “x” present in thisdocument?” (“x” being the data the prompter is trying to extract) tells a lot about how the model is interpreting your document. With a business-optimized Language Model, as opposed to generically trained Language Models, the prompter can check the context window for the answer provided and determine what section or data the model is pulling from. With this insight, the prompter understands the model better and with this understanding the prompts will be improved. 

Allowing the model to reply with verbose answers, and not restricting the model on format, can provide insights. Adding, a phrase of “explain your answer” can yield valuable information. The prompter can use these insights and reword the prompt, use synonyms, or reorganize the prompt to increase effectiveness before Step 5.

Step 5: Go Big or Go…

Now you are ready for full scale…or to start over again. If you are comfortable with the answers your prompts are generating, and understand the wrong answers, then move to full-scale validation.

The time and effort in this stage will vary greatly. Factors include total number of unique prompts, how many prompts have known answers, risk with the business process the prompts support, how heterogenous the document types are, how homogeneous the data quality is, and other factors.

Files that have a consistent structure can often be validated more easily because the prompts are for fields specific to that document type. More heterogeneous documents (e.g. pools with varied or unstructured formats) will require testing a larger set of documents.

As referenced, there is another possibility here. That is, poor results may mean a restart. This likely means one of the principles was violated. However, even if a restart is needed, the situation is not a total loss. It is almost guaranteed that going through these steps has developed the prompter’s capability to write clearer, more direct prompts and the steps will be executed more quickly in the next run-through.

A fair question is, if a restart is needed, how many times does an organization go through this 5 step process? The answer depends on preliminary results. Say, for the sake of example, the target accuracy is 78%. In testing, the first run-through of the process resulted in 35% accuracy, the second 58%, and a third 65%, it appears the results are converging in a positive direction. Lazarus’ counsel would be to continue through another round.

On the other hand, if the results are initially 15% accuracy, then 21%, and down to 4%, then results are not converging in a positive direction and Lazarus’ recommendation is to look for external expertise.

...prompting is both art and science and prompting will continue to evolve quickly.

Summary

This Insight presents five steps needed for success in the world of prompting. As noted earlier, prompting is both art and science and prompting will continue to evolve quickly. Following the Steps and guidance here will maximize probability of success and ignoring the guidance here will increase risk, money, and time. Continue to watch Lazarus for Insights as this evolution occurs.

About Lazarus AI

Lazarus is an AI technology company that develops Large Language Models (LLMs) and associated solutions for industries such as insurance. The team at Lazarus is available to discuss all your AI needs regardless of use case or industry. Contact john@lazarusai.com for any feedback.

お問い合わせご参加ください