What Is Prompt Engineering ?

Introduction

A prompt is just like a clear instruction or a set of instructions you give to a tool or person. Whether it's a keyword you type into a search engine, a command for a computer program, or a question you ask a friend, prompts help them understand what you're looking for or want them to do.

Prompt engineering, the art and science of crafting effective prompts, has become increasingly essential with the rise in popularity of Large Language Models (LLMs) as it enables utilization of the full capabilities of LLMs.

This article will help you master prompt engineering through the lens of Language Models.‍

Prompts and LLMs

While working on prompt engineering, you generally use an API to interact with the LLM. These APIs consist of a set of hyperparameters that can be adjusted to achieve desired outputs. In this discussion, we will examine the Hugging Face Inference API (as depicted in the image below) and explore the importance of each parameter.

from huggingface_hub import InferenceClient # HF Inference Endpoints parameter endpoint_url = "https://YOUR_ENDPOINT.endpoints.huggingface.cloud" hf_token = "hf_YOUR_TOKEN" # Streaming Client client = InferenceClient(endpoint_url, token=hf_token) # Generation parameters gen_kwargs = { "max_new_tokens": 512, "top_k": 50, # Adjusting top-k sampling parameter "top_p": 0.8, # Adjusting nucleus sampling parameter "temperature": 0.5, # Adjusting temperature for randomness "repetition_penalty": 1.5, # Adjusting repetition penalty to avoid repetitive responses "stop_sequences": ["\nUser:", "", "</s>"], } # Prompt prompt = "What are the effects of climate change on" # Text generation stream = client.text_generation(prompt, stream=True, details=True, **gen_kwargs)

As mentioned above, different hyperparameters can be adjusted to influence the quality and diversity of generated text. Let’s take a closer look at the various hyperparameters included in the gen_kwards property above - ‍

Hyperparameters

Temperature

It's like adjusting the spice level in your cooking - higher temperature means more randomness, like adding spice for flavour, while lower temperature keeps things predictable, like sticking to a recipe.For example, in creative writing tasks like generating poetry or brainstorming story ideas, a higher temperature setting can result in more diverse and imaginative text.

Top_k

Think of it as narrowing down choices in a library to the most popular books. It selects the most probable tokens during text generation, refining the output.Consider a customer service chatbot that assists users with common queries. By setting a top_k parameter, the chatbot can prioritize responses based on the most relevant information, ensuring that users receive accurate and helpful assistance without being overwhelmed by unnecessary details.

Top_p

Top_p sets a limit on the tokens considered by choosing tokens until a cumulative probability of p is reached. Both top_k and top_p are used to control diversity and quality.

Max_new_tokens

It's like setting a word limit for an essay. Max_new_tokens determines how much text the model can generate, keeping it within a specified length.For instance, if you're generating responses for a chatbot, setting a maximum token limit ensures that the responses remain concise and relevant to the conversation context or you can increase

Repetition_penalty

Repetition_penalty discourages the model from reusing tokens, promoting diversity in the generated text.In a conversational AI application, such as a virtual assistant, setting a repetition_penalty ensures that the assistant's responses remain varied and natural during extended interactions

Frequency Penalty

Frequency Penalty encourages the model to explore less common tokens, making the text more unique.Suppose you're developing a news aggregator app that summarizes articles from various sources. By applying a frequency penalty, the app can prioritize lesser-known publications or niche topics, providing users with a diverse range of perspectives.

Presence Penalty

Presence Penalty guides the model to generate text that aligns with specific criteria or avoids certain topics, ensuring relevance.In a content moderation system for online forums, setting a presence penalty helps filter out inappropriate or offensive language. For example, if a user attempts to post discriminatory comments, the presence penalty would guide the system to generate a warning message.

General tips for writing better prompts

Starting with Simplicity

You should begin by crafting simple straightforward prompts and gradually introduce complexity through refinement, rather than burdening all the information at the beginning. So while dealing with a big task, try to break it down into subsets.

Clear Guidance

Commands should be clear and explicit. For example,

Poor prompt: " The quick brown fox jumps over the lazy dog, Translate this."

Better prompt: "Translate the following English text into Spanish: 'The quick brown fox jumps over the lazy dog.'"

Being Specific

Enhance prompt clarity by including relevant examples and detailed instructions.

Poor prompt: “Write about social media and its effects.”

Better prompt: “Write a 500-word essay discussing social media's impact on teenagers' mental health. Include statistics from reputable sources such as the American Psychological Association and provide real-life examples of individuals affected by excessive social media use. “

Avoiding Confusion

Ensure prompts are clear and direct to prevent ambiguity in model responses.

Continuous Improvement

Iteratively refine prompts based on feedback and performance to optimise results.

Tools, frameworks, and libraries for prompt engineering in LLMs

Hugging Face Transformers

It provides interfaces for fine-tuning models on specific tasks and allows for the creation of custom prompts.

OpenAI Codex API

OpenAI's API includes capabilities for prompt engineering, allowing users to generate code or text based on prompts.

GPT-3 Playground

Platforms like the GPT-3 Playground offer interactive environments for experimenting with prompts and observing model responses in real-time.

Google's T5 Framework

T5 (Text-To-Text Transfer Transformer) is a framework that casts all NLP tasks into a text-to-text format, simplifying prompt engineering for various tasks.

AllenNLP

AllenNLP provides tools for building and evaluating NLP models, including mechanisms for prompt design and evaluation.

Advanced Techniques for Prompt Engineering

In this section, I will talk about some SOTA (State-of-the-art) techniques that are implemented under the umbrella of Prompt Engineering in natural language processing (NLP) and language model fine-tuning. There may be slight overlap with the methods mentioned above, consider this to be a more formal and structured approach to understanding the techniques:

Zero-shot Prompting

As evident by the term ‘zero-shot’, here the inference is direct without providing additional context to the LLM. Instead, it generates results based on its pre-existing knowledge.

Prompt:

Who was the first man in space?

Output:

The first man in space was Yuri Gagarin, a Soviet cosmonaut.

Few-shot Prompting

When it comes to Few-shot Prompting, it's all about giving the Language Model (LLM) a bit of a heads-up before asking the question. This extra context serves two big purposes: first off, it helps the LLM grasp what's going on, so it can give a more spot-on response. Secondly, it enables the LLM to learn new stuff it hasn't come across before, keeping its knowledge fresh and adaptable.

Prompt:

Compose a poem with three stanzas, each focusing on a different theme: love, nature, and solitude.

Output:

In your embrace, I find my solace, A symphony of hearts beating as one, Lost in the dance of passion's embrace. Behold the whispering trees, their leaves aglow, Nature's canvas painted with hues divine, Serenity is found in the murmuring flow. In the quietude of the night, I am alone, A solitary soul under the starry sky, Finding solace in the silence, all on my own.

Prompt Chaining

Prompt chaining is a powerful technique for breaking down complex tasks into more manageable steps. By feeding the output of one prompt into the input of another, you can streamline processes and tackle intricate problems efficiently.

This approach offers several advantages: simplifying instructions, pinpointing areas of difficulty for focused troubleshooting, and better validation of results. Additionally, prompt chaining enables the parallel execution of tasks, allowing for the simultaneous processing of multiple subtasks.

Prompt-1:

There is a workout routine, present on the website <link>:

Please identify any exercises mentioned on the website. Provide them as direct quotes, enclosed in <quotes></quotes> XML tags. If there are no exercises, please say "There are no exercises mentioned.”

Output-1:

<quotes> "10 push-ups" "20 squats" </quotes>

Prompt-2:

Here are exercises mentioned in the routine, enclosed in <quotes></quotes> XML tags:

<quotes> "10 push-ups" "20 squats" </quotes>

Please use these exercises to construct a workout routine. Ensure that your instructions are clear and easy to follow.

Output-2:

# Generates a comprehensive workout plan

Chain-of-Thought Prompting

The Chain-of-Thought Prompting technique mirrors the scenario where a student is given an example problem and subsequently challenged to solve similar problems.

Prompt:

Q: Sally has 50 apples. She gives 15 to her friend and then buys three times as many as she gave away. How many apples does Sally have now?

A: Sally started with 50 apples. After giving away 15, she has 50 - 15 = 35 apples left. Then she buys three times as many as she gave away, which is 3 * 15 = 45 apples. Adding the apples she bought to what she had left, Sally now has 35 + 45 = 80 apples. Therefore, Sally has 80 apples. Q: Joe has 20 eggs. He buys 2 more cartons of eggs. Each carton contains 12 eggs. How many eggs does Joe have now?

Output:

A: Joe started with 20 eggs. 2 cartons of 12 eggs is 24 eggs. 20 + 24 = 44. Therefore, Joe has 44 eggs, and the answer is 44.

In situations when you have fewer examples or no examples, adding a phrase like "Let's think step by step" to the original prompt is effective at improving the model's performance.

Automatic Chain-of-Thought (Auto-CoT)

It automatically generates examples that show the LLM how to solve problems. These examples are called "demonstrations" and they are created by eliciting the LLM to articulate its thought process and elucidate how it would approach a problem.

How does Auto-CoT work?

Auto-CoT works in two stages:

Question clustering: First, it groups similar questions. This helps to ensure that the demonstrations cover a wide range of different types of problems.
Demonstration sampling: Then, it selects a representative question from each cluster and asks the LLM to generate a demonstration for it. The LLM does this by thinking out loud and explaining its reasoning step-by-step.

Fine-Tunning of LLMs

Often fine-tuning is not regarded as part of prompt engineering but similar to prompt engineering it is a method for adapting large language models (LLMs) to specific tasks.
It involves training the already trained model to our specialised labelled dataset, thus adjusting its parameter. While the last layers are often adjusted to suit the new data, fine-tuning can involve tweaking parameters across multiple layers to better capture domain-specific features while retaining the knowledge learned from the original training.

Traditionally Fine-Tunning was a complex and resource-intensive process that required powerful hardware, expertise in machine learning and large amounts of labeled data.

However, now with platforms like Hugging Face, which provide pre-trained models and easy-to-use fine-tuning pipelines, fine-tuning has become more accessible and efficient. By integrating the capabilities of Hugging Face with traditional fine-tuning approaches, we can leverage pre-trained models as starting points, reducing the need for vast amounts of labelled data and expertise.

Truefoundry also provides the facility of fine-tuning your LLMs, with its intuitive and simple interface, you can fine-tune your models in 3 simple steps:

Connecting your database
Comparing across finetuning jobs and choosing the right one for your needs.
Deploy your fine-tuned model.

Retrieval Augmented Generation (RAG)

In RAG, retrieval is used as a component alongside generation to enhance the model's performance in tasks such as question answering and text generation. RAG is adaptable for scenarios with evolving facts, which is valuable because LLMs' fixed knowledge can't keep up. RAG lets language models skip retraining, accessing the latest information through retrieval-based generation to produce dependable outputs.

In recent years, RAG systems have progressed from basic Naive RAG to more sophisticated Advanced RAG and Modular RAG models.

Naive RAG retrieves information based on user input but struggles with accuracy due to outdated data and irrelevant responses. Advanced RAG improves this by fine-tuning the retrieval process, making it more precise and relevant.

Modular RAG takes it further by offering different customizable modules like search and memory, allowing for flexibility in solving specific problems. Overall, these advancements aim to make conversation systems smarter and more reliable by better managing information retrieval and response generation.

Truefoundry also offers an end-to-end interface for RAG with the ability to integrate with any metadata store, embeddings, or LLM models

Reinforcement learning from human feedback (RLHF)

For quite a while, the idea of training a language model using reinforcement learning seemed unfeasible due to both engineering and algorithmic challenges. Understanding the technicalities of RLHF will involve various Reinforcement learning prerequisites, So I will try to keep the explanation very general.

Consider a problem where our goal is to train a robot to navigate a maze. Traditionally in Reinforcement Learning (RL), the robot aims to reach its goal quickly and gets feedback based on how well it performs in the maze. But Reinforcement Learning from Human Feedback (RLHF) takes it a step further by letting humans give extra input. They can comment on more than just speed, like whether the robot avoids obstacles or takes a path that looks good.

For example, if the robot picks a path that dodges obstacles or follows a route that humans like, it might get some bonus points. This way, the robot learns not just to reach the goal fast, but also to consider what humans prefer.

‍

In prompt engineering for large language models (LLMs), RLHF is pretty handy. It makes sure prompts get better at getting the responses we want, improves prompt quality with human checks, lets us customize prompts to fit our preferences, and keeps up with changes in what's popular over time. By including human input, it helps make sure the results are closer to what we're looking for, across different tasks and fields.

‍

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now

The fastest way to build, govern and scale your AI

Book a Demo

What is Prompt Engineering?

Introduction

Prompts and LLMs

Hyperparameters

Temperature

Top_k

Top_p

Max_new_tokens

Repetition_penalty

Frequency Penalty

Presence Penalty

General tips for writing better prompts

Starting with Simplicity

Clear Guidance

Being Specific

Avoiding Confusion

Continuous Improvement

Tools, frameworks, and libraries for prompt engineering in LLMs

Hugging Face Transformers

OpenAI Codex API

GPT-3 Playground

Google's T5 Framework

AllenNLP

Advanced Techniques for Prompt Engineering

Zero-shot Prompting

Few-shot Prompting

Prompt Chaining

Chain-of-Thought Prompting

Automatic Chain-of-Thought (Auto-CoT)

How does Auto-CoT work?

Fine-Tunning of LLMs

Retrieval Augmented Generation (RAG)

Reinforcement learning from human feedback (RLHF)

Built for Speed: ~10ms Latency, Even Under Load

Discover More

What is Similarity Search & How Does it work?

Transformer Architecture in Large Language Models

What are LLM Agents?

Introduction to Langchain

The Complete Guide to AI Gateways and MCP Servers

Product

Company

Resources

Blog

Subscribe to our newsletter