Fine Tuning: OpenAI Models + Your Confluence Data

March 16, 2023
Share this post

If ChatGPT is the iPhone, then the AppStore is yet to be built- and that will be a suite of vertical applications built on top of it. Every industry, every business, and every individual is going to build these applications in small and big ways. Starting from building a healthcare or a manufacturing-specific ChatGPT, to writing domain-specific emails for marketing, to answering enterprise-specific questions from the internal knowledge base, to even building personal search engines to answer questions like- what is John’s address or when is my mother-in-law’s birthday? We wrote a bunch of examples here in our previous blog post.

Why is ChatGPT not enough in itself for these vertical applications?

To understand this, its important to understand clear vs deep web:

  • Clear Web: publicly accessible web pages indexed by search engines. e.g. Wikipedia, books, social media posts, etc.
  • Deep Web: part of the internet that is behind an authentication system. e.g your email, or SaaS platforms. This constitutes 96% of the web.

Models like ChatGPT are trained on a massive dataset but all that is available on the clear web. So you can’t ask it a question whose answer depends on anything on the deep web like your email or private docs. However, in the process of learning from the massive clear web dataset, models like ChatGPT build so much intelligence about the language & semantics that it’s much easier for it to learn new information from small quantities of data for a specific task.

Come in fine-tuning

Fine-tuning is a powerful technique that allows us to leverage the knowledge and learning of a pre-trained model like ChatGPT to improve the performance of the model on a new task by training it on a smaller, task-specific dataset. For example, let’s say you want to build a Question answering system for your internal company docs stored in Confluence. You can pass all the text content from your confluence and fine-tune a GPT model with them.

Sounds simple. Why is everyone not doing it?

While OpenAI has reduced a lot of friction through its fine-tuning APIs, it still requires a fair bit of consideration, planning, and effort along different axes-

Data Preparation

The data that you need to fine-tune the models has to be in a specific prompt-completion pair format.

{“prompt”: <prompt-text>, “completion”: <ideal generated text>}

A few pointers to consider:

  1. It is non-trivial to take arbitrary documents like company confluence docs and convert them into high-quality examples in the above format.
  2. You need few-hundred prompts at the very minimum, which requires human effort and subject matter expertise.
  3. Careful separators marking the beginning and end of the prompt and completions which does not appear in the text. You also need to ensure that the input prompts later use the same delimiters.
  4. Increasing the number of examples improves the performance of the model.
  5. Something as simple as adding a space before the completion starts can improve performance.

Model Selection

OpenAI has multiple models that can be fine-tuned and each of them has tradeoffs that need to be considered-

  1. Ada: Smaller, faster models are easy to fine-tune and works well on a number of tasks like classification, language generation and question answering. Given that it has fewer parameters, it is also less performant than others and can compromise precision.
  2. Curie: Larger and more powerful than Ada and is also general purpose. It becomes harder to fine-tune on smaller datasets because of a large number of parameters.
  3. Babbage: Designed for NLP tasks that require specialized knowledge like law or medicine and can be fine-tuned very well. It is not as general purpose and not suited for fine-tuning on larger datasets or more complex tasks.
  4. Davinci: Largest and most powerful general purpose model. Very hard to train on less powerful hardware and should be used for complex NLP tasks.

Note: You can also incrementally fine-tune a previously fine-tuned model. There is some limitation on the timeframe due to changing APIs that the models must have been fine-tuned after 21, Apr 2022.

Hyper-Parameter Selection

We tried using fine-tuning APIs of OpenAI with default parameters and they worked very well on some of the tasks, but we could get improvements as much as 40% by tweaking the hyper-parameters.

  1. Learning Rate Multiplier: defaults to 0.05, 0.1, or 0.2 depending on final batch_size. he fine-tuning learning rate is the original learning rate used for pretraining multiplied by this multiplier. The learning rate determines how quickly the model adjusts its weights during training. A learning rate that is too high can cause the model to converge to a suboptimal solution, while a learning rate that is too low can cause the model to converge too slowly or get stuck in a local minimum. OpenAI recommends using larger learning rates for larger batch sizes but typically operating between 00.02 to 0.2.
  2. Batch Size: defaults to ~0.2% of the number of examples in the training set, capped at 256. The batch size determines how many examples are processed in each training iteration. A larger batch size can lead to faster training times, but it can also cause the model to overfit or result in higher memory usage. It is generally recommended to use a batch size that is as large as possible without causing excessive memory usage.
  3. Number of Epochs: defaults to 4. The number of epochs determines how many times the model will be trained on the entire dataset. Too few epochs can result in underfitting, while too many epochs can result in overfitting. It is generally recommended to monitor the model's performance on a validation set and stop training when the performance stops improving.
  4. Prompt Loss Weight: defaults to 0.01. The weight to use for loss on the prompt tokens. This controls how much the model tries to learn to generate the prompt (as compared to the completion which always has a weight of 1.0), and can add a stabilizing effect to training when completions are short. For short prompts, OpenAI recommends increasing this number and for long prompts a small weight works the best.
  5. Model: defaults to curie. Model selection table described above is also a hyper parameter but we felt like it deserved a special mention given how important it is in this process as a starting point.

Cost consideration

The fine-tuning costs vary a lot depending on what parameters you decide to choose so its very important to have a solid understanding.


For 1M tokens:
- Ada: $0.4
- Babbage: $0.6
- Curie: $3
- Davinci: $30.

  1. 1 token is roughly 4 characters or 0.75 words.
  2. Davinci is almost 100x more expensive than Ada.
  3. Curie is 10x more expensive than ada and is the default model.
  4. To put it in perspective, Wikipedia has roughly 5B tokens so fine-tuning a curie would cost a whopping $15000. This is only an academic statement as these models are already pre-trained on Wikipedia. Please save your $$.
  5. Number of training tokens also depends on your training epochs. Basically total training tokens = tokens in your file * number of epochs.


Using fine-tuned models are significantly more expensive (~5x) than the pretrained counterparts.

For 1M tokens,
- Ada: $1.6 vs $0.4
- Babbage: $2.4 vs $0.5
- Curie $12 vs $2
- DaVinci $120 vs $20

When you are using the model, parameters like bestof and n also affect your cost because you end up generating multiple completions for a single prompt. Consider using max_response_length to save cost or reduce the usage of best_of and n parameters.

Case Study. Fine Tuning With Confluence Docs

Building a fine-tuned model for our own confluence dataset was not trivial. It involved the following 4-step process.

Data Fetching

It took me a bit of an effort to figure out reading all the data from Confluence through its APIs because Confluence has rich data with tables and headings and subheadings and I had to convert that into simple text for ease of usage. Also, managing permissions and the right level of access was not trivial. Finally, we were able to create a very simple form where you submit the url, username and API Key and we can fetch the data.

Data Preprocessing

This was the hardest part. Here we tried multiple approaches-

  1. Random splitting of sentences: based on number of words in prompts and completions
  2. Regex: Separating out sentences based on words like- and, but, however etc.
  3. ChatGPT: Giving context to ChatGPT and asking it to generate questions.  

We eventually selected most pairs from #3 but also threw in some random sample from #1 and #2 above. For our dataset we generated over 50000 pairs of prompts and completions with an average length of 17 words for prompt and 133 words for response making it a total of 150 words per pair. Ideally, this step would need much more experimentation.

Model fine-tuning

We experimented with different models and hyper-parameters and realized Curie tends to perform better than Ada or Babbage but we felt like Ada was fine given the cost tradeoff. We didn’t try Davinci. We had to tweak the learning rate and settled at 0.05 whereas we ran the model for 6 epochs. Cost of training 1 run of Curie was about $30 and Ada was about $4. Overall through a bunch of experimentation it might have costed us $400 in OpenAI credits for fine tuning - we also had a few bad runs.


We noticed that the fine tuned model performed strictly better than the original model on our internal dataset related questions. We had our team try out about 100 odd prompts and rated them manually. This is not scientific but worked for this simple use case. Interestingly, we noticed that in some cases, the fine tuned model performed worse than the original model for general queries. We still need to debug what’s happening here.

This was a fun exercise and we will be working on experimenting with fine tuning on other datasets.


If you want any specific datasets, or want me to open up this app to be used for your own Confluence, reach out to me at

TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.  

Discover More

April 16, 2024

Cognita: Building an Open Source, Modular, RAG applications for Production

LLMs & GenAI
April 11, 2024

How To Choose The Best Vector Database

LLMs & GenAI
March 28, 2024

Leveraging Fractional GPUs on Kubernetes

LLMs & GenAI
March 14, 2024

Helping Enterprises accelerate the time to value for GenAI

LLMs & GenAI

Related Blogs

No items found.

Blazingly fast way to build, track and deploy your models!