Fine-Tuning Large Language Models with Your Own Data to Mimick Your Style (Part II)

Vasileios iosifidis
Feb 25
8 min read

Updated: Jun 17

Fine-tuning LLMs to your tasks has never been easier than before! In this post, we will explore different ways of fine-tuning LLMs and compare the pros and cons of each method. We will review and compare the fine-tuned models and share some tricks on squeezing a 14B LLM into a commercial/cheap 16GB GPU, putting the pedal through the floor!

In the previous post, we explored ways to synthetically generate datasets from unstructured data sources. We have also reviewed ways to improve the extracted data quality using various methods. As a tip, always invest some time to improve your data quality since it is the most important part of solving your task.

📈 Discover how real businesses use AI to create value. Join the newsletter for practical use cases and strategic insights.

Subscribe to newsletter!

Fine-tuning

Fine-tuning language models has become increasingly common in recent years due to advances in natural language processing (NLP). However, before we dive into the pros and cons of fine-tuning large language models (LLMs), let's first define what it means. Fine-tuning an LLM involves training the model on a specific set of data customized toward a particular use case or domain. This approach allows us to leverage the pre-trained weights learned during previous stages while adapting them specifically for our intended purposes. Fine-tuning is actually a special type of transfer learning.

Pros

Task-Specific Adaptation: Fine-tuning allows LLMs to be adapted to specific tasks or domains, making them more accurate and effective for those tasks. For example, an LLM can be fine-tuned for sentiment analysis, machine translation, or question-answering.
Improved Performance: By fine-tuning on task-specific data, LLMs can achieve better performance than when using only pre-trained models. This is because the model learns the nuances and patterns of the task at hand, leading to higher accuracy and better results.
Reduced Training Time: Fine-tuning often requires less training time compared to training from scratch. This is because the LLM already has a solid foundation of knowledge and only needs to learn the specific task-related information.
Domain-Specific Knowledge: Fine-tuning helps the model acquire domain-specific knowledge, making it more proficient in that area. For instance, an LLM can be fine-tuned for legal document analysis, medical diagnosis, or financial forecasting

Cons

Data Quality and Quantity: Fine-tuning requires high-quality data, which can be expensive and time-consuming to collect and curate. The amount of data needed can also be substantial, especially for more complex tasks.
Computational Resources (not in our case :P): Fine-tuning LLMs is computationally intensive and requires significant hardware resources, including powerful GPUs and large amounts of memory. This can lead to high costs in terms of infrastructure and energy consumption.
Overfitting: There is a risk of overfitting the model to the specific dataset used for fine-tuning, which can result in poor generalization to new, unseen data. This is particularly problematic if the dataset has biases or is not representative of the broader domain.
Model Size and Complexity: Fine-tuning large models can be difficult due to their size and complexity. Managing these models effectively during training and inference can be challenging.

Finetuning: SFT or DPO?

Model fine-tuning is an essential step in developing open-ended dialogue systems as it enables the model to learn the dialogue context and other target properties such as engaging conversational style from expert demonstrations. We present two approaches to fine-tuning chat models: Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO).

In SFT, we use a standard supervised learning approach to train the model on labeled input-output pairs, in our case instruction/response pairs. SFT is straightforward, efficient, and effective when high-quality datasets are available, but it lacks direct preference alignment which limits its flexibility to adapt to nuanced human expectations and can inherit biases from the dataset leading to overly rigid responses.

In contrast, the DPO approach fine-tunes the model’s parameters by optimizing preference-based contrastive loss directly. DPO does not rely on a separate reward model as required by RLHF and thus makes the optimization process computationally more efficient. We show that although DPO does not require additional supervision, it can achieve better model performance than SFT in terms of overall response quality and human preference, while inheriting the advantages of SFT.

Our experiments show that DPO outperforms SFT in terms of overall response quality and human preferences, while still benefiting from the strengths of SFT such as efficiency and effectiveness. Our results also demonstrate that the combination of SFT and DPO can further improve model performance and provide more flexibility in adapting to different human preferences.

Squeezing the bytes out of GPUs

Fine-tuning sounds expensive, right? Well, it depends on your task and your experience. In some cases, training or fine-tuning an LLM can evaporate your budget (and make you bankrupt) but if done right, it can cost you just a few bucks here and there. Here, we are going to explore cost-effective ways to make the best out of every single bit of your hardware, without a dramatic loss in performance.

We are not going to invent the wheel, but rather use popular open-source libraries to opt for our hardware. Our base here is going to be Qwen2.5 14B, which is a pretty big model for cheap hardware to use. Now, to fit this beast to a 16GB VRAM, there are some quantization techniques that reduce the floating numbers (this impacts precision) but we can live with some drop in performance.

There is a plethora of posts that mention how many bits of precision should be used. In our case, we use a 4-bit quantization on the 14B model. By "compressing" the model we can fit it into less than 13GB vRAM which is pretty cool (code snippet, we are using unsloth):

model, tokenizer = FastLanguageModel.from_pretrained(model_name=model_name, max_seq_length=2048, load_in_4bit=True)

We are going to use the data from the previous post for both the SFT and DPO model finetuning. For the SFT, we have split the data into chunks and fed it to the model iteratively to reduce memory consumption. Below we can see how the model's loss drops over the training epochs.

In this case, and after some hyperparameter selection (e.g., batch size, learning rate, weight decay, gradient accumulation steps, etc.) the training is progressing well -- loss is decreasing, and accuracy is increasing.

In DPO, we do not use the base model, but rather use the fine-tuned model of the SFT process. The idea is to enhance the fine-tuning through preferential selection which has been shown in various cases to be superior to SFT and also really easy to implement. There is no need for deep RL expertise for this method, because the heavy lifting is actually done internally by the base (weights frozen) model.

In DPO's case, the loss decreasing & reward margin increasing means training is proceeding well.

Comparison

Now let's compare these two fine-tuned versions and see: 1) if they can maintain the new information that was fed to them and 2) which one can mimic the writing style better. For this experiment, we are going to use the same prompt and the same hyperparameters to compare the base model, the SFT model and the SFT+DPO model.

prompt = "Write me a paragraph about AdaFair and how it works"

Model	Response
Base (Qwen2.5-14B)	AdaFair is an online platform that connects buyers and sellers of used cars in India. It offers a range of services such as car valuation, inspection, financing options, and insurance coverage to ensure a seamless buying experience for customers. The company uses advanced technology to provide accurate vehicle history reports, which helps buyers make informed decisions. Additionally, AdaFair also provides after-sales support to its customers, including warranty coverage and roadside assistance. Overall, AdaFair aims to revolutionize the way people buy and sell used cars by making the process more transparent, convenient, and hassle-free.
SFT	AdaFair is a fairness-aware algorithm that aims to mitigate bias in machine learning models by adjusting their decision boundaries based on demographic information. The algorithm works by incorporating fairness constraints into the optimization process, which ensures that similar individuals from different groups receive similar treatment. This approach helps prevent discrimination against protected classes such as race or gender while maintaining high accuracy in predictions. By considering both predictive performance and fairness criteria during training, AdaFair provides more equitable outcomes across various demographics, making it particularly useful for applications where unbiased decisions are crucial.
SFT+DPO	AdaFair is a fairness-aware machine learning framework designed to mitigate bias in predictive models. It operates by incorporating fairness constraints into the training process, ensuring that the model's predictions are not influenced by sensitive attributes such as race or gender. The core idea behind AdaFair is to adjust the weights of different data points during training so that the model learns from all groups equally, thereby reducing discrimination against any particular subgroup. This approach helps in achieving more equitable outcomes across various demographic categories while maintaining high accuracy on overall performance metrics

It's quite clear that the base model answers completely wrong while SFT and SFT+DPO answered correctly to the prompt. Of course, it is a bit harder to distinguish which one mimics the writing style so lets try another prompt.

prompt = "Rephrase this text for a scientific post: Reading is one of the most powerful tools for personal and intellectual growth, offering countless benefits for the mind and soul. It enhances cognitive abilities, improves vocabulary, and strengthens critical thinking skills by exposing readers to diverse perspectives and complex ideas. Regular reading has also been linked to better memory retention and mental agility, reducing the risk of cognitive decline over time. Beyond its intellectual advantages, reading provides an escape from daily stress, fostering relaxation and emotional well-being."

SFT response:

Reading offers numerous benefits for both personal and intellectual growth. It not only enhances cognitive abilities but also improves vocabulary and strengthens critical thinking skills through exposure to various perspectives and ideas. Moreover, regular reading has been associated with improved memory retention and mental agility, potentially reducing the risk of cognitive decline. In addition to these intellectual advantages, reading serves as an effective means of escaping daily stress, promoting relaxation and emotional well-being.<|endoftext|>

SFT+DPO response:

Reading is a highly effective means of fostering personal and intellectual development, providing numerous advantages for the mind and spirit. It strengthens cognitive abilities, improves vocabulary, and enhances critical thinking skills by exposing readers to a wide range of perspectives and intricate ideas. Engaging in regular reading has been linked to better memory retention and mental agility, which can help reduce the risk of cognitive decline over time. In addition to its intellectual benefits, reading offers an escape from daily stress, promoting relaxation and emotional well-being.<|endoftext|>'

Even though both answers seem ok, at this point SFT matches better the writing style! Other prompts also showed similar behavior. This behavior indicates that the negative samples are not allowing the model to pick up the writing style effectively. There are a couple of ways to fix this: 1) re-create the negative samples with a better k-shot prompting, 2) use more advanced models for the synthetic generation, 3) create manually annotated dataset to improve quality and so on. In any case, the SFT method provides good enough results being fine-tuned on 2k samples.

Conclusion

Fine-tuning large language models (LLMs) has become an accessible and effective technique for increasing model performance, despite initial concerns about cost and complexity. In our study, we examined various fine-tuning methods, including supervised fine-tuning (SFT) and Direct Preference Optimization (DPO), and their effects on model behavior. We found that while DPO improves preference alignment, SFT excels at preserving writing style.

To make fine-tuning more efficient, we employed quantization techniques, enabling the process on consumer-grade GPUs. The optimal approach varies based on the specific application and limitations. As fine-tuning methods advance, refining negative samples, improving data quality, and utilizing advanced synthetic generation techniques will further enhance model capabilities.

📥 Want practical AI use cases? Subscribe to stay informed.

Subscribe to newsletter!