In today's digital landscape, AI has transformed our relationship with technology. Large Language Models (LLMs), in particular, have changed the way we communicate with machines. If you’ve ever wished a language model could better represent your unique style and voice, fine-tuning it with your data is the answer. Re-training would also be an option but we might force the LLM to overwrite its existing knowledge (catastrophic forgetting) and also it is a lot more expensive and time-consuming (depending on our hardware). For our use case, we will use publicly available research papers that were written some time ago (check here).

In this blog post, we will focus on data extraction, preprocessing, and generating instruction-response datasets—three critical steps to customize language models to fit your needs. In the second part of this series of posts, we will use this dataset to fine-tune our open-source (base) LLM and deploy it locally as an end point for http requests.
Why Fine-Tuning Matters
Fine-tuning a pre-trained language model tailors it to resonate with your specific style, tone, and content preferences. This isn't merely about making a sophisticated model smarter; it’s about making it truly yours. When you customize a model, you enhance its ability to engage in meaningful ways. This personalized approach leads to more enjoyable experiences, whether you are drafting emails, creating engaging content, or automating responses.
Data Extraction: The First Step
Data extraction lays the foundation for fine-tuning any language model. This process involves collecting text that reflects your writing style or the content you regularly interact with. In a previous post, we showed how you can crawl data daily to build up your data source.
Begin by identifying where your writing style is best captured. For example, in the LLM Engineer's Handbook, authors crawl data from their publicly available blog posts. You can also consider your blog posts, emails, social media updates, or any documentation you've written. Aim to pull in sources that represent your style.
In terms of quantity, it's often suggested to gather as much as possible but I would suggest focusing on quality. You can achieve a lot of things with let's say 1,000 high-quality samples compared to 10,000 low-quality samples. A diverse array of topics—ranging from technical articles to personal anecdotes—can also enhance the model's understanding of your writing style, so try not to create a repetitive theme (in our case, we want to go for my academic writing style and see if the LLM can replicate it).
Data Preprocessing: Refining Your Dataset
After gathering your data, the next critical step is preprocessing. This phase prepares your raw data for effective training.
Cleaning the Data
Start with cleaning your text to ensure clarity and coherence in training. Focus on eliminating:
Duplicate sentences
Formatting errors
Unnecessary special characters
By doing this, you'll allow the model to grasp your writing style better, ensuring it learns the nuances that make your voice unique. For academic papers, there are some open source python libraries that parse the data. We have used "scipdf" which requires "en_core_web_sm" model from "spacy" and GROBID service running locally. Instructions on how to install them can be found here.
Of course, parsing the data does not guarantee high-quality parsed text, which practically means that we have to deep dive into the data and start creating rules to increase the quality. For our data file collection, the following function cleans the text.
import re
def clean_scientific_text(text: str) -> str:
"""
Cleans parsed scientific text by removing non-ASCII characters, newlines,
excessive whitespace, and special symbols.
:param text: Input string.
:return: Cleaned string.
"""
text = re.sub(r'[^\x00-\x7F]+', '', text)
text = re.sub(r'\[\d+\]', '', text)
text = re.sub(r'\(.*?et al\., \d{4}\)', '', text)
text = re.sub(r'[^a-zA-Z0-9.,;:\-\s]', '', text)
text = re.sub(r'(?<=\s)([a-zA-Z0-9])(?=\s)', '', text)
text = text.replace(' ', ' ').replace(' ', ' ').replace(' ', ' ').replace(' . ', '. ').replace(' , ', ', ')
return text.strip()
This function receives a string and returns it clean. It would be advisable to understand the structure of your data to make the most out of them. The pre-processing of the data is done sequentially, section by section and paragraph by paragraph. Paragraphs which are less than 500 characters in length are eliminated to keep the quality high and the samples meaningful. A piece of good advice is to keep metadata as well e.g., to which section the extracted content belongs. In various use cases, this information can help you with your task e.g., in RAG you may want to filter out content that comes from the abstract section and so on.
My dataset after some further filtering comes down to ~500 samples, and looks like this:

This means that we have around 500 long paragraphs of text which we can use for generating the dataset for the LLM's fine-tuning.
Generating the Dataset through back-translation
Once your data is cleaned and stored, it's time to create a training dataset. This step involves producing samples that meet the requirements of the model you wish to fine-tune. Our end goal is to fine-tune a base model through pairs of instructions and answers, and having raw content can be very handy when it comes to generating instruction-answer pairs.
Back-translation is a technique commonly used in machine translation, but it can also be a powerful tool for dataset generation, especially for creating instruction-answer pairs from existing content.
Back-translation for translation tasks is the process of taking a piece of text, translating it into another language, and then translating it back into the original language. The idea is that by forcing the text to go through this transformation, you introduce slight variations while preserving the meaning. In the instruction-answer generation, back-translation receives the unstructured raw data and generates the instruction-answer pairs with the help of an LLM.
In our use case, to have a good variety of instruction-answer pairs, we utilized an ensemble of open-source LLMs such as Llama-3.2-3B-Instruct, vicgalle/Configurable-Llama-3.1-8B-Instruct, and tiiuae/Falcon3-10B-Instruct-GPTQ-Int4 from the huggingface repository. We pulled models which have high IFEval (Instruction-Following Evaluation, that is to test the model's ability to follow explicit formatting instructions) but also low parameters (< 14B) to fit to our consumer hardware (#GPU_Poor, #low_cost_solutions).
Prompting for back-translation
Prompting plays a crucial role in back-translation, especially when using large language models (LLMs) to generate high-quality instruction-answer pairs. A well-designed prompt ensures that the model preserves key details while introducing natural variations in phrasing. Without clear guidance, the output might drift too far from the original meaning or become overly simplified.
By carefully structuring prompts—specifying tone, detail level, and expected output format—you can generate diverse, high-quality dataset entries that enhance the robustness of your fine-tuned model. Effective prompting transforms back-translation from a simple linguistic exercise into a powerful tool for dataset augmentation. A very interesting observation is that, if you mention to the LLM about a tip ($$$ grease the wheels baby $$$) it gives better results (reddit post). I have noticed that the output which was supposed to be a JSON object, was failing fewer times when I "provided" a tip in my prompt. The prompt also contains a 2-shot prompting, meaning it gives 2 examples to the LLM to improve the output (it does). Below I include the system role and user prompt:
"You are an advanced writing assistant, helping users craft clear, engaging, and grammatically correct content. You adapt your tone based on user preferences. You get 200$ for each answer that satisfies the user's requirements."
Based on the following EXTRACT, generate three instruction-answer triples. Each triple should consist of:
1. An instruction asking about a specific topic in the context.
2. A generated answer that attempts to answer the instruction based on the context, named as 'rejected'.
3. An extracted answer that is a relevant excerpt directly from the given context, named as 'accepted'.
Only use concepts from the context to generate the instructions. Instructions must never explicitly mention a context, a system, an author, a course, or an extract.
Instructions must be self-contained and general.
Important:
- Ensure that the accepted answer is a verbatim copy from the context, including all punctuation and apostrophes and must imitate the writing style of the EXTRACT.
- Pay attention to the format and writing style of the EXTRACT. The accepted answer must be lengthy, and imitating the writing style of EXTRACT.
- The accepted answer must be at least two sentences or more, so try to create instructions that correspond to such accepted answers.
- Instructions must be self-contained and general, without explicitly mentioning a context, a system, a study, a course, an author, or extract.
The question and answer should be derived only from the given text. Use the following (IMPORTANT) JSON format:
[
{
"instruction": "...",
"rejected": "...",
"accepted": "..."
},
...
]
### **Example 1, EXTRACT**:
"The industrial revolution marked a pivotal shift in economic structures, leading to rapid urbanization and the emergence of mechanized production. Factories, once rare, became the dominant means of manufacturing, altering traditional labor patterns and reshaping societal hierarchies. Workers moved en masse from rural areas to cities, seeking employment in these newly established industries. However, this transition was not without consequence; working conditions were often harsh, hours were long, and wages remained low, particularly for women and children. Despite these challenges, technological advancements continued to drive production efficiency, setting the stage for economic expansion and global trade growth."
### **Example 1, Response**:
{
"instruction": "What were the key consequences of the industrial revolution on labor and urban life?",
"rejected": "The industrial revolution made life easier for everyone as technology improved, and workers had more free time.",
"accepted": "Workers moved en masse from rural areas to cities, seeking employment in these newly established industries. However, this transition was not without consequence; working conditions were often harsh, hours were long, and wages remained low, particularly for women and children."
}
### **Example 2, EXTRACT**:
"Spending time with loved ones is one of life's greatest joys! Whether it's sharing a laugh, enjoying a cozy afternoon together, or just being present, these moments truly bring happiness. The simple things often have the biggest impact, leaving us with warm memories that last forever."
### **Example 2, Response**:
{
"instruction": "What makes spending time with others such a meaningful experience?",
"rejected": "Time with others is valuable because it brings happiness and strengthens bonds.",
"accepted": "Spending time with loved ones is one of life's greatest joys! Whether it's sharing a laugh, enjoying a cozy afternoon together, or just being present, these moments truly bring happiness."
}
### **EXTRACT**:
Dataset evaluation
While LLMs can generate large datasets efficiently, manual inspection remains crucial to ensure accuracy, coherence, and alignment with the intended use case. Automated generation often introduces subtle errors, inconsistencies, or biases that may go unnoticed without human review. By manually verifying a subset of the dataset, you can catch factual inaccuracies, unnatural phrasing, or instruction-answer mismatches that could degrade model performance.
Additionally, human oversight helps maintain diversity and ensures the dataset reflects real-world queries. A careful balance between automation and manual validation leads to higher-quality training data, ultimately improving the reliability and trustworthiness of the fine-tuned model.

Let's have a look at a couple of instruction-answer pairs for ourselves!
This one seems to be ok, while the instruction-answer pair align with the content.
{'instruction': 'What are the limitations and trade-offs of using FABBOO in predictive modeling and execution time?', 'rejected': 'FABBOO is a good predictor because it has low unfair outcomes and high accuracy.', 'accepted': 'We observe that FABBOO has poor predictive performance across all datasets; however, its unfair outcomes remain low. As the number of weak learners increases, so does balanced accuracy but also the run time linearly e.g., for to 10 balanced accuracy is increased by for Adult cen., 18 for Bank, for Compas, for Default, 16 for KDD cen., 20 for Law Sc., for Loan, for NYPD and for synthetic. Similarly, in terms of execution time, the addition of more weak learners increases the time linearly. Interestingly, after sufficient number of weak learners e.g., 10 FABBOOs, results do not change significantly.', 'content': 'In Figure 4a, FABBOO is tuned to mitigate unfair outcomes based on statistical parity. We observe that for FABBOO has poor predictive performance across all datasets; however, its unfair outcomes remain low. As increases, so does balanced accuracy but also the run time linearly e.g., for to 10 balanced accuracy is increased by for Adult cen., 18 for Bank, for Compas, for Default, 16 for KDD cen., 20 for Law Sc., for Loan, for NYPD and for synthetic. Similar behavior can also be observed for equal opportunity Figure 4. In terms of execution time, the addition of more weak learners increases the time linearly. Interestingly after sufficient number of weak learners e.g., 10 FABBOOs results do not change significantly.'}
There are, however, cases that the LLM did not follow the instructions and copied the 2-shot prompting example to generate pairs e.g.,:
{'instruction': 'What are the key consequences of the industrial revolution on labor and urban life?', 'rejected': 'The industrial revolution made life easier for everyone as technology improved, and workers had more free time.', 'accepted': 'Workers moved en masse from rural areas to cities, seeking employment in these newly established industries. However, this transition was not without consequence; working conditions were often harsh, hours were long, and wages remained low, particularly for women and children.', 'content': 'From our experiments, we have observed that FABBOOs ability to maintain good predictive performance does not rely strongly on the hyper-parameter as long as is sufficiently large. As seen from the results, the performance does not change significantly, neither does FABBOO ability to mitigate unfair outcomes. In this section, we analyze FABBOOs behaviour by varying the class imbalance ratio over time -recall that we do not assume fixed minority class. We show how FABBOO is affected and the impact of parameter on the online class imbalance monitor of Equation 5. For this purpose, we have generated synthetic data streams of varying class ratios over time Figure 5. For, we consider values in range of 0, 0.99 with 0.1 incremental step. Recall that low means lower contribution of historical data higher decay and higher contribution of recent data. We report on balanced accuracy, recall and Cum.S.P. for visibility purposes Cum.S.P. was multiplied by 10 .'},
Such cases need to be removed from the dataset to maintain high quality. There are various ways to do that. One of them is to generate a set of keywords that can filter out these responses. Another way is to use an LLM as a judge and prompt it to decide if an instruction matches with the content. Using sentence embeddings is also a good way to discard semantic irrelevant pairs. In our case, we kept it simple and generated a set of keywords that matched our filtering rules and finalized the instruction-answer dataset. We also removed samples that contained short (<100 chars) or very long answers (>650). The "answer length" distribution shows that the majority of our samples are between 100 to 200 characters. After the filtering process, we are ready to move with the fine-tuning of the LLM for our awesome task!

A sneak peek of the training data (JSON is a common structure to store the instruction-response entries):
[{"role": "user", "content": "What are the characteristics of the SentiWordNet database that make it prone to errors compared to the distant supervision approach?"}, {"role": "assistant", "content": "On the other side, SentiWordNet contains high-quality sentiment annotations comparing to the distant supervision approach that relies on emoticons as proxies for sentiment, and therefore, it is prone to errors."}],
[{"role": "user", "content": "What are some common methods used to tackle discrimination in machine learning models?"}, {"role": "assistant", "content": "Among the most popular methods in this category are class-label swapping, instance re-weighting, sampling, and instance transformation."}],
[{"role": "user", "content": "What is the purpose of using labeled data for training and testing in a prequential evaluation approach?"}, {"role": "assistant", "content": "We refer to this as prequential evaluation. We also consider holdout evaluation: we split the original dataset into training and testing set. The evaluation procedure is similar to prequential evaluation, the only difference is that we use for training testing only data from the training testing, accordingly set."}],
Conclusions
In this first part, we explored the essentials of data extraction, pre-processing, and dataset generation—key steps toward creating a model that captures your authentic style. In the next part, we will explore the fine-tuning process itself, discussing the training methods and tweaks that can help you achieve your desired results.
** If you are interested in other ML use-cases, please contact me using the form (and also include a publicly available dataset for this case, I'm always curious to explore new problems).