In today's data-driven world, extracting useful information from PDF documents is more important than ever. PDF files, commonly used for sharing knowledge, pose unique challenges for Retrieval-Augmented Generation (RAG) and Natural Language Processing (NLP) tasks. The right PDF parser/reader can dramatically improve how effectively your application retrieves and processes data from these files. In this blogpost, we evaluate 6 open-source PDF parsers on a manually annotated and large-scale dataset, which will help you to choose the best tool for your needs.
Significance of PDF Parsers in RAG and NLP
When dealing with RAG or NLP tasks, it is crucial to extract relevant information accurately from documents such as PDFs. These documents often contain complex structures like multi-column layouts, tables, and images, making high-quality PDF parsers essential. Without accurate extraction, the RAG system may struggle to find meaningful content, leading to incorrect or incomplete answers. Also, a poorly parsed document might jumble sections or fail to correctly represent hierarchical structures (e.g., headings and subheadings), leading to the loss of context. Again, as a result, the retrieval mechanism in RAG could pull out irrelevant or fragmented text, making it difficult for the model to generate a coherent, contextually relevant response.
Key Factors to Consider When Evaluating PDF Parsers
When assessing PDF parsers, keep these key factors in mind to ensure optimal performance in your RAG applications.
1. Accuracy in Text Extraction
The most important aspect of any PDF reader is its accuracy in extracting text. Flaws can arise from interpreting fonts incorrectly, dealing with scanned documents, or having poor optical character recognition (OCR).
To evaluate the accuracy, run tests on a variety of sample documents that include both text-based PDFs and scanned files. Look for discrepancies in the extracted text; even minor errors can change the meaning of critical information.
2. Handling of Various PDF Formats
PDFs come in diverse formats, including those with images, tables, and multi-column layouts. A robust PDF parser should navigate these variations effectively.
For instance, when evaluating a parser, test it with documents that have tables or complex layouts. You should assess whether the extracted text flows logically and maintains its intended structure. A parser that handles these formats well minimizes the risk of losing important context.
3. Speed of Extraction
Speed is vital for maintaining efficiency. In many projects, speedy extraction allows workflows to run smoothly. Long extraction times can significantly harm RAG and NLP tasks, especially when immediate results are necessary.
During tests, measure how long it takes for different readers to extract text from files of various sizes. Aim for a reader that achieves a good balance between speed and accuracy.
4. Preservation of Document Structure and Formatting
Preserving the original structure and formatting of PDFs is another critical quality measure. Each document’s layout, including headings and footnotes, often contains contextual clues vital for understanding the content.
To assess this, compare the extracted text against the original PDF, checking for maintained hierarchies and formatting. Parsers that successfully preserve visual structure can enhance comprehension, which is particularly important in RAG and NLP tasks.
Benchmarking PDF Parsers through the DocLayNet dataset
DocLayNet is a large-scale, human-annotated document layout segmentation dataset containing 80,863 pages from diverse domains like finance, science, patents, and more. The human experts manually labeled each page in the dataset, identifying and bounding the different layout elements like text, tables, images, headers, and more. It provides detailed annotations for 11 distinct layout classes, including text, tables, images, and headers. This dataset is valuable for training and evaluating document understanding models, enabling tasks like information extraction, document summarization, and search.
For our analysis, we sampled around 2,000 different PDFs of varying sizes. Another criterion for our sampling was to extract PDFs of more than 100 tokens. Below, we highlight some characteristics of our sample with respect to length (tokens) as well as categories.
The following figure shows the distribution of tokens per document of this 2,000 sample of the original dataset.
Comparing Mechanism: Embedding models
In our analysis, we’ve used embedding models as a mechanism to compare two sentences based on their semantic similarity. Embedding models, such as those based on transformer architectures like BERT or Sentence-BERT, represent sentences as dense vectors in a high-dimensional space. These embeddings capture the underlying meaning of the sentences, encoding not just the individual words but also the relationships between them in context which makes them ideal for RAG systems.
By comparing the cosine similarity between the sentence embeddings, we can effectively measure how similar two sentences are in terms of their content and meaning, rather than just their lexical overlap. This approach makes sense because it transcends simple keyword matching, allowing for a more nuanced comparison that accounts for synonyms, word order, and contextual differences.
We have used the best model with the least amount of parameters based on the leaderboard in HuggingFace at the time being (stella_en_400M_v5), which ranks 6th with only 435M parameters.
Open Source PDF Parsers
PyPDF: A versatile PDF library for various operations like splitting, merging, cropping, and text extraction (v5.1.0)
pdfminer.six: A powerful tool for extracting text and metadata from PDF documents, providing detailed information about text location, font, and color (v20231228)
PyMuPDF: A high-performance library for data extraction, analysis, conversion, and manipulation of PDF documents (v1.24.13)
Docling: A library for reading and understanding various document formats, including advanced PDF parsing for layout analysis and table structure extraction (v2.4.2)
pdfplumber: A library for extracting text and tables from PDFs, providing detailed information about each character, rectangle, and line (v0.11.4)
pypdfium2: A Python binding to PDFium, a powerful library for PDF rendering, inspection, manipulation, and creation (v4.30.0)
Results
The following figure shows the performance in terms of mean cosine similarity between the ground truth and the extracted text of each employed method (a higher score is better). It is clear from the sample analysis of the 2,000 randomly selected documents that pymupdf and pypdfium2 are the clear winners.
We also evaluate the performance of PDFs with less text and more figures; so the following figure shows the performance of the PDFs with less than 1024 tokens which resulted in 1,218 documents. It seems that the docling library performs worse on files with fewer text and more figures. On the contrary, the other parsers were not significantly affected.
In terms of time (a lower value is better), it seems that pymupdf and pypdfium2 are again the clear winners while docling and pdfplumber are the slowest among the employed packages.
Our analysis aligns well with a recent paper that compared similar open-source PDF parsers on the same dataset.
Conclusions
In this blogpost, we have evaluated a set of open-source PDF parsers. Such analysis is essential for efficient data extraction. By focusing on key factors such as accuracy, handling of formats, speed, structure preservation, and management of complexity, you can select the right tool for your needs. From our analysis, we concluded that pymupdf and pypdfium2 are the best methods for the given dataset which spawns a variety of domains.
Incorporating both qualitative and quantitative assessments into your evaluation process can provide a clearer picture of performance. Ultimately, your choice of PDF parser directly affects the ability of your systems to generate accurate responses and drive effective workflows.
** If you are interested in other ML use-cases, please contact me using the form (and also include a publicly available dataset for this case, I'm always curious to explore new problems).