Milvian Insights

Capturing Complete Context in PDFs using X-RAG

Written by Dakota Smith | Jan 27, 2025 7:22:05 PM

“X-RAG” – Extended Retrieval-Augmented Generation for Complex PDF Content Extraction

Introduction

Complex PDF documents often blend multi-page text with rich visual information, presenting a significant challenge for conventional data extraction. Traditional “text-only” pipelines typically parse each page as an isolated text chunk, overlooking crucial cross-page references and visual cues embedded within images. This gap leads to incomplete or fragmented information retrieval.

To address these limitations, we introduce “X-RAG,” or Extended Retrieval-Augmented Generation. By coupling page-to-image conversion with large language model (LLM) processing, X-RAG ensures the full context—including multi-page text flows and intricate visual details—is captured and transformed into a structured representation (e.g., XML).

 

The X-RAG Method at a Glance

X-RAG extends traditional Retrieval-Augmented Generation in two key ways:

  • Cross-Page Contextualization
    • Problem: Information about a given topic (e.g., a product feature) may span multiple PDF pages or appear in images scattered throughout the document.
    • Solution: X-RAG processes these pages in overlapping batches, referencing previously generated content so that context isn’t “lost” when moving from one batch to the next.
  • Visual Element Integration
    • Problem: Images within PDFs can contain text, diagrams, or other key insights (e.g., annotated schematics, process flows). Conventional text extraction or embeddings ignore these elements.
    • Solution: X-RAG uses a page-to-image approach and provides these images to the LLM for image-aware reasoning. The system then generates XML tags that incorporate detailed image descriptions.

 

Workflow Overview

1. Input PDF

The process begins with a PDF document.

2. Convert PDF Pages to Images

Each page of the PDF is turned into a corresponding image (often done so that any visual elements can be processed further, for example by an AI model or an OCR system).

3. Batching + Context

Before further processing, the images (and/or their text segments) are grouped or batched. This step also involves organizing relevant context (such as metadata or additional information) needed for downstream tasks. Using the LLM's multi-modal capabilities for Images, we curate the relative context between images, in order to batch images according to each other's relative content.

4. PDF Text Extraction

In parallel (or as a complementary process), the text layer is extracted from the original PDF. This might be done via OCR (if the PDF does not have a searchable text layer) or by simply reading the embedded text if available.

5. LLM Image Enrichment

Using a Large Language Model (LLM), the images are analyzed or “enriched.” This typically means generating descriptions, identifying key elements, or extracting metadata that augments the extracted PDF text with more detailed context or semantic information.

6. XML Assembly

All of the outputs (the extracted text, image descriptions, and any other contextual data) are combined to form an XML structure. This involves mapping the text and image metadata into a specified XML schema or format.

7. XML Repair & Validation

The assembled XML is then checked for correctness. Any structural or syntactic errors are repaired, and the content is validated against a target schema or set of rules to ensure completeness and compliance.

8. Final XML

The process concludes with a properly structured, validated XML file that integrates all the information extracted and enriched from the PDF.


Unique Advantages of X-RAG

Retains Multi-Page Context

By carrying forward “in-progress” XML from earlier batches, the system avoids disjointed references. Sections spanning multiple pages become a coherent, single narrative.

Full Visual Intelligence

Enabling the LLM to “see” page images means it can interpret diagrams, call out captions, and even handle textual overlays that standard text extraction might miss.

Highly Structured Output

The final XML preserves hierarchical organization—from high-level <manual> tags, to <section> tags, down to <page>and <image> elements—making it far simpler to index, search, or reuse the data.

Scalability

X-RAG can scale to very large PDFs by adjusting batch size, ensuring the LLM remains within token limits.
The approach is modular: each batch is processed independently, and final XML pieces are concatenated and cleaned.

 

Key Challenges and Considerations

Prompt Design
Effective instructions are crucial. The prompt must clearly explain how to handle page boundaries, references, and images to maintain continuity and avoid duplication.

Token Management

LLMs have context window limits. Large documents require careful batching (e.g., 2–3 pages at a time) to manage token usage.

Inconsistent or Noisy PDF Layouts

In reality, pages may vary widely in design (column-based text, large images, embedded fonts). A robust image rendering process is essential to capture content reliably.

Validation

The system’s final step validates XML structure to prevent malformed tags or partial outputs from earlier incomplete steps.

 

Applications of X-RAG

Technical Manuals

Long user guides with cross-referenced diagrams. X-RAG ensures the transitions between textual instructions and diagrams are preserved.

Legal & Regulatory Documents

Important references can span multiple pages, and images or tables may contain key provisions. X-RAG keeps each reference in context with relevant visual cues.

Academic Papers & Research Reports

Multi-page figures (e.g., figure splits across consecutive pages) remain logically coherent. Supporting text is properly tied to each figure element.

Large-Scale Archiving
X-RAG’s structured output is ideal for digital repositories, e-discovery tools, or knowledge bases needing advanced search and analytics.

 

Conclusion

X-RAG marries the strengths of large language models with extended retrieval across page-spanning context and images. By preserving visual, textual, and structural continuity in an XML output, X-RAG ensures no critical information is lost, enabling next-level insights and analytics over complex, image-heavy PDF documents.

This method surmounts the limitations of purely text-centric pipelines, offering a robust, holistic view of the document’s information. Whether it’s technical manuals, regulatory filings, or research papers, X-RAG yields complete and cohesive datasets ready for advanced downstream processing.

Use this button to send us an email, we're happy to meet with you and find a solution to your challenge!