Generating clean text for LLM processing

Preprocessing hundreds of complex documents into a suitable format for future AI API analysis.

Mon 08 July 2024

Challenge

Provide organizational-level context to single document edits during the review phase.
Prepare hundreds of complex technical documents for custom LLM training.

Intended result

A python function that cleans the output of a Sphinx document into a plain-text output optimized for machine reading. To do this in a structured way, we build the document as a single, hierarchical HTML file. We then strip out HTML tag elements until we are left with plain text. We could also generate unstructured XML or TXT files via Sphinx's in-built functions.
Fine-tune an existing LLM to include an organization's full suite of technical documents. For this, we would need to create clean text outputs of all our documents (as above) and train our desired base LLM model.
A pipeline YAML script that creates a machine readable text output of proposed changes/additions.
A merge request function that summarizes these changes and looks for inconsistencies against a company's entire technical documentation suite.

Background

Applications of Custom LLMs in Document Analysis

Content Summarization: Generate summaries of long documents to provide quick overviews of changes or new content.
Version Control: Manage different versions of documents, tracking changes over time and maintaining a comprehensive history of document evolution.
Quality Assurance: Ensure that new documents meet quality standards by checking for completeness, accuracy, and adherence to templates or guidelines.
Contextual Awareness: Understand the context and semantics of technical content across different documents. Capture subtle nuances and relationships between concepts.
Document Alignment: Compare changes with the existing corpus of documents. Identify similarities, differences, and potential contradictions by analyzing the semantic meaning rather than exact text matches.
Fuzzy Matching: Quantify how closely related or contradictory new information is to existing documents. Identify potential inconsistencies that may not be obvious through simple keyword matching.

Merge requests

We use merge requests as our key review platform for proposing new documents or edits to existing documents. As part of these requests, we script a number of functions that are automatically performed and provided as an output for reviewers to consider. For example, we run style and grammar checks on the proposed content to flag up any use or incorrect terms or spelling. If there are errors, they are produced in a report and the document release is blocked until they are fixed.

We would add this type of document analysis as a automated phase of the document review pipeline.