Source data to customer content flow

Using python to extract data from engineering systems, clean it, then parse it for user documentation.

I just finished a recent project where we had CSV tables consisting of 30,000+ lines that we needed to put into some sort of searchable and readable format for customers. After some trials, I found that parsing the information into a non-indexable HTML format that relied on the web browser text search function addressed this issue. Generating 30,000+ indexable locations in each file burdens both the document generation and the actual access of files.

We used the Python pandas and texttable functions to generate the 30,000+ readable text tables from the content. We also created a single basic table that included the key searchable content and a link to the relevant individual table.

After importing the source CSV files from engineering, at a glance, we:

  1. Applied regex cleaning of the data fields.
  2. Created single text files for each line that included a description and a sub-table of key content, as shown below.

Our pipeline built the generated text files on-the-fly. This allowed us to import and parse source data as often as we needed to, without having to add a huge amount to data to the actual document repository.


# snippet

# table generation from
tableObj = texttable.Texttable(max_width=118)# Set columns
tableObj.header(["Range","Name", "Type", "Reset", "Description"])
for i, row in df.loc[[index]].iterrows(): # Iterate register individual fields as a table
    description = str(row['help'])+"\r\n\r\n"+str(row['map'])
# Display table
# ./gitlab-ci.yml pipeline example

  stage: build
    - python
    - python
      - _static/fields_*
      - docs/tables/*.CSV_overview

See also