I just finished a recent project where we had CSV tables consisting of 30,000+ lines that we needed to put into some sort of searchable and readable format for customers. After some trials, I found that parsing the information into a non-indexable HTML format that relied on the web browser text search function addressed this issue. Generating 30,000+ indexable locations in each file burdens both the document generation and the actual access of files.
We used the Python pandas and texttable functions to generate the 30,000+ readable text tables from the content. We also created a single basic table that included the key searchable content and a link to the relevant individual table.
After importing the source CSV files from engineering, at a glance, we:
- Applied regex cleaning of the data fields.
- Created single text files for each line that included a description and a sub-table of key content, as shown below.
Our pipeline built the generated text files on-the-fly. This allowed us to import and parse source data as often as we needed to, without having to add a huge amount to data to the actual document repository.
# generateMemoryMapRegisters.py snippet # table generation from https://pypi.org/project/texttable/ tableObj = texttable.Texttable(max_width=118)# Set columns tableObj.header(["Range","Name", "Type", "Reset", "Description"]) for i, row in df.loc[[index]].iterrows(): # Iterate register individual fields as a table description = str(row['help'])+"\r\n\r\n"+str(row['map']) tableObj.add_row( [str(row['range']),str(row['name']),str(row['type']),str(row['reset']),description] ) # Display table print(tableObj.draw())
# ./gitlab-ci.yml pipeline example fields: stage: build script: - python generateMemoryMapOverview.py - python generateMemoryMapRegisters.py artifacts: paths: - _static/fields_* - docs/tables/*.CSV_overview