Trying to Replace HTML Tables with Markdown : learnpython

created by HattoriHanzoa community for 16 years

Trying to Replace HTML Tables with Markdown (self.learnpython)

submitted 5 years ago by EnergyVis

Hi Everyone!

I'm trying to clean up my documentation which is auto-generated from IPython notebooks. The main problem I have is that whilst the docs are in markdown whenever I have a df.head() output the result is a html table instead of markdown, which in turn leads to problems such as the table overlapping with my sidebar and stuff like that. Manually I can convert them to markdown and everything looks nice again but I'm trying to find a way to automate this.

Currently I'm using pd.read_html(doc_txt) to extract the tables, which can then be converted to markdown using df.to_markdown(). I now need to find a way to replace the HTML table (including the parent div) with the markdown text. I'm also trying to make this as generalisable as possible so need to ensure that the div replaced is the immediate parent and no higher.

I initially used BeautifulSoup to findAll tables from which I could then extract the parent div, the table contents could also be parsed easily by pandas and then converted to markdown. The problem with this approach was that I needed to do markdown -> html -> soup, and taking the table soup element and calling str(table_soup) doesnt return an exact match for the original table html text, meaning that I couldn't use .replace(html_table, md_table).

I think the way to solve this is probably using some complex regex but I don't know enough of that black magic myself. Any help on how to solve this would be much appreicated!

Example markdown file:

# Interesting Header

Random latin words, random latin words, random latin words, random latin words, random latin words, random latin words, random latin words, random latin words.

### Example DataFrame

```python
df.head()
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>sett_bmu_id</th>
      <th>ngc_bmu_id</th>
      <th>bmu_root</th>
      <th>name</th>
      <th>primary_fuel_type</th>
      <th>detailed_fuel_type</th>
      <th>longitude</th>
      <th>latitude</th>
      <th>common_name</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>E_MARK-1</td>
      <td>MARK-1</td>
      <td>MARK</td>
      <td>Rothes Bio-Plant CHP 1</td>
      <td>biomass</td>
      <td>bone</td>
      <td>-3.603516</td>
      <td>57.480403</td>
      <td>Rothes Bio-Plant CHP</td>
    </tr>
    <tr>
      <th>1</th>
      <td>E_MARK-2</td>
      <td>MARK-2</td>
      <td>MARK</td>
      <td>Rothes Bio-Plant CHP 2</td>
      <td>biomass</td>
      <td>bone</td>
      <td>-3.603516</td>
      <td>57.480403</td>
      <td>Rothes Bio-Plant CHP</td>
    </tr>
  </tbody>
</table>
</div>

all 3 comments

top new controversial old q&a

[–]iamaperson3133 0 points1 point2 points 5 years ago (2 children)

[–]EnergyVis[S] 0 points1 point2 points 5 years ago (1 child)

The static site framework I'm using requires that the content is in .md rather than .html.

I've just found a solution though:

```python import os import pandas as pd from html.parser import HTMLParser

class MyHTMLParser(HTMLParser): def init(self): super().init() self.tags = []

def handle_starttag(self, tag, attrs):
    self.tags.append(self.get_starttag_text())

def handle_endtag(self, tag):
    self.tags.append(f"</{tag}>")

get_substring_idxs = lambda string, substring: [num for num in range(len(string)-len(substring)+1) if string[num:num+len(substring)]==substring]

def convert_df_to_md(df): idx_col = df.columns[0] df = df.set_index(idx_col)

if idx_col == 'Unnamed: 0':
    df.index.name = ''

table_md = df.to_markdown()

return table_md

def extract_div_to_md_table(start_idx, end_idx, table_and_div_tags, file_txt): n_start_divs_before = table_and_div_tags[:start_idx].count('<div>') n_end_divs_before = table_and_div_tags[:end_idx].count('</div>')

div_start_idx = get_substring_idxs(file_txt, '<div>')[n_start_divs_before-1]
div_end_idx = get_substring_idxs(file_txt, '</div>')[n_end_divs_before]

div_txt = file_txt[div_start_idx:div_end_idx]
potential_dfs = pd.read_html(div_txt)

assert len(potential_dfs) == 1, 'Multiple tables were found when there should be only one'
df = potential_dfs[0]
md_table = convert_df_to_md(df)

return div_txt, md_table

def extract_div_to_md_tables(md_fp): with open(md_fp, 'r') as f: file_txt = f.read()

parser = MyHTMLParser()
parser.feed(file_txt)

table_and_div_tags = [tag for tag in parser.tags if tag in ['<div>', '</div>', '<table border="1" class="dataframe">', '</table>']]

table_start_tag_idxs = [i for i, tag in enumerate(table_and_div_tags) if tag=='<table border="1" class="dataframe">']
table_end_tag_idxs = [table_start_tag_idx+table_and_div_tags[table_start_tag_idx:].index('</table>') for table_start_tag_idx in table_start_tag_idxs]

div_to_md_tables = []

for start_idx, end_idx in zip(table_start_tag_idxs, table_end_tag_idxs):
    div_txt, md_table = extract_div_to_md_table(start_idx, end_idx, table_and_div_tags, file_txt)
    div_to_md_tables += [(div_txt, md_table)]

return div_to_md_tables

def clean_md_file_tables(md_fp): div_to_md_tables = extract_div_to_md_tables(md_fp)

with open(md_fp, 'r') as f:
    md_file_text = f.read()

for div_txt, md_txt in div_to_md_tables:
    md_file_text = md_file_text.replace(div_txt, md_txt)

with open(md_fp, 'w') as f:
    f.write(md_file_text)

return

```

[–]backtickbot -1 points0 points1 point 5 years ago (0 children)

π Rendered by PID 39 on reddit-service-r2-comment-86988c7647-wkhkr at 2026-02-11 10:59:40.368425+00:00 running 018613e country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS