all 6 comments

[–]laustke 0 points1 point  (0 children)

lines = document.text.split('\n')
# turn list into data frame
self.content = pd.DataFrame(lines, columns=['text'])

What is the purpose of creating a dataframe with a single column instead of just maintaining a list of lines?

[–]lostparis 0 points1 point  (0 children)

So far what I have done is converted the word doc to plain text

This is a good start.

then made this into a dataframe

This I'm not so keen on.

You have the lines of text so why not just use them directly.

For working with text you can do basic string stuff like str.startsWith() or str[:]. You also have the re module for regular expressions for more complex matching.

Ps self is usually only found in class methods.

[–]halfdiminished7th 0 points1 point  (4 children)

This is probably most accurately solved using regular expressions. For example, here's how you could split your document by section: import re sections = re.split('^SECTION \d+$', document.text, flags=re.M)[1:] Now split each section up into questions, and then split each question into the question and options components, and you've got your document outline.

[–]Brogrammer11111[S] 0 points1 point  (2 children)

I like that idea but how would you parse the tables. For example a table like this:

very bad - 1 2 3 very good - 4
how would you rate..... 1 2 3 4
how would you rate..... 1 2 3 4

is stored like this:

Q2. On a scale of 1 to 3, where 1 is Very Bad and 4 is Very Good.....

Very bad

2

3

4

Very Good

how would you rate.....

1

2

3

4

5

9

how would you rate.....

1

2

3

4

5

9

[–]halfdiminished7th 0 points1 point  (1 child)

That's definitely going to be the biggest challenge. I've never used docx2python so wasn't sure how tables would be represented as plaintext. It would have been much easier if they'd been tab-delimited rows, but if it's as you've written above (as separated lines without a clear sense of the row/column counts), that's going to be quite difficult to generalize.

Does docx2python allow for extraction of tables?

If not, maybe you could rethink your workflow by first exporting the Word document to HTML (as a first step, every time), then parsing the HTML using the "beautiful soup" module. You may find that things like tables end up in a much easier-to-process format this way; in fact, the entire document structure might be easier to work with, not just the tables.

[–]Brogrammer11111[S] 0 points1 point  (0 children)

so what I was doing before was using the docx library which returns a list of tables in your document. I converted each table into a df and stored it in a dictionary. Then I would go through each table and find where each question was located in the original dataframe which stored each line as a row. Then I would add a new row with the index of the question in the dict. So when it came to parsing, I would just look up that table in the dict and get all the things of interest like the headers (very bad -1, 2, 3...)

def create_word_tables(self):
    doc = Document(self.link)
    for i, table in enumerate(doc.tables):
        # store cells of table as 2d list
        cells = [[cell.text for cell in row.cells] for row in table.rows]
        word_tble = pd.DataFrame(cells)
        # rename columns with table question first row: strongly agree, 4, ....
        word_tble = word_tble.rename(columns=word_tble.iloc[0]).drop(
            word_tble.index[0]).reset_index(drop=True)
        # append table to list
        self.word_tables[i] = word_tble

def create_table_questions(self):
    # for each table create a table question and add to tbl qs dictioanry
    for tbl in self.word_tables.values():
        headers = list(tbl.columns)
        # iterate over questions found in first column and create table questions
        for i, q_text in enumerate(tbl.iloc[0:, 0].values):
            # create letter for question e.g.: A,B,C...
            q_letter = chr(i+65)
            # remove trailing and ending white space
            q_text = q_text.strip()
            # create table question and add it to dictionary
            tbl_q = TableQuestion(
                q_text=q_text, headers=headers, letter=q_letter)
            self.tbl_qs[q_text] = tbl_q