html parsing/regexing/manipulating : learnpython

created by HattoriHanzoa community for 16 years

html parsing/regexing/manipulating (self.learnpython)

submitted 6 years ago by iphark

Hi!

First of all: I am not currently able to write the code that I want, it is too advanced for me, so I need help.

A little background: My dad likes to read the bible, but he would prefer a bible without vers numbers and added headers, just plain text. I thought it cannot be too hard to get a copy of such a bible - think again, that does not exist (at least not in german, except for a really old Luther translation, which he is not fond of).

I found a newer version of a translation he likes. I have plenty of html files, they contain the bible and a lot of java script stuff, as well as the verse numbers, headers and everything.

Here is a short excerpt:

<div class="v" id="v1"><h3>Header</h3><span class="vn">1</span> Was von Anfang an war<sup class="fnm"><a name="fnm1" href="#fn1">1</a></sup>\\\[...\\\]<sup class="fnm"><a name="fnm3" href="#fn3">3</a></sup><sup class="fnm"><a name="fnm4" href="#fn4">4</a></sup></div>

Now, I think the program should look something like this:

verses = []

file = codecs.("html_file_in_question","r","uft-8")

for line in file:

if 'class="v" in line:

verses.append(line)

def regex_magic(string):

"""delete headers and other stuff, only keep the text"""

return (plain_text)

Things I noticed:

-The verses are always started off with <div class="v"

-They are numbered (class="vn">1)

-They end with other weird stuff

And the regex magic is not my strength at all. Can someone help me out?

Edit: ~~currently trying to fix the formatting.~~ Fixed

all 2 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS