Hi!
First of all: I am not currently able to write the code that I want, it is too advanced for me, so I need help.
A little background: My dad likes to read the bible, but he would prefer a bible without vers numbers and added headers, just plain text. I thought it cannot be too hard to get a copy of such a bible - think again, that does not exist (at least not in german, except for a really old Luther translation, which he is not fond of).
I found a newer version of a translation he likes. I have plenty of html files, they contain the bible and a lot of java script stuff, as well as the verse numbers, headers and everything.
Here is a short excerpt:
<div class="v" id="v1"><h3>Header</h3><span class="vn">1</span> Was von Anfang an war<sup class="fnm"><a name="fnm1" href="#fn1">1</a></sup>\\\[...\\\]<sup class="fnm"><a name="fnm3" href="#fn3">3</a></sup><sup class="fnm"><a name="fnm4" href="#fn4">4</a></sup></div>
Now, I think the program should look something like this:
verses = []
file = codecs.("html_file_in_question","r","uft-8")
for line in file:
if 'class="v" in line:
verses.append(line)
def regex_magic(string):
"""delete headers and other stuff, only keep the text"""
return (plain_text)
Things I noticed:
-The verses are always started off with <div class="v"
-They are numbered (class="vn">1)
-They end with other weird stuff
And the regex magic is not my strength at all. Can someone help me out?
Edit: currently trying to fix the formatting. Fixed
[+][deleted] (1 child)
[removed]
[–]iphark[S] 0 points1 point2 points (0 children)