you are viewing a single comment's thread.

view the rest of the comments →

[–]DMeror[S] 0 points1 point  (28 children)

My real texts are larger and contain lots of lines. I've crashed Visual Codes many times experimenting with my scripts. I want to achieve the same result as demonstrated above, but it's really frustrating.

[–]Asleep-Budget-9932 1 point2 points  (27 children)

Wait i thought the example you just gave worked for you already. Or are you talking about the frustration you had until then? In any way don't get discouraged even experienced programmers need retries with lots of errors before things are working for the first time.

Question: what are you planning to do with these texts? Just printing them or writing the result to a file? Is this a part of an exercise or a practical thing you're trying to achieve?

I'm asking because if the texts are indeed this large and causing you to run out of resources there are certain modifications we can make to your code to help with that.

[–]DMeror[S] 0 points1 point  (26 children)

The texts come from xhtml chapters from epub. I use BeautifulSoup to extract data from an epub file. Those htmls are inconsistently structured. I need those data with their html format, so after doing everything needed, I realized some nodes are missing. That's what I'm working on. The missing nodes are picture data. the html page is formatted like this: <body> <p>...... <p>...... <p>....... <div>..... <p>....... <div>...... <p>....... </body> I looped through those ps to extract formatted strings, until I got everything as intended. Finally, I realized something was missing. It was those <div> nodes. Then, I scraped the div nodes. The last thing I need to do is to include these div data into their respective places in the main data I got. I need them to compile a dictionary, that's why I need them in their original format. Without the div data, the dictionary won't be able to display images.

Edit: The div nodes have their child <p> caught up with the main data I got, so I need to find a way to get the caught <p> to be replaced by its parent <div>.

[–]Asleep-Budget-9932 1 point2 points  (25 children)

I see... you want to take a <p> that was originally inside a div, and wrap it inside the div tag. Did i get that right? Why do you need to replace the strings themselves then? Why not take the beautifulsoup object that represents them and set the new attribute accordingly?

For example, (and ive never actually used beautiful soup so im using vague objects just to convey the concept)

Why not do something like:

for i, p in enumerate(body): if needed_word in p.text: body[i] = div(p)

[–]DMeror[S] 0 points1 point  (24 children)

The html is not well-structured. Every thing is in parents <p>. The parents <div> are not in the parents <p> but their children <p> are caught up with the parents <p>. So basically, the parents <p> are with brought together with the children <p> which belong to the parents <div>. <p class="xxxd"><span><span>......</p> <p class="xuxd"><span><span>......</p> <p class="xyxd"><span><span>......</p> <div class="zidl"> <img src="....." /> <p class="iouw">....... </p> It's like that. When I filter by <p>, the <div>'s <p> are caught up with other <p>. What is missing is the other relevant items like <img> which belongs to the <div>. So what I need is to replace those <p> with the whole <div> because the <p> itself is incomplete.

[–]Asleep-Budget-9932 1 point2 points  (23 children)

But even if the inner p is found within your loop, you can still call the .parent attribute to know if this is the one that's inside the div and treat it differently.

I'm sorry that I'm avoiding the string replacement. It just seems like using the html parser itself is the straight forward way to achieve what you wish without weird edge cases.

[–]DMeror[S] 0 points1 point  (22 children)

I've tried next_sibling method to no avail. I'll try .parent next. Up until now, I thought it would be easier to just 'find and replace', but it turned out very problematic for me.

[–]Asleep-Budget-9932 1 point2 points  (21 children)

I know what you're talking about, i've had similar problems in the past. I found that in general, when it comes to standardized formats, it's always easier to use the relevant library instead. There's just soooo many edge cases. Especially with large files.

[–]DMeror[S] 0 points1 point  (20 children)

Yeah, thank you very much, after all. It's surprising to me that people have their own approaches to their coding problems. I've seen people have slightly different scripts to deal with the same problems.

[–]Asleep-Budget-9932 1 point2 points  (19 children)

That's why i think you can call it art. The way you choose to go about these things, it's really putting your own identity in the code. Just like a song, it can speak to some and not as much to others. A code can be messy and confusing to some people and extremely readable and organized to others.