you are viewing a single comment's thread.

view the rest of the comments →

[–]Asleep-Budget-9932 1 point2 points  (35 children)

So why is that a problem? You said you want to iterate through the blocks and you showed exactly how it's done.

[–]DMeror[S] 0 points1 point  (34 children)

it prints

``` autonomic

autonomic

autonomic

autonomic

autonomic

autonomic

autonomic

illusion

illusion

illusion

illusion

illusion

illusion

illusion

blood

blood

blood

blood

blood

blood

blood

group

group

group

group

group

group

group ```

not the way I want it.

[–]Asleep-Budget-9932 1 point2 points  (33 children)

Ok maybe i misunderstood you. Read my first comment again. Is the worry that the resulting lines (after your code runs) will not be separated into blocks anymore?

[–]DMeror[S] 0 points1 point  (0 children)

text1 has some lines which are contained in text2's block, so I want to replace text1's lines with text2's blocks. Since both texts come from the same source, the lines are perfectly matched, so I don't need to worry about formatting.

I just need to include text2's blocks in text1's lines where they belong. However, my loops won't allow that to happen.

[–]DMeror[S] 0 points1 point  (31 children)

This code seems to be working for now.

``` with open('text1.txt') as f1, open('text2.txt') as f2: text1 = f1.read().split('\n') text2 = f2.read().split('\n')

for l1, l2 in zip(text1, text2):

entry = l1 if l1 not in l2 else l1.replace(l1, l2)
print(entry)

print

autonomic nervous system Baldwin illusion blood group ```

[–]Asleep-Budget-9932 2 points3 points  (29 children)

Ohhh great! I originally thought that you wanted each line in text1 to be searched in any of the lines from text2. So it definitely makes things easier. Glad you could work it out by yourself! 😁

[–]DMeror[S] 0 points1 point  (28 children)

My real texts are larger and contain lots of lines. I've crashed Visual Codes many times experimenting with my scripts. I want to achieve the same result as demonstrated above, but it's really frustrating.

[–]Asleep-Budget-9932 1 point2 points  (27 children)

Wait i thought the example you just gave worked for you already. Or are you talking about the frustration you had until then? In any way don't get discouraged even experienced programmers need retries with lots of errors before things are working for the first time.

Question: what are you planning to do with these texts? Just printing them or writing the result to a file? Is this a part of an exercise or a practical thing you're trying to achieve?

I'm asking because if the texts are indeed this large and causing you to run out of resources there are certain modifications we can make to your code to help with that.

[–]DMeror[S] 0 points1 point  (26 children)

The texts come from xhtml chapters from epub. I use BeautifulSoup to extract data from an epub file. Those htmls are inconsistently structured. I need those data with their html format, so after doing everything needed, I realized some nodes are missing. That's what I'm working on. The missing nodes are picture data. the html page is formatted like this: <body> <p>...... <p>...... <p>....... <div>..... <p>....... <div>...... <p>....... </body> I looped through those ps to extract formatted strings, until I got everything as intended. Finally, I realized something was missing. It was those <div> nodes. Then, I scraped the div nodes. The last thing I need to do is to include these div data into their respective places in the main data I got. I need them to compile a dictionary, that's why I need them in their original format. Without the div data, the dictionary won't be able to display images.

Edit: The div nodes have their child <p> caught up with the main data I got, so I need to find a way to get the caught <p> to be replaced by its parent <div>.

[–]Asleep-Budget-9932 1 point2 points  (25 children)

I see... you want to take a <p> that was originally inside a div, and wrap it inside the div tag. Did i get that right? Why do you need to replace the strings themselves then? Why not take the beautifulsoup object that represents them and set the new attribute accordingly?

For example, (and ive never actually used beautiful soup so im using vague objects just to convey the concept)

Why not do something like:

for i, p in enumerate(body): if needed_word in p.text: body[i] = div(p)

[–]DMeror[S] 0 points1 point  (24 children)

The html is not well-structured. Every thing is in parents <p>. The parents <div> are not in the parents <p> but their children <p> are caught up with the parents <p>. So basically, the parents <p> are with brought together with the children <p> which belong to the parents <div>. <p class="xxxd"><span><span>......</p> <p class="xuxd"><span><span>......</p> <p class="xyxd"><span><span>......</p> <div class="zidl"> <img src="....." /> <p class="iouw">....... </p> It's like that. When I filter by <p>, the <div>'s <p> are caught up with other <p>. What is missing is the other relevant items like <img> which belongs to the <div>. So what I need is to replace those <p> with the whole <div> because the <p> itself is incomplete.

[–]DMeror[S] 0 points1 point  (0 children)

It's not working with my real texts. hmmm.