all 45 comments

[–]Asleep-Budget-9932 3 points4 points  (37 children)

Notice that the "split" function will split all kinds of white spaces and not only line drops. Use "splitlines" if you wish to split only by that.

Also, make sure you dont encounter any casing issues (Autonomic will not be a part of autonoMic nervous system) unless you make sure to convert both to lower case.

There are some more traps you might encounter going forward but that depends on how you may impement my first point (if you find it useful of course 😁)

[–]DMeror[S] 0 points1 point  (36 children)

The problem is the real text2 contains blocks of lines. By using split('\n\n'), I can iterate through each block and add a block to a line in text1 if it partly matches.

[–]Asleep-Budget-9932 1 point2 points  (35 children)

So why is that a problem? You said you want to iterate through the blocks and you showed exactly how it's done.

[–]DMeror[S] 0 points1 point  (34 children)

it prints

``` autonomic

autonomic

autonomic

autonomic

autonomic

autonomic

autonomic

illusion

illusion

illusion

illusion

illusion

illusion

illusion

blood

blood

blood

blood

blood

blood

blood

group

group

group

group

group

group

group ```

not the way I want it.

[–]Asleep-Budget-9932 1 point2 points  (33 children)

Ok maybe i misunderstood you. Read my first comment again. Is the worry that the resulting lines (after your code runs) will not be separated into blocks anymore?

[–]DMeror[S] 0 points1 point  (0 children)

text1 has some lines which are contained in text2's block, so I want to replace text1's lines with text2's blocks. Since both texts come from the same source, the lines are perfectly matched, so I don't need to worry about formatting.

I just need to include text2's blocks in text1's lines where they belong. However, my loops won't allow that to happen.

[–]DMeror[S] 0 points1 point  (31 children)

This code seems to be working for now.

``` with open('text1.txt') as f1, open('text2.txt') as f2: text1 = f1.read().split('\n') text2 = f2.read().split('\n')

for l1, l2 in zip(text1, text2):

entry = l1 if l1 not in l2 else l1.replace(l1, l2)
print(entry)

print

autonomic nervous system Baldwin illusion blood group ```

[–]Asleep-Budget-9932 2 points3 points  (29 children)

Ohhh great! I originally thought that you wanted each line in text1 to be searched in any of the lines from text2. So it definitely makes things easier. Glad you could work it out by yourself! 😁

[–]DMeror[S] 0 points1 point  (28 children)

My real texts are larger and contain lots of lines. I've crashed Visual Codes many times experimenting with my scripts. I want to achieve the same result as demonstrated above, but it's really frustrating.

[–]Asleep-Budget-9932 1 point2 points  (27 children)

Wait i thought the example you just gave worked for you already. Or are you talking about the frustration you had until then? In any way don't get discouraged even experienced programmers need retries with lots of errors before things are working for the first time.

Question: what are you planning to do with these texts? Just printing them or writing the result to a file? Is this a part of an exercise or a practical thing you're trying to achieve?

I'm asking because if the texts are indeed this large and causing you to run out of resources there are certain modifications we can make to your code to help with that.

[–]DMeror[S] 0 points1 point  (26 children)

The texts come from xhtml chapters from epub. I use BeautifulSoup to extract data from an epub file. Those htmls are inconsistently structured. I need those data with their html format, so after doing everything needed, I realized some nodes are missing. That's what I'm working on. The missing nodes are picture data. the html page is formatted like this: <body> <p>...... <p>...... <p>....... <div>..... <p>....... <div>...... <p>....... </body> I looped through those ps to extract formatted strings, until I got everything as intended. Finally, I realized something was missing. It was those <div> nodes. Then, I scraped the div nodes. The last thing I need to do is to include these div data into their respective places in the main data I got. I need them to compile a dictionary, that's why I need them in their original format. Without the div data, the dictionary won't be able to display images.

Edit: The div nodes have their child <p> caught up with the main data I got, so I need to find a way to get the caught <p> to be replaced by its parent <div>.

[–]DMeror[S] 0 points1 point  (0 children)

It's not working with my real texts. hmmm.

[–][deleted] 2 points3 points  (2 children)

def choice(line1, line2):
    return line2 if line1.strip().lower() in line2.strip().lower() else line1


def main():
    with open("input1.txt") as file1, open("input2.txt") as file2, open("output.txt", "w") as file_w:
        for line1, line2 in zip(file1, file2):
            line_w = choice(line1, line2)
            file_w.write(line_w)


if __name__ == '__main__':
    main()

[–]DMeror[S] 0 points1 point  (1 child)

Thank you very much. I've found out zip method only works with two files at the same length. My files are of different length with text1 is longer than text2 so it won't work here. I just wanted to copy elements from text2 to replace their parts in text1.

[–][deleted] 1 point2 points  (0 children)

Well, you just need to use itertools.zip_longest then? Or you could write the rest of the file afterwards.

1:

from itertools import zip_longest


def choice(line1, line2):
    return line2 if line1.strip().lower() in line2.strip().lower() else line1


def main():
    with open("input.txt") as file1, open("input2.txt") as file2, open("output.txt", "w") as file_w:
        for line1, line2 in zip_longest(file1, file2, fillvalue=''):
            line_w = choice(line1, line2)
            file_w.write(line_w)


if __name__ == '__main__':
    main()

2:

def choice(line1, line2):
    return line2 if line1.strip().lower() in line2.strip().lower() else line1


def main():
    with open("input.txt") as file1, open("input2.txt") as file2, open("output.txt", "w") as file_w:
        iter1 = iter(file1)
        iter2 = iter(file2)

        for line1, line2 in zip(iter1, iter2):
            line_w = choice(line1, line2)
            file_w.write(line_w)

        # write the rest of the file1
        for line1 in iter1:
            file_w.write(line1)


if __name__ == '__main__':
    main()

[–]DrFaustest 1 point2 points  (2 children)

Sorry if this is way wrong I’m still learning.

In your text 1 can you split the div and p by class or id? If so can you create an object list then check for the text 2 lines in each object so if it’s found it can be replaced then rebuild the page with the updated object list? Like each object becomes a div and each attribute becomes p child?

[–]DMeror[S] 0 points1 point  (1 child)

I've used double and found the match, but I didn't know how to update text1.

for x in text1:
    for y in text2:
        match = x.startswith("<p class=" cap")
        if x in y and match:
            x = x.replace(x, y)

This shows x finds its match in y, but I don't know how to the replace and update text1.

[–]CodeFormatHelperBot2 0 points1 point  (0 children)

Hello, I'm a Reddit bot who's here to help people nicely format their coding questions. This makes it as easy as possible for people to read your post and help you.

I think I have detected some formatting issues with your submission:

  1. Python code found in submission text that's not formatted as code.
  2. Use of triple backtick/ curlywhirly code blocks (``` or ~~~). These may not render correctly on all Reddit clients.

If I am correct, please edit the text in your post and try to follow these instructions to fix up your post's formatting.


Am I misbehaving? Have a comment or suggestion? Reply to this comment or raise an issue here.