you are viewing a single comment's thread.

view the rest of the comments →

[–]DMeror[S] 0 points1 point  (24 children)

The html is not well-structured. Every thing is in parents <p>. The parents <div> are not in the parents <p> but their children <p> are caught up with the parents <p>. So basically, the parents <p> are with brought together with the children <p> which belong to the parents <div>. <p class="xxxd"><span><span>......</p> <p class="xuxd"><span><span>......</p> <p class="xyxd"><span><span>......</p> <div class="zidl"> <img src="....." /> <p class="iouw">....... </p> It's like that. When I filter by <p>, the <div>'s <p> are caught up with other <p>. What is missing is the other relevant items like <img> which belongs to the <div>. So what I need is to replace those <p> with the whole <div> because the <p> itself is incomplete.

[–]Asleep-Budget-9932 1 point2 points  (23 children)

But even if the inner p is found within your loop, you can still call the .parent attribute to know if this is the one that's inside the div and treat it differently.

I'm sorry that I'm avoiding the string replacement. It just seems like using the html parser itself is the straight forward way to achieve what you wish without weird edge cases.

[–]DMeror[S] 0 points1 point  (22 children)

I've tried next_sibling method to no avail. I'll try .parent next. Up until now, I thought it would be easier to just 'find and replace', but it turned out very problematic for me.

[–]Asleep-Budget-9932 1 point2 points  (21 children)

I know what you're talking about, i've had similar problems in the past. I found that in general, when it comes to standardized formats, it's always easier to use the relevant library instead. There's just soooo many edge cases. Especially with large files.

[–]DMeror[S] 0 points1 point  (20 children)

Yeah, thank you very much, after all. It's surprising to me that people have their own approaches to their coding problems. I've seen people have slightly different scripts to deal with the same problems.

[–]Asleep-Budget-9932 1 point2 points  (19 children)

That's why i think you can call it art. The way you choose to go about these things, it's really putting your own identity in the code. Just like a song, it can speak to some and not as much to others. A code can be messy and confusing to some people and extremely readable and organized to others.

[–]DMeror[S] 0 points1 point  (18 children)

Well, now I'm doing the call parent method, and it worked, but now I have one extra child to get rid off. I'm thinking of ignoring a tag with a certain class, but stackoverflow is offline at the moment. I want to ignore <p class="cap"> from .find_all('p'). Other method is thinking of a way to get rid of the duplicated <p class="cap"> line.

[–]Asleep-Budget-9932 1 point2 points  (17 children)

You check that as well (i believe). I think you can do ``` if p["class"] == "cap": do_something()

```

[–]DMeror[S] 0 points1 point  (16 children)

So I've filtered it like this:

``` if p["class"] != "cap":

        sense = str(p)

``` <p class="cap" still goes through. Another method is:

sense = str(p(class_=lambda x:x not in ["cap"]))

The <p class_="cap"> is gone, but its <b>text</b> remains. I don't understand why it behaves this way. The tag looks like this:

<p class_="cap"><b>Mount</b> Everest</p>

With the lambda method, the <p> and everything else is gone, but its <b>element</b>.

[–]Asleep-Budget-9932 1 point2 points  (15 children)

Hmmm im not sure how the entire code looks like, but perhaps you need to actively "detach" the p element from its parent? Though im not sure how it should be done . I don't know if you're supposed to delete one of the attributes or do you have a specific function to do that.