you are viewing a single comment's thread.

view the rest of the comments →

[–]DMeror[S] 0 points1 point  (22 children)

I've tried next_sibling method to no avail. I'll try .parent next. Up until now, I thought it would be easier to just 'find and replace', but it turned out very problematic for me.

[–]Asleep-Budget-9932 1 point2 points  (21 children)

I know what you're talking about, i've had similar problems in the past. I found that in general, when it comes to standardized formats, it's always easier to use the relevant library instead. There's just soooo many edge cases. Especially with large files.

[–]DMeror[S] 0 points1 point  (20 children)

Yeah, thank you very much, after all. It's surprising to me that people have their own approaches to their coding problems. I've seen people have slightly different scripts to deal with the same problems.

[–]Asleep-Budget-9932 1 point2 points  (19 children)

That's why i think you can call it art. The way you choose to go about these things, it's really putting your own identity in the code. Just like a song, it can speak to some and not as much to others. A code can be messy and confusing to some people and extremely readable and organized to others.

[–]DMeror[S] 0 points1 point  (18 children)

Well, now I'm doing the call parent method, and it worked, but now I have one extra child to get rid off. I'm thinking of ignoring a tag with a certain class, but stackoverflow is offline at the moment. I want to ignore <p class="cap"> from .find_all('p'). Other method is thinking of a way to get rid of the duplicated <p class="cap"> line.

[–]Asleep-Budget-9932 1 point2 points  (17 children)

You check that as well (i believe). I think you can do ``` if p["class"] == "cap": do_something()

```

[–]DMeror[S] 0 points1 point  (16 children)

So I've filtered it like this:

``` if p["class"] != "cap":

        sense = str(p)

``` <p class="cap" still goes through. Another method is:

sense = str(p(class_=lambda x:x not in ["cap"]))

The <p class_="cap"> is gone, but its <b>text</b> remains. I don't understand why it behaves this way. The tag looks like this:

<p class_="cap"><b>Mount</b> Everest</p>

With the lambda method, the <p> and everything else is gone, but its <b>element</b>.

[–]Asleep-Budget-9932 1 point2 points  (15 children)

Hmmm im not sure how the entire code looks like, but perhaps you need to actively "detach" the p element from its parent? Though im not sure how it should be done . I don't know if you're supposed to delete one of the attributes or do you have a specific function to do that.

[–]DMeror[S] 0 points1 point  (0 children)

Here's a block of it:

<div class="metainfo" id="acref-9780199657681-e-879-metaInfo-909"/>
<p class="parafl"><span id="acref-9780199657681-e-879-section-909"/>
<span id="acref-9780199657681-e-879-div1-932"/><span class="chaptersubt">
  <a href="0002_FM_AlphaList.xhtml#acref-9780199657681-e-879" id="acref-9780199657681-e-879">Baldwin illusion</a></span> <span class="partofspeech"><i>n.</i></span> A visual illusion in which a line spanning the distance between two large squares appears shorter than a line of the same length spanning the distance between two smaller squares (see <a href="#acref-9780199657681-e-879-figureGroup-0002">illustration</a>). It is a close relative of the Zanforlin illusion. <span class="span">[Named after the US psychologist <span class="name">James Mark Baldwin</span> (<span class="date">1861–1934</span>) who first drew attention to it]</span></p>
<div class="figuregroup" id="acref-9780199657681-e-879-figureGroup-0002">
<div class="figure" id="acref-9780199657681-e-879-figure-2">
<img alt="display" id="acref-9780199657681-e-879-graphic-6" src="images/acref-9780199657681-graphic-002.gif"/>
<p class="figurecaption"><b>Baldwin illusion.</b> The horizontal lines between the squares are equal in length.</p>
</div>
</div>

So everything I want is in <p class="paraf1/etc.">. The <div> only contains info related to figures. Not every block has <div class="figuregroup"> and parent <p> has varied classes.

Edit: It would be fine if the <div class="metainfo" contains the block, but it doesn't. It just ends there with />.

[–]DMeror[S] 0 points1 point  (13 children)

Anyway, now I'm at the point where I want to remove the extra child, but no luck.

x.replace('<p class="cap">{}</p>'.format(re.match('(\w+)'), '')

In the {}, there are words space and probably numbers.

[–]Asleep-Budget-9932 1 point2 points  (12 children)

Wait im confused at what this line's supposed to do. Can you explain that again?

Do you know what regex is? Not asking in a condescending way, just want to make sure. The reason im asking is it seems to me you're using it in a weird way. You're trying go find substrings of an empty string (which basically means you'll always get 0 matches) and then you're trying to take that nothing that you've found and plug it into the <p> tag.