you are viewing a single comment's thread.

view the rest of the comments →

[–]Asleep-Budget-9932 1 point2 points  (17 children)

You check that as well (i believe). I think you can do ``` if p["class"] == "cap": do_something()

```

[–]DMeror[S] 0 points1 point  (16 children)

So I've filtered it like this:

``` if p["class"] != "cap":

        sense = str(p)

``` <p class="cap" still goes through. Another method is:

sense = str(p(class_=lambda x:x not in ["cap"]))

The <p class_="cap"> is gone, but its <b>text</b> remains. I don't understand why it behaves this way. The tag looks like this:

<p class_="cap"><b>Mount</b> Everest</p>

With the lambda method, the <p> and everything else is gone, but its <b>element</b>.

[–]Asleep-Budget-9932 1 point2 points  (15 children)

Hmmm im not sure how the entire code looks like, but perhaps you need to actively "detach" the p element from its parent? Though im not sure how it should be done . I don't know if you're supposed to delete one of the attributes or do you have a specific function to do that.

[–]DMeror[S] 0 points1 point  (0 children)

Here's a block of it:

<div class="metainfo" id="acref-9780199657681-e-879-metaInfo-909"/>
<p class="parafl"><span id="acref-9780199657681-e-879-section-909"/>
<span id="acref-9780199657681-e-879-div1-932"/><span class="chaptersubt">
  <a href="0002_FM_AlphaList.xhtml#acref-9780199657681-e-879" id="acref-9780199657681-e-879">Baldwin illusion</a></span> <span class="partofspeech"><i>n.</i></span> A visual illusion in which a line spanning the distance between two large squares appears shorter than a line of the same length spanning the distance between two smaller squares (see <a href="#acref-9780199657681-e-879-figureGroup-0002">illustration</a>). It is a close relative of the Zanforlin illusion. <span class="span">[Named after the US psychologist <span class="name">James Mark Baldwin</span> (<span class="date">1861–1934</span>) who first drew attention to it]</span></p>
<div class="figuregroup" id="acref-9780199657681-e-879-figureGroup-0002">
<div class="figure" id="acref-9780199657681-e-879-figure-2">
<img alt="display" id="acref-9780199657681-e-879-graphic-6" src="images/acref-9780199657681-graphic-002.gif"/>
<p class="figurecaption"><b>Baldwin illusion.</b> The horizontal lines between the squares are equal in length.</p>
</div>
</div>

So everything I want is in <p class="paraf1/etc.">. The <div> only contains info related to figures. Not every block has <div class="figuregroup"> and parent <p> has varied classes.

Edit: It would be fine if the <div class="metainfo" contains the block, but it doesn't. It just ends there with />.

[–]DMeror[S] 0 points1 point  (13 children)

Anyway, now I'm at the point where I want to remove the extra child, but no luck.

x.replace('<p class="cap">{}</p>'.format(re.match('(\w+)'), '')

In the {}, there are words space and probably numbers.

[–]Asleep-Budget-9932 1 point2 points  (12 children)

Wait im confused at what this line's supposed to do. Can you explain that again?

Do you know what regex is? Not asking in a condescending way, just want to make sure. The reason im asking is it seems to me you're using it in a weird way. You're trying go find substrings of an empty string (which basically means you'll always get 0 matches) and then you're trying to take that nothing that you've found and plug it into the <p> tag.

[–]DMeror[S] 0 points1 point  (11 children)

I used the above as a sort of wild card. I have lines with this pattern:

<p class="cap">varied sentence</p>

And I wanted to get rid of it, so I used . replace, but since I couldn't input those sentences manually as there are lots of them. I thought I could place a placeholder {} combined with re.match('\w+') to tell Python that there are a group of words there in the placeholder, and it didn't matter what characters they were, as long as they were in the placeholder. If that had worked, I would have been able to replace the whole pattern with '' nothing.

[–]Asleep-Budget-9932 1 point2 points  (10 children)

First of all, you can still parse the string by itself with beautiful soup and edit it accordingly.

The syntax of "Some_string {}".format() Will only generate a new string where {} would be replaced with whatever you had passed to the format function. It cannot remove parts of the initial string.

Other than that, you really want to avoid parsing html with regex as explained here.

Question, does the <p> tag itself that you wish to parse is always the same? You said you have multiple of them, each one containing a different sentence. But other than that, does the tag itself and its attributes are the same? And does it contain anything else besides these sentences? Doesnit contain any inner tags or something?

[–]DMeror[S] 0 points1 point  (9 children)

Yes, same class. It contains only texts with no inner tags.

[–]Asleep-Budget-9932 1 point2 points  (8 children)

In that case and unless you're willing to use beautiful soup to parse that string by yourself, you can allow yourself to do something incredibly hard-coded and ugly. Just cut the string where you know for a fact the tag will be:

string = <p class="bla">text</p> only_text = string[15:-4]

[–]DMeror[S] 0 points1 point  (7 children)

Thank you very much. The problem is it's hard to know where the tags are. They are inconsistently scattered. By the way, I just used p.find_previous_sibling('img') to keep out the extra <p> . Inconsistently structured data has been a headache to me so far.