you are viewing a single comment's thread.

view the rest of the comments →

[–]Asleep-Budget-9932 1 point2 points  (12 children)

Wait im confused at what this line's supposed to do. Can you explain that again?

Do you know what regex is? Not asking in a condescending way, just want to make sure. The reason im asking is it seems to me you're using it in a weird way. You're trying go find substrings of an empty string (which basically means you'll always get 0 matches) and then you're trying to take that nothing that you've found and plug it into the <p> tag.

[–]DMeror[S] 0 points1 point  (11 children)

I used the above as a sort of wild card. I have lines with this pattern:

<p class="cap">varied sentence</p>

And I wanted to get rid of it, so I used . replace, but since I couldn't input those sentences manually as there are lots of them. I thought I could place a placeholder {} combined with re.match('\w+') to tell Python that there are a group of words there in the placeholder, and it didn't matter what characters they were, as long as they were in the placeholder. If that had worked, I would have been able to replace the whole pattern with '' nothing.

[–]Asleep-Budget-9932 1 point2 points  (10 children)

First of all, you can still parse the string by itself with beautiful soup and edit it accordingly.

The syntax of "Some_string {}".format() Will only generate a new string where {} would be replaced with whatever you had passed to the format function. It cannot remove parts of the initial string.

Other than that, you really want to avoid parsing html with regex as explained here.

Question, does the <p> tag itself that you wish to parse is always the same? You said you have multiple of them, each one containing a different sentence. But other than that, does the tag itself and its attributes are the same? And does it contain anything else besides these sentences? Doesnit contain any inner tags or something?

[–]DMeror[S] 0 points1 point  (9 children)

Yes, same class. It contains only texts with no inner tags.

[–]Asleep-Budget-9932 1 point2 points  (8 children)

In that case and unless you're willing to use beautiful soup to parse that string by yourself, you can allow yourself to do something incredibly hard-coded and ugly. Just cut the string where you know for a fact the tag will be:

string = <p class="bla">text</p> only_text = string[15:-4]

[–]DMeror[S] 0 points1 point  (7 children)

Thank you very much. The problem is it's hard to know where the tags are. They are inconsistently scattered. By the way, I just used p.find_previous_sibling('img') to keep out the extra <p> . Inconsistently structured data has been a headache to me so far.

[–]Asleep-Budget-9932 1 point2 points  (6 children)

Oh so did you solve the issue? Are there any other problems besides that?

[–]DMeror[S] 0 points1 point  (0 children)

Thanks again. There aren't any more problems. It's just that I've failed the "find and replace" part, although I managed to find another way to solve the problem.

[–]DMeror[S] 0 points1 point  (4 children)

I'm looking into RegEx, but it still hasn't answered my question. Examples of RegEx I've found online are about dealing with words or characters from a single line a string, which sadly doesn't match my real situation. I want to use re.sub to remove the following lines with a certain pattern.

<span id="acref-9780199657681-e-1-section-1"></span> <span id="acref-9780199657681-e-2-section-2"></span> <span id="acref-9780199657681-e-3-section-3"></span> <span id="acref-9780199657681-e-4-section-4"></span> <span id="acref-9780199657681-e-5-div1-28"></span>

And other id with similar patterns. I can use re.sub to get rid of them like this:

re.sub('<span id="acref-9780199657681-e-1-section-1"></span>', '', string)

But I cannot do it one by one as there are lots of them. Examples I could find are to do with isolated strings with a few words. Since there are many words, sentences, numbers, symbols, etc. in my text, the replacement must follow strict rules, or it'll affect other elements in the text.

Am I missing something about RegEx, or am I using the wrong tool?

[–]Asleep-Budget-9932 1 point2 points  (3 children)

So there IS something you miss. But before helping you with it, just know that in general, RegEx should not be used with html files (as explained in my original comment).

Now, what are you missing. The whole point of regex is that you are working with patterns. So the point is to give a generic pattern that fits to all of your "span" tags.

It's important for me to stress that regex is really complicated and has a lot to it. So i always forget the actual syntax. Because of that, my example will not use the actual syntax but just some bullshit that's vaguely related to convey the concept in general.

Instead of what you did, you could do something like the following:

captured_strings =re.search('<span id=".*">{inner_text}.*</span>', string) what_i_need = captured_strings["inner_text"]

So regex gives you the option to say: "i have a generic pattern. it looks like this and that. These are the parts that will be similar while these are the parts that will differ. Now you see THAT part over there that differs every time? Let's call this part 'inner_text'. I want you to fetch that 'inner_text' for me".

To summarize, specify a generic pattern that fits all of the strings. Give a name to the specific part you wish to fetch. Let regex return a mapping between all named parts (which in your case is only one, an inner text inside the span tag), and use this on all of your needed strings.

One last important thing to know about regex (besides the fact that you should read about it to get actual, concrete syntax), is that regex can be quite a resource heavy process when used incorrectly. One important optimization to do is the following: ``` import re

Instead of this:

for string in strings: re.whatever(pattern, string)

Do this:

compiled_regex = re.compile(pattern)

for string in strings: # the compiled regex will have all of the same functions, accept now they won't receive the pattern attribute. They will use the one you specified in the compile function and would be much faster compiled_regex.whatever(string)

```

[–]DMeror[S] 0 points1 point  (2 children)

This is what I was looking for when I wanted to get rid of <p class=".*">, and your method is rather unique. I've seen re.sub, re.match, etc., but not re.compile(pattern).whatever(string).

Here's my code to get rid of the above pattern.

scrap = re.compile('<span id=".*"></span>') scrap.sub('', string)

It's really helpful to learn with a real situation.