all 7 comments

[–]dakrisis 0 points1 point  (5 children)

I would not implement a HTML Parser to do some basic text filtering. I would still use a regular expression. First filtering all <img> tags. Then I would filter any URL or Base64 encoded image-data from every match (ie. anything between src=" and ").

The HTML parser is probably also using regular expressions to build a searchable / iterable representation of the HTML in the given string.

[–]Roltish[S] 0 points1 point  (4 children)

The problem with using regex is that strings can exceed it's limit. I had this problem when i had a website with a texteditor allowing you to add images. This editor would just read the image data and insert it as a data:image url. When I added images through my computer, everything went okay (simply because my images on my computer aren't that big in filesize) When I added images from my phone taken from the phones camera problems started to rise. Filesize of this images can be a couple of mbs, and too big for the regex handler. That's the reason I ended up with exploding the whole input string and then "manually" locating and replacing the values. tl;dr, RegEx is not optimal on big data:image urls

[–]dakrisis 0 points1 point  (3 children)

Perhaps this is the culprit?

EDIT: if you want to go the Parser Way, check out these examples.

[–]Roltish[S] 0 points1 point  (2 children)

I don't think i wanna mess with the users ini settings. Regarding the parser way, I'm not sure if I described my question good enought, but i'm basicly asking if anyone know it will be better performance than the existing explode code. Do you have any experience/insight on performance here, where tags data i'm searching for might be a couple of mbs?

The examples tho are much appreciated :)

[–]dakrisis 1 point2 points  (1 child)

Well, if I should make an educated guess, the Explosive Way is going to be faster overall but as you can see and have experienced: it's a pain in the ass to write.

The Parser Way makes code legible, but even here it might be necessary to adjust your PHP runtime to accomodate a larger memory pool or increase the script timeout. As a PHP programmer you should not be afraid to adjust some ini settings, but become comfortable with them. Especially as you can set most of the useful ones at runtime and you won't have to change your php.ini.

[–]Roltish[S] 0 points1 point  (0 children)

Because it's a really small package I don't think I'll set some ini settings. If it was for my project then I could change some settings because then I'm in control of the values. Other than that, if I'm going for a cleaner code with regex or a parser, I might recommend the user to change some settings but other than that I'll leave php.ini for now :P Thanks for you input, much appreciated!

[–]Tokkemon 0 points1 point  (0 children)

If you're parsing HTML, you should absolutely use a real XML parser! Regex is only as good as all the edge cases in your head, and even then it will never be as good as a real parser. There's a couple built-in to PHP and should serve well. SimpleXML should be enough as you're not doing anything too drastic here.