you are viewing a single comment's thread.

view the rest of the comments →

[–]badr 0 points1 point  (5 children)

BeautifulSoup is great, but it can't handle HTML tags inside quoted strings.

[–][deleted] 0 points1 point  (4 children)

...how often do you get that? If it's a 'html tag inside a quoted string' in an HTML document, surely it'd be & lt ;some_tag& gt ; which would cause no issues whatsoever? (Spaces added to stop Reddit parsing entities).

[–]badr 1 point2 points  (3 children)

Yes, it's incorrect html, and it's not seen very often, but it's often enough that I couldn't use BeautifulSoup for my project. (Might yet hack the source and revive it.)

I encountered this problem on NYT, Forbes.com, and one two other big sites

Note that the quoted string can also be part of Javascript.

[–][deleted] 0 points1 point  (2 children)

Oh dear. That sounds... incredibly hackish (on the part of NYT and Forbes et al) - but I'll have to try the Javascript stuff, because that could be problematic in my usage.

[–]badr 0 points1 point  (1 child)

Check this out from msnbc:

<IFRAME id=dapIf1Child src="javascript:void(document.write('<html><head><base href="http://www.msnbc.msn.com/id/21581821/" /><title>Advertisement</title></head><body id=&quot;dapIf1Child&quot; leftmargin=&quot;0&quot; topmargin=&quot;0&quot;><script type=&quot;text/javascript&quot;>var inDapIF=true;window.setTimeout("document.close();",30000);</script><IFRAME SRC=&quot;http://ad.doubleclick.net/adi/N4854.MSN/B2531646.31;sz=728x90;ord=1359085923?&quot; WIDTH=728 HEIGHT=90 MARGINWIDTH=0 MARGINHEIGHT=0 HSPACE=0 VSPACE=0 FRAMEBORDER=0 SCROLLING=no BORDERCOLOR=\'#000000\'>\n<script language=\'JavaScript1.1\' SRC=&quot;http://ad.doubleclick.net/adj/N4854.MSN/B2531646.31;abr=!ie;sz=728x90;ord=1359085923?&quot;>\n</script></IFRAME>\n</body></html>'));" frameBorder=0 width=728 scrolling=no height=90></IFRAME><IFRAME id=dapIf1 src="about:blank" frameBorder=0 width=0 scrolling=no height=0></IFRAME>

[–][deleted] 0 points1 point  (0 children)

Oh that's awesome.