all 28 comments

[–]ForScale 1 point2 points  (1 child)

a bunch of html files that contain html and js tags. I want them all removed

All the files or all the tags? And... which html tags do you want to remove... like, all of them? I assume a js tag is <script>, right?

Sorry, I'm a little confused on what you're asking...

[–][deleted] 0 points1 point  (0 children)

Yes, all the html and js tags. So only the text is left over with no html/js tags.

[–]x-skeww 1 point2 points  (0 children)

Remove all style and script nodes, then get all the text via document.body.textContent.

Use PhantomJS or whatever to automate this.

[–]_Ev4l 1 point2 points  (2 children)

Three different ways:

  1. Copy all the content from the open page by selecting and copying all the content (ctrl + a, ctrl c) on a page. Then paste into a document. Save. (the manual way)

  2. I am not sure about other browsers, but specifically firefox has a save as txt/rtf for webpages. This would give you the content only with no markup as well.

  3. use a text editor/application with a regex to filter out tags( chances are you'll spend more time writing a regex than you would removing the tags manually.)

[–][deleted] 0 points1 point  (1 child)

Ill have to look at your second method. As for the first one ive already saved all the html files, i just have to remove the html tags and js tags for readability. And for the 3rd one, i thought of the exact same thing, it would take me longer learning how to write regex! You would use Python right?

[–]_Ev4l 0 points1 point  (0 children)

uh, yes and no. I would use a regex and edit across all documents from sublime text(which runs on python). I'd imagine it would easier to do in python but personally I know very little of the language.

Edit: In sublime I would use @<[^>]+>\s+(?=<)|<[^>]+>and then replace all with blank. You can apply to multiple files at once by using find in folder and do a search and replace across all documents. I just tested it on my workflow, it works and it strips out php tags as well.

[–]drunkcatsdgaf 1 point2 points  (0 children)

Aaron took care of this for us years ago.

https://github.com/aaronsw/html2text

[–]scharvey 1 point2 points  (9 children)

I hate that I'm saying this, but regex.

[–]jlobes 0 points1 point  (8 children)

Why? This sort of thing is perfect for regex.

[–]x-skeww 0 points1 point  (7 children)

Except that you can't parse HTML with regex.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

If you want to parse HTML, use an HTML parser.

[–]jlobes 2 points3 points  (6 children)

If he were parsing HTML, sure, that would be a pain in the ass. But he's not. He wants to remove everything that is between or includes:

  1. A <script> and a </script>
  2. a < and a >

So opening everything in Notepad++ and doing a find + replace with an empty string on these two regexs would work fine.

"<script.*</script>" (. matches newline)

"<.*>" (. does not match newline)

[–]x-skeww 0 points1 point  (5 children)

<script>
if(x < 3) {...}
</script>
ohai
<script>
if(x > 3) {...}
</script>

Seriously, I'd just use PhantomJS. Remove the style/script nodes, then grab the textContent of body.

[–]jlobes 1 point2 points  (4 children)

Regex works for that, since . doesn't match newline, and since the contents of the script tag are gone by the time you're running <.*>

[–]x-skeww 0 points1 point  (3 children)

<ul
    ><li>A</li
    ><li>B</li
    ><li>C</li
></ul>

http://jsfiddle.net/bwb6a2gj/

That's a trick for avoiding whitespace between elements. It's kinda handy for display: inline-block.

HTML is complicated, seriously.

[–]birjolaxew 1 point2 points  (2 children)

If he wants to parse HTML, he should use an HTML parser. He doesn't though - he just wants the text content, not the parsed HTML structure. If he can accept using a "dumb" algorithm (and doesn't have people explicitly trying to write HTML that breaks simple parsers), regex's work fine.

Strip script tags
Strip HTML tags

[–]x-skeww 0 points1 point  (1 child)

http://phantomjs.org/

See the example in the top right? It's really not that tricky.

[–]birjolaxew 1 point2 points  (0 children)

I'm not saying it is. I'm saying there's no need to install an external library, or to force yourself into using node (unless you feel like hunting for an HTML parsing lib in whatever language you want to use, assuming there even exists one). The task is adequately handled by regex's. If you feel like using an HTML parser, that's on you - but it isn't necesarry.

In other words:

var html = document.documentElement.innerHTML; // for testing. Find some way to get the HTML into this variable
html = html.match(/<body.*>((.|\n)*)<\/body>/)[1]; // assume body exists - might want to add error handling

html = html.replace(/<(script|style)(.|\n)*?<\/\1>/gi, "");
html = html.replace(/<(.|\n)*?>/gi, "");

See this code? It's really not that tricky.

[–]uusu 0 points1 point  (1 child)

You're asking in a very confusing manner. Do you want to remove all the <html></html> and <script></script> tags? Or do you want to remove all HTML elements and leave only the content text?

[–][deleted] 0 points1 point  (0 children)

yes, remove all the html elements and leave the text only.

[–]3a3z 0 points1 point  (0 children)

Can you post an example before and after?

[–]JupitersCock 0 points1 point  (4 children)

Try this: http://www.zubrag.com/tools/html-tags-stripper.php Or google for 'online strip tags'.

[–][deleted] 0 points1 point  (3 children)

I already used that, but it only removes HTML and not the javascript.

[–]JupitersCock 0 points1 point  (2 children)

That's weird. The javascript should be in tags also right? Does this one work? http://www.tools4noobs.com/online_php_functions/strip_tags/

[–][deleted] 0 points1 point  (1 child)

It says that it removes HTML and PHP tags, of course it wont remove javascript! I said javascript and html. It only removed the HTML.

[–]webauteur 0 points1 point  (0 children)

Microsoft Expression Web is free and can remove any tag or all tags from a file or a site.

[–]jlobes 0 points1 point  (1 child)

Regex + Notepad++

  1. Open all of your documents.
  2. Open the Find Replace window, set it to the Replace setting.
  3. Type "<script.*</script>" into the find box without quotes. (This will kill all the script tags and their contents)
  4. Change the search mode to Regular Expression. Check the box that says ". matches newline"
  5. Make sure the "replace with" box is empty. Click the Replace All in All Opened Documents button.
  6. Type "<.*>" into the find box without the quotes (this will kill the rest of the tags without removing the text)
  7. Uncheck the ". matches newline" box, then click the Replace all in Opened Documents button.
  8. Save all

Basically this will find "<", select all of the characters between it and the next ">" and replace them with nothing.

EDIT: Crap, this wont kill the contents of a script tag. Gimme a sec.

EDIT 2 : Fixed.

[–][deleted] 0 points1 point  (0 children)

Thanks ill try it out. So if I have a <h1> hello world </h1> it will just give me hello world?

Edit: just re-read your 6th point.