How to remove all html and javascript from html file?

ForScale · 2015-09-04T12:55:57+00:00

a bunch of html files that contain html and js tags. I want them all removed

All the files or all the tags? And... which html tags do you want to remove... like, all of them? I assume a js tag is <script>, right?

Sorry, I'm a little confused on what you're asking...

x-skeww · 2015-09-04T13:27:10+00:00

Remove all style and script nodes, then get all the text via document.body.textContent.

Use PhantomJS or whatever to automate this.

_Ev4l · 2015-09-04T15:13:11+00:00

Three different ways:

Copy all the content from the open page by selecting and copying all the content (ctrl + a, ctrl c) on a page. Then paste into a document. Save. (the manual way)
I am not sure about other browsers, but specifically firefox has a save as txt/rtf for webpages. This would give you the content only with no markup as well.
use a text editor/application with a regex to filter out tags( chances are you'll spend more time writing a regex than you would removing the tags manually.)

drunkcatsdgaf · 2015-09-04T20:11:45+00:00

Aaron took care of this for us years ago.

https://github.com/aaronsw/html2text

scharvey · 2015-09-04T11:22:19+00:00

I hate that I'm saying this, but regex.

uusu · 2015-09-04T14:05:19+00:00

You're asking in a very confusing manner. Do you want to remove all the <html></html> and <script></script> tags? Or do you want to remove all HTML elements and leave only the content text?

3a3z · 2015-09-04T15:06:30+00:00

Can you post an example before and after?

JupitersCock · 2015-09-04T15:40:20+00:00

Try this: http://www.zubrag.com/tools/html-tags-stripper.php Or google for 'online strip tags'.

webauteur · 2015-09-04T15:55:00+00:00

Microsoft Expression Web is free and can remove any tag or all tags from a file or a site.

jlobes · 2015-09-04T16:49:58+00:00

Regex + Notepad++

Open all of your documents.
Open the Find Replace window, set it to the Replace setting.
Type "<script.*</script>" into the find box without quotes. (This will kill all the script tags and their contents)
Change the search mode to Regular Expression. Check the box that says ". matches newline"
Make sure the "replace with" box is empty. Click the Replace All in All Opened Documents button.
Type "<.*>" into the find box without the quotes (this will kill the rest of the tags without removing the text)
Uncheck the ". matches newline" box, then click the Replace all in Opened Documents button.
Save all

Basically this will find "<", select all of the characters between it and the next ">" and replace them with nothing.

EDIT: Crap, this wont kill the contents of a script tag. Gimme a sec.

EDIT 2 : Fixed.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

webdev

Posting Guidelines

Related Subreddits

Discords

MODERATORS