Clean Up HTML

This page cleans up 'dirty' non-compliant HTML, as often produced by web-design programs.  Using the regular expressions entered by default or yourself below, the page removes unwanted tags, attributes, etc, optionally indenting the result, thus reducing the wearisome work of hand-editing to achieve compliance.

To use this online version, copy and paste dirty HTML into the HTML Input field, set parameters as required, press Submit, and then copy the cleaned HTML from the HTML Output field.  The paramaters that you can set are described in the clickable Help section  As this script is still under development, there is as yet no console version available for download.

Help

By default, the configuration fields, which you can edit as required, contain comma separated lists of regular expressions representing HTML tags or attributes.  Unless otherwise specified, closing tags are assumed to be the same as the opening tag prefixed by a forward slash, as in <tag> … </tag>, but tags which differ from this pattern can be entered as opening and closing tags seperated by a tilde ~, for example the comment tag would be entered as !--~--.  Note that for tags the enclosing angled brackets < and > will be added automatically and should not be entered anywhere here.

The configuration fields that you can set are as follows …

First, there is a field controlling HTML tags to be excised entirely, that is to say the entire HTML including and between the opening tag and the closing tag will be removed.  By default, this field contains regular expressions to remove non-compliant tags often inserted by Microsoft products: O\:[^>]+ and !-*\[IF[^\]]*\]~!\[ENDIF\]-*.

Next, there is a field controlling HTML tags to be removed, that is to say both opening and closing tags will be removed, but intervening HTML will be left for further processing.  By default, this contains the following tags: font and span.

Next, there is a field controlling HTML tags to be cleaned entirely by removing all attributes.  By default, this contains the following tags: html, body, table, and tr.

Next, there is a field controlling HTML tags to be left unchanged, that is to say the single tag or opening and closing tags will be normalised as to case and quoting but otherwise left unchanged.  By default, this contains the following tags: !DOCTYPE, iframe, link, meta, script, style, and comment.

Next, there is a field controlling tag pairs that may meaningfully be empty, and therefore will not be removed if empty.  By default, this contains the following tags: applet, iframe, menuitem, object, output, script, textarea, td, th, and tr.

Next, there is field controlling tag attributes to be left unchanged.  By default, this contains the following attributes: accesskey, class, dir, id, lang, style, tabindex, title, content, href, http-equiv, name, on.*, src, target, type, value, colspan, rowspan, cols, rows, height, and width.

Next, there is a field controlling the indentation style of the HTML output, which works by determining the given number of spaces or tabs to insert for each indent.  A positive integer inserts the given number of spaces, 0 gives no indentation, a negative integer inserts a single tab which for the purposes of line-length calculations is taken to represent the equivalent positive number of spaces.  By default, this contains -4, a tab representing 4 spaces.

Next, there is a field controlling the maximum length of a line before forcing line wrap in the HTML output.  By default, this contains 80 characters.  As their name suggests, the inner HTML between opening and closing tags in the leave unchanged list are not wrapped.

Next, there is field controlling whether tags may be linewrapped internally.  By default, this is set to false.

Next, there is a field controlling opening and closing tag pairs that will be linewrapped after the closing tag.  By default, this contains the following tags: caption, heading h[1-6], label, legend, option, and title.

Next, there is field controlling tag pairs that must not be linewrapped.  By default, this contains the following tags: a, abbr, acronym, b, big, center, del, em, font, i, ins, q, s, samp, small, span, strike, strong, sub, sup, tt, and u.

Next, there is field controlling which EOL characters to use.  By default, this is set to: Browser/OS.

Next, there is field controlling whether to display the cleaned HTML in an internal frame. By default, this is set to true.

An editable comment can be added to the HTML source, or omitted altogether by blanking it.  It's default value is: Cleaned by: <This URL>, <Date>.

Notes

Apologies for these inconveniences.

 

 

 

 

 

 

 

 
 

 

 

 

End-Of-Line Character

 

 

 

 

 

 

     

 

 

 

HTML Output:

 

HTML Demonstration Output:
 

 

 

UpdatedDescription
11/07/2014v1.2  Fixed partial failure when leave-alone tags field empty
27/05/2014v1.1  Added saving and restoring of defaults to/from cookies
22/05/2014Created v1.0