Class HtmlSanitizer
HTML sanitizer for user-generated content. Uses Jsoup with explicit allowlists to neutralize
cross-site-scripting payloads (script tags, event handler attributes, javascript: URIs, etc.)
while preserving safe markup produced by the CMS rich-text editor or annotation/comment
authors. The Jsoup dependency is fully encapsulated — callers never import
org.jsoup.*.
Two profiles are exposed, each with a sanitizing variant and a read-only validation variant that share the same allowlist definition:
cleanRichText(String)/isCleanRichText(String)— for TinyMCE-style rich-text editor output (CMS htmltext components, license placeholder descriptions). Allows the structural and inline tag set including tables, headings, lists, links, images and figure/figcaption.cleanComment(String)/isCleanComment(String)— for short user-authored snippets (comments, annotation bodies). Allows only minimal inline formatting; preserves plain-text line breaks by converting them to<br>before sanitization.cleanCommentPlainText(String)— for consumers that must produce plain text (no HTML markup) such as IIIF Search hit selectors or other non-HTML JSON payloads. Strips all tags and preserves plain-text newlines verbatim (no<br>injection).
Replaces the regex-based StringTools.stripJS(String) which only removed
<script> and <svg> blocks and was bypassable through any other XSS vector
(event-handler attributes, javascript: URIs, etc.).
-
Method Summary
Modifier and TypeMethodDescriptionstatic StringcleanComment(String input) Sanitize short user-authored snippets such as comments and annotation bodies.static StringcleanCommentPlainText(String input) Sanitize user-authored content for consumers that must emit plain text (no HTML markup), for example IIIF Search hit selectors whoseprefix/suffixfields are plain text per the W3C Web Annotation spec, or any other non-HTML JSON payload that may be rendered downstream.static StringcleanRichText(String input) Sanitize rich-text HTML produced by TinyMCE-style editors.static booleanisCleanComment(String input) Validate whether the given input would survivecleanComment(String)unchanged.static booleanisCleanRichText(String input) Validate whether the given input would survivecleanRichText(String)unchanged (modulo Jsoup's internal parser representation).
-
Method Details
-
cleanRichText
Sanitize rich-text HTML produced by TinyMCE-style editors. Allows the typical structural and inline tag set, with explicit URL-scheme allowlist for anchors and images. Output is generated withprettyPrint=falseso byte-equality round-trips are preserved for non-pathological input (no whitespace collapsing).- Parameters:
input- raw HTML string from a CMS rich-text editor; may benull- Returns:
- sanitized HTML containing only allowlisted tags and attributes;
nullif input wasnull
-
isCleanRichText
Validate whether the given input would survivecleanRichText(String)unchanged (modulo Jsoup's internal parser representation). Use this for warn-or-clean detection branches, instead of comparing input tocleanRichText(input)— the latter would trigger on harmless attribute reordering or tag-case normalization.- Parameters:
input- HTML string to validate; may benull- Returns:
trueif input contains only allowlisted tags and attributes (or isnull/empty);falseotherwise
-
cleanComment
Sanitize short user-authored snippets such as comments and annotation bodies. Allows only minimal inline formatting; rejects images, tables, headings and any block-level structural markup. Plain-text line breaks (\n) are converted to<br>before sanitization so they survive Jsoup's whitespace collapsing.- Parameters:
input- raw string; may benull- Returns:
- sanitized string with allowlisted inline tags only;
nullif input wasnull
-
cleanCommentPlainText
Sanitize user-authored content for consumers that must emit plain text (no HTML markup), for example IIIF Search hit selectors whoseprefix/suffixfields are plain text per the W3C Web Annotation spec, or any other non-HTML JSON payload that may be rendered downstream.Differs from
cleanComment(String)in two ways: the deny-by-defaultSafelist.none()strips ALL tags (including the inline formatting thatcleanCommentpreserves), and plain-text newlines are kept as\nrather than rewritten to<br>\n. Output therefore round-trips a plain-text input unchanged.prettyPrint(false)prevents Jsoup from collapsing whitespace in surviving text nodes.- Parameters:
input- raw string; may benull- Returns:
- plain-text string with all HTML tags removed and
\npreserved;nullif input wasnull
-
isCleanComment
Validate whether the given input would survivecleanComment(String)unchanged. Plain-text line breaks are first converted to<br>(matching the sanitize path) so a multi-line plain-text comment is considered clean.- Parameters:
input- string to validate; may benull- Returns:
trueif input contains only allowlisted tags (or isnull/empty);falseotherwise
-