Class HtmlSanitizer
HTML sanitizer for user-generated content. Uses Jsoup with explicit allowlists to neutralize
cross-site-scripting payloads (script tags, event handler attributes, javascript: URIs, etc.)
while preserving safe markup produced by the CMS rich-text editor or annotation/comment
authors. The Jsoup dependency is fully encapsulated — callers never import
org.jsoup.*.
Two profiles are exposed, each with a sanitizing variant and a read-only validation variant that share the same allowlist definition:
cleanRichText(String)/isCleanRichText(String)— for TinyMCE-style rich-text editor output (CMS htmltext components, license placeholder descriptions). Allows the structural and inline tag set including tables, headings, lists, links, images and figure/figcaption.cleanComment(String)/isCleanComment(String)— for short user-authored snippets (comments, annotation bodies). Allows only minimal inline formatting; preserves plain-text line breaks by converting them to<br>before sanitization.cleanCommentPlainText(String)— for consumers that must produce plain text (no HTML markup) such as IIIF Search hit selectors or other non-HTML JSON payloads. Strips all tags and preserves plain-text newlines verbatim (no<br>injection).
-
Method Summary
Modifier and TypeMethodDescriptionstatic StringcleanComment(String input) Sanitize short user-authored snippets such as comments and annotation bodies.static StringcleanCommentPlainText(String input) Sanitize user-authored content for consumers that must emit plain text (no HTML markup), for example IIIF Search hit selectors whoseprefix/suffixfields are plain text per the W3C Web Annotation spec, or any other non-HTML JSON payload that may be rendered downstream.static StringcleanFulltextSnippet(String input) Sanitize fulltext snippets produced by the Solr highlighter for display in search-result hit boxes.static StringSanitize page-level fulltext output produced by the ALTO reading pipeline (seeALTOTools.getFulltext(String, String, boolean)and theNamedEntityEnricherit pipes through).static StringcleanRichText(String input) Sanitize rich-text HTML produced by TinyMCE-style editors.static booleanisCleanComment(String input) Validate whether the given input would survivecleanComment(String)unchanged.static booleanisCleanRichText(String input) Validate whether the given input would survivecleanRichText(String)unchanged (modulo Jsoup's internal parser representation).
-
Method Details
-
cleanRichText
Sanitize rich-text HTML produced by TinyMCE-style editors. Allows the typical structural and inline tag set, with explicit URL-scheme allowlist for anchors and images. Output is generated withprettyPrint=falseso byte-equality round-trips are preserved for non-pathological input (no whitespace collapsing).- Parameters:
input- raw HTML string from a CMS rich-text editor; may benull- Returns:
- sanitized HTML containing only allowlisted tags and attributes;
nullif input wasnull
-
isCleanRichText
Validate whether the given input would survivecleanRichText(String)unchanged (modulo Jsoup's internal parser representation). Use this for warn-or-clean detection branches, instead of comparing input tocleanRichText(input)— the latter would trigger on harmless attribute reordering or tag-case normalization.- Parameters:
input- HTML string to validate; may benull- Returns:
trueif input contains only allowlisted tags and attributes (or isnull/empty);falseotherwise
-
cleanComment
Sanitize short user-authored snippets such as comments and annotation bodies. Allows only minimal inline formatting; rejects images, tables, headings and any block-level structural markup. Plain-text line breaks (\n) are converted to<br>before sanitization so they survive Jsoup's whitespace collapsing.- Parameters:
input- raw string; may benull- Returns:
- sanitized string with allowlisted inline tags only;
nullif input wasnull
-
cleanCommentPlainText
Sanitize user-authored content for consumers that must emit plain text (no HTML markup), for example IIIF Search hit selectors whoseprefix/suffixfields are plain text per the W3C Web Annotation spec, or any other non-HTML JSON payload that may be rendered downstream.Differs from
cleanComment(String)in two ways: the deny-by-defaultSafelist.none()strips ALL tags (including the inline formatting thatcleanCommentpreserves), and plain-text newlines are kept as\nrather than rewritten to<br>\n. Output therefore round-trips a plain-text input unchanged.prettyPrint(false)prevents Jsoup from collapsing whitespace in surviving text nodes.- Parameters:
input- raw string; may benull- Returns:
- plain-text string with all HTML tags removed and
\npreserved;nullif input wasnull
-
cleanFulltextSnippet
Sanitize fulltext snippets produced by the Solr highlighter for display in search-result hit boxes. Allows only the<mark>tag with the singleclassattribute, matching the markup emitted bySearchHelper.replaceHighlightingPlaceholders(String)(which produces<mark class="search-list--highlight">…</mark>). Every other tag and every other attribute is stripped.If a future highlight emitter introduces additional markup (for example
<em>or phrase-level wrappers), the allowlist must be extended explicitly — silent acceptance of new tags is exactly what this profile prevents.- Parameters:
input- raw snippet HTML; may benull- Returns:
- sanitized snippet containing only allowlisted
<mark>markup;nullif input wasnull
-
cleanFulltextWithNamedEntities
Sanitize page-level fulltext output produced by the ALTO reading pipeline (seeALTOTools.getFulltext(String, String, boolean)and theNamedEntityEnricherit pipes through). Allows only the<button>tag with the exact attribute set thatNamedEntityEnricher.CONTENT_TEMPLATEemits:class,type, and the fourdata-entity-id|data-entity-type|data-entity-authority-data-uri|data-entity-authority-data-searchattributes. Nohref, no other URL-bearing attribute, no other tag.Used for the ALTO branch of
PhysicalElement.getFullText(). The plain-fulltext branch (server-trusted indexer-pipeline content such as the KHI theme files) is deliberately not sanitized — see the audit memory entry for HIGH 5.- Parameters:
input- raw fulltext HTML from the ALTO pipeline; may benull- Returns:
- sanitized HTML containing only allowlisted
<button>markup;nullif input wasnull
-
isCleanComment
Validate whether the given input would survivecleanComment(String)unchanged. Plain-text line breaks are first converted to<br>(matching the sanitize path) so a multi-line plain-text comment is considered clean.- Parameters:
input- string to validate; may benull- Returns:
trueif input contains only allowlisted tags (or isnull/empty);falseotherwise
-