Class HtmlSanitizer

java.lang.Object
io.goobi.viewer.controller.HtmlSanitizer

public final class HtmlSanitizer extends Object

HTML sanitizer for user-generated content. Uses Jsoup with explicit allowlists to neutralize cross-site-scripting payloads (script tags, event handler attributes, javascript: URIs, etc.) while preserving safe markup produced by the CMS rich-text editor or annotation/comment authors. The Jsoup dependency is fully encapsulated — callers never import org.jsoup.*.

Two profiles are exposed, each with a sanitizing variant and a read-only validation variant that share the same allowlist definition:

  • cleanRichText(String) / isCleanRichText(String) — for TinyMCE-style rich-text editor output (CMS htmltext components, license placeholder descriptions). Allows the structural and inline tag set including tables, headings, lists, links, images and figure/figcaption.
  • cleanComment(String) / isCleanComment(String) — for short user-authored snippets (comments, annotation bodies). Allows only minimal inline formatting; preserves plain-text line breaks by converting them to <br> before sanitization.
  • cleanCommentPlainText(String) — for consumers that must produce plain text (no HTML markup) such as IIIF Search hit selectors or other non-HTML JSON payloads. Strips all tags and preserves plain-text newlines verbatim (no <br> injection).

Replaces the regex-based StringTools.stripJS(String) which only removed <script> and <svg> blocks and was bypassable through any other XSS vector (event-handler attributes, javascript: URIs, etc.).

  • Method Details

    • cleanRichText

      public static String cleanRichText(String input)
      Sanitize rich-text HTML produced by TinyMCE-style editors. Allows the typical structural and inline tag set, with explicit URL-scheme allowlist for anchors and images. Output is generated with prettyPrint=false so byte-equality round-trips are preserved for non-pathological input (no whitespace collapsing).
      Parameters:
      input - raw HTML string from a CMS rich-text editor; may be null
      Returns:
      sanitized HTML containing only allowlisted tags and attributes; null if input was null
    • isCleanRichText

      public static boolean isCleanRichText(String input)
      Validate whether the given input would survive cleanRichText(String) unchanged (modulo Jsoup's internal parser representation). Use this for warn-or-clean detection branches, instead of comparing input to cleanRichText(input) — the latter would trigger on harmless attribute reordering or tag-case normalization.
      Parameters:
      input - HTML string to validate; may be null
      Returns:
      true if input contains only allowlisted tags and attributes (or is null/empty); false otherwise
    • cleanComment

      public static String cleanComment(String input)
      Sanitize short user-authored snippets such as comments and annotation bodies. Allows only minimal inline formatting; rejects images, tables, headings and any block-level structural markup. Plain-text line breaks (\n) are converted to <br> before sanitization so they survive Jsoup's whitespace collapsing.
      Parameters:
      input - raw string; may be null
      Returns:
      sanitized string with allowlisted inline tags only; null if input was null
    • cleanCommentPlainText

      public static String cleanCommentPlainText(String input)
      Sanitize user-authored content for consumers that must emit plain text (no HTML markup), for example IIIF Search hit selectors whose prefix/suffix fields are plain text per the W3C Web Annotation spec, or any other non-HTML JSON payload that may be rendered downstream.

      Differs from cleanComment(String) in two ways: the deny-by-default Safelist.none() strips ALL tags (including the inline formatting that cleanComment preserves), and plain-text newlines are kept as \n rather than rewritten to <br>\n. Output therefore round-trips a plain-text input unchanged. prettyPrint(false) prevents Jsoup from collapsing whitespace in surviving text nodes.

      Parameters:
      input - raw string; may be null
      Returns:
      plain-text string with all HTML tags removed and \n preserved; null if input was null
    • isCleanComment

      public static boolean isCleanComment(String input)
      Validate whether the given input would survive cleanComment(String) unchanged. Plain-text line breaks are first converted to <br> (matching the sanitize path) so a multi-line plain-text comment is considered clean.
      Parameters:
      input - string to validate; may be null
      Returns:
      true if input contains only allowlisted tags (or is null/empty); false otherwise