Class ALTOTools

java.lang.Object
io.goobi.viewer.controller.ALTOTools

public final class ALTOTools extends Object
Utility class for parsing and processing ALTO (Analyzed Layout and Text Object) XML documents.
  • Field Details

    • TAG_LABEL_IGNORE_REGEX

      public static final String TAG_LABEL_IGNORE_REGEX
      Constant TAG_LABEL_IGNORE_REGEX.
      See Also:
    • ALTO_PROBLEMATIC_CHARS

      public static final String ALTO_PROBLEMATIC_CHARS
      Characters that can cause an "Invalid UTF-8 middle byte" error in the parser.
      See Also:
  • Method Details

    • getFulltext

      public static String getFulltext(Path path, String encoding) throws IOException
      Reads the plain full-text from an alto file. Don't merge line breaks.
      Parameters:
      path - path to the ALTO file
      encoding - character encoding to use when reading the file
      Returns:
      String containing plain text from ALTO at the given path
      Throws:
      IOException
    • getFulltext

      public static String getFulltext(String alto, String charset, boolean mergeLineBreakWords)
      getFullText.
      Parameters:
      alto - ALTO XML document as a string
      charset - character encoding of the ALTO document
      mergeLineBreakWords - merge words split across line breaks into one
      Returns:
      the plain text extracted from the ALTO XML document, or null on error
    • getNERTags

      public static List<TagCount> getNERTags(String alto, String inCharset, NERTag.Type type)
      getNERTags.
      Parameters:
      alto - ALTO XML document as a string
      inCharset - character encoding of the ALTO document
      type - NER tag type to filter by; null returns all types
      Returns:
      a list of NER TagCount objects extracted from the given ALTO document
    • alto2Txt

      protected static String alto2Txt(String alto, String charset, boolean mergeLineBreakWords) throws IOException, org.jdom2.JDOMException
      alto2Txt.
      Parameters:
      alto - ALTO XML document as a string
      charset - character encoding of the ALTO document
      mergeLineBreakWords - merge words split across line breaks into one
      Returns:
      the plain text extracted from the ALTO XML document
      Throws:
      IOException - if any.
      XMLStreamException - if any.
      org.jdom2.JDOMException
    • createXmlParser

      public static XMLStreamReader createXmlParser(InputStream is) throws FactoryConfigurationError, XMLStreamException
      Throws:
      FactoryConfigurationError
      XMLStreamException
    • getWordCoords

      public static List<String> getWordCoords(String altoString, String charset, Set<String> searchTerms)
      getWordCoords.
      Parameters:
      altoString - ALTO XML document as a string
      charset - character encoding of the ALTO document
      searchTerms - set of terms whose coordinates to locate
      Returns:
      a list of coordinate strings for words matching any of the given search terms in the ALTO document
    • getRotatedCoordinates

      public static String getRotatedCoordinates(String inCoords, int rotation, Dimension pageSize)
      getRotatedCoordinates.
      Parameters:
      inCoords - comma-separated coordinate string to rotate
      rotation - rotation angle in degrees (0, 90, 180, 270)
      pageSize - dimensions of the page image
      Returns:
      the rotated bounding box as a comma-separated coordinate string
    • getWordCoords

      public static List<String> getWordCoords(String altoString, String charset, Set<String> searchTerms, int rotation)
      Parameters:
      altoString - String containing the ALTO XML document
      charset - character encoding of the ALTO document
      searchTerms - Set of search terms
      rotation - Image rotation in degrees
      Returns:
      a list of coordinate strings for words matching any of the given search terms, rotated to match the image orientation
    • getWordCoords

      public static List<String> getWordCoords(String altoString, String charset, Set<String> searchTerms, int proximitySearchDistance, int rotation)
    • rotate

      protected static Rectangle rotate(Rectangle rect, int rotation, Dimension imageSize)
      rotate.
      Parameters:
      rect - rectangle to rotate around the image center
      rotation - rotation angle in degrees (90, 180, 270)
      imageSize - dimensions of the image for computing the rotation
      Returns:
      the rotated bounding rectangle
    • getMatchALTOWord

      public static int getMatchALTOWord(de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word eleWord, String[] words)
      getMatchALTOWord.
      Parameters:
      eleWord - ALTO word element whose content to match against
      words - array of search words to match against the element
      Returns:
      1 if there is a match; 0 otherwise
    • getALTOCoords

      public static String getALTOCoords(de.intranda.digiverso.ocr.alto.model.superclasses.GeometricData element)
      getALTOCoords.
      Parameters:
      element - ALTO geometric element whose bounding box to extract
      Returns:
      the bounding box of the given element as a comma-separated coordinate string (x, y, width, height)