Class ALTOTools

java.lang.Object
io.goobi.viewer.controller.ALTOTools

public final class ALTOTools extends Object

ALTOTools class.

  • Field Details

    • TAG_LABEL_IGNORE_REGEX

      public static final String TAG_LABEL_IGNORE_REGEX
      Constant TAG_LABEL_IGNORE_REGEX.
      See Also:
    • ALTO_PROBLEMATIC_CHARS

      public static final String ALTO_PROBLEMATIC_CHARS
      Characters that can cause an "Invalid UTF-8 middle byte" error in the parser.
      See Also:
  • Method Details

    • getFulltext

      public static String getFulltext(Path path, String encoding) throws IOException
      Read the plain full-text from an alto file. Don't merge line breaks.
      Parameters:
      path -
      encoding -
      Returns:
      String containing plain text from ALTO at the given path
      Throws:
      IOException
    • getFulltext

      public static String getFulltext(String alto, String charset, boolean mergeLineBreakWords)

      getFullText.

      Parameters:
      alto - a String object.
      charset -
      mergeLineBreakWords - a boolean.
      Returns:
      a String object.
    • getNERTags

      public static List<TagCount> getNERTags(String alto, String inCharset, NERTag.Type type)

      getNERTags.

      Parameters:
      alto - a String object.
      inCharset -
      type - a NERTag.Type object.
      Returns:
      a List object.
    • alto2Txt

      protected static String alto2Txt(String alto, String charset, boolean mergeLineBreakWords) throws IOException, XMLStreamException, org.jdom2.JDOMException

      alto2Txt.

      Parameters:
      alto - a String object.
      charset - ALTO charset
      mergeLineBreakWords - a boolean.
      Returns:
      a String object.
      Throws:
      IOException - if any.
      XMLStreamException - if any.
      org.jdom2.JDOMException
    • createXmlParser

      public static XMLStreamReader createXmlParser(InputStream is) throws FactoryConfigurationError, XMLStreamException
      Throws:
      FactoryConfigurationError
      XMLStreamException
    • getWordCoords

      public static List<String> getWordCoords(String altoString, String charset, Set<String> searchTerms)

      getWordCoords.

      Parameters:
      altoString - a String object.
      charset -
      searchTerms - a Set object.
      Returns:
      a List object.
    • getRotatedCoordinates

      public static String getRotatedCoordinates(String inCoords, int rotation, Dimension pageSize)

      getRotatedCoordinates.

      Parameters:
      inCoords - a String object.
      rotation - a int.
      pageSize - a Dimension object.
      Returns:
      a String object.
    • getWordCoords

      public static List<String> getWordCoords(String altoString, String charset, Set<String> searchTerms, int rotation)
      Parameters:
      altoString - String containing the ALTO XML document
      charset -
      searchTerms - Set of search terms
      rotation - Image rotation in degrees
      Returns:
      a List object.
    • getWordCoords

      public static List<String> getWordCoords(String altoString, String charset, Set<String> searchTerms, int proximitySearchDistance, int rotation)
    • rotate

      protected static Rectangle rotate(Rectangle rect, int rotation, Dimension imageSize)

      rotate.

      Parameters:
      rect - a Rectangle object.
      rotation - a int.
      imageSize - a Dimension object.
      Returns:
      a Rectangle object.
    • getMatchALTOWord

      public static int getMatchALTOWord(de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word eleWord, String[] words)

      getMatchALTOWord.

      Parameters:
      eleWord - a Word object.
      words - an array of String objects.
      Returns:
      1 if there is a match; 0 otherwise
    • getALTOCoords

      public static String getALTOCoords(de.intranda.digiverso.ocr.alto.model.superclasses.GeometricData element)

      getALTOCoords.

      Parameters:
      element - a GeometricData object.
      Returns:
      a String object.