Package io.goobi.viewer.controller
Class ALTOTools
java.lang.Object
io.goobi.viewer.controller.ALTOTools
Utility class for parsing and processing ALTO (Analyzed Layout and Text Object) XML documents.
-
Field Summary
Fields -
Method Summary
Modifier and TypeMethodDescriptionprotected static Stringalto2Txt.static XMLStreamReaderstatic StringgetALTOCoords(de.intranda.digiverso.ocr.alto.model.superclasses.GeometricData element) getALTOCoords.static StringgetFulltext(String alto, String charset, boolean mergeLineBreakWords) getFullText.static StringgetFulltext(Path path, String encoding) Reads the plain full-text from an alto file.static intgetMatchALTOWord(de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word eleWord, String[] words) getMatchALTOWord.getNERTags(String alto, String inCharset, NERTag.Type type) getNERTags.static StringgetRotatedCoordinates(String inCoords, int rotation, Dimension pageSize) getRotatedCoordinates.getWordCoords(String altoString, String charset, Set<String> searchTerms) getWordCoords.getWordCoords(String altoString, String charset, Set<String> searchTerms, int rotation) getWordCoords(String altoString, String charset, Set<String> searchTerms, int proximitySearchDistance, int rotation) protected static Rectanglerotate.
-
Field Details
-
TAG_LABEL_IGNORE_REGEX
ConstantTAG_LABEL_IGNORE_REGEX.- See Also:
-
ALTO_PROBLEMATIC_CHARS
Characters that can cause an "Invalid UTF-8 middle byte" error in the parser.- See Also:
-
-
Method Details
-
getFulltext
Reads the plain full-text from an alto file. Don't merge line breaks.- Parameters:
path- path to the ALTO fileencoding- character encoding to use when reading the file- Returns:
Stringcontaining plain text from ALTO at the given path- Throws:
IOException
-
getFulltext
getFullText.- Parameters:
alto- ALTO XML document as a stringcharset- character encoding of the ALTO documentmergeLineBreakWords- merge words split across line breaks into one- Returns:
- the plain text extracted from the ALTO XML document, or null on error
-
getNERTags
getNERTags.- Parameters:
alto- ALTO XML document as a stringinCharset- character encoding of the ALTO documenttype- NER tag type to filter by; null returns all types- Returns:
- a list of NER TagCount objects extracted from the given ALTO document
-
alto2Txt
protected static String alto2Txt(String alto, String charset, boolean mergeLineBreakWords) throws IOException, org.jdom2.JDOMException alto2Txt.- Parameters:
alto- ALTO XML document as a stringcharset- character encoding of the ALTO documentmergeLineBreakWords- merge words split across line breaks into one- Returns:
- the plain text extracted from the ALTO XML document
- Throws:
IOException- if any.XMLStreamException- if any.org.jdom2.JDOMException
-
createXmlParser
public static XMLStreamReader createXmlParser(InputStream is) throws FactoryConfigurationError, XMLStreamException -
getWordCoords
public static List<String> getWordCoords(String altoString, String charset, Set<String> searchTerms) getWordCoords.- Parameters:
altoString- ALTO XML document as a stringcharset- character encoding of the ALTO documentsearchTerms- set of terms whose coordinates to locate- Returns:
- a list of coordinate strings for words matching any of the given search terms in the ALTO document
-
getRotatedCoordinates
getRotatedCoordinates.- Parameters:
inCoords- comma-separated coordinate string to rotaterotation- rotation angle in degrees (0, 90, 180, 270)pageSize- dimensions of the page image- Returns:
- the rotated bounding box as a comma-separated coordinate string
-
getWordCoords
public static List<String> getWordCoords(String altoString, String charset, Set<String> searchTerms, int rotation) - Parameters:
altoString- String containing the ALTO XML documentcharset- character encoding of the ALTO documentsearchTerms- Set of search termsrotation- Image rotation in degrees- Returns:
- a list of coordinate strings for words matching any of the given search terms, rotated to match the image orientation
-
getWordCoords
-
rotate
rotate.- Parameters:
rect- rectangle to rotate around the image centerrotation- rotation angle in degrees (90, 180, 270)imageSize- dimensions of the image for computing the rotation- Returns:
- the rotated bounding rectangle
-
getMatchALTOWord
public static int getMatchALTOWord(de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word eleWord, String[] words) getMatchALTOWord.- Parameters:
eleWord- ALTO word element whose content to match againstwords- array of search words to match against the element- Returns:
- 1 if there is a match; 0 otherwise
-
getALTOCoords
public static String getALTOCoords(de.intranda.digiverso.ocr.alto.model.superclasses.GeometricData element) getALTOCoords.- Parameters:
element- ALTO geometric element whose bounding box to extract- Returns:
- the bounding box of the given element as a comma-separated coordinate string (x, y, width, height)
-