Package io.goobi.viewer.controller
Class ALTOTools
java.lang.Object
io.goobi.viewer.controller.ALTOTools
ALTOTools class.
-
Field Summary
Fields -
Method Summary
Modifier and TypeMethodDescriptionprotected static Stringalto2Txt.static XMLStreamReaderstatic StringgetALTOCoords(de.intranda.digiverso.ocr.alto.model.superclasses.GeometricData element) getALTOCoords.static StringgetFulltext(String alto, String charset, boolean mergeLineBreakWords) getFullText.static StringgetFulltext(Path path, String encoding) Read the plain full-text from an alto file.static intgetMatchALTOWord(de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word eleWord, String[] words) getMatchALTOWord.getNERTags(String alto, String inCharset, NERTag.Type type) getNERTags.static StringgetRotatedCoordinates(String inCoords, int rotation, Dimension pageSize) getRotatedCoordinates.getWordCoords(String altoString, String charset, Set<String> searchTerms) getWordCoords.getWordCoords(String altoString, String charset, Set<String> searchTerms, int rotation) getWordCoords(String altoString, String charset, Set<String> searchTerms, int proximitySearchDistance, int rotation) protected static Rectanglerotate.
-
Field Details
-
TAG_LABEL_IGNORE_REGEX
ConstantTAG_LABEL_IGNORE_REGEX.- See Also:
-
ALTO_PROBLEMATIC_CHARS
Characters that can cause an "Invalid UTF-8 middle byte" error in the parser.- See Also:
-
-
Method Details
-
getFulltext
Read the plain full-text from an alto file. Don't merge line breaks.- Parameters:
path-encoding-- Returns:
Stringcontaining plain text from ALTO at the given path- Throws:
IOException
-
getFulltext
getFullText.
-
getNERTags
getNERTags.
- Parameters:
alto- aStringobject.inCharset-type- aNERTag.Typeobject.- Returns:
- a
Listobject.
-
alto2Txt
protected static String alto2Txt(String alto, String charset, boolean mergeLineBreakWords) throws IOException, XMLStreamException, org.jdom2.JDOMException alto2Txt.
- Parameters:
alto- aStringobject.charset- ALTO charsetmergeLineBreakWords- a boolean.- Returns:
- a
Stringobject. - Throws:
IOException- if any.XMLStreamException- if any.org.jdom2.JDOMException
-
createXmlParser
public static XMLStreamReader createXmlParser(InputStream is) throws FactoryConfigurationError, XMLStreamException -
getWordCoords
public static List<String> getWordCoords(String altoString, String charset, Set<String> searchTerms) getWordCoords.
-
getRotatedCoordinates
getRotatedCoordinates.
-
getWordCoords
public static List<String> getWordCoords(String altoString, String charset, Set<String> searchTerms, int rotation) - Parameters:
altoString- String containing the ALTO XML documentcharset-searchTerms- Set of search termsrotation- Image rotation in degrees- Returns:
- a
Listobject.
-
getWordCoords
-
rotate
rotate.
-
getMatchALTOWord
public static int getMatchALTOWord(de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word eleWord, String[] words) getMatchALTOWord.
- Parameters:
eleWord- aWordobject.words- an array ofStringobjects.- Returns:
- 1 if there is a match; 0 otherwise
-
getALTOCoords
public static String getALTOCoords(de.intranda.digiverso.ocr.alto.model.superclasses.GeometricData element) getALTOCoords.
- Parameters:
element- aGeometricDataobject.- Returns:
- a
Stringobject.
-