Class AltoSearchParser
java.lang.Object
io.goobi.viewer.model.iiif.search.parser.AbstractSearchParser
io.goobi.viewer.model.iiif.search.parser.AltoSearchParser
IIIF Search API parser that searches for matches within ALTO full-text documents.
- Author:
- Florian Alpers
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionMap<org.apache.commons.lang3.Range<Integer>, List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line>> findLineMatches(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> lines, String regex) findLineMatches.findWordMatches(List<de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word> words, String regex) findWordMatches.List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> getContainingLines(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> allLines, int indexStart, int indexEnd) getContainingLines.intgetLineEndIndex(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> allLines, de.intranda.digiverso.ocr.alto.model.structureclasses.Line line) getLineEndIndex.List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> getLines(de.intranda.digiverso.ocr.alto.model.structureclasses.logical.AltoDocument doc) getLines.intgetLineStartIndex(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> allLines, de.intranda.digiverso.ocr.alto.model.structureclasses.Line line) getLineStartIndex.getPrecedingText(de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word w, int maxLength) getPrecedingText.getSucceedingText(de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word w, int maxLength) getSucceedingText.getText.List<de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word> getWords(de.intranda.digiverso.ocr.alto.model.structureclasses.logical.AltoDocument doc) getWords.Methods inherited from class io.goobi.viewer.model.iiif.search.parser.AbstractSearchParser
getAutoSuggestRegex, getContainedWordRegex, getPrecedingText, getQueryRegex, getSingleWordRegex, getSucceedingText
-
Constructor Details
-
AltoSearchParser
public AltoSearchParser()
-
-
Method Details
-
findWordMatches
public List<List<de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word>> findWordMatches(List<de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word> words, String regex) findWordMatches.- Parameters:
words- candidate words to match againstregex- regular expression to test each word's content- Returns:
- a list of matched word groups, where each group is a consecutive sequence of matching ALTO words
-
findLineMatches
public Map<org.apache.commons.lang3.Range<Integer>,List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line>> findLineMatches(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> lines, String regex) findLineMatches.- Parameters:
lines- ALTO lines to search throughregex- regular expression applied to concatenated line text- Returns:
- a map of character-index ranges to the ALTO lines containing the match
-
getText
getText.- Parameters:
lines- ALTO lines whose content to concatenate- Returns:
- the concatenated text content of the given ALTO lines, joined by spaces
-
getLines
public List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> getLines(de.intranda.digiverso.ocr.alto.model.structureclasses.logical.AltoDocument doc) getLines.- Parameters:
doc- ALTO document to extract lines from- Returns:
- a list of all ALTO text lines contained in the given document
-
getWords
public List<de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word> getWords(de.intranda.digiverso.ocr.alto.model.structureclasses.logical.AltoDocument doc) getWords.- Parameters:
doc- ALTO document to extract words from- Returns:
- a list of all ALTO word elements contained in the given document
-
getContainingLines
public List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> getContainingLines(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> allLines, int indexStart, int indexEnd) getContainingLines.- Parameters:
allLines- all ALTO lines with their contentindexStart- start character index of the matchindexEnd- end character index of the match- Returns:
- a list of ALTO lines whose character range overlaps with the given match positions
-
getLineStartIndex
public int getLineStartIndex(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> allLines, de.intranda.digiverso.ocr.alto.model.structureclasses.Line line) getLineStartIndex.- Parameters:
allLines- all ALTO lines providing character offsetsline- target line whose start index is sought- Returns:
- a int.
-
getLineEndIndex
public int getLineEndIndex(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> allLines, de.intranda.digiverso.ocr.alto.model.structureclasses.Line line) getLineEndIndex.- Parameters:
allLines- all ALTO lines providing character offsetsline- target line whose end index is sought- Returns:
- a int.
-
getPrecedingText
public String getPrecedingText(de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word w, int maxLength) getPrecedingText.- Parameters:
w- word whose preceding siblings to collectmaxLength- maximum character count of returned text- Returns:
- the text content of sibling words preceding the given word, up to maxLength characters
-
getSucceedingText
public String getSucceedingText(de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word w, int maxLength) getSucceedingText.- Parameters:
w- word whose following siblings to collectmaxLength- maximum character count of returned text- Returns:
- the text content of sibling words following the given word, up to maxLength characters
-