Class AltoSearchParser

java.lang.Object
io.goobi.viewer.model.iiif.search.parser.AbstractSearchParser
io.goobi.viewer.model.iiif.search.parser.AltoSearchParser

public class AltoSearchParser extends AbstractSearchParser
IIIF Search API parser that searches for matches within ALTO full-text documents.
Author:
Florian Alpers
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    Map<org.apache.commons.lang3.Range<Integer>,List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line>>
    findLineMatches(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> lines, String regex)
    findLineMatches.
    List<List<de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word>>
    findWordMatches(List<de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word> words, String regex)
    findWordMatches.
    List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line>
    getContainingLines(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> allLines, int indexStart, int indexEnd)
    getContainingLines.
    int
    getLineEndIndex(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> allLines, de.intranda.digiverso.ocr.alto.model.structureclasses.Line line)
    getLineEndIndex.
    List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line>
    getLines(de.intranda.digiverso.ocr.alto.model.structureclasses.logical.AltoDocument doc)
    getLines.
    int
    getLineStartIndex(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> allLines, de.intranda.digiverso.ocr.alto.model.structureclasses.Line line)
    getLineStartIndex.
    getPrecedingText(de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word w, int maxLength)
    getPrecedingText.
    getSucceedingText(de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word w, int maxLength)
    getSucceedingText.
    getText(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> lines)
    getText.
    List<de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word>
    getWords(de.intranda.digiverso.ocr.alto.model.structureclasses.logical.AltoDocument doc)
    getWords.

    Methods inherited from class io.goobi.viewer.model.iiif.search.parser.AbstractSearchParser

    getAutoSuggestRegex, getContainedWordRegex, getPrecedingText, getQueryRegex, getSingleWordRegex, getSucceedingText

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • AltoSearchParser

      public AltoSearchParser()
  • Method Details

    • findWordMatches

      public List<List<de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word>> findWordMatches(List<de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word> words, String regex)
      findWordMatches.
      Parameters:
      words - candidate words to match against
      regex - regular expression to test each word's content
      Returns:
      a list of matched word groups, where each group is a consecutive sequence of matching ALTO words
    • findLineMatches

      public Map<org.apache.commons.lang3.Range<Integer>,List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line>> findLineMatches(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> lines, String regex)
      findLineMatches.
      Parameters:
      lines - ALTO lines to search through
      regex - regular expression applied to concatenated line text
      Returns:
      a map of character-index ranges to the ALTO lines containing the match
    • getText

      public String getText(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> lines)
      getText.
      Parameters:
      lines - ALTO lines whose content to concatenate
      Returns:
      the concatenated text content of the given ALTO lines, joined by spaces
    • getLines

      public List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> getLines(de.intranda.digiverso.ocr.alto.model.structureclasses.logical.AltoDocument doc)
      getLines.
      Parameters:
      doc - ALTO document to extract lines from
      Returns:
      a list of all ALTO text lines contained in the given document
    • getWords

      public List<de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word> getWords(de.intranda.digiverso.ocr.alto.model.structureclasses.logical.AltoDocument doc)
      getWords.
      Parameters:
      doc - ALTO document to extract words from
      Returns:
      a list of all ALTO word elements contained in the given document
    • getContainingLines

      public List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> getContainingLines(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> allLines, int indexStart, int indexEnd)
      getContainingLines.
      Parameters:
      allLines - all ALTO lines with their content
      indexStart - start character index of the match
      indexEnd - end character index of the match
      Returns:
      a list of ALTO lines whose character range overlaps with the given match positions
    • getLineStartIndex

      public int getLineStartIndex(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> allLines, de.intranda.digiverso.ocr.alto.model.structureclasses.Line line)
      getLineStartIndex.
      Parameters:
      allLines - all ALTO lines providing character offsets
      line - target line whose start index is sought
      Returns:
      a int.
    • getLineEndIndex

      public int getLineEndIndex(List<de.intranda.digiverso.ocr.alto.model.structureclasses.Line> allLines, de.intranda.digiverso.ocr.alto.model.structureclasses.Line line)
      getLineEndIndex.
      Parameters:
      allLines - all ALTO lines providing character offsets
      line - target line whose end index is sought
      Returns:
      a int.
    • getPrecedingText

      public String getPrecedingText(de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word w, int maxLength)
      getPrecedingText.
      Parameters:
      w - word whose preceding siblings to collect
      maxLength - maximum character count of returned text
      Returns:
      the text content of sibling words preceding the given word, up to maxLength characters
    • getSucceedingText

      public String getSucceedingText(de.intranda.digiverso.ocr.alto.model.structureclasses.lineelements.Word w, int maxLength)
      getSucceedingText.
      Parameters:
      w - word whose following siblings to collect
      maxLength - maximum character count of returned text
      Returns:
      the text content of sibling words following the given word, up to maxLength characters