wiki:ApertureExtractor

Context Navigation

Version 13 (modified by anonymous, 20 years ago) (diff)
--

Extractor

Java Interface

Probably equal to:

source:trunk/gnowsis/src/org/gnowsis/data/extractor/ExtractorPlaintext.java and
source:/trunk/gnoDesktopSearch/src/java/org/gnowsis/desktopsearch/extractor/DocumentExtractor.java

/**
 * 
 * Extractors are used to extract metadata and fulltext from InputStreams,
 * the inputstream is in a format passed by Mime-Type.
 * These extractors can produce metadatamaps.
 */
public interface Extractor {

    /**
     * return a plaintext representation of the file
     * @param source the file to look into
     * @param mimetype the mimetype that has been identified by gnowsis that this file is
     * @return null or a string. Null is returned, if no plaintext is in the file. If it could not be 
     * extracted, an exception is thrown.
     * @throws ExtractionException when something goes wrong with extraction
     * @throws FileNotFoundException when the file is not existant
     */
    public String getPlaintext(File source, String mimetype) throws FileNotFoundException, ExtractionException ;

    /**
     * create a lucene document.
     * To see what fields would be needed, look at the top of this class.
     * @param file
     * @param uri the uri identifying the passed file. You may need it when you add sophisticated rdf information
     * @param mimetype the mimetype of the passed file/stream. If your extractor can handle multiple mime-types, this can be handy.
     * @param options optional options that may help you.
     * @return a lucene document
     */
    public Document createLuceneDocument(File file, String uri, String mimetype, Object options) throws IOException, DocumentExtractorException;
    
    public Document createLuceneDocument(InputStream stream, String uri, String mimetype, Object options) throws IOException, DocumentExtractorException;
}

A better alternative may be:

public interface Extractor {

    public void extract(URI id, InputStream stream, Charset charset, String mimetype, Repository repository);
}

Notes:

The mimetype is specified because the same Extractor can be used for several mimetypes (e.g. OpenOfficeExtractor or OpenDocumentExtractor) while there may be slight differences for different mimetypes.
A Sesame Repository is specified to the Extractor to put its contents in. The application can then decide whether this is an in-memory repository where only this Extractor is operating on, whether the statements go directly into a persistent storage, whether an optimized Repository implementation is used, etc.

Download in other formats:

Plain Text