wiki:ApertureExtractor

Context Navigation

Version 18 (modified by anonymous, 21 years ago) (diff)
--

Extractor

Notes:

The mimetype is specified because the same Extractor can be used for several mimetypes (e.g. OpenOfficeExtractor or OpenDocumentExtractor) while there may be slight differences for different mimetypes.
A Sesame Repository is specified to the Extractor to put its contents in. The repository is neatly wrapped as RDFMap. The application can then decide whether this is an in-memory repository on which only this Extractor is operating, whether the statements go directly into a persistent storage, whether an application-specific optimized Repository implementation is used, etc.

TODO: Leo questions if the rdfmap should be passed or not. Chris: specifying it as a parameter releaves the implementor from the burden of instantiating one himself, which may not be that trivial, depending on the chosen RDF interface.

Java Interface

Probably equal to:

source:trunk/gnowsis/src/org/gnowsis/data/extractor/ExtractorPlaintext.java and
source:/trunk/gnoDesktopSearch/src/java/org/gnowsis/desktopsearch/extractor/DocumentExtractor.java

/**
 * 
 * Extractors are used to extract metadata and fulltext from InputStreams,
 * the inputstream is in a format passed by Mime-Type.
 * These extractors can produce RDFMaps.
 */
public interface Extractor {


    /**
     * create extracted information into the passed RDFMap called "result"
     * To see what fields should be needed and which must be added, look at the 
     * commments above
     * @param id the uri identifying the passed object. You may need it when you add sophisticated rdf information. It is also the topResource in the passed result
     * @param stream an opened inputstream which you can exclusively read. You must call the
stream.close() operation when you are finished extracting.
     * @param charset the charset in which the inputstream is encoded
     * @param mimetype the mimetype of the passed file/stream. If your extractor can handle multiple mime-types, this can be handy.
     * @param result - the place where the extracted data is to be written to 
     * @throws IOException when problems arise reading the stream.
     * @throws DocumentExctractorException when the metadata of the stream cannot be extracted,
     * when the stream does not conform to the MimeType's norms.
     */
    public void extract(URI id, InputStream stream, Charset charset, String mimetype, RDFMap result)  throws IOException, DocumentExtractorException;


 /*
  inferior ALTERNATIVE:

    public RDFMap extract(URI id, InputStream stream, Charset charset, String mimetype)  throws IOException, DocumentExtractorException;

  inferior because with first, they only need to know the interface and with inferior they 
  have to know how to instantiate a RDFMap. Also performace of first is better, if the
  RDF store is sneaked and passed through the method */

 
}

Download in other formats:

Plain Text