Changes between Version 14 and Version 15 of ApertureExtractor


Ignore:
Timestamp:
10/17/05 15:57:34 (19 years ago)
Author:
sauermann
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • ApertureExtractor

    v14 v15  
    11= Extractor = 
     2 
     3Notes: 
     4 
     5 * The mimetype is specified because the same Extractor can be used for several mimetypes (e.g. !OpenOfficeExtractor or !OpenDocumentExtractor) while there may be slight differences for different mimetypes. 
     6 * A Sesame Repository is specified to the Extractor to put its contents in. The repository 
     7is neatly wrapped as RDFMap. The application can then decide whether this is an in-memory repository on which only this Extractor is operating, whether the statements go directly into a persistent storage, whether an application-specific optimized Repository implementation is used, etc. 
     8 
     9TODO: Leo questions if the rdfmap should be passed or not. 
     10 
    211 
    312== Java Interface == 
     
    615 * source:/trunk/gnoDesktopSearch/src/java/org/gnowsis/desktopsearch/extractor/DocumentExtractor.java 
    716 
     17 
    818{{{ 
    919#!java 
     20 
    1021/** 
    1122 *  
    1223 * Extractors are used to extract metadata and fulltext from InputStreams, 
    1324 * the inputstream is in a format passed by Mime-Type. 
    14  * These extractors can produce metadatamaps. 
     25 * These extractors can produce RDFMaps. 
    1526 */ 
    1627public interface Extractor { 
    1728 
    18     /** 
    19      * return a plaintext representation of the file 
    20      * @param source the file to look into 
    21      * @param mimetype the mimetype that has been identified by gnowsis that this file is 
    22      * @return null or a string. Null is returned, if no plaintext is in the file. If it could not be  
    23      * extracted, an exception is thrown. 
    24      * @throws ExtractionException when something goes wrong with extraction 
    25      * @throws FileNotFoundException when the file is not existant 
    26      */ 
    27     public String getPlaintext(File source, String mimetype) throws FileNotFoundException, ExtractionException ; 
    2829 
    2930    /** 
    30      * create a lucene document. 
    31      * To see what fields would be needed, look at the top of this class. 
    32      * @param file 
    33      * @param uri the uri identifying the passed file. You may need it when you add sophisticated rdf information 
     31     * create extracted information into the passed RDFMap called "result" 
     32     * To see what fields should be needed and which must be added, look at the  
     33     * commments above 
     34     * @param id the uri identifying the passed object. You may need it when you add sophisticated rdf information. It is also the topResource in the passed result 
     35     * @param stream an opened inputstream which you can exclusively read. You must call the 
     36stream.close() operation when you are finished extracting. 
     37     * @param charset the charset in which the inputstream is encoded 
    3438     * @param mimetype the mimetype of the passed file/stream. If your extractor can handle multiple mime-types, this can be handy. 
    35      * @param options optional options that may help you. 
    36      * @return a lucene document 
     39     * @param result - the place where the extracted data is to be written to  
     40     * @throws IOException when problems arise reading the stream. 
     41     * @throws DocumentExctractorException when the metadata of the stream cannot be extracted, 
     42     * when the stream does not conform to the MimeType's norms. 
    3743     */ 
    38     public Document createLuceneDocument(File file, String uri, String mimetype, Object options) throws IOException, DocumentExtractorException; 
    39      
    40     public Document createLuceneDocument(InputStream stream, String uri, String mimetype, Object options) throws IOException, DocumentExtractorException; 
    41 } 
    42  
    43  
    44 }}} 
    45  
    46 A better alternative may be: 
    47  
    48 {{{ 
    49 #!java 
    50 public interface Extractor { 
    51  
    52     public void extract(URI id, InputStream stream, Charset charset, String mimetype, Repository repository); 
     44    public void extract(URI id, InputStream stream, Charset charset, String mimetype, RDFMap result)  throws IOException, DocumentExtractorException; 
    5345} 
    5446}}} 
    5547 
    56 Notes: 
    57  
    58  * The mimetype is specified because the same Extractor can be used for several mimetypes (e.g. !OpenOfficeExtractor or !OpenDocumentExtractor) while there may be slight differences for different mimetypes. 
    59  * A Sesame Repository is specified to the Extractor to put its contents in. The application can then decide whether this is an in-memory repository on which only this Extractor is operating, whether the statements go directly into a persistent storage, whether an application-specific optimized Repository implementation is used, etc.