Changes between Version 3 and Version 4 of ApertureOverview


Ignore:
Timestamp:
10/12/05 11:08:42 (19 years ago)
Author:
anonymous
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • ApertureOverview

    v3 v4  
    77The source will contain all relevant information about semantic data extraction, everything that is needed to get starting with a fulltext and metadata extraction framework. Our intent is that developers can download a single distribution file with a fully working environment, that also includes adapter and extractor implementations. Developers can use this package to fill their lucene-based applications or other data stores. 
    88 
    9 The features of the framework will be: 
     9== Features == 
    1010 
    11  * easy to use: easy to learn, easy to code, easy to deploy in industrial projects 
     11 * Easy to use: easy to learn, easy to code, easy to deploy in industrial projects 
    1212 * Extract fulltext from many common file formats and information systems like IMAP email servers 
    13  * Extract metadata like author, date, subject and more from the data sources 
    14  * open the data objects for viewing 
    15  * Fully configurable framework, storing and editing config files is done through a SWING gui. 
    16  * Pluggable architecture: can be easily extended, can be easily integrated to other projects. 
     13 * Extract metadata like author, date, subject and more from the data sources and file formats 
     14 * Open data objects for viewing 
     15 * Fully configurable framework, storing and editing config files is done through a SWING gui 
     16 * Pluggable architecture: can be easily extended, can be easily integrated to other projects 
    1717 * Architecture based on industry standard OSGI 
    1818 * Compatible with RDF, but not solely based on it 
    1919 
    20 Components in the framework are: 
     20== Components == 
    2121 
    22  * DataSource Interface 
    23  * TextExtractor Interface 
    24  * DataSource implementation for Filesystem 
    25  * DataSource implementation for IMAP mail servers 
    26  * TextExtractor implementation for everything we know: PDF, Word, Fulltext, excel 
    27  * OSGI bindings and connector code 
    28  * Configuration gui 
    29  * Sample appication showing how to use it, with gui (=either Autofocus or Sesame or Gnowsis) 
    30  * Metadata format description (RDFS schema) and example file for the metadata 
     22 * !DataSource interface 
     23 * !DataSource implementations for file systems, websites (or rather hypertextual sources in general) and IMAP servers 
     24 * Near future work: !OutlookSource, !MozillaSource/ThunderbirdSource 
    3125 
    32 Right from the beginning we will support the following file types: 
     26 * !DataAccessor interface 
     27 * !DataAccessor implementations for file, http(s) and imap schemes 
     28 
     29 * !DataCrawler interface 
     30 * One basic !DataCrawler implementation for every !DataSource type 
     31 * Later maybe more specialized !DataCrawler implementations, e.g. a !WindowsFileSystemCrawler with OS-specific optimizations 
     32 
     33 * Extractor interface 
     34 * Extractor implementation for everything we can easily support: PDF, Word, Excel, HTML, plain text, ... 
     35 * New domain for us but also probably very doable: PNG, JPG, AVI, ... 
     36 
     37 * !ArchiveExtractor interface 
     38 * !ArchiveExtractor implementations for Zip and Gzip 
     39 
     40 * !LinkExtractor interface 
     41 * !LinkExtractor implementation for HTML and XHTML 
     42 * Later maybe PDF, Flash, ... 
     43 
     44 * !MimetypeIdentifier interface 
     45 * Badic !MimeTypeIdentifer implementation based on magic numbers; absolute necessity for choosing the right Extractor, !LinkExtractor or !ArchiveExtractor implementation for a given file 
     46 
     47 * [http://www.osgi.org/ OSGi] bindings and connector code (can be realized so that code is also usable outside an OSGi-based application) 
     48 * Configuration gui (what needs to be configured? isn't this very application-specific?) 
     49 * Sample GUI appication showing how to use it. Can also be used as test application, e.g. when you are developing new Extractor implementations. 
     50 * Metadata format descriptions (RDFS schema) and example metadata files 
     51 
     52== Supported File Formats == 
     53 
     54Right from the beginning we will support these file formats: 
    3355 
    3456 * Plain text 
     
    4163 * Microsoft Powerpoint 97+ 
    4264 * Microsoft Works 
    43  * OpenOffice 1.0+: Writer, Calc, Impress, Draw 
    44  * StarOffice 6.0+: Writer, Calc, Impress, Draw 
    45  * WordPerfect 5.x 
    46  * Emails 
    47  * IMAP Servers 
     65 * !OpenOffice 1.0+: Writer, Calc, Impress, Draw 
     66 * !StarOffice 6.0+: Writer, Calc, Impress, Draw 
     67 * !OpenDocument (!OpenOffice 2.0+) 
     68 * !WordPerfect 5.x 
     69 * Emails (.eml files)