Context Navigation

Changes between Version 3 and Version 4 of ApertureOverview

Timestamp:: 10/12/05 11:08:42 (20 years ago)
Author:: anonymous
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

ApertureOverview

-                      v3
+                      v4
 The source will contain all relevant information about semantic data extraction, everything that is needed to get starting with a fulltext and metadata extraction framework. Our intent is that developers can download a single distribution file with a fully working environment, that also includes adapter and extractor implementations. Developers can use this package to fill their lucene-based applications or other data stores.
+The features of the framework will be:
+== Features ==
  * easy to use: easy to learn, easy to code, easy to deploy in industrial projects
+ * Easy to use: easy to learn, easy to code, easy to deploy in industrial projects
  * Extract fulltext from many common file formats and information systems like IMAP email servers
  * Extract metadata like author, date, subject and more from the data sources
  * open the data objects for viewing
  * Fully configurable framework, storing and editing config files is done through a SWING gui.
  * Pluggable architecture: can be easily extended, can be easily integrated to other projects.
+ * Extract metadata like author, date, subject and more from the data sources and file formats
+ * Open data objects for viewing
+ * Fully configurable framework, storing and editing config files is done through a SWING gui
+ * Pluggable architecture: can be easily extended, can be easily integrated to other projects
  * Architecture based on industry standard OSGI
  * Compatible with RDF, but not solely based on it
+Components in the framework are:
+== Components ==
+ * DataSource Interface
+ * TextExtractor Interface
+ * DataSource implementation for Filesystem
+ * DataSource implementation for IMAP mail servers
+ * TextExtractor implementation for everything we know: PDF, Word, Fulltext, excel
+ * OSGI bindings and connector code
+ * Configuration gui
+ * Sample appication showing how to use it, with gui (=either Autofocus or Sesame or Gnowsis)
+ * Metadata format description (RDFS schema) and example file for the metadata
+ * !DataSource interface
+ * !DataSource implementations for file systems, websites (or rather hypertextual sources in general) and IMAP servers
+ * Near future work: !OutlookSource, !MozillaSource/ThunderbirdSource
+Right from the beginning we will support the following file types:
+ * !DataAccessor interface
+ * !DataAccessor implementations for file, http(s) and imap schemes
+ * !DataCrawler interface
+ * One basic !DataCrawler implementation for every !DataSource type
+ * Later maybe more specialized !DataCrawler implementations, e.g. a !WindowsFileSystemCrawler with OS-specific optimizations
+ * Extractor interface
+ * Extractor implementation for everything we can easily support: PDF, Word, Excel, HTML, plain text, ...
+ * New domain for us but also probably very doable: PNG, JPG, AVI, ...
+ * !ArchiveExtractor interface
+ * !ArchiveExtractor implementations for Zip and Gzip
+ * !LinkExtractor interface
+ * !LinkExtractor implementation for HTML and XHTML
+ * Later maybe PDF, Flash, ...
+ * !MimetypeIdentifier interface
+ * Badic !MimeTypeIdentifer implementation based on magic numbers; absolute necessity for choosing the right Extractor, !LinkExtractor or !ArchiveExtractor implementation for a given file
+ * [http://www.osgi.org/ OSGi] bindings and connector code (can be realized so that code is also usable outside an OSGi-based application)
+ * Configuration gui (what needs to be configured? isn't this very application-specific?)
+ * Sample GUI appication showing how to use it. Can also be used as test application, e.g. when you are developing new Extractor implementations.
+ * Metadata format descriptions (RDFS schema) and example metadata files
+== Supported File Formats ==
+Right from the beginning we will support these file formats:
  * Plain text
 …
  * Microsoft Powerpoint 97+
  * Microsoft Works
  * OpenOffice 1.0+: Writer, Calc, Impress, Draw
  * StarOffice 6.0+: Writer, Calc, Impress, Draw
  * WordPerfect 5.x
  * Emails
  * IMAP Servers
+ * !OpenOffice 1.0+: Writer, Calc, Impress, Draw
+ * !StarOffice 6.0+: Writer, Calc, Impress, Draw
+ * !OpenDocument (!OpenOffice 2.0+)
+ * !WordPerfect 5.x
+ * Emails (.eml files)