Version 2 (modified by anonymous, 19 years ago) (diff) |
---|
Aperture Overview
Administrators: Christiaan Fluit & Leo Sauermann
Source Code: Interfaces and standard implementations of Aperture
The source will contain all relevant information about semantic data extraction, everything that is needed to get starting with a fulltext and metadata extraction framework. Our intent is that developers can download a single distribution file with a fully working environment, that also includes adapter and extractor implementations. Developers can use this package to fill their lucene-based applications or other data stores.
The features of the framework will be:
- easy to use: easy to learn, easy to code, easy to deploy in industrial projects
- Extract fulltext from many common file formats and information systems like IMAP email servers
- Extract metadata like author, date, subject and more from the data sources
- open the data objects for viewing
- Fully configurable framework, storing and editing config files is done through a SWING gui.
- Pluggable architecture: can be easily extended, can be easily integrated to other projects.
- Architecture based on industry standard OSGI
- Compatible with RDF, but not solely based on it
Components in the framework are:
- DataSource Interface
- TextExtractor Interface
- DataSource implementation for Filesystem
- DataSource implementation for IMAP mail servers
- TextExtractor implementation for everything we know: PDF, Word, Fulltext, excel
- OSGI bindings and connector code
- Configuration gui
- Sample appication showing how to use it, with gui (=either Autofocus or Sesame or Gnowsis)
- Metadata format description (RDFS schema) and example file for the metadata
Right from the beginning we will support the following file types:
- Plain text
- HTML
- XML
- PDF (Portable Document Format)
- RTF (Rich Text Format)
- Microsoft Word 97+
- Microsoft Excel 97+
- Microsoft Powerpoint 97+
- Microsoft Works
- OpenOffice 1.0+: Writer, Calc, Impress, Draw
- StarOffice 6.0+: Writer, Calc, Impress, Draw
- WordPerfect 5.x
- Emails
- IMAP Servers