| 1 | = Aperture Overview = |
| 2 | |
| 3 | Administrators: Christiaan Fluit & Leo Sauermann |
| 4 | Source Code: Interfaces and standard implementations of the SeDAF |
| 5 | |
| 6 | The source will contain all relevant information about semantic data extraction, everything that is needed to get starting with a fulltext and metadata extraction framework. Our intent is that developers can download a single distribution file with a fully working environment, that also includes adapter and extractor implementations. Developers can use this package to fill their lucene-based applications or other data stores. |
| 7 | |
| 8 | The features of the framework will be: |
| 9 | |
| 10 | * easy to use: easy to learn, easy to code, easy to deploy in industrial projects |
| 11 | * Extract fulltext from many common file formats and information systems like IMAP email servers |
| 12 | * Extract metadata like author, date, subject and more from the data sources |
| 13 | * open the data objects for viewing |
| 14 | * Fully configurable framework, storing and editing config files is done through a SWING gui. |
| 15 | * Pluggable architecture: can be easily extended, can be easily integrated to other projects. |
| 16 | * Architecture based on industry standard OSGI |
| 17 | * Compatible with RDF, but not solely based on it |
| 18 | |
| 19 | Components in the framework are: |
| 20 | |
| 21 | * DataSource Interface |
| 22 | * TextExtractor Interface |
| 23 | * DataSource implementation for Filesystem |
| 24 | * DataSource implementation for IMAP mail servers |
| 25 | * TextExtractor implementation for everything we know: PDF, Word, Fulltext, excel |
| 26 | * OSGI bindings and connector code |
| 27 | * Configuration gui |
| 28 | * Sample appication showing how to use it, with gui (=either Autofocus or Sesame or Gnowsis) |
| 29 | * Metadata format description (RDFS schema) and example file for the metadata |
| 30 | |
| 31 | Right from the beginning we will support the following file types: |
| 32 | |
| 33 | * Plain text |
| 34 | * HTML |
| 35 | * XML |
| 36 | * PDF (Portable Document Format) |
| 37 | * RTF (Rich Text Format) |
| 38 | * Microsoft Word 97+ |
| 39 | * Microsoft Excel 97+ |
| 40 | * Microsoft Powerpoint 97+ |
| 41 | * Microsoft Works |
| 42 | * OpenOffice 1.0+: Writer, Calc, Impress, Draw |
| 43 | * StarOffice 6.0+: Writer, Calc, Impress, Draw |
| 44 | * WordPerfect 5.x |
| 45 | * Emails |
| 46 | * IMAP Servers |