| 10 | | |
| 11 | | |
| 12 | | == Sourceforge Project == |
| 13 | | |
| 14 | | Administrators: Christiaan Fluit & Leo Sauermann |
| 15 | | Source Code: Interfaces and standard implementations of the SeDAF |
| 16 | | |
| 17 | | The source will contain all relevant information about semantic data extraction, everything that is needed to get starting with a fulltext and metadata extraction framework. Our intent is that developers can download a single distribution file with a fully working environment, that also includes adapter and extractor implementations. Developers can use this package to fill their lucene-based applications or other data stores. |
| 18 | | |
| 19 | | The features of the framework will be: |
| 20 | | |
| 21 | | * easy to use: easy to learn, easy to code, easy to deploy in industrial projects |
| 22 | | * Extract fulltext from many common file formats and information systems like IMAP email servers |
| 23 | | * Extract metadata like author, date, subject and more from the data sources |
| 24 | | * open the data objects for viewing |
| 25 | | * Fully configurable framework, storing and editing config files is done through a SWING gui. |
| 26 | | * Pluggable architecture: can be easily extended, can be easily integrated to other projects. |
| 27 | | * Architecture based on industry standard OSGI |
| 28 | | * Compatible with RDF, but not solely based on it |
| 29 | | |
| 30 | | Components in the framework are: |
| 31 | | |
| 32 | | * DataSource Interface |
| 33 | | * TextExtractor Interface |
| 34 | | * DataSource implementation for Filesystem |
| 35 | | * DataSource implementation for IMAP mail servers |
| 36 | | * TextExtractor implementation for everything we know: PDF, Word, Fulltext, excel |
| 37 | | * OSGI bindings and connector code |
| 38 | | * Configuration gui |
| 39 | | * Sample appication showing how to use it, with gui (=either Autofocus or Sesame or Gnowsis) |
| 40 | | * Metadata format description (RDFS schema) and example file for the metadata |
| 41 | | |
| 42 | | Right from the beginning we will support the following file types: |
| 43 | | |
| 44 | | * Plain text |
| 45 | | * HTML |
| 46 | | * XML |
| 47 | | * PDF (Portable Document Format) |
| 48 | | * RTF (Rich Text Format) |
| 49 | | * Microsoft Word 97+ |
| 50 | | * Microsoft Excel 97+ |
| 51 | | * Microsoft Powerpoint 97+ |
| 52 | | * Microsoft Works |
| 53 | | * OpenOffice 1.0+: Writer, Calc, Impress, Draw |
| 54 | | * StarOffice 6.0+: Writer, Calc, Impress, Draw |
| 55 | | * WordPerfect 5.x |
| 56 | | * Emails |
| 57 | | * IMAP Servers |
| 58 | | |