10 | | |
11 | | |
12 | | == Sourceforge Project == |
13 | | |
14 | | Administrators: Christiaan Fluit & Leo Sauermann |
15 | | Source Code: Interfaces and standard implementations of the SeDAF |
16 | | |
17 | | The source will contain all relevant information about semantic data extraction, everything that is needed to get starting with a fulltext and metadata extraction framework. Our intent is that developers can download a single distribution file with a fully working environment, that also includes adapter and extractor implementations. Developers can use this package to fill their lucene-based applications or other data stores. |
18 | | |
19 | | The features of the framework will be: |
20 | | |
21 | | * easy to use: easy to learn, easy to code, easy to deploy in industrial projects |
22 | | * Extract fulltext from many common file formats and information systems like IMAP email servers |
23 | | * Extract metadata like author, date, subject and more from the data sources |
24 | | * open the data objects for viewing |
25 | | * Fully configurable framework, storing and editing config files is done through a SWING gui. |
26 | | * Pluggable architecture: can be easily extended, can be easily integrated to other projects. |
27 | | * Architecture based on industry standard OSGI |
28 | | * Compatible with RDF, but not solely based on it |
29 | | |
30 | | Components in the framework are: |
31 | | |
32 | | * DataSource Interface |
33 | | * TextExtractor Interface |
34 | | * DataSource implementation for Filesystem |
35 | | * DataSource implementation for IMAP mail servers |
36 | | * TextExtractor implementation for everything we know: PDF, Word, Fulltext, excel |
37 | | * OSGI bindings and connector code |
38 | | * Configuration gui |
39 | | * Sample appication showing how to use it, with gui (=either Autofocus or Sesame or Gnowsis) |
40 | | * Metadata format description (RDFS schema) and example file for the metadata |
41 | | |
42 | | Right from the beginning we will support the following file types: |
43 | | |
44 | | * Plain text |
45 | | * HTML |
46 | | * XML |
47 | | * PDF (Portable Document Format) |
48 | | * RTF (Rich Text Format) |
49 | | * Microsoft Word 97+ |
50 | | * Microsoft Excel 97+ |
51 | | * Microsoft Powerpoint 97+ |
52 | | * Microsoft Works |
53 | | * OpenOffice 1.0+: Writer, Calc, Impress, Draw |
54 | | * StarOffice 6.0+: Writer, Calc, Impress, Draw |
55 | | * WordPerfect 5.x |
56 | | * Emails |
57 | | * IMAP Servers |
58 | | |