13 | | * Extract metadata like author, date, subject and more from the data sources |
14 | | * open the data objects for viewing |
15 | | * Fully configurable framework, storing and editing config files is done through a SWING gui. |
16 | | * Pluggable architecture: can be easily extended, can be easily integrated to other projects. |
| 13 | * Extract metadata like author, date, subject and more from the data sources and file formats |
| 14 | * Open data objects for viewing |
| 15 | * Fully configurable framework, storing and editing config files is done through a SWING gui |
| 16 | * Pluggable architecture: can be easily extended, can be easily integrated to other projects |
22 | | * DataSource Interface |
23 | | * TextExtractor Interface |
24 | | * DataSource implementation for Filesystem |
25 | | * DataSource implementation for IMAP mail servers |
26 | | * TextExtractor implementation for everything we know: PDF, Word, Fulltext, excel |
27 | | * OSGI bindings and connector code |
28 | | * Configuration gui |
29 | | * Sample appication showing how to use it, with gui (=either Autofocus or Sesame or Gnowsis) |
30 | | * Metadata format description (RDFS schema) and example file for the metadata |
| 22 | * !DataSource interface |
| 23 | * !DataSource implementations for file systems, websites (or rather hypertextual sources in general) and IMAP servers |
| 24 | * Near future work: !OutlookSource, !MozillaSource/ThunderbirdSource |
32 | | Right from the beginning we will support the following file types: |
| 26 | * !DataAccessor interface |
| 27 | * !DataAccessor implementations for file, http(s) and imap schemes |
| 28 | |
| 29 | * !DataCrawler interface |
| 30 | * One basic !DataCrawler implementation for every !DataSource type |
| 31 | * Later maybe more specialized !DataCrawler implementations, e.g. a !WindowsFileSystemCrawler with OS-specific optimizations |
| 32 | |
| 33 | * Extractor interface |
| 34 | * Extractor implementation for everything we can easily support: PDF, Word, Excel, HTML, plain text, ... |
| 35 | * New domain for us but also probably very doable: PNG, JPG, AVI, ... |
| 36 | |
| 37 | * !ArchiveExtractor interface |
| 38 | * !ArchiveExtractor implementations for Zip and Gzip |
| 39 | |
| 40 | * !LinkExtractor interface |
| 41 | * !LinkExtractor implementation for HTML and XHTML |
| 42 | * Later maybe PDF, Flash, ... |
| 43 | |
| 44 | * !MimetypeIdentifier interface |
| 45 | * Badic !MimeTypeIdentifer implementation based on magic numbers; absolute necessity for choosing the right Extractor, !LinkExtractor or !ArchiveExtractor implementation for a given file |
| 46 | |
| 47 | * [http://www.osgi.org/ OSGi] bindings and connector code (can be realized so that code is also usable outside an OSGi-based application) |
| 48 | * Configuration gui (what needs to be configured? isn't this very application-specific?) |
| 49 | * Sample GUI appication showing how to use it. Can also be used as test application, e.g. when you are developing new Extractor implementations. |
| 50 | * Metadata format descriptions (RDFS schema) and example metadata files |
| 51 | |
| 52 | == Supported File Formats == |
| 53 | |
| 54 | Right from the beginning we will support these file formats: |
43 | | * OpenOffice 1.0+: Writer, Calc, Impress, Draw |
44 | | * StarOffice 6.0+: Writer, Calc, Impress, Draw |
45 | | * WordPerfect 5.x |
46 | | * Emails |
47 | | * IMAP Servers |
| 65 | * !OpenOffice 1.0+: Writer, Calc, Impress, Draw |
| 66 | * !StarOffice 6.0+: Writer, Calc, Impress, Draw |
| 67 | * !OpenDocument (!OpenOffice 2.0+) |
| 68 | * !WordPerfect 5.x |
| 69 | * Emails (.eml files) |