Version 2 (modified by Leo Sauermann <leo.sauermann@…>, 19 years ago) (diff) |
---|
Semantic Data Access by Aduna & DFKI
To extract data and fulltext from various datasources and store them in systems like gnowsis or Aduna Metadata Server.
Sourceforge Project
Administrators: Christiaan Fluit & Leo Sauermann Source Code: Interfaces and standard implementations of the SeDAF
The source will contain all relevant information about semantic data extraction, everything that is needed to get starting with a fulltext and metadata extraction framework. Our intent is that developers can download a single distribution file with a fully working environment, that also includes adapter and extractor implementations. Developers can use this package to fill their lucene-based applications or other data stores.
The features of the framework will be:
- easy to use: easy to learn, easy to code, easy to deploy in industrial projects
- Extract fulltext from many common file formats and information systems like IMAP email servers
- Extract metadata like author, date, subject and more from the data sources
- open the data objects for viewing
- Fully configurable framework, storing and editing config files is done through a SWING gui.
- Pluggable architecture: can be easily extended, can be easily integrated to other projects.
- Architecture based on industry standard OSGI
- Compatible with RDF, but not solely based on it
Components in the framework are:
- DataSource Interface
- TextExtractor Interface
- DataSource implementation for Filesystem
- DataSource implementation for IMAP mail servers
- TextExtractor implementation for everything we know: PDF, Word, Fulltext, excel
- OSGI bindings and connector code
- Configuration gui
- Sample appication showing how to use it, with gui (=either Autofocus or Sesame or Gnowsis)
- Metadata format description (RDFS schema) and example file for the metadata
Right from the beginning we will support the following file types:
- Plain text
- HTML
- XML
- PDF (Portable Document Format)
- RTF (Rich Text Format)
- Microsoft Word 97+
- Microsoft Excel 97+
- Microsoft Powerpoint 97+
- Microsoft Works
- OpenOffice 1.0+: Writer, Calc, Impress, Draw
- StarOffice 6.0+: Writer, Calc, Impress, Draw
- WordPerfect 5.x
- Emails
- IMAP Servers
credits
The following third party libraries have helped making the metadata framework the success that it is. These freely available libraries deserve a lot of credit for that, and we highly recommend them to others as well!
- Gnowsis: http://www.gnowsis.org/
- HtmlParser: http://htmlparser.sourceforge.net/
- Idmeta: http://www.geocities.com/marcoschmidt.geo/
- Jakarta Commons FileUpload: http://jakarta.apache.org/commons/fileupload/
- Jakarta Lucene: http://jakarta.apache.org/lucene/
- Jakarta POI: http://jakarta.apache.org/poi/
- Java Look and Feel Graphics Repository: http://java.sun.com/developer/techDocs/hi/repository/
- JavaBeans Activation Framework: http://java.sun.com/products/javabeans/glasgow/jaf.html
- JavaMail API: http://java.sun.com/products/javamail/
- JGoodies Looks: http://www.jgoodies.com/freeware/looks/
- NGramJ: http://ngramj.sourceforge.net/
- PDFBox: http://www.pdfbox.org/
- Sesame: http://www.openrdf.org/
- WinLAF: https://winlaf.dev.java.net/
- Xpdf: http://www.foolabs.com/xpdf/
license The SeDAF is published under a BSD or CPL compatible license.
Attachments (2)
-
aperture_overview.ppt
(35.0 KB) -
added by sauermann 19 years ago.
Rough-cut overview of the framework
- API changes (20051114).txt (9.5 KB) - added by chris 19 years ago.
Download all attachments as: .zip