Semantic Data Access by Aduna & DFKI
To extract data and fulltext from various datasources and store them in
systems like gnowsis or Aduna Metadata Server.
Sourceforge Project
Administrators: Christiaan Fluit & Leo Sauermann
Source Code: Interfaces and standard implementations of the SeDAF
The source will contain all relevant information about semantic data
extraction, everything that is needed to get starting with a fulltext
and metadata extraction framework. Our intent is that developers can
download a single distribution file with a fully working environment,
that also includes adapter and extractor implementations. Developers
can use this package to fill their lucene-based applications or other
data stores.
The features of the framework will be:
- easy to use: easy to learn, easy to code, easy to deploy in
industrial projects
- Extract fulltext from many common file formats and information
systems like IMAP email servers
- Extract metadata like author, date, subject and more from the
data sources
- open the data objects for viewing
- Fully configurable framework, storing and editing config files is
done through a SWING gui.
- Pluggable architecture: can be easily extended, can be easily
integrated to other projects.
- Architecture based on industry standard OSGI
- Compatible with RDF, but not solely based on it
Components in the framework are:
- DataSource Interface
- TextExtractor Interface
- DataSource implementation for Filesystem
- DataSource implementation for IMAP mail servers
- TextExtractor implementation for everything we know: PDF, Word,
Fulltext, excel
- OSGI bindings and connector code
- Configuration gui
- Sample appication showing how to use it, with gui (=either
Autofocus or Sesame or Gnowsis)
- Metadata format description (RDFS schema) and example file for
the metadata
Right from the beginning we will support the following file types:
- Plain text
- HTML
- XML
- PDF (Portable Document Format)
- RTF (Rich Text Format)
- Microsoft Word 97+
- Microsoft Excel 97+
- Microsoft Powerpoint 97+
- Microsoft Works
- OpenOffice 1.0+: Writer, Calc, Impress, Draw
- StarOffice 6.0+: Writer, Calc, Impress, Draw
- WordPerfect 5.x
- Emails
- IMAP Servers
credits
The following third party libraries have helped making the metadata
framework
the success that it is. These freely available libraries deserve
a lot of credit for that, and we highly recommend them to others
as well!
- Gnowsis: http://www.gnowsis.org/
- HtmlParser: http://htmlparser.sourceforge.net/
- Idmeta: http://www.geocities.com/marcoschmidt.geo/
- Jakarta Commons FileUpload:
http://jakarta.apache.org/commons/fileupload/
- Jakarta Lucene: http://jakarta.apache.org/lucene/
- Jakarta POI: http://jakarta.apache.org/poi/
- Java Look and Feel Graphics Repository:
http://java.sun.com/developer/techDocs/hi/repository/
- JavaBeans Activation Framework:
http://java.sun.com/products/javabeans/glasgow/jaf.html
- JavaMail API: http://java.sun.com/products/javamail/
- JGoodies Looks: http://www.jgoodies.com/freeware/looks/
- NGramJ: http://ngramj.sourceforge.net/
- PDFBox: http://www.pdfbox.org/
- Sesame: http://www.openrdf.org/
- WinLAF: https://winlaf.dev.java.net/
- Xpdf: http://www.foolabs.com/xpdf/
license
The SeDAF is published under a BSD or CPL compatible license.