wiki:SemanticDataIntegrationFramework

Version 1 (modified by Leo Sauermann <leo.sauermann@…>, 16 years ago) (diff)

--

<h1>Semantic Data Access by Aduna &amp; DFKI <br> </h1> To extract data and fulltext from various datasources and store them in systems like gnowsis or Aduna Metadata Server.<br> <h2>Sourceforge Project</h2> Administrators: Christiaan Fluit &amp; Leo Sauermann<br> Source Code: Interfaces and standard implementations of the SeDAF<br> <br> The source will contain all relevant information about semantic data extraction, everything that is needed to get starting with a fulltext and metadata extraction framework. Our intent is that developers can download a single distribution file with a fully working environment, that also includes adapter and extractor implementations. Developers can use this package to fill their lucene-based applications or other data stores.<br> <br> The features of the framework will be:<br> <ul>

<li>easy to use: easy to learn, easy to code, easy to deploy in

industrial projects<br>

</li> <li>Extract fulltext from many common file formats and information

systems like IMAP email servers</li>

<li>Extract metadata like author, date, subject and more from the

data sources</li>

<li>open the data objects for viewing<br> </li> <li>Fully configurable framework, storing and editing config files is

done through a SWING gui.</li>

<li>Pluggable architecture: can be easily extended, can be easily

integrated to other projects. <br>

</li> <li>Architecture based on industry standard OSGI</li> <li>Compatible with RDF, but not solely based on it</li>

</ul> Components in the framework are:<br> <ul>

<li>DataSource Interface</li> <li>TextExtractor Interface</li> <li>DataSource implementation for Filesystem</li> <li>DataSource implementation for IMAP mail servers</li> <li>TextExtractor implementation for everything we know: PDF, Word,

Fulltext, excel</li>

<li>OSGI bindings and connector code<br> </li> <li>Configuration gui</li> <li>Sample appication showing how to use it, with gui (=either

Autofocus or Sesame or Gnowsis)</li>

<li>Metadata format description (RDFS schema) and example file for

the metadata<br>

</li>

</ul> Right from the beginning we will support the following file types:<br> <ul>

<li>Plain text</li> <li>HTML</li> <li>XML</li> <li>PDF (Portable Document Format)</li> <li>RTF (Rich Text Format)</li> <li>Microsoft Word 97+</li> <li>Microsoft Excel 97+</li> <li>Microsoft Powerpoint 97+</li> <li>Microsoft Works</li> <li>OpenOffice 1.0+: Writer, Calc, Impress, Draw</li> <li>StarOffice 6.0+: Writer, Calc, Impress, Draw</li> <li>WordPerfect 5.x</li> <li>Emails</li> <li>IMAP Servers</li>

</ul> <h2>credits<br> </h2> The following third party libraries have helped making the metadata framework<br> the success that it is. These freely available libraries deserve<br> a lot of credit for that, and we highly recommend them to others<br> as well!<br> <ul>

<li>Gnowsis: http://www.gnowsis.org/</li> <li>HtmlParser: http://htmlparser.sourceforge.net/</li> <li>Idmeta: http://www.geocities.com/marcoschmidt.geo/</li> <li>Jakarta Commons FileUpload:

http://jakarta.apache.org/commons/fileupload/</li>

<li>Jakarta Lucene: http://jakarta.apache.org/lucene/</li> <li>Jakarta POI: http://jakarta.apache.org/poi/</li> <li>Java Look and Feel Graphics Repository:

http://java.sun.com/developer/techDocs/hi/repository/</li>

<li>JavaBeans Activation Framework:

http://java.sun.com/products/javabeans/glasgow/jaf.html</li>

<li>JavaMail API: http://java.sun.com/products/javamail/</li> <li>JGoodies Looks: http://www.jgoodies.com/freeware/looks/</li> <li>NGramJ: http://ngramj.sourceforge.net/</li> <li>PDFBox: http://www.pdfbox.org/</li> <li>Sesame: http://www.openrdf.org/</li> <li>WinLAF: https://winlaf.dev.java.net/</li> <li>Xpdf: http://www.foolabs.com/xpdf/</li>

</ul> <h2>license</h2> The SeDAF is published under a BSD or CPL compatible license.<br>

Attachments (2)

Download all attachments as: .zip