| 1 | <h1>Semantic Data Access by Aduna & DFKI <br> |
| 2 | </h1> |
| 3 | To extract data and fulltext from various datasources and store them in |
| 4 | systems like gnowsis or Aduna Metadata Server.<br> |
| 5 | <h2>Sourceforge Project</h2> |
| 6 | Administrators: Christiaan Fluit & Leo Sauermann<br> |
| 7 | Source Code: Interfaces and standard implementations of the SeDAF<br> |
| 8 | <br> |
| 9 | The source will contain all relevant information about semantic data |
| 10 | extraction, everything that is needed to get starting with a fulltext |
| 11 | and metadata extraction framework. Our intent is that developers can |
| 12 | download a single distribution file with a fully working environment, |
| 13 | that also includes adapter and extractor implementations. Developers |
| 14 | can use this package to fill their lucene-based applications or other |
| 15 | data stores.<br> |
| 16 | <br> |
| 17 | The features of the framework will be:<br> |
| 18 | <ul> |
| 19 | <li>easy to use: easy to learn, easy to code, easy to deploy in |
| 20 | industrial projects<br> |
| 21 | </li> |
| 22 | <li>Extract fulltext from many common file formats and information |
| 23 | systems like IMAP email servers</li> |
| 24 | <li>Extract metadata like author, date, subject and more from the |
| 25 | data sources</li> |
| 26 | <li>open the data objects for viewing<br> |
| 27 | </li> |
| 28 | <li>Fully configurable framework, storing and editing config files is |
| 29 | done through a SWING gui.</li> |
| 30 | <li>Pluggable architecture: can be easily extended, can be easily |
| 31 | integrated to other projects. <br> |
| 32 | </li> |
| 33 | <li>Architecture based on industry standard OSGI</li> |
| 34 | <li>Compatible with RDF, but not solely based on it</li> |
| 35 | </ul> |
| 36 | Components in the framework are:<br> |
| 37 | <ul> |
| 38 | <li>DataSource Interface</li> |
| 39 | <li>TextExtractor Interface</li> |
| 40 | <li>DataSource implementation for Filesystem</li> |
| 41 | <li>DataSource implementation for IMAP mail servers</li> |
| 42 | <li>TextExtractor implementation for everything we know: PDF, Word, |
| 43 | Fulltext, excel</li> |
| 44 | <li>OSGI bindings and connector code<br> |
| 45 | </li> |
| 46 | <li>Configuration gui</li> |
| 47 | <li>Sample appication showing how to use it, with gui (=either |
| 48 | Autofocus or Sesame or Gnowsis)</li> |
| 49 | <li>Metadata format description (RDFS schema) and example file for |
| 50 | the metadata<br> |
| 51 | </li> |
| 52 | </ul> |
| 53 | Right from the beginning we will support the following file types:<br> |
| 54 | <ul> |
| 55 | <li>Plain text</li> |
| 56 | <li>HTML</li> |
| 57 | <li>XML</li> |
| 58 | <li>PDF (Portable Document Format)</li> |
| 59 | <li>RTF (Rich Text Format)</li> |
| 60 | <li>Microsoft Word 97+</li> |
| 61 | <li>Microsoft Excel 97+</li> |
| 62 | <li>Microsoft Powerpoint 97+</li> |
| 63 | <li>Microsoft Works</li> |
| 64 | <li>OpenOffice 1.0+: Writer, Calc, Impress, Draw</li> |
| 65 | <li>StarOffice 6.0+: Writer, Calc, Impress, Draw</li> |
| 66 | <li>WordPerfect 5.x</li> |
| 67 | <li>Emails</li> |
| 68 | <li>IMAP Servers</li> |
| 69 | </ul> |
| 70 | <h2>credits<br> |
| 71 | </h2> |
| 72 | The following third party libraries have helped making the metadata |
| 73 | framework<br> |
| 74 | the success that it is. These freely available libraries deserve<br> |
| 75 | a lot of credit for that, and we highly recommend them to others<br> |
| 76 | as well!<br> |
| 77 | <ul> |
| 78 | <li>Gnowsis: http://www.gnowsis.org/</li> |
| 79 | <li>HtmlParser: http://htmlparser.sourceforge.net/</li> |
| 80 | <li>Idmeta: http://www.geocities.com/marcoschmidt.geo/</li> |
| 81 | <li>Jakarta Commons FileUpload: |
| 82 | http://jakarta.apache.org/commons/fileupload/</li> |
| 83 | <li>Jakarta Lucene: http://jakarta.apache.org/lucene/</li> |
| 84 | <li>Jakarta POI: http://jakarta.apache.org/poi/</li> |
| 85 | <li>Java Look and Feel Graphics Repository: |
| 86 | http://java.sun.com/developer/techDocs/hi/repository/</li> |
| 87 | <li>JavaBeans Activation Framework: |
| 88 | http://java.sun.com/products/javabeans/glasgow/jaf.html</li> |
| 89 | <li>JavaMail API: http://java.sun.com/products/javamail/</li> |
| 90 | <li>JGoodies Looks: http://www.jgoodies.com/freeware/looks/</li> |
| 91 | <li>NGramJ: http://ngramj.sourceforge.net/</li> |
| 92 | <li>PDFBox: http://www.pdfbox.org/</li> |
| 93 | <li>Sesame: http://www.openrdf.org/</li> |
| 94 | <li>WinLAF: https://winlaf.dev.java.net/</li> |
| 95 | <li>Xpdf: http://www.foolabs.com/xpdf/</li> |
| 96 | </ul> |
| 97 | <h2>license</h2> |
| 98 | The SeDAF is published under a BSD or CPL compatible license.<br> |