Changes between Version 5 and Version 6 of ApertureArchitecture


Ignore:
Timestamp:
10/12/05 13:00:40 (19 years ago)
Author:
anonymous
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • ApertureArchitecture

    v5 v6  
    11= Aperture Architecture = 
    22 
    3 = Leo: please don't edit, I'm working on this right now = 
    4  
    5 == DataSources and Friends == 
    6  
    7 The central parts in the architecture are currently DataSource, DataCrawler, 
    8 DataAccessor and DataObject. Together they are used to access the contents of 
    9 an information system, such as a file system or web site. 
    10  
    11 A DataSource contains all information necessary to locate the information 
    12 items in a source. For example, a FileSystemDataSource has a set of one or 
    13 more directories on a file system, a set of patterns that describe what files 
    14 to include or exclude, etc. 
    15  
    16 A DataCrawler is responsible for actually accessing the physical source and 
    17 reporting the individual information items as DataObjects. Each DataObject 
    18 contains all metadata provided by the data source, such as file names, 
    19 modification dates, etc., as well as the InputStream providing access to 
    20 physical resource. 
    21  
    22 We have chosen to distinguish between a DataSource and a DataCrawler as there 
    23 may be several alternative crawling strategies for a single DataSource type. 
    24 Consider for example a generic FileSystemCrawler that handles any kind of 
    25 file system accessible through java.io.File versus a WindowsFileSystemCrawler 
    26 using OS-native functionality to get notified about file additions, deletions 
    27 and changes. Another possibility is various DataCrawler implementations that 
    28 have different trade-offs in speed and accuracy. 
    29  
    30 Currently, A DataSource also contains support for writing its configuration 
    31 to or initializing it from an XML file. We might consider putting this in a 
    32 separate utility class, because the best way to store such information is 
    33 often application dependent. 
    34  
    35 A DataCrawler creates DataObjects for the individual information items it 
    36 encounters in the data source. These DataObjects are reported to 
    37 DataCrawlerListeners registered at the DataCrawler. An abstract base class 
    38 (DataCrawlerBase) is provided that provides base functionality for 
    39 maintaining information about which files have been reported in the past, 
    40 allowing for incremental scanning. 
    41  
    42 In order to create a DataObject for a single resource encountered by the 
    43 DataCrawler, a DataAccessor is used. This functionality is kept out of the 
    44 DataCrawler implementations on purpose because there may be several crawlers 
    45 who can make good use of the same data accessing functionality. A good 
    46 example is the FileSystemCrawler and HypertextCrawler, which both make use of 
    47 the FileDataAccessor. Although they arrive at the physical resource in 
    48 different ways (by traversing folder trees vs. following links from other 
    49 documents), they can use the same functionality to turn a java.io.File into a 
    50 FileDataObject. 
    51  
    52 It should be clear now that a DataCrawler is specific for the kind of 
    53 DataSource it supports, whereas a DataAccessor is specific for the url 
    54 scheme(s) it supports. 
    55  
    56 The AccessData instance used in DataCrawlerBase maintains the information 
    57 about which objects have been scanned before. This instance is passed to the 
    58 DataAccessor as this is the best class to do this detection. For example, 
    59 this allows the HttpDataAccessor to use HTTP-specific functionality to let 
    60 the webserver decide on whether the resource has changed since the last scan, 
    61 preventing an unchanged file from being transported to the crawling side in 
    62 the first place. 
    63  
    64 == HypertextCrawler == 
    65  
    66 The HypertextCrawler makes use of two external compoments: a mime type 
    67 identifier and a hypertext link extractor. The latter component is required 
    68 to know which resources are linked from a specific resource and should be 
    69 crawled next. This functionality is realized as a separate component/service 
    70 as there are many document types that support links (PDF might be a nice one 
    71 to support next). A specific link extractor is thus mimetype-specific. 
    72 However, in order to know which link extractor to use, one first needs to 
    73 know the mime type of the starting resource, which is handled by the first 
    74 component. 
    75  
    76 == Email interpretation == 
    77  
    78 The ImapDataAccessor is a fairly complex class that does a lot of effort in 
    79 interpreting a mime message. Rather than just delivering the raw inputstream 
    80 of the Message, it produces a DataObject with possible child DataObjects that 
    81 reflects as best as possible the way in which mail readers display the mail. 
    82  
    83 For example, what may seem to be a simple mail with a few headers and a body 
    84 may in fact be a multipart mail with two alternative bodies, one in plain 
    85 text and one in HTML. What conceptually is a single "information object" is 
    86 spread over 4 different JavaMail objects (a MimeMessage with a Multipart 
    87 containing two BodyParts, if I remember correctly). The ImapDataAccessor 
    88 tries to hide this complexity of multiparts and just creates a single 
    89 DataObject with headers and content. 
    90  
    91 It may be a good idea to adapt the other mail crawlers such as the existing 
    92 Outlook and Mozilla mail crawlers so that they produce javax.mail.Message 
    93 objects. We can then refactor the ImapDataAccessor so that this Message- 
    94 interpretation code lives elsewhere, making it possible to also apply it on 
    95 the Messages created by these other mail crawlers. This allows us to reuse 
    96 the mail interpretation code accross these mail formats. 
    97  
    98 If these other mail crawlers have access to the raw mail content (i.e. the 
    99 message as transported through SMTP), this may be rather easy to realize, as 
    100 the functionality to parse these lines and convert it into a Message 
    101 datastructure is part of JavaMail. We should see if this functionality is 
    102 publicly available in the library. 
    103  
    104 == Extractors == 
    105  
    106 This API is still under discussion, that's why I shipped the older 
    107 TextExtractor implementations to DFKI. 
    108  
    109 The purpose of Extractor is to extract all information (full text and other) 
    110 from an InputStream of a specific document. Extractors are therefore 
    111 mimetype-specific. 
    112  
    113 Todo: describe and discuss final API 
    114  
    115 == OSGi == 
    116  
    117 Both Aduna and DFKI are in favour of using OSGi as a way to bundle these 
    118 components. At Aduna we have followed a specific way of modelling a service, 
    119 using a factory for every implementation of a service, and a separate 
    120 registry that registers all implementations of a specific service. It is the 
    121 responsibility of the bundle activator of a service to register an instance 
    122 of a service implementation's factory with the service registry. This allows 
    123 for a very light-weight initialization of the system, provided that creation 
    124 of a factory instance is very light-weight. 
    125  
    126 Currenly, Leo and Chris think that we should base our code only on pure OSGi 
    127 code (i.e. org.osgi.*) and not use any other utilities such as the dependency 
    128 manager that's currently used in the Aduna code. Perhaps Herko can tell us 
    129 more about what we're in for, because we both have hardly any experience with 
    130 OSGi yet. 
    131  
    132 == Archives == 
    133  
    134 Some functionality that is still missing but that we at Aduna would really 
    135 like to have is support for handling archives such as zip and rar files. 
    136  
    137 The interface for doing archive extraction will probably be a mixture of 
    138 Extractor and DataSource/DataCrawler. On the one hand they will be mimetype- 
    139 specific and will operate on an InputStream (perhaps a DataObject), just like 
    140 Extractor, on the other hand they deliver a stream of new DataObjects. 
    141  
    142 A URI scheme also has to be developed for such nested objects, so that you 
    143 can identify a stream packed inside an archive. 
    144  
    145 Support for zip and gzip are probably trivial as these formats are already 
    146 accessible through java.util.zip. Rar is another format we encounter 
    147 sometimes. As far as I know there is no java library available for it is an 
    148 open format, i.e. the specs are available. 
    149  
    150 == Opening resources == 
    151  
    152 Besides crawling resources, we should also be able to open them. 
    153  
    154 At first this may look like a job for the DataAccessor, which after all has 
    155 knowledge about the details of the physical source. 
    156  
    157 On second thought, I believe that for the opening of files you need some 
    158 other service, parallel to DataAccessor, that is also scheme-specific and 
    159 that takes care of opening the files. Reasons: 
    160  
    161 - DataAccessors actually retrieve the files, which is not necessary for some 
    162 file openers. For example, for opening a local file you can instruct Windows 
    163 to do just that. Similarly, a web page can be retrieved and shown by a 
    164 webbrowser, there is no need for us to retrieve the contents and feed it to 
    165 the browser. 
    166  
    167 - There may be several alternative ways of opening a resource. For example, 
    168 the java.net JDIC project contains functionality for opening files and 
    169 webpages, whereas we have our own classes to do that. 
    170  
    171 This may be a good reason to decouple this functionality from the 
    172 DataAccessor and run it in parallel. 
    173  
    174 == The use of RDF == 
    175  
    176 We should discuss where and how RDF is used in this framework. In previous 
    177 email discussions we already thought about using RDF as a way to let an 
    178 Extractor output its extracted information, because of the flexibility it 
    179 provides: 
    180  
    181 - no assumption on what the metadata looks like, can be very simple or very 
    182 complex 
    183 - easy to store in RDF stores, no transformation necessary (provided that 
    184 you have named graphs support) 
    185  
    186 The same technique could also be used in the DataObjects, which now use a Map 
    187 with dedicated keys, defined per DataObject type. I would be in favour of 
    188 changing this to "something RDF", as it considerably eases development. 
    189  
    190 Leo came up with an idea that allows delivering RDF while at the same time 
    191 providing a simpler interface to programmers not knowledgeable in RDF. The 
    192 idea is to create a class that implements both the org.openrdf.model.Graph 
    193 interface as well as the java.util.Map interface. The effect of 
    194  
    195         result.put(authorURI, "chris"); 
    196          
    197 with the authorURI being equal to the URI of the author predicate, would then 
    198 be equal to 
    199  
    200         result.add(documentURI, authorURI, "chris"); 
    201  
    202 I.e., you can use the Map methods to insert simple resource-predicate-literal 
    203 statements (the majority), which is simple to document and understand, 
    204 whereas people who know what they are doing can also add arbitrary RDF 
    205 statements. 
     3<page deleted>