SemanticDataIntegrationFramework: API changes (20051114).txt

File API changes (20051114).txt, 9.5 KB (added by chris, 5 years ago)
Line 
1* DataSource.ID is now a URI rather than a String.
2
3Basically a good idea, especially in the context of context ;)
4
5* Removed DataSource.getConfiguration, .initConfiguration and
6.checkConfiguration for now.
7
8I'm still not sure whether these methods are a good idea as they restrict
9configuration data to simple key-value pairs. I'm thinking for example about
10giving FileSystemDataSource a *set* of root directories, with for each
11directory *separate* *sets* of include and exclude dirs (we've seen multiple
12use cases where this would have been very handy). That's 3 things that are
13hard to configure in a key-value based API.
14
15Right now I'm going for dedicated configuration methods in each DataSource. We
16can always later add such generic configuration methods that internally parse
17the configuration data and invoke the specialized methods, although I suspect
18this will be without the full configuration possibilities as provided by the
19specialized methods. I would be fine with that though, as we always build
20dedicated UIs so these additional methods would do us no harm. I'd like to
21postpone this decision for now though until we know more about the pros and
22cons of this approach.
23
24* Created an ...aperture.model package.
25
26This package holds the DataSource-related interfaces and the DataObject
27hierarchy. I've decided to put them apart from the rest, as they are used in
28all other parts of the framework except the Extractor part and in my feeling
29do not "belong" more to the scope of one of them than to the others.
30
31DataObject has two subtypes: BinaryObject and Folder. The first adds a
32getContent method delivering the InputStream, the second has no extra methods.
33All other methods have become part of the metadata, i.e. there is no
34getContentType, getChildren, etc.
35
36FYI: the MIME type is now also part of the metadata. However, this only holds
37a mime type if the data source reported one, i.e. it does NOT return the
38result of a MimeTypeIdentifier. IMO this should always be an application
39design decision. For example, if you implement a wget-like crawler, you don't
40want all this processing stuff to take place and you might even specifically
41be interested in what the source returns, rather than what some smart (cl)ass
42makes of it ;) It may be so that the mime type created by the
43MimeTypeIdentifier is *stored* in the DataObject, but this is up to the
44containing system to decide. FYI2: the InfoSource I intend to make will do
45this, so people using the entire Aperture framework need not worry about it.
46
47* Naming Changes
48
49e.g. Crawler instead of DataCrawler, CrawlerListener instead of
50DataCrawlerListener. This already makes my code easier to read.
51
52Also, some classes may have changed package, but I've lost track of that.
53
54* DataAccessor now gets a Date as parameter rather than an
55AccessData/CrawlData.
56
57In all our use cases a Date is sufficient. Furthermore this simplifies a lot
58of things:
59- DataAccessor implementors need not learn another API (CrawlData)
60- Therefore also eases documentation considerably
61- CrawlData can be hidden inside the abstract CrawlerBase class, it does not
62  even need to live at the Crawler level.
63- Only this class handles what's stored in and retrieved from this object. In
64  the old setup both the crawler and the accessor read and changed data in it,
65  making it possible for one to screw up for the other.
66
67In cases where the DataAccessor needs more information, there is probably
68already a strong connection between the crawler and the accessor and you can
69specify the additional params in the Map or even combine the Crawler and
70DataAccessor implementation in a single class that implements both interfaces.
71
72* Removed HierarchicalAccess
73
74I believe it is redundant. Redundancy is not a problem if it makes life
75easier, but I don't even think it does that ;) For example, the root folders
76can be get from the DataSource wrapper by the HierarchicalAccess (although it
77does not define generic methods for that at the DataSource level). The
78DataObjects can be directly retrieved from the DataAccessor, the
79HierarchicalAccess would only be delegating calls to it. Finally, information
80about super- and subfolders will be part of the metadata of the DataObject,
81e.g. a BinaryObject's metadata will have some partOf/containedIn/whatever
82property, a Folder's metadata will hold sub and super folders metadata.
83
84* Removed DataFactory.
85
86Its design (one factory for DataSource, DataCrawler, etc.) makes the incorrect
87assumption that for a given DataSource implementation the DataCrawler
88implementation as well as the implementations of all other interfaces
89mentioned here are fixed. In our use cases this is typically not the case.
90
91I propose the following factory approach, with which we have very good
92experiences in another OSGi-based system. Warning: long explanation ahead ;)
93
94Each XYZ API interface comes with its own XYZFactory interface whose get()
95method embeds the knowledge of how an instance of this type is best
96instantiated. For example, it may always return the same statically held
97instance, return new instances on each get() call, temporarily cached shared
98instances using WeakReferences, etc.
99
100Examples: a PlainTextExtractorFactory always returns the same
101PlainTextExtractor instance, as it is stateless. A DataCrawlerFactory will
102usually create a new instance, except when the implementation is stateless.
103The MagicMimeTypeIdentifierFactory returns a shared instance that is cached
104using a WeakReference, as (1) its constructor does some costly initialization,
105(2) the instance consumes an significant amount of memory and (3) the identify
106method defined in the MimeTypeIdentifier interface does not alter its state.
107In other words: you want to keep the instance around as long as it's used but
108also get rid of it when you're done.
109
110In some cases the get() method will be called newInstance() when from an
111architectural perspective it is vital that a new instance is returned. This is
112typically the case for objects that will be configured after being returned by
113the factory, e.g. DataSources. For other cases (e.g. DataCrawlers) it will not
114matter whether you get a unique instance or not and the decision is then best
115left to the XYZFactory implementation. This is expressed by the more neutral
116get() method, which makes no assumptions on this matter. If there is ever a
117case when it is vital that the instance is shared (haven't encountered one
118yet), I would propose a sharedInstance() method.
119
120XYZ implementations are provided as separate OSGi bundles (i.e. separate from
121the bundle that provides the XYZ interface itself). An implementation bundle
122contains an implementation for both the XYZ and the XYZFactory interface. The
123BundleActivator of this bundle should announce that the factory implementation
124is an implementation of XYZFactory.
125
126As said above, the XYZ interface is part of a separate bundle that only
127provides this API to the system. This bundle has no BundleActivator as it does
128not register a service, it only provides a service API.
129
130Besides the XYZ interface itself, this bundle also contains an XYZRegistry.
131The job of a registry is to keep track of all the XYZFactory implementations
132that are announced to the OSGi platform. Every time the BundleActivator of an
133XYZFactory implementation announces the implementation's existance, the
134implementation of the XYZRegistry gets notified about this. More specifically,
135the BundleActivator of the XYZRegistryImpl makes sure it gets notifications
136from the OSGi platform about new XYZFactory implementations and passes this
137information to the XYZRegistryImpl, so that the registry itself is still
138completely non-OSGi-specific.
139
140When you need an XYZ instance, you approach the XYZRegistry instance (of which
141there is only one in the system) with the necessary details (e.g. a MIME type,
142a scheme, a DataSource type, etc.) and it will provide you with an appropriate
143XYZFactory implementation, if there is any available. This factory will then
144provide you with the instance.
145
146The XYZRegistryImpl is part of a separate bundle, it should not be provided
147with the bundle containing XYZ and XYZRegistry, as there may be some
148application-dependent decisions to make in its implementation. For example, a
149DataCrawlerRegistryImpl could take a look at on which OS platform the
150application is running and prefer an OS-specific DataCrawler (or actually: its
151corresponding DataCrawlerFactory) over an OS-independent implementation,
152assuming that OS-specific implementations provide better optimizations. In
153different domains there may be different strategies for choosing a factory, so
154this should not be part of the bundle that defines the XYZ and XYZRegistry
155implementations.
156
157As you can see in the implementations of our factories and registries, there
158is actually no OSGi-specific code in them. The code that gets informed about
159new factories and passes them on to the registries are part of the OSGi-
160specific BundleActivators. This is the only location where use of OSGi is
161assumed. It is always possible to directly instantiate a XYZregistryImpl and
162pass it a set of XYZFactory implementations, as you can see in the code
163examples. However, then you have to make assumptions on which registry and
164factory implementations are available. The BundleActivators automate this
165process so that you don't have to embed this knowledge in your application.
166E.g., add a new Extractor implementation bundle (a jar file!) to your system
167and you automatically can handle the mime types it supports. No line of
168existing code then needs to be changed.