| 1 | * DataSource.ID is now a URI rather than a String.
|
|---|
| 2 |
|
|---|
| 3 | Basically a good idea, especially in the context of context ;)
|
|---|
| 4 |
|
|---|
| 5 | * Removed DataSource.getConfiguration, .initConfiguration and
|
|---|
| 6 | .checkConfiguration for now.
|
|---|
| 7 |
|
|---|
| 8 | I'm still not sure whether these methods are a good idea as they restrict
|
|---|
| 9 | configuration data to simple key-value pairs. I'm thinking for example about
|
|---|
| 10 | giving FileSystemDataSource a *set* of root directories, with for each
|
|---|
| 11 | directory *separate* *sets* of include and exclude dirs (we've seen multiple
|
|---|
| 12 | use cases where this would have been very handy). That's 3 things that are
|
|---|
| 13 | hard to configure in a key-value based API.
|
|---|
| 14 |
|
|---|
| 15 | Right now I'm going for dedicated configuration methods in each DataSource. We
|
|---|
| 16 | can always later add such generic configuration methods that internally parse
|
|---|
| 17 | the configuration data and invoke the specialized methods, although I suspect
|
|---|
| 18 | this will be without the full configuration possibilities as provided by the
|
|---|
| 19 | specialized methods. I would be fine with that though, as we always build
|
|---|
| 20 | dedicated UIs so these additional methods would do us no harm. I'd like to
|
|---|
| 21 | postpone this decision for now though until we know more about the pros and
|
|---|
| 22 | cons of this approach.
|
|---|
| 23 |
|
|---|
| 24 | * Created an ...aperture.model package.
|
|---|
| 25 |
|
|---|
| 26 | This package holds the DataSource-related interfaces and the DataObject
|
|---|
| 27 | hierarchy. I've decided to put them apart from the rest, as they are used in
|
|---|
| 28 | all other parts of the framework except the Extractor part and in my feeling
|
|---|
| 29 | do not "belong" more to the scope of one of them than to the others.
|
|---|
| 30 |
|
|---|
| 31 | DataObject has two subtypes: BinaryObject and Folder. The first adds a
|
|---|
| 32 | getContent method delivering the InputStream, the second has no extra methods.
|
|---|
| 33 | All other methods have become part of the metadata, i.e. there is no
|
|---|
| 34 | getContentType, getChildren, etc.
|
|---|
| 35 |
|
|---|
| 36 | FYI: the MIME type is now also part of the metadata. However, this only holds
|
|---|
| 37 | a mime type if the data source reported one, i.e. it does NOT return the
|
|---|
| 38 | result of a MimeTypeIdentifier. IMO this should always be an application
|
|---|
| 39 | design decision. For example, if you implement a wget-like crawler, you don't
|
|---|
| 40 | want all this processing stuff to take place and you might even specifically
|
|---|
| 41 | be interested in what the source returns, rather than what some smart (cl)ass
|
|---|
| 42 | makes of it ;) It may be so that the mime type created by the
|
|---|
| 43 | MimeTypeIdentifier is *stored* in the DataObject, but this is up to the
|
|---|
| 44 | containing system to decide. FYI2: the InfoSource I intend to make will do
|
|---|
| 45 | this, so people using the entire Aperture framework need not worry about it.
|
|---|
| 46 |
|
|---|
| 47 | * Naming Changes
|
|---|
| 48 |
|
|---|
| 49 | e.g. Crawler instead of DataCrawler, CrawlerListener instead of
|
|---|
| 50 | DataCrawlerListener. This already makes my code easier to read.
|
|---|
| 51 |
|
|---|
| 52 | Also, some classes may have changed package, but I've lost track of that.
|
|---|
| 53 |
|
|---|
| 54 | * DataAccessor now gets a Date as parameter rather than an
|
|---|
| 55 | AccessData/CrawlData.
|
|---|
| 56 |
|
|---|
| 57 | In all our use cases a Date is sufficient. Furthermore this simplifies a lot
|
|---|
| 58 | of things:
|
|---|
| 59 | - DataAccessor implementors need not learn another API (CrawlData)
|
|---|
| 60 | - Therefore also eases documentation considerably
|
|---|
| 61 | - CrawlData can be hidden inside the abstract CrawlerBase class, it does not
|
|---|
| 62 | even need to live at the Crawler level.
|
|---|
| 63 | - Only this class handles what's stored in and retrieved from this object. In
|
|---|
| 64 | the old setup both the crawler and the accessor read and changed data in it,
|
|---|
| 65 | making it possible for one to screw up for the other.
|
|---|
| 66 |
|
|---|
| 67 | In cases where the DataAccessor needs more information, there is probably
|
|---|
| 68 | already a strong connection between the crawler and the accessor and you can
|
|---|
| 69 | specify the additional params in the Map or even combine the Crawler and
|
|---|
| 70 | DataAccessor implementation in a single class that implements both interfaces.
|
|---|
| 71 |
|
|---|
| 72 | * Removed HierarchicalAccess
|
|---|
| 73 |
|
|---|
| 74 | I believe it is redundant. Redundancy is not a problem if it makes life
|
|---|
| 75 | easier, but I don't even think it does that ;) For example, the root folders
|
|---|
| 76 | can be get from the DataSource wrapper by the HierarchicalAccess (although it
|
|---|
| 77 | does not define generic methods for that at the DataSource level). The
|
|---|
| 78 | DataObjects can be directly retrieved from the DataAccessor, the
|
|---|
| 79 | HierarchicalAccess would only be delegating calls to it. Finally, information
|
|---|
| 80 | about super- and subfolders will be part of the metadata of the DataObject,
|
|---|
| 81 | e.g. a BinaryObject's metadata will have some partOf/containedIn/whatever
|
|---|
| 82 | property, a Folder's metadata will hold sub and super folders metadata.
|
|---|
| 83 |
|
|---|
| 84 | * Removed DataFactory.
|
|---|
| 85 |
|
|---|
| 86 | Its design (one factory for DataSource, DataCrawler, etc.) makes the incorrect
|
|---|
| 87 | assumption that for a given DataSource implementation the DataCrawler
|
|---|
| 88 | implementation as well as the implementations of all other interfaces
|
|---|
| 89 | mentioned here are fixed. In our use cases this is typically not the case.
|
|---|
| 90 |
|
|---|
| 91 | I propose the following factory approach, with which we have very good
|
|---|
| 92 | experiences in another OSGi-based system. Warning: long explanation ahead ;)
|
|---|
| 93 |
|
|---|
| 94 | Each XYZ API interface comes with its own XYZFactory interface whose get()
|
|---|
| 95 | method embeds the knowledge of how an instance of this type is best
|
|---|
| 96 | instantiated. For example, it may always return the same statically held
|
|---|
| 97 | instance, return new instances on each get() call, temporarily cached shared
|
|---|
| 98 | instances using WeakReferences, etc.
|
|---|
| 99 |
|
|---|
| 100 | Examples: a PlainTextExtractorFactory always returns the same
|
|---|
| 101 | PlainTextExtractor instance, as it is stateless. A DataCrawlerFactory will
|
|---|
| 102 | usually create a new instance, except when the implementation is stateless.
|
|---|
| 103 | The MagicMimeTypeIdentifierFactory returns a shared instance that is cached
|
|---|
| 104 | using a WeakReference, as (1) its constructor does some costly initialization,
|
|---|
| 105 | (2) the instance consumes an significant amount of memory and (3) the identify
|
|---|
| 106 | method defined in the MimeTypeIdentifier interface does not alter its state.
|
|---|
| 107 | In other words: you want to keep the instance around as long as it's used but
|
|---|
| 108 | also get rid of it when you're done.
|
|---|
| 109 |
|
|---|
| 110 | In some cases the get() method will be called newInstance() when from an
|
|---|
| 111 | architectural perspective it is vital that a new instance is returned. This is
|
|---|
| 112 | typically the case for objects that will be configured after being returned by
|
|---|
| 113 | the factory, e.g. DataSources. For other cases (e.g. DataCrawlers) it will not
|
|---|
| 114 | matter whether you get a unique instance or not and the decision is then best
|
|---|
| 115 | left to the XYZFactory implementation. This is expressed by the more neutral
|
|---|
| 116 | get() method, which makes no assumptions on this matter. If there is ever a
|
|---|
| 117 | case when it is vital that the instance is shared (haven't encountered one
|
|---|
| 118 | yet), I would propose a sharedInstance() method.
|
|---|
| 119 |
|
|---|
| 120 | XYZ implementations are provided as separate OSGi bundles (i.e. separate from
|
|---|
| 121 | the bundle that provides the XYZ interface itself). An implementation bundle
|
|---|
| 122 | contains an implementation for both the XYZ and the XYZFactory interface. The
|
|---|
| 123 | BundleActivator of this bundle should announce that the factory implementation
|
|---|
| 124 | is an implementation of XYZFactory.
|
|---|
| 125 |
|
|---|
| 126 | As said above, the XYZ interface is part of a separate bundle that only
|
|---|
| 127 | provides this API to the system. This bundle has no BundleActivator as it does
|
|---|
| 128 | not register a service, it only provides a service API.
|
|---|
| 129 |
|
|---|
| 130 | Besides the XYZ interface itself, this bundle also contains an XYZRegistry.
|
|---|
| 131 | The job of a registry is to keep track of all the XYZFactory implementations
|
|---|
| 132 | that are announced to the OSGi platform. Every time the BundleActivator of an
|
|---|
| 133 | XYZFactory implementation announces the implementation's existance, the
|
|---|
| 134 | implementation of the XYZRegistry gets notified about this. More specifically,
|
|---|
| 135 | the BundleActivator of the XYZRegistryImpl makes sure it gets notifications
|
|---|
| 136 | from the OSGi platform about new XYZFactory implementations and passes this
|
|---|
| 137 | information to the XYZRegistryImpl, so that the registry itself is still
|
|---|
| 138 | completely non-OSGi-specific.
|
|---|
| 139 |
|
|---|
| 140 | When you need an XYZ instance, you approach the XYZRegistry instance (of which
|
|---|
| 141 | there is only one in the system) with the necessary details (e.g. a MIME type,
|
|---|
| 142 | a scheme, a DataSource type, etc.) and it will provide you with an appropriate
|
|---|
| 143 | XYZFactory implementation, if there is any available. This factory will then
|
|---|
| 144 | provide you with the instance.
|
|---|
| 145 |
|
|---|
| 146 | The XYZRegistryImpl is part of a separate bundle, it should not be provided
|
|---|
| 147 | with the bundle containing XYZ and XYZRegistry, as there may be some
|
|---|
| 148 | application-dependent decisions to make in its implementation. For example, a
|
|---|
| 149 | DataCrawlerRegistryImpl could take a look at on which OS platform the
|
|---|
| 150 | application is running and prefer an OS-specific DataCrawler (or actually: its
|
|---|
| 151 | corresponding DataCrawlerFactory) over an OS-independent implementation,
|
|---|
| 152 | assuming that OS-specific implementations provide better optimizations. In
|
|---|
| 153 | different domains there may be different strategies for choosing a factory, so
|
|---|
| 154 | this should not be part of the bundle that defines the XYZ and XYZRegistry
|
|---|
| 155 | implementations.
|
|---|
| 156 |
|
|---|
| 157 | As you can see in the implementations of our factories and registries, there
|
|---|
| 158 | is actually no OSGi-specific code in them. The code that gets informed about
|
|---|
| 159 | new factories and passes them on to the registries are part of the OSGi-
|
|---|
| 160 | specific BundleActivators. This is the only location where use of OSGi is
|
|---|
| 161 | assumed. It is always possible to directly instantiate a XYZregistryImpl and
|
|---|
| 162 | pass it a set of XYZFactory implementations, as you can see in the code
|
|---|
| 163 | examples. However, then you have to make assumptions on which registry and
|
|---|
| 164 | factory implementations are available. The BundleActivators automate this
|
|---|
| 165 | process so that you don't have to embed this knowledge in your application.
|
|---|
| 166 | E.g., add a new Extractor implementation bundle (a jar file!) to your system
|
|---|
| 167 | and you automatically can handle the mime types it supports. No line of
|
|---|
| 168 | existing code then needs to be changed.
|
|---|