ApertureSimpleDataCrawler – gnowsis

Context Navigation

only for testing ideas, this class will not be part of Aperture

The ApertureSimpleDataCrawler is responsible for a simple access to structured data sources.

Instances of this interface would be classes like FileDataSource, ImapDataSource, OutlookDataSource, ...

public interface SimpleDataCrawler {

 /**
  * init the datasource passing variables like name, base path, passwords, server hostname, etc
  */
 public void init(Map parameters);

 /**
  * open the passed data object so that it can be viewed / edited by the user. for a file,
  * this would mean that the operating system opens the file, for an address book entry
  * the address book application would have to start
  */
 public void openObject(String uri);

 /**
  * get the root uri of this datasource
  */
 public String getRootUri();

 /**
  * get the detailed data of one object, including plaintext and metadata 
  * this is costly.
  * This may (internally) make heavy reuse of Extractors
  */
 public Map getDataOfObject(String uri);

 /**
  * List sub-folders, Iterator contains folder uris as Strings.
  * this may also return the uris of objects, if the objects can contain sub-objects. 
  * (IMAP-attachments)-but this is bad as detection of sub-objects of emails is costly.
  * the first call of this method would be with the getRootUri()
  */
 public Iterator listSubFolders(String uri);

 /**
  * List objects inside the passed folder, Iterator contains folder uris of objects as Strings.
  * the first call of this method would be with the getRootUri()
  */
 public Iterator listSubObjects(String uri);

 /**
  * get a map of metadata about the passed object, 
  * enough so that changes can be detected.
  * if one value in this map has changed compared to the previously returned map (in the last scan)
  * than getDataOfObject is called to get the current data.
  */
 public Map getChangeDataOfObject(String uri);

}

Remarks Chris:

The concept of a hierarchy, including a root, is strongly present in this interface. I don't really like this as some data sources have no intrinsic hierarchy, e.g. graph- or table-oriented data sources, so that crawling such sources becomes awkward or even impossible. For example, what kind of folders would a WebCrawler return? How would it be able to know all folders and all items in those folders a priori?

In our code we prevent this problem by keeping the traversal of the hierarchy, graph, table or whatever kind of structure the data source has, internal in the crawler implementation. Information regarding the folder/row/whatever becomes just another part of the metadata of the returned DataObjects.

Another complaint is that change detection is apparently performed outside the crawler. I assume the idea here is that there will be a single piece of code to detect changes that will work with any kind of SimpleDataCrawler? Else there wouldn't be a reason to have this getChangeDataOfObject method. In my opinion change detection will ideally be highly crawler dependent. You might be able to generalize it this way, but at the cost of not being able to do source-specific optimizations. Consider for example HTTP-specific improvements (the if-modified-since header that lets the webserver tell you whether a source has not been changed). Maybe we can also improve IMAP in a similar way, I don't know yet.

"Crawler" is probably not a good term to use in the name of this interface as there is some other code using an instance of this interface that retrieves the folder and object uris and decides to retrieve them (i.e. the actual crawling).

The Javadoc comments suggest that Extractors are applied internally in the crawler implementations. I think they should be applied somewhere outside the crawlers: it is not up to the crawlers to decide how the encountered objects are processed. For example, a wget-like utility using this framework will need all metadata obtained from the data source but will have no interest in the extracted text.

What I do like is that folders become more prominent. This may seem to contradict what I said before. What I mean is that it is possible to retrieve information about folders itself. Using a simple extension (e.g., a getDataOfFolder method) it becomes possible to retrieve all metadata of a folder. This is something we have not considered before in our own architecture.

Here's a new idea that in my opinion merges this idea with our own architecture. Create a super interface of DataObject (Resource? - has a strong RDF association. Entity? - has other associations here at Aduna). DataObject then gets a sibling named Folder. Crawlers do not only produce DataObject instances, they produce instances of its supertype. This way, crawlers that crawl data sources with an intrinsic hierarchy can return Folder instances, which contain all metadata of the Folder, similar to how DataObjects contain metadata of that object. Similarly, we can introduce other DataObject siblings for capturing table- or graph-related metadata that is not specific to a single DataObject. Crawler-using applications that have no interest in this information can simply ignore these events. Also, the crawler interface itself does not need to specify folder-/graph-/table-specific information.

Leo> I like this very much, the superclass and sibling idea. I will create objects accordingly. We had exactly the same problem of the "graph" structure that was hidden somewhere inside files or folder objects. If we use something like ApertureDataObjectFile and ApertureDataObjectFolder objects to capture and divide this semantic information, perfect. I would argue for ApertureDataObjectFile for things that are 'like a file', so attachments,web pages, web files and local files would all fall into this category. For the ApertureDataObjectFolder I would suggest they are restricted to something like real folders, like file folders, outlook folders or IMAP folders. For things like "attachments inside an email" I would still use the getChildren() idea, and not the Folder thing. Although it may be nice if an email with attachments is both a Folder and a File.

Leo> Another thing that is solved by this solution: Purely Metadata Objects. In ms-outlook, or any address book or calendaring application, you normally don't have files but have many metadata objects like appointments and persons. These would then be only ApertureDataObject instances and return their data on getMetadata() but would not have a content.

In our use case this also facilitates metadata indexing in because currently our MetadataFetcher (the class transforming the information inside a DataObject to RDF statements) interprets the document URIs and "reinvents" the folder hierarchy, modeling it as Resources with a partOf relation. This would then no longer be necessary, the Folder instance would already contain all necessary information.

Leo: ok, the idea of having a kind of "Sub-Class" sounds good. I would make ApertureDataObject the parent of ApertureDataFolderObject, so that ApertureDataFolderObject has all properties of DataObject and more.

Chris: I don't think so, as a DataObject also has properties that a Folder does not have. For example, a Folder has no InputStream, no byte size, etc. I think DataSources and Folders are really something different. They surely do share characteristics (they both have a URI and metadata) but this should be expressed by their super type. One should not inherit from the other.

Leo: ok, siblings are the right thing. Yes, if it has the getContent() method, it should return something.

Leo: Ok, still an important use case of gnowsis is not covered: structured crawling. We often need the user to select "which folders are now to be crawled" or the user wants to "see" what there really is in the datasource, without having to crawl it completely first. So I suggest to make another interface, independent of the existing DataCrawler, that ALLOWS to "peek" into the datasource, IF it is hierarchical. This interface I called ApertureHierachicalAccess and is described there.

Last modified 20 years ago Last modified on 10/24/05 17:18:26

Download in other formats:

Plain Text