wiki:tickets/78

METAFOR QUERY TOOL

DESIGN DOCUMENT

This document provides an overview for one of the main components of WP5, the query tool. It is also a deliverable for the Met Office:

From the description of MO Deliverable 3.1:

The tools will support a basic “keyword” search of all holdings in the METAFOR archive. This is likely to ignore the structure of the metadata and treat all elements and attributes of CIM instances as equally searchable. A more advanced search of the archive that relies on the structure of the metadata, allowing users to browse through different categories of holdings, successively narrowing down their results is also a desirable feature for METAFOR. However, this “faceted” search is not listed as a deliverable in the DoW.

By M6 [M18 of the project], [I] will have provided the MO with a design document showing how to implement basic search... This document will also include advice on how to implement a more advanced “faceted” search functionality. By M12 [M24 of the project], [I] will have delivered search functionality against a set of CIM instance documents1, though it may be a stand-alone (ie: not online) implementation. This deliverable will include keyword search and may include faceted search.

1. Overview

Previous climate data archives (for example, AR4/CMIP5) allowed users to locate data but called for a lot of a priori knowledge about the archived file. FTP protocols required that users either knew the URI of the file in question or else could extract meaning from the filename itself – failing that, users would have to seek help from an appropriate “expert” (most likely a member of the team that produced the data in the first place). Furthermore, once a dataset was located there was little information linking it back to the models that generated it nor the experiments for which they were run – again, that sort of information would likely come from the aforementioned expert.

METAFOR aims to improve this process by using a standard metadata language, the “Common Information Model” (CIM), to describe the archived data, the software models that generated that data, and the experiments for which those models were run. The CIM has been designed at a high level using UML which is then implemented as an XML Schema. 1 The CIM has a rich structure that can be searched. Individual CIM instances 2 can also contain links to other CIM instances, enabling users to browse through a network of related metadata descriptions. These features of the CIM should be taken into account when designing the approach and underlying technologies that the query tool will use.

This document will describe the scope of the query tool: what it is meant to do and for whom. It will then describe, at a high level, at the level of the interface, how it is meant to work. Finally it will describe, at a lower level, specific implementation details

2. Scope

The METAFOR Query Tool will allow METAFOR users to perform queries on CIM instances. CIM instances are delivered to a central portal. The query tool will allow users to browse and search the portal for the instances they are interested in.

There are two aspects of any query: the query request and the query result. The request is the actual search being performed on the portal holdings. The goal of a request may be to find one particular CIM instance or to simply narrow down the search space of instances a bit. The result is what the request returns. This may be one or many CIM documents, or parts of CIM documents (snippets), or descriptions of CIM instances.

As implied above, different types of requests and results may be appropriate. That is, the query tool should provide multiple ways of issuing a query and viewing the results based on the user type and/or use case being supported. This does not imply that these different functionalities need to be implemented as a single piece of technology. However, mixing query methods with a significantly different look and feel should be avoided as it may confuse METAFOR users.

2.1 User Types and Use Cases

This tool, as with the other components of METAFOR, will have to support a wide range of users ranging from the very naïve to climate data experts. A representative sample, along with the sorts of requests each user type will make and the results each will expect, are listed below:

user typesample queries
climate modeller - may be a computer scientist / climate scientist who writes model code, or couples together existing models, or configures potential models (as with UMUI, for instance); may be quite familiar with the CIM software package"Find me an ocean model that I can couple with my atmosphere model in order to run a simulation in support of a CMIP5 experiment."
climate scientist - overlaps somewhat with the climate modeller above, but is not necessarily knowledgeable about software models; understands climate data and knows what they are looking for"Find me all the models that were used for a particular experiment."
other scientifically-literate users - environmental scientists, biologists, geologists, representatives of the impacts community; these users may be familiar with searching for datasets but not know what the concepts and relationships modeled by the CIM represent"Find me climate data to look at crop species impact in the Sahel region."
naive users - may be their first time at the CIM portal; just leaning what the CIM is and what the query tool does; may be an educator or a policymaker or a novice climate scientist / modeller"Just browse through the metadata holdings to get an idea of what the archive provides."
metadata experts - understands the structure of metadata systems; ought to know how to run queries but not necessarily what to look for"Find me some simulations that used double CO2 forcing"
data experts - familiar with the structure of climate datasets; familiar with data archival methods"Find me all papers that refer to the dataset I am interested in."
metadata administrators - these are system experts; they are likely to know some query "shortcuts""Show me if the holdings have changed since my last visit."

Some high-level use cases have already been defined for METAFOR (those that don't apply to the query tool have been crossed out):

Finally, the Curator project  http://www.earthsystemcurator.org/, which shares many of the same goals of METAFOR and is already working collaboratively with METAFOR, has developed its own prototype query application  http://curator.ucar.edu/query/advanced.htm to support its user base, the Earth System Grid (ESG) community. Their stated goal is "to address the formidable challenges associated with enabling analysis of and knowledge development from global Earth System models." Their prototype allows users to search for and view climate data records. Curator does not currently use the CIM, but they are aiming to modify their own metadata format to make it easy to import/export to the CIM.

2.2 Query Types

These different users and user cases warrant different query approaches. The query tool design will consider three distinct types of query requests:

  • unrestricted search
  • advanced search
  • faceted search

and two distinct classes of query results:

  • primary results
  • secondary results

There are different ways to constrain a query. Some users will benefit from an unrestricted search where every element within a CIM instance is given equal weight and searched simultaneously. Previously, the phrases “keyword search” or “free-text search” have been used to describe this functionality. This is similar to the type of search performed by Google: a simple text search into which users can enter whatever they want. An unrestricted search might be useful to the casual user of METAFOR - someone who is just experimenting with the portal. Or it may appeal to an expert user who is able to use their knowledge of the CIM and the CIM Archives to craft a useful query 3. However, it is unlikely to be focused enough for most users. It is also not obvious how to return the query results when there may be nothing in the query to indicate what type of document the user was searching for.

Other users may benefit from a more structured query. In order to support anything more complicated than an unrestricted search, a subset of elements or attributes within the CIM must be identified as “facets,” 4 or dimensions to search on. Advanced search uses a set of pre-constrained queries into which a user can plug search terms into. For example, in the use case above, "Find me some simulations that used double CO2 forcing," the fact that the query should search the forcing conditions of all simulations for the text “double CO2” is built into the query. It is obvious what the user expects the query to return: simulation documents. An advanced query could in principle be arbitrarily complex. But, in practise, searching more than a few dimensions at once is likely to be confusing for anybody other than “hard-core” users who will be using the query tool on a regular basis and therefore willing to invest more time in understanding a complicated interface. It would be useful to use advanced search to process the sorts of common searches that are made by regular users.

Faceted search allows users to browse the full breadth of CIM instances. It successively narrows down the search space along multiple dimensions. At any stage a facet can be removed from the search and the previous set of query results will be active. This is different from advanced search where a single (albeit a potentially very complex) query is performed all at once.

Faceted search is most useful when there are three or more dimensions of a classification. Otherwise a simpler hierarchical or tree classification system, where each new group is a sub-type of its parent, is preferred. Using facets also has the advantage that users are not required to have complete knowledge of the entities being classified nor their relationships. This is precisely one of the issues that METAFOR aims to address. However, the CIM may not map well onto faceted search because the dimensions that users will want to search on (based on the use-cases above) are not necessarily restricted to a “closed” CV 5. Thus, the facets are hard to define. Not only are the facet values not well-understood, but there is an overwhelming set of elements and attributes in the CIM to consider searching as facets. Identifying the most effective subset to support is difficult. This makes focusing early effort on faceted search risky.  And, regardless of whether the query tool focuses more on faceted search or advanced search, the facet set must be large enough to allow for useful queries, but not so large as to be overwhelming.

Note that unrestricted search, advanced search, and faceted search are not mutually-exclusive. I would expect the delivered solution, though not necessarily the prototype, to include aspects of all three query request types. I would also expect any search on a domain as complex on climate modeling to have a very well-documented interface.

Error: Macro Image(designDocument_fig2.jpg, 85%, border="1") failed
invalid literal for int() with base 10: '"1"'

The figure above shows how Curator combines the three request types: Users can apply an unrestricted search (the bottom text-box) onto the result of a faceted search (the middle widgets) constrained by an advanced search (the top radio-button selection).

Whatever method or combination of methods is used, the query has to return something useful to the user. The primary results need to provide enough information for a user to decide if any one warrants further investigation. Upon running a query, full CIM documents do not need to be returned. A straightforward list will suffice. Again, Google provides a good example of this by initially returning high-level descriptions and small snippets of (cached) content for webpages matching queries and not retrieving the entire page content. In addition to being more intuitive, this is also more efficient.

However, this does raise some interesting questions: How does the system decide what subset of metadata should be returned as a query result? Obviously the document name is important, but what else? And how does it retrieve that metadata content without resorting to accessing the full CIM instance? It would be useful to have those relevant subsets of metadata stored separately from complete CIM instances. This additional store may or may not be the same as the subset of metadata that can be searched. That is, there is an argument for extracting the "searchable" bits of a CIM instance from the full XML document as well as the "displayable" bits of a CIM instance. This would be simpler if those two bits were the same, but this won't be known until prototype development begins. Nevertheless, there is a logical distinction between these three sets of metadata. It is worth considering whether they need to be distinguished physically (ie: stored separately) as well.

The primary query results should be able to be analysed and compared before proceeding to secondary results. A useful feature would be the ability to sort the results along different dimensions. A simple list divided into columns where each column heading corresponds to another primary metadata facet and where each column can be sorted would go a long way towards allowing users to quickly analyse the results.

Secondary results are displayed when a user selects one of the primary results. A secondary result is a full CIM instance displayed on its own. So clearly a mechanism to connect the primary results with the secondary results is needed. Ideally, a secondary result should be returned in such a way that users can browse the internal links between CIM documents and progress onto other secondary results. This should occur without users “losing their place.” A breadcrumb trail showing the route they traveled to arrive at the current secondary result could help with this.

The next section will outline the design for the CIM Query Tool.

3. High Level Design (what functionality should the query tool support)

Error: Macro Image(designDocument_fig1.jpg, 85%, border="1") failed
invalid literal for int() with base 10: '"1"'

The pieces that need to fit together in order to build the CIM Query Tool are numbered below.

As mentioned earlier, Curator already has a working prototype and is commited to working closely with METAFOR. In fact, they have been modifying their user interface in order to support more "CIM-like" metadata instances. This is already being demonstrated to users and is receiving positive user feedback. METAFOR can learn many lessons form their experiences. However, the underlying technology stack that they have implemented is not what WP4 and WP5 have expected to use; it is not what is most compatible with the design of the METAFOR portal. 6 WP5 needs to weigh the risks of starting with a blank canvas and thus delaying interacting with real users versus building upon an existing solution even if that means starting with technologies that were previously dismissed as inappropriate for METAFOR. Whichever route is chosen, the basic components described below are very similar.

First and foremost, CIM instances have to exist and be archived into a repository. Building CIM instances is not the query tool's responsibility. But choosing the back-end storage medium is.

  1. Database of CIM instances

I am assuming that CIM instances will enter the repository as XML documents which validate against the APPCIM and a given Controlled Vocabulary (currently they are one-and-the-same, eventually they will be separate artifacts). This does not necessarily mean that they will be stored as XML documents, though. Regardless of persistence format, the database will have to support issuing query requests and getting query results.

  1. unrestricted search interface
  1. advanced search interface; this involves creating one or more pre-constrained queries
  1. faceted search interface

All three of these should issue requests via a REST API so that other portals can easily make use of the CIM repository. That API needs to be agreed between WP4, WP5, and WP6.

Next, the facets need to be identified and stored. Recall that these are the subset of CIM elements and attributes which we want to be able to query during advanced and faceted search. Once identified the values for the facets need to be extracted from the CIM and/or CVs. The ultimate format of the CVs is unknown at this point, so the tools will concentrate on extracting these from the CIM.

  1. identify the facets; store them - this may be in a separate store from the aforementioned instance database, or it may simply be a cleverly designed index on the database used to store instances.  A methodology for encoding the facets directly into the CONCIM would be preferred (to avoid hard-coding it into the query tool).

Next, a way to view query results is needed. This has two parts to it:

  1. a “viewer” for primary results
  1. a “viewer” for secondary results

This may be something as simple as a stylesheet applied to incoming XML representations of results. Again, the results should be able to be fetched via a RESTful API. For example, facets and facetValues can exist as parameters embedded within a URL.

Regarding the list of primary results, as with the facets it may not come from the same store as the CIM instances themselves so some technique of identifying, storing, and accessing the subset of CIM documents which need to be displayed in the primary viewer is needed:

  1. identify and store the subset of elements and attributes within CIM documents to be presented as primary results; this can either be a separate artifact or an index onto the main database used to store CIM instances.
  2. a way of mapping from primary to secondary results.

At this point in the project, before the structure of the CIM is finalised and while use cases are still emerging, a flexible approach to the query tool is what's needed. To that end, I will develop a set of tools that incrementally adds functionality beginning with one-dimensional search, or the simplest type of advanced search.

4. Low Level Design (what technologies should be used)

4.1 Chosen Technologies (for prototype)

As WP5 is a work in progress, the following list may change.  Most notably, I will concentrate on getting the backend working (technology to store CIM instances and retrieve CIM documents and snippets, and code to issue query requests) before the frontend (user interface code to craft queries and view results).  But also, METAFOR may benefit from building on top of the existing Curator codebase.  The Python-centric technology stack described below may start off as a Java-centric Curator one.  And Curator focuses on RDF technologies (OWL, Sesame) to perform faceted search, whereas I planned on implementing advanced search first.

The CIM is already very XML-centric. And the emerging code from other WPs are already very Python-centric. This will influence my choice of technologies in the first instance, since I am interested in rapid prototyping. I will be assuming that the portal is built on Pylons. I will also assume a javascript front-end for the query interface.

  1. Database of CIM instances - a native XML database, eXist  http://www.exist-db.org/ will be used; queries into eXist can use the xquery  http://www.w3.org/TR/xquery/ language and can be submitted via a RESTful API.
  2. unrestricted search interface - a simple, though inneficient, xquery can be built to search all CIM instances, this can be driven through simple javascript form elements.
  3. advanced search interface; this involves creating one or more pre-constrained queries - This can be a webform which builds a query. In the short-term, I am only concerned with searching on a single dimension and so a simplistic GUI will do.
  4. faceted search interface - TBD. Previous discussions have suggested that using SPARQL queries against an RDF implementation of the CIM is a good way to do faceted search. This is very similar to the approach that Curator took who used Sesame as their RDF triple store.  However, this obviously requires generating RDF versions of CIM documents.  Before commiting myself to that added complexity, I would like to investigate being able to do faceted search using standard database techniques (assuming that eXist can support them).
  5. identify the facets; store them (may be a separate store from the aforementioned instance database) - initially elements and attributes within the CIM could be identified as <<facet>> stereotypes, this information could then be used as the instance is harvested into the portal.
  6. a “viewer” for primary results - Some flavour of JavaScript to turn an XML result into an ordered list
  7. a “viewer” for secondary results - XSLT applied to a CIM instance retrieved from eXist (a stylesheet can be specified as part of the POST request to eXist).
  8. identify the subset of elements and attributes within CIM documents to be presented as primary results - the same technique used to identify the facets can be used to identify the "primary result elements."
  9. a storage mechanism for that subset, though not necesarily a separate physical store
  10. a way of mapping from primary to secondary results - this relies on one of the subset of CIM document content identified in item 8 above to include an identifying "key" from which a full record from the database identified in item 1 above can be retrieved.

4.2 Considered Technologies (for deliverable)

The above section describes what I plan on implementing for the prototype query tool (bearing in mind that I may choose to adopt certain technologies and/or techniques from Curator as I become more familiar with their prototype).  This section describes other technologies (including those "endorsed" by Curator) that may prove useful for the deliverable query tool.

Using an Object Relational Mapper (ORM) which can serialise to a database (for persistence) or an XML document (for transfer). The Python-based Django  http://djanfoproject.com or SQLAlchemyElixer  http://elixer.ematia.de/trac/wiki are examples of these. This has the advantage of creating the structure of our database tables for us (though it remains to be seen how good an auto-generated RDMS is).

It is worth nothing that ORM functionality is planned in a future release of FullMoon  http://projects.arcs.org.au/trac/fullmoon/wiki/FullMoon. If we were to migrate the CIM to be GML-compliant, we may get this (and more) functionality for free.

The controlled vocabularies will eventually be divorced from the CIM and moved to a Controlled Vocabulary (CV) server.  BODC has an existing server implementation for this.  A deliverable version of the query tool may need to populate possible facet values from those servers. 

Faceted Search using OWL/RDF requires, obviously, converting from UML/XSD. There are existing tools (such as GRDDL  http://www.w3.org/TR/grddl-primer/) which generates RDF from HTML/XML (but not UML).

Faceted Search using an XML-specific technology such as XFML?

Mako  http://www.makotemplates.org/ or Genshi  http://genshi.edgewall.org/ can provide templating capabilities to the front-end if XSL is not up to the task.  They both have the advantage of being Python-compatible.

As the front-end develops it may prove useful to build up queries in javascript.  Therefore, the jquery  http://jquery.com/ JavaScript library is worth investigating.

YUI  http://developer.yahoo.com/yui/ is what drives the Curator front-end.  Another possibility is AJAX.  All of these are also JavaScript-based.  However, assuming that the backend database retains a RESTful API (like the one supported by eXist), any advantages that these technologies have regarding accessing a server become less obvious.

5. References


 1In the long-term a separate controlled vocabulary (CV) constraining the terms that can be used within CIM instances for particular user communities will also exist. Currently, this is built directly into the UML but it should not really be considered part of the CIM.

 2This document attempts to distinguish between the terms “instance,” “document,” “record,” and “snippet.” Document refers to a complete coherent unit of metadata. Only those CIM elements with the stereotype <<document>> can be treated as a document. The term suggests XML documents and that is currently the transfer format of choice for the CIM. A record, in contrast, suggests a relational database entry. A snippet is an extracted piece of metadata – part of a document or record. Instance is the most generic of these terms. It is used when I do not want to make any implications about the underlying format of the metadata. An instance could be all or part of an XML document, a database entry, an RDF triple, or a text or binary file.

 3It may be that the release version of unrestricted search has keyword operators like Google (“type:,” “date:,” etc.) to help with this. At this early stage, I am not planning on implementing something so complicated.

 4Note that I use the term "facet" in the generic sense meaning "aspect." It does not imply I am talking about faceted search. Both faceted search and advanced search require certain aspects of the CIM to be tagged as dimensions upon which users can run queries.

 5A closed controlled vocabulary is one whose permitted values are completely restricted to the set of enumerated values stored in the vocabulary; the user is not permitted to extend or override that set.

 6Although, in some sense, if the concerns are properly separated then it is only the APIs that matter; the underlying code should be transparent.

Attachments