wiki:tickets/193

Extracting CIM metadata from code and verifying CIM metadata and code are consistent

This work forms part of the CIM Automatic Capture Tools part of Workpackage 6.

One way to generate CIM metadata that describes software is to extract it from the software itself.

The three ways (that I can think of) to do this are

1: provide an API that software must conform to.

As a simple example the API might require software components to provide a subroutine called GET_CIM_COMPONENT_NAME() which returns the name of the software component that should be recorded by the CIM.

2: provide a commenting convention that the software must conform to.

Continuing the example of the software component name. One might require a model to contain !CIM COMPONENT_NAME <MY_COMPONENT_NAME>. These comments could then be extracted using either a modified version of a code documentation system (such as DOXYGEN) or by some bespoke pattern matching software (as is done by FCM for dependence analysis for example).

3: parse the code and extract the required information (where possible).

This third approach has the benefit of not needing code developers to modify their code, however the information that can be extracted may be limited. For example one might have to assume that the CIM component name is the name of the main program, module name, etc. (depending on the way the component is written).

An obvious first question is "What CIM information needs to be captured for a software component"? The natural follow up question is then "Can this information be better captured in other ways"? The final question (if the answer to the second question is no) is "Can this information be captured directly from the code"?

The CIM currently uml contains the following softwareComponent attributes ...

  • a component type (processor or model)
  • a description
  • whether it is embedded or not (I think this means whether it is visible to an external coupler or not).
  • a component name (or at least the concept that it can be referenced)
  • a set of properties which (I think but might be wrong) will eventually describe names (such as Sea Surface Temperature) and

how these names get/provide their values (i.e. the values might be embedded, read in/written out via a read/write statement or passed to/from a coupler (including via arguments).

In addition a processor component type can have

  • a type of transformation
  • a conservation flag
  • regridding information if required
  • other things!!!

A model component type can have

  • a timestep
  • an equation (whatever that means)

Of course a softwareComponent may contain other software components and therefore have coupling information. I suspect that most of the required information in this case can be automatically extracted from the coupler that is being used i.e. ESMF plans to be able to write out metadata, Oasis3 has the namcouple file, Oasis4 has the smioc and scc and TDT has coupling configuration files. It is more of an issue to determine properties that are read/written directly to/from a file and therefore are not visible to existing couplers.

So most of the above information that is internal to a component (i.e. not part of coupling) is not something that could be picked up directly from code as the information is not embedded in the code (e.g. the timestep, the component type, the description).

What can be picked up automatically is variables that are read/written to/from a file, variables that are given internal values, and variables that are passed by argument. However, this list is probably going to be much larger than the modeller wants/requires in general, although it might be useful as a reference.

My first thought therefore is that some form of code commenting convention might be the best way to go to extract information, with full code parsing being of some potential use to provide a list of all data that is input and output.

The other potential benefit of parsing code in some form is being able to check that the metadata produced is consistent with the code. This is particularly useful if the metadata has been produced by hand. Again, as mentioned above only some checks can be done to code as code does not embed some (most?) of the required metadata.

Technology

As mentioned before, either a modified version of DOXYGEN or similar tool or a bespoke pattern matching code is the obvious way to extract comments.

For parsing code a modified version of OFP seems to be a reasonable way forward.