The Answer is Meta-Data

The Answer is Meta-Data

Position paper, submitted to the Eighth (1999) High Performance Transaction Systems Workshop

Philip A. Bernstein

Microsoft Research

Now that so much of the world’s data and transaction processing services are available on the Internet, there is a greater need for better ways to integrate this data and services for development, management, and end-user access. The heart of such integration is meta-data that describes the things-to-be-integrated. But unfortunately, vendors and researchers aren’t giving meta-data the level of attention it needs to get this integration done.

In the traditional IT world, data warehouse construction tools are currently the main driver of meta-data requirements. These tools need schemas for the source and target of transformations for data scrubbing and integration. They also need detailed semantic descriptions of those transformations, for data lineage analysis and for generating code that performs the transformations. Vendors of these types of tools have proprietary meta-data repositories and the large database vendors are competing vigorously for these third party tools to integrate into their repository environments. Progress is being made, both in tool capability and in meta-data integration. But even with the best tools, it’s time-consuming and therefore expensive to scrub data for a warehouse. In effect, this is the static, batch-oriented version of the semantic data integration problem that has interested database researchers for decades. Researchers could help here, for example, by developing more powerful meta-data driven functionality for data transformations.

Despite its great promise, integrated CASE based on shared meta-data continues to be a slow-growth field. Although some enterprise-oriented tools are becoming more repository centric, most desktop development tools are still strongly file-oriented. Even many high-end configuration management tools use files, not databases. The benefits of using a database-based repository for fine-grained, versioned data sharing between development tools remain compelling: better reuse, better management of versioned configurations, impact analysis, automatic change propagation, inter-tool navigation (test, bug tracking, design models, code), etc. Developers of electronic commerce application would be major beneficiaries, given the many types of objects that need to be assembled in such applications and the different tools that must be used for these various object types. But products are still far from realizing all these benefits.

In the past, poor performance and low functionality were deterrents to using commercial database products for meta-data supporting design applications, such as environments for electronic commerce or data warehousing. These weaknesses are largely fixed by the latest generation of repository products and by the power of today’s desktop systems ¾ fixed well enough that they are no longer the main impediments to progress. It’s time for another run at using database technology to improve the productivity of designers and developers, and to improve the integration of the artifacts they produce.

In a better world, data and transaction servers would be more self-describing, making meta-data a more central aspect of the development of transaction process applications and the databases they produce. This better world would have the following characteristics:

Business applications would come with meta-data for their interfaces, business processes, XML display I/O, and DB requirements. Much of this meta-data exists, but it’s captive to vendor-specific tools and therefore very hard to retrieve. Installing the application would cause the installation of metadata in a standard form that all tools could use.
Binding to a web site would enable the download of web site meta-data
Such meta-data would be merged into client’s view of similar sites (e.g., a personal content categorization hierarchy). Inferencing tools would help here.
Standard information models for commonly used meta-data would make this merging easier and more predictable.
The meta-data would be timestamped, for later refresh or garbage collection.
A user could discover the context of application operations by browsing metadata, for example, to:
Get the data lineage of data warehouse information
Get a diagram of the business process containing a function of interest, e.g. to diagnose why a workflow step is stuck, what is supposed to happen next, and by whom.
Show where a given web page appears in the user’s private categorization hierarchy
By increasing the semantic content of component descriptions, better component libraries and integration tools could be produced to improve developer productivity.
More diverse libraries of self-describing reusable components would be possible, greatly expanding the component market beyond the current focus on ActiveX controls and Java applets on the one hand, and massive integrated application packages on the other.
Tools could automatically generate wrappers and other adapters to integrate self-describing components.
When binding to a remote application server, an application could look up which standard transaction types are supported and how the standard message formats should be mapped to XML.
A system manager could navigate from a deployed application to dependent deployed objects (installation descriptions, source code) to diagnose trouble reports.

Many of these scenarios require that meta-data expand beyond its traditional design-time usage into run-time scenarios. Some of the large application suites use meta-data repositories in this way today. Easy application integration requires that this usage style be expanded to all of the world’s on-line databases and applications.

Many of the above scenarios could be attained with increased vendor investment using state-of-the-art technology. However, many of them require research. For example, automating the integration of applications will require richer semantic models that can be processed with predictable and acceptable performance. This could be done using algorithms for merging, mapping and transforming heterogeneous and evolving transaction and database services. Like transaction processing, query processing, or access methods, meta-data management is a field, not just a problem. Database researchers should focus on it, in a systematic way, with the goal of converging on a standard meta-data architecture, powerful general-purpose meta-data management tools, and an orderly approach to applying these to the integration of internet services and databases.