Tuesday, September 05, 2006

Digital Library Issues

Improved Access to Systematics Publications: Releasing the Power of Legacy Data

Systematics publications are unique among scientific publications. Besides their quasi legal status as a conditional part of the description of a new taxon recommended by the International Code for Zoological Nomenclature (ICZN for animals, with similar codes for plants, bacteria, virus and fungi), they are highly structured and standardized. All the descriptive content is linked to a particular taxon, and they are very rich in descriptive data, the original (and subsequent) description of the taxon. These descriptions are not just taxon hypotheses but include various amounts of morphological and more recently molecular characters, materials examined, notes on behavior and distribution (an interpretation of materials examined), nomenclatorial sections, phylogenies, bibliographic references and visual art. Finally most of the content is factual knowledge, or a description of a piece of nature. Recent publications share additionally some of the structural elements of standard scientific publications (e.g. abstract, introduction, acknowledgments, etc.).

Over the centuries (since 1758 or the 10th edition of Linnaeus’ Systema Naturae), the basic structure of descriptions have not changed substantially. In its most basic lay out, they include a title (title and author) and a list of treatments. They include a nomenclatorial section containing minimally a name of the taxon, a brief description and mentioning of its distribution.

Each element in the descriptive part of a systematic publication can thus be related to a particular taxon, in particular position in a particular publication.

“Red head” of taxon X on page Y in publication Z is enough to locate it in the entire body of our legacy data. To make this machine readable, this entire relationship can be standardized, using ontologies (or controlled vocabulary), DOIs and LSIDs identifying each elements. Rod Page shows how to retrieve it. The page number is given by original designation of the position within the hard copy, which is no longer available from electronic publications. The taxon LSID can be automaticall retrieved from the Hymenoptera Name Server. A controled vocabulary or ontology is being developed by a group within the International Society of Hymenopterists, including antbase and HNS participation.

This nested structure makes systematics publication a prime candidate for automated data extraction, which we currently try to develop using, among other taxa, ant literature as pilot group (see TWiki on ants). TaxonX is the XML schema we developed and are now applying to an increasing body of publications to see its strength and value.

In fact the enormous amount of data collated over the last 250 years is the prime reason to make an effort to find ways to extract this information. For ants alone, systematics publications include over 90,000 pages, and most likely several 10 million pages for all the world’s currently described species. If we are lucky, some of it will be made accessible through the Biodiversity Heritage Library, the Biologia Centrali Americana, Animal Base and other digitization efforts.

At antbase we are currently looking into marking up all the ca 120 publications covering the Malagasy ant fauna, the ant publications from within the American Museum of Natural History Novitates, and incoming new publications.

Since descriptions are factual knowledge, they can not be copyrighted and thus be made accessible over the Web. Scientific practice demands a acknowledgement of the authorship, which at the same time is the proof of quality (i.e. it has been published).