29 May 2008
An Indicator Experiment
Background
The use of macroinvertebrates as biological indicators of water quality has a long history [1], and variants of the biotic index developed by William Beck in the ’50s [2] are currently in wide use in stream and river monitoring efforts. In [3], for example, macroinvertebrates are divided into four classes, those that are sensitive to pollutants (e.g. Mayflies); semi-sensitive to pollutants (e.g. Dragonflies); semi-tolerant of pollutants (e.g. right-side opening snails); and tolerant of pollutants (e.g. left-side opening snails, the filthy beasts). EPA has an excellent collection of pages describing the nature of the pollution sensitivity of each taxon [4]. The surveyor counts the number of taxa from each class, and plugs the results into a simple formula to come up with the index.
Ontology
This seemed like a useful collection of concepts to represent in OWL, and so we created an indicator ontology [5], which defines the following class structure:
<BiologicalIndicator>
<AquaticBiologicalIndicator> <rdfs:subClassOf> <BiologicalIndicator>
<SensitiveAquaticThing> <rdfs:subClassOf> <AquaticBiologicalIndicator>
<SemiSensitiveAquaticThing> <rdfs:subClassOf> <AquaticBiologicalIndicator>
<SemiTolerantAquaticThing> <rdfs:subClassOf> <AquaticBiologicalIndicator>
<TolerantAquaticThing> <rdfs:subClassOf> <AquaticBiologicalIndicator>
Then we asserted the asorted indicator taxa to be subClasses of one of SensitiveAquaticThing, SemiSensitiveAquaticThing, SemiTolerantAquaticThing, or TolerantAquaticThing.
Queries
Here are a couple of queries that we’ve run which combine the indicator ontology, our invasives ontology [6], our tree of life ontology [7], our food web data [8], and the rdf representation of some EPA spreadsheet data on macroinvertebrate counts from North Carolina [9]. (The spreadsheet data is converted on-the-fly via the rdf123 web service [10].
i. FIND OBSERVATIONS OF BIOLOGICAL INDICATORS.
Of course, this should retrieve almost all the EPA spreadsheet data. Interestingly, when we run it without the tree of life data, it only results in a small number of hits. This is because the ontology designates taxa to the various sensitivity classes at (typically) the family or order level, whereas the reporting is done (typically) at the genus or family level. When we do include the tree of life in the dataset, we get thousands of hits (as expected).
ii. FIND INVASIVE PREDATORS OF THE MACROINVERTEBRATES OF NORTH CAROLINA, WITH LOCATION.
This results in a number of hits [11]. But it’s not really scientifically interesting, since most fish will eat most insects, provided that the fish and insect are co-located. This query is more designed to show off the integration possibilities provided by our approach. Useful queries on the data remains a goal (see below).
Near-term Future Work
i. Calculate biotic indices for a variety of un-assessed water bodies. Often, if macroinvertebrate data is collected, it is for the express purpose of calculating a biotic index, and so our approach adds nothing. We do have some Sierra Nevada food web data, soon to be published in RDF, where the macroinvertebrate community data falls out of the food web. So this is probably where we’ll start.
ii. Get data on chemical pollutants. This will enable some interesting correlations to be done with not only the macroinvertebrate data, but also with presence/absence data on invasive fish.
iii. Expand the indicator ontology. If you look at the ontology [5], you’ll see that it has plenty of room to grow. We’d like to add concepts like NonBiologicalAquaticIndicator, AirQualityIndicator, etc. But we won’t add theses concepts until seeing the actual instance data behind them.
iv. Fix some mistakes in the ontology. For example, bloodworm midges are currently equated with Chironomidae, which is not accurate.
Comments, Suggestions, Better Ideas
Please.
References
1.http://www.uwsp.edu/cnr/research/gshepard/History/History.htm
2. http://www.washjeff.edu/Chartiers/Chartier/BIOTIC.html
3. http://watermonitoring.uwex.edu/pdf/level1/data-Biotic.pdf
4. http://www.epa.gov/bioindicators/html/invertebrate.html
also, e.g. http://www.epa.gov/bioindicators/html/stoneflies.html
5. http://spire.umbc.edu/ontologies/IndicatorOntology.owl
6. http://spire.umbc.edu/ontologies/InvasivesOntology.owl
also, http://spire.umbc.edu/ontologies/lists/ISSG-GISD.owl
7. http://spire.umbc.edu/ont/ethan.php
8. http://spire.umbc.edu/ont/allFoodWebStudies.owl
9. http://rdf123.umbc.edu/server/?src=http://www.csee.umbc.edu/~jsachs/water_bugs_big.csv
10. http://rdf123.umbc.edu/
11. http://cs.umbc.edu/~jsachs/InvasivePredators.html (These are distinct predator/prey combinations, without locations.)
1 November 2007
Filling A Niche
There has never a widely adopted RDF vocabulary for representing geographic shapes. Four years ago the W3C came up with a Basic Geo Vocabulary which was restricted to representing points via their latitude and longitude, but gave no specification on how to represent lines and polygon features.
The W3C Geospatial Incubator Group has just published as their final report a pair of documents on Geospatial Vocabularies and Geospatial Ontologies. In so doing they have come up with a GeoOWL ontology that includes classes for points, lines, polygons, and boxes. It is based largely on the GeoRSS specification for encoding geographical information in RSS feeds. The ontology does not have a model for spatial relationships, e.g. being able to say a feature is contained within another. Nevertheless, being able to associate geographic shapes with any other entity solves many of the semantic modelling problems associated with biodiversity data.
28 September 2007
Report from TDWG: RDF is not dead
Ontologies and RDF were the big buzzwords at the TDWG 2007 meeting this year in Bratislava, Slovakia. This is surprising because two years ago at the St. Petersburg meeting I felt something like an outside semantic web agitator, conspiring in a smoky bar with a few other like-minded colleagues such as Kathi Schleidt. Also, I had just flown to Slovakia from a wedding at which two people from the IT world independently said, “RDF, isn’t that dead?”
During the last year, TDWG has been exploring the potential of richer semantics, led by Roger Hyam and others. They have outlined a technical architecture implying that all TDWG standards should be enabled for W3C semantic web technologies.
Some background for non-TDWGians. The Taxonomic Databases Working Group (pronounced “TAD wig”), recently renamed Biodiversity Information Standards, is the international body that has been working for several decades to define standards for data exchange among natural history museums. They develop and approve standard schemas (e.g. ABCD) and protocols for exchanging biological specimen data (e.g. TAPIR) and other kinds of related taxonomic information (e.g. descriptive data and literature). Applications that use these standards can then be implemented. TDWG standards make portals like GBIF possible; they are also supposed to underpin the Encyclopedia of Life. There is discussion of enlarging the scope of TDWG to go beyond the museum-oriented information it has focused on in the past. This makes sense to many of us because of the primary importance of biological taxa in many related fields such as ecology.
Many in TDWG are still trying to understand why it may be important to provide RDF-based solutions, especially when years have been spent developing XML schemas. The most compelling argument to me is that over and over again, individual projects end up solving their particular problems (a need for flexible schemas, an inability to effectively map XML schemas) by independently deriving RDF-like solutions. Specific examples of which I am aware are the ALTERNet project in Austria and the Spider Assembling the Tree of Life project in the United States.
Over the next few weeks I’ll flesh out my thoughts coming away from TDWG 2007. At the moment, they fall into two categories of ideas.
- How does the Spire project’s ETHAN (our Evolutionary Trees and Natural History Ontology) and related products (InvasivesOntology, SpireEcoConcepts, etc.) relate to existing and proposed TDWG standards? What has Spire learned from its modeling and tool-building that might be helpful to the overall effort. We have experience to share as interest groups with complex schema like SDD (Structured Data Description) and TaXMLit (Taxonomic Literature) contemplate how to proceed.
- What architecture really makes sense for a semantic TDWG? Spire models its architecture on a highly distributed web document push model, indexed by global semantic search engines like Swoogle. In contrast, TDWG tends to assume that content providers form an organized network where consumers pull data directly from nodes using mutually agreed upon protocols.
For now, I’ll just note that a move to the semantic web is hindered by two significant barriers. First, we don’t have really impressive examples of the power of the semantic approach. Computer scientists talk about wines and FOAF. Inspiring but not entirely convincing to biologists. Spire has done some proof-of-concept work using real data to answer fake questions. We really need to tackle something where the result is publishable in a scientific venue. Until we get this even I have to remain skeptical that all this headache-inducing work is worth it. The SEEK project is working on semantic data flows in ecology; maybe they can point to a simple success story that will resonate with the taxonomic community.
Second, we don’t have user friendly tools for biologists, much less non-ontologist developers to use to jump into the semantic fray. Bob Morris spoke about this and I agree. We’ve worked a little on this in Spire with Spotter and RDF123, and I’ve done a little on Leptree.net. There is going to be a SWUI workshop at CHI (Computer Human Interfaces) next year, so I like to think the community is working to remedy this.
17 September 2007
Spotter 1.0
We’ve released a new version of Spotter, our Firefox plug-in for semantic eco-blogging. Launching Spotter brings up a form that prompts for values such as reporter, observer, taxon, common name, lat, long, etc. Upon form submission, the user is given the URI of the RDF record that was created from her input. This URI can then be used to link to the RDF from a blog post. (This is what we’ve been doing on the Fieldmarking blog – click on the owls to view RDF.) The plug-in can also be used to provide RDF annotation to someone else’s image, either on a blog or on a photo-sharing site.
More background on our eco-blogging effort is here
You can download Spotter here
The new version supports default values, and includes a map-based lat/long finder. Comments and suggestions are very welcome.
14 September 2007
TOAD Modelling
A basic query to ask in biogeography is ‘what species lives here?’. There are a lot of data resources to answer that question. I come up with at least four different types of data resources. First, there are direct records of observations of species, one online example being the citizen science effort eBird which allows birdwatchers to record on the Web what birds they’ve seen. There are coarse-scale range maps, an example being the map in this species account on the mountain lion. There are species lists collected over a substantial period of time, such as lists from parks and nature reserves. Finally, there are probabilistic distribution models generated by tools such as openModeller.
We have just started work on a Semantic Web application that will return information in RDF on species status and distribution for a selected geographic area. The aim of the application is to provide a framework for amalgamating the four types of data sources above into some sort of uniform species list. (I’m calling this TOAD data, for Taxon Observation And Distribution. A TOAD ontology may be in the not-too-distant future.)
I think the basic data granule here comes down to who-where-what-when? That is, a combination of data source by geographic region by taxonomic entity by time period. For now the idea is to concatenate the who-where-what-when parameters into a single URI, thus designating it as a resource over which one can return an RDF description. Variants of this URI pattern will return information such as species lists for a particular region (either according to one data source or across all data sources handled by the system), sets of observation data for a particular species, or metadata for a data source.
10 September 2007
Naming The World
Last winter I gave a talk about gazetteers to a geography seminar here. I mentioned some of the important online gazetteer resources (e.g. the Alexandria Digital Library gazetteer or the Getty Thesaurus of Geographic Names), but somehow hadn’t yet run across perhaps the most interesting web-oriented gazetteer project to date, GeoNames.
The GeoNames gazetteer project has been around for a couple of years and currently contains more than eight million geographic names representing 6.4 million unique geographic entities. The project started off by assembling some of the major public domain gazetteer resources such as the USGS Geographic Names Information System for place names in the US and the National Geospatial-Intelligence Agency database for non-US names, but has expanded its content to include many other sources as well. GeoNames provides an elaborate RESTful API to its data and in a wiki-like fashion allows people to enter or correct placenames on their own.
A year ago, Bernard Vatant set about the task of integrating GeoNames into the Semantic Web, and with a little help from Harry Chen put together an ontology for placenames in GeoNames together with a URI scheme for the placenames backed by Semantic Web-friendly content negotiation.
From the point of view of somebody who wants to refer to placenames in a Semantic Web document, this set of URIs is a great resource. As a case in point, we are interested in providing species lists in RDF for various geographic regions. Often such species lists refer to named regions rather than geographic coordinates (for instance “Yolo County, California”). It is now straightforward to come up with a URI for such a region — Yolo County is represented by http://sws.geonames.org/5410882/. Even better, the fact that users can add their own placenames allows the creation of good URIs for locations with species lists that aren’t yet in GeoNames.
GeoNames is getting ever more comprehensive, though. It’s fun to see that the building where I’m writing this now, Wickson Hall, already has a GeoNames URI.
7 September 2007
Avoiding Islands
I’ve been inspired to start this blog by learning of the Linking Open Data project. This is a Semantic Web project to create a interlinked data commons on the web using RDF to link across open datasets. The project is still young, but has grown impressively. The figure at right is their diagram of the currently linked datasets. The whole network has well over 2 billion RDF triples in it, the datasets interlinked with approaching a million RDF links.
Though this network is rich, as of now it contains little in the way of scientific datasets. In the course of the Spire project, we would like to begin extending this network to biodiversity and natural history information sets. Of which there is a great deal of content already on the web; this catalog from TDWG lists 556 different biodiversity informatics projects to date.
The trouble with this set of biodiversity informatics projects is that the vast majority of these are islands, with little means to network these data across projects. History is partly to blame here — many of these projects were started before the rise of the Programmable Web, and the notion of supplying open web APIs for data access was simply not part of the developers’ thinking.
In the posts that follow, we will be exploring tools, projects, and other advances that may help to lead to a well-developed semantic network for natural history information.

