Metadata and tagging

Delivering to the expectations in WIPII will require making much better use of metadata and tags than is currently the case. Only if content is described accurately can we surface it in the right places, identify other related content and enable users to easily find what they are looking for. This page is only to capture a few aspects that might prove tricky, not to be the comeplete requirements set for metadata. I know that Squiz already offers much functionality in tis area, with which Andrew or other developers will become familiar. I am sure many of the common elements (data created, date expires (or maybe a content review date), author, asset type, etc) will serve our needs well. 

Although all metadata is data about other data in this context metadata refers to the more rigid and predefined approach known as a taxonomy and tagging the more flexible and user-defined approach. While we will often use only a metadata approach to describe our digital assets, there are times when allowing the user to select their own terms is beneficial, either because we have not provided the correct values or so we can learn what the most common values are. 

In a previous life I was responsible for a metadata working group that created and incentivised the adoption of a new approach to metadata for the NZ compulsory education sector. Of note was the blend of "structured learning resource metadata" (based on IEE LOM and Dublin Core) and "user annotated" (or peer enhanced" metadata. Feel free to read over the final report and the vocabulary and numbering schema. For more on this user-peer approach see this child page.

The two approaches do not need to be mutually exclusive. Rather, they can play complimentary and supporting roles. Both could be useful for an area like research expertise, where whatever value list we provide will be insufficient for some researchers. But in most areas there will be little or no need for the user annotated or peer enhanced approach.

Topic-subject taxonomy

The single most complex area will be the topic/subject, not only because of the large number of values (including the possibility that lower level subjects will be in more than one branch and that entire topic branches might repeat under different mega-topics) but because I hope that what was designed for navigation to find subjects we teach will also work for research areas (for both Media guide and Research expertise). If this works this taxonomy functions like a keyword tag set for many other asset types (e.g. news, events, features, etc).

Some challenges with this approach are:

  • Alternative terms: How do we discover the alternative terms with which a user might know one of our subjects. This ranges from the easy to predict area of abbreviations (taxation is also referred to as tax, mathematics as maths or (in America) math) and acronyms (e.g. english for speakers of other languages as ESOL) to the more challenging of thesaurus terms (teaching and teacher training could both map to education).
  • Alternative spelling: We do not want new tags that are only an alternative spelling of the official tag value, either when this was a mistake or an intentional alternative spelling. Suggesting close matches while people are typing is useful, honing in on likely values. See http://www.google.com or http://www.trademe.co.nz. for good examples of this in action.
  • Growing our taxonomy: We ought to review the user annotations periodically, seeking to learn from the terms our users prefer and reflecting on how we evolve our official tag set.
  • For humans or machines: We need to decide if the primary user of the metadata/tags are humans or machines. If the later, we might wish a numbering system/schema, as they are more precise and more efficient for storage and movement. This was an essential element of the approach that I previously sponsored, as we had multiple repositories of learning resources and yet harvested all the metadata in to a central repository for search/discovery (DigitalNZ). We may not have this complexity and might therefore have less need for a numbering system. However, we still have the issue of
  • Same string at different levels: For example, accounting is both a topic and a subject, as is economics or psychology. It will most likely be necessary to which level has been tagged, as the relationship between subjects would be different than between topics. For example, economics as a subject might be related to other subjects like econometrics (as they are in the same topic) and finance (in another topic) and as a topic be related to other topics such as government or social sciences. We could possible work around this by a smart combination of UI (only store tags at the subject level even if the interface appears to allow topic level tagging) and/or storing an string that captures both the value and its position (e.g. the string might be "topic:economics"
  • Relative importance of taxonomy values and tags: We need to determine if we give a higher weight to the values selected from the structured/official/published set than to the user annotated values. How do we want to treat them? Is this possible technically?
  • Intended audience: Would we gain benefit from an element that targets an asset to an audience or a specific part of the site directly? For example, we might have two descriptions of the same subject, one for undergraduates and one for postgraduates.

Tagging people

Many assets will relate to people, either explicitly (as in a staff profile or the author of an article/paper/etc) or less obviously (as in mostly about someone). If we treated this element as "primary person", allowed multiple values, then we can build up a relationship between any asset and one or more people. US/UI design will have to figure out how we identify somebody (e.g. by url to profile, user name, email , etc).

Combined with the topic-subject tags this primary person becomes very powerful. We can easily allow a reader to discover more about the primary person (courses taught, articles published, events where they will speak). We could develop/deploy some visualisation tool that shows the nodal relationships between our assets like this DIgitalNZ example.