Kim H. Veltman
Challenges for a Semantic Web: From Transactions to Relations and Meanings
Submitted to: The Fourteenth International World Wide Web Conference, Chiba, Japan, 10-14 May 2005.
The traditions of markup languages (e.g. SGML, XML) have focussed on adding semantic meaning through their markup of individual words within the text. This in-text meaning can be coupled with more detailed meaning found in classification systems, dictionaries, encyclopaedias and other reference works available through virtual reference rooms and complement other efforts towards multimodal meaning.
Contemporary discussions of a semantic web focus on entities and properties in terms of five semantic primitives. Dahlberg has distinguished between entities, properties, activities and dimensions in knowledge organization. This essay suggests that all of these should be included and explores the use of Perreault’s distinctions between subsumptive, ordinal and determinative relations to move towards a larger view of a semantic web. It claims that the idea of hyperlinks can be enriched through the concepts of illustrated links, omni-links and dynamic links. When coupled systematically with enduring knowledge of memory institutions (libraries, museums and archives), this approach can expand greatly types of relations among subjects/objects and take us from simple, logical transactions to complex meanings. If applied seriously this approach will enable us to transform a largely, static web into an evolving dynamic web that reflects different spatial (local, regional, national, international) and temporal (historical) dimensions. This will complement work down on constraint based on contextual relations and transform a web of trust aimed mainly at business transactions into a semantic web in a richer sense that serves wider dimensions of human expression and culture.
In the study of literature, the first half of the 20th century brought two major trends in textual analysis. One focussed on underlying structures (Moscow, Prague, Paris, Oxford, Yale schools). The other focussed on close reading of individual words (Cambridge) and new criticism (Oxford). The second half of the 20th century saw the evolution of Standard Generalized Markup Language (SGML). In the context of literature this focussed on the markup of individual words in projects such as the Records of Early English Drama (REED), the Dictionary of Old English (DOE), and the Oxford English Dictionary (OED). The thrust of such efforts was to separate the content of such texts from the form in which they might eventually be displayed. A limitation was that SGML was very difficult for non-technical users.
The advent of HyperText Markup Language (HTML, 1990), introduced a wonderfully effective interim solution that removed the distinction between underlying content and various forms of expression. Extensible Markup Language (XML) set out to create a subset of SGML, which would re-instate the separation between content and form and create a web of trust based on logical principles. This latter feature is essential for reliable online transactions and scheduling and is seen by some as synonymous with the quest for a semantic web. While obviously necessary for the world of business, this view is too narrow for the needs of culture and humanism. This essay explores those needs and outlines steps that can lead to a more comprehensive approach, building on distinctions introduced by Perreault.
1.i. The Legacy of Boole
When Shannon and Weaver were initially confronted by the challenge of electronic communication in the 1940s they chose a binary system of 0s and 1s based on Boole, which itself was a simplification of the ideas of Euler. This focussed attention on the logical states of identity (is =) and negation (is not ≠). This repertoire was extended to include five basic semantic primitives: existence, coreference, relation, conjunction and negation. Even so the legacy of Boole led to a continued emphasis on identity and negation.
This paper makes two points. First, it suggests that the semantic web will become even more effective if it integrates more insights from the realm of memory institutions and particularly from the field of Knowledge Organization. Instead of speaking generally at a “high level” of properties as any “relationships between two kinds of entities,” it would be very useful to distinguish more specifically between different kinds of relations and integrate these into markup languages, search and retrieval strategies and use these structures for sense making in future methods for the presentation of knowledge and information. Attention is focused on three kinds of relations identified by Perreault: subsumptive, ordinal and determinative relations. Not considered here are allegorical relations or metaphorical relations.
Secondly, this paper proposes that the advent of automated omni-linking and new methods for hyper-illustration and dynamic links, introduce new possibilities for the semantic web. Instead of looking to a) markup within a text, or b) metadata tags in headers to files and/or documents, one can link words in texts directly with dictionaries, encyclopaedias and other reference materials of virtual reference rooms.
In the past, top-down methods constrained users to adopt a) a standardized field name and b) an authority file which accepted only spelling of words (names, terms, places) within that field. In future, we expand the concept of authority files to include all possible variants, then local, regional, national and international versions can continue and cultural diversity can be fostered.
2. Subsumptive Relations
A first dramatic step beyond the narrow confines of Booleian logic lies in subsumptive relations in terms of partition (part/whole) and abstraction (type/kind).
Perhaps the most obvious of these is partition (partitio). Such relations are well established in parts catalogues and in specialized subject catalogues in fields such as medicine. Hyper-linking a term such as (the human) body to such catalogues would greatly expand an average user’s ability to explore topics related to the human body. The semantic web lies not just in identifying and tagging the logical relations of a term on a page, but lies in linking that term with logical relations which have been established elsewhere by others. Hereby, notions of identity of an object are expanded to include all its parts. Such whole-part relations can be classed as internal, subsumptive relations.
Another aspect of partitive, subsumptive relation pertains to the accidents of a given substance, notably, distinguishing characteristics, quantity, quality, etc. Again, although these aspects are typically not linked with a term in a given text, they are usually described in libraries and specialised knowledge databases. Knowing the accidents of a given substance allows us to search for it with much greater precision: i.e. by being able to specify the colour, texture, weight and other characteristics of a given object.
A second area of abstraction (divisio), can be classed as external subsumptive relations insomuch as they reveal how a given being entity or substance is in turn part of a larger type/kind or genus/species category. For instance a man is part of the species Homo Sapiens which is part of the order of Primates, which is part of the phylum of Chordata and in turn part of the kingdom of Animalia. Such information is typically found in biological taxonomies which were made famous by Linnaeus and his followers. Knowing the categories under which a being or object is classed again increases greatly our ability to search for related subjects and objects (figure 1).
Such systems for ordering the world change over time. The system of Linneaus was replaced by later systems of Jussieu, Cuvier and Haeckel which for a time became the standard system for ordering their field. In the case of classification systems, multiple competing systems frequently exist simultaneously. This is also the case with personal lists of terms which contain hierarchies as tree structures that do not lay claim to absolute truth and yet prove very useful as ordering and sense-making systems. Such systems vary both spatially (local, regional, national, international) and temporally (in different historical periods).
The architects of the Internet have already moved from a single ontological model to a framework that recognizes multiple ontologies. As we explore the potentials of linking marked up terms with multiple ordering systems and reference modes, it is increasingly important to distinguish among different levels of claims to truth. In business, either/or logic is usually sufficient (a transaction either occurs or does not occur). In the realms of scholarship and culture more subtle distinctions are necessary: a claim made or supported at the international level by an official organisation in a field typically carries more weight than a local organization or the claim of an isolated individual. The semantic web needs to reflect not only true statements but also the level of truth in statements and claims.
3. Ordinal Relations
Viewed in terms of questions, entities and properties are primarily concerned with What? and Who? By contrast, Perreault’s ordinal relations (Dahlberg’s dimensions) concern other questions. Positional (spatial) ordinal relations entail the question Where?. Positional (temporal) ordinal relations entail the question When?. Conditional (state, attitude, energy) and Comparative (degree, size, duration, identical, similar, analogous, dissimilar) ordinal relations entail the question How?
Figure 2. Perreault’s Subsumptive Relations with respect to Universals and Particulars.
A systematic approach to subsumptive relations will lead to new insights concerning the differing role of ordinal relations with respect to universals and particulars. The (universal) concept of an eagle will include a range of sizes and ages whereas the (particular) measurements and life span of a given eagle will be in terms of precise figures. In this sense, universals are above the space/time horizon whereas particulars are below that horizon. In future our knowledge organization needs to reflect these distinctions.
4. Determinative Relations
Aristotle made a distinction between action and suffering. This became a distinction between actions and processes (Dahlberg) or between active and passive determinative relations (Perreault). These have further divisions and historically have corresponded to the questions Why? and more recently How? The semantic web should also address such determinative relations (cf. figure 1). It might build on the work in linguistics on constraint based grammar and specifically on purpose infinitives.
Most search engines today focus only on searching for Who? and What? A semantic web needs to provide searches for all six basic questions: Who?, What?, Where?, When?, How?, and Why? in various combinations. A quest to achieve this will transform the practice of links.
The notion of hypertext and hyperlinks as envisaged by Doug Engelbart and developed by Ted Nelson remains largely in the tradition of the footnote/reference whereby a given word in a text is linked with another set of words either at the end of the text or to some other site. The semantic web as here envisaged would introduce at least three novel kinds of links, namely, omni-links, hyper-illustrations and dynamic links.
The SUMS Corporation has developed a prototype for an omni-linked version where every word in a text is hyper-linked without a need to highlight individual words with the customary blue font. This means, for example, that every word in a book on Leonardo can be linked with a database recording Leonardo’s uses of that word in his manuscripts. While this is admittedly of limited consequence in the case of prepositions (e.g. in, out, by) or copulas (is, was), it is extremely useful in the case of significant terms such as his four powers of nature (force, motion, percussion, weight). It is also extremely efficient in that a simple algorithm allows one to make such links automatically rather than needing to make each link manually.
Hyperlinks have typically been one-to-one correspondences between a word in a text and another note or site. Omni-links can function at different levels of knowledge: i.e. the same omni-linked word in a text can be connected with: 1) a term in a classification; 2) a definition in a dictionary; c) an explanation in an encyclopaedia; d) a title in a catalogue or bibliography;
e) partial contents in the form of an abstract or a review or f) the full contents of an article or book. Hereby omni-links introduce access to meaning at multiple layers.
Initially such links with definitions will simply be made with standard dictionaries (e.g. Oxford in English, Larousse in French) This can be extended to etymological dictionaries (e.g. Gaudefroy in French and Grimm in German). Eventually this approach points to a kinds of dictionaries that distinguish between ostensive, nominal and real definitions. As such the semantic web will bring access to meanings that vary geographically and historically.
Notwithstanding many advances in the field of printing, coloured images remain extremely expensive and are therefore kept to a minimum in most books. Hence the use of such coloured images remains one of the serious limitations of the printed book. An electronic version of the same book can provide a simple replica of the printed book, but it can also go much further. By using hyper-illustration an author can potentially different series of images to their book, whereby an amateur might have a few simple illustrations, while an expert is provided with a series of more complex illustrations. These series can in turn be coupled with further series. Hence, while the printed version of a book discusses virtual reality and offers perhaps one illustration, a hyper-illustrated book might provide five examples and allow an interested reader to access an entire lecture with 100 illustrations of virtual reality. This does considerably more than overcome the limitations of printing. It introduces the possibility of layers of images which become more diverse and complex to meet the needs of more advanced readers.
5.iii. Dynamic Links: Past and Future
Marshall McLuhan made us aware of the paradoxes of print media. They had the enormous advantage of “fixing” a text in the sense of establishing an authoritative version which did not change with every scribe as had been the case with manuscripts. At the same time this fixed version of text meant that one could not simply erase and rewrite a passage. Any changes meant a new printing and usually an entirely new edition.
The association of computers with dynamic ideas goes back at least to 1968, the year the Internet began in Britain and when Alan Kay conceived the idea of a Dynabook. Since then there has been much hype about dynamic links and dynamic link libraries. Today even Microsoft Word has a “Fields” feature that allows one to trace dates as they change.
Just as the world of industry speaks of self-healing products, the world of scholarship envisages self-updating publications. In this scenario, rapidly changing statistics such as what is the fastest computer or how many persons are on the Internet (200 million in 2000 and over 800 million 2004) would be updated automatically every time that standard websites devoted to these themes are brought up to date.
While intuitively easy to imagine, serious attempts to create self-updating dynamic books will require a considerable adjustment in writing practice. In the past, authors typically focussed on precise information which, ironically, is the most likely to become quickly dated. In future authors may typically focus on more general claims in their texts which are then substantiated by links to standard sites which update “volatile” statistics and information.
6. Coreference and the Role of Standards
Coreference or equivalence is an area where great changes are occurring with respect to the semantic web. In books on logic the problem of identity and coreference is typically a binary either/or question. In the world of culture, the problem of coreference is much more elusive. Names of the same persons and places change both geographically (especially in different languages), and historically. For instance, the name Leonardo da Vinci is alternatively spelled Lionardo da Vinci, Léonard de Vinci etc. Similarly, the names Liège, Luik and Lüttich refer to the same city in northern Belgium.
To deal with such problems the 19th century developed the ideal of standards and specifically the notion of authority files. It seemed that the challenge was to persuade everyone to use the official spelling of a name. The good news was that everyone who followed this approach could exchange information efficiently. The bad news was that this approach blocked access to all variant spellings of the same name.
The 20th century revealed that in many ways the 19th century ideals were too optimistic: that complete consensus even on the fields of MARC (Machine Readable Card) records was not possible. But surely there were a few fields about which everyone could agree? The Dublin Core efforts revealed that even this limited application of the top down approach was not universally acceptable. Gradually those in technology have begun to realize that the way ahead is not to force everyone to change their practices to fit the technology but rather to adjust the technology to fit persons’ different practices.
In this new view standards which bring authority files remain important but need to be complemented by tables that record all the variants of those names. As a result, local, regional, national and international variants of names of persons and places can be used and one can still arrive at the proper co-reference. Philosophically this is of profound importance because it means that standards no longer threaten (cultural) diversity. Indeed standards can help diversity to prosper: evolution is embracing not replacing.
This simple philosophical breakthrough has major technological and political consequences. There are over 6,500 languages in the world and no-one can be expected to learn them all. The top down model of the 19th century made it seem that the triumph of one standard language such as English was inevitable: a question of sheer numbers said the pundits at ICANN and elsewhere. A decade ago this seemed a reasonable premise. English constituted almost 90% of the web. In 2004, English represent 35% of the web. By 2006, it is claimed that there will be more Internet users in China than in the United States and there are serious claims that Chinese will be used more than English.
The philosophy that combines authority files for names with variant names can readily be extended to languages. For example, the original language of a text can be the authority file and translations in different languages function as variant names. To take a concrete example: this means that a Hungarian can type in the title A festészetrõl and arrive at a “standard” name of De pictura (On Painting) without needing to know how to spell the title in Latin or in English. Conversely, this means that an English speaking person can find Hungarian titles without knowing how to spell in that language.
Technologically this implies that we can create a semantic web which reflects the diversity of the world’s languages and cultures. To a certain extent the trends towards internationalization even in the realm of domain names are already moving in that direction. Politically this means that instead of threatening linguistic and cultural diversity, a global internet can foster their growth and development. It also means that standards for mapping between various distributed databases become ever more significant.
7. Virtual Reference Rooms
Implicit in the above outline are new activities for the semantic web and indeed some major changes in the organization of knowledge. Trends towards authority files whereby problems of identity and co-reference (equivalence) can more readily be resolved remain important. But these collections of standard names and titles need to be complemented by collections of variants which will enable users to continue using their preferred versions rather than forcing them to conform to an externally imposed version. Such collections of variants are a key to success in efforts at bridging and mapping between/among different ontologies.
In the world of printed books the information concerning such variant names was typically found in the reference rooms of libraries. For instance, in art history Thieme-Becker’s standard lexicon of artists (Künstlerlexikon) alongside the familiar version of their names, included typical variant spellings. Such reference rooms also contained different classification systems (ontologies); dictionaries (alternative definitions or semantic meanings); encyclopaedias (more extensive explanations or detailed semantic meanings); catalogues and bibliographies (bridges to more detailed semantic information). We need an electronic equivalent of such reference rooms: we need virtual reference rooms to complement existing trends towards a WWW Virtual Library.
Today there are already collections such as www.onelook.co which links 6,146,163 words in 971 different dictionaries. The World Wide Web Virtual Library already has a link to the Dewey Classification System. Needed are bridges among different classification systems, dictionaries and encyclopaedias for virtual reference rooms. G7 pilot projects such as Bibliotheca Universalis, European Projects such as RENARDUS and national efforts for Union Catalogues (e.g. CCF in France, GVB in Germany, ICCU in Italy and CURL in the UK) have made initial efforts in this direction.
Such hitherto scattered efforts need to be integrated to form an integral part of the Research Infrastructures linked with the Grid initiatives of the European Union. These distributed repositories of reference materials will require input from a series of communities: not just World Wide Web and Internet, but also memory institutions (libraries, museums and archives); linguists (e.g., ELSNET).
The outputs of such infrastructure work will be almost invisible to the untrained eye, just as the enormous efforts that go into making electronic library catalogues and electronic reference works are nearly invisible. Even so their implications will be enormous. In a European context it will mean that citizens from the newly accessed states will be able to gain access to records from all the member states without needing to master perfectly the orthographies of 25 languages. It will mean also that local and regional variants need not in future to be seen as competing with or opposed to national and international versions associated with standards.
There is a growing trend to scan in the full texts and images of memory institutions. The French Gallica project is scanning in 70,000 texts which will be freely accessible. The British Early English Books Online (EEBO) Project has scanned in the full texts of 125,000 books between 1475 and 1650. In China, all the Classics are available in eight dialects through Unicode (800 million characters). Other countries have analogous projects at the national, regional and local levels.
Such efforts towards 1) digital libraries will be complemented by 2) a distributed virtual reference room as part of research infrastructures and could be integrated with 3) a virtual agora for collaborative research and creativity. Together these three elements would constitute a Distributed European Electronic Resource (DEER) and serve as the basis for a larger World Distributed Electronic Resource (WONDER). Such a project will take the semantic web far beyond its contemporary focus on transactions and scheduling, which limits it to machine readable meaning and open its vision to the full range of human meanings. 
With respect to the elusive concept of a semantic web, many efforts in the WWW, in computer science (especially information systems and computational linguistics) focus on strategies and markup languages that will enable new algorithms to extract meaning from texts automatically. The advent of new linking methods introduce complementary possibilities whereby words in texts are given contexts by linking them, via virtual reference rooms, to established to established repositories of meaning such as dictionaries, encyclopaedias and gazetteers.
The extent to which this approach is fruitful will depend on the field and the kind of knowledge. For enduring knowledge this will be vital. For new, personal or collaborative knowledge there will be emergent meanings which yet to be established. Increasingly we need to recognize that personal, collaborative and enduring knowledge have their own rules and yet there is an ongoing need to create bridges and mapping systems across these different realities. The work on new knowledge visualisation and knowledge management tools in recognizing new connections and meanings is important. So too is past work on codifying accepted meanings via lexicology and lexicography which were once linked to semantics along with semiotics and semasiology. One of the real challenges is to relate this past and present work more systematically.
Related to this challenge is a need to recognize more consciously that a language is more than a collection of words which can simply be translated into other languages. The Italian dictum: traduttore tradittore (a translator is a traitor) ponts to a deeper and more subtle dilemma. Each language is really a way of knowing, linked with a culture, often with belief systems and cosmologies. Words can be translated: ultimately languages cannot. In an age where none of us can learn all of the world’s 6,500 languages, how can we create a WWW that allows us to share while teaching us to respect the invisible chasms which our cultures entail? How can we create interfaces for a semantic web which help us to become more conscious of these invisible dimensions of cyberspace which correspond somehow to what Hall called the Hidden Dimension of encounters in physical space?
Tim Berners-Lee has a vision of a web of trust whereby the truths of logic will help us to distinguish between reliable knowledge and unreliable information/data. This is a noble goal. His quest for a semantic web in this sense, first formulated publicly at WWW 7 (Brisbane, 1997) deserves to be seen as a networked vision of a much older goal that goes back to the philosophers, logicians and grammarians of antiquity and which led via the trends of new criticism (Oxford) and close reading (Cambridge) in the first half of the 20th century to the quest for systematic markup languages in the latter half of the 20th century.
Initially it seemed as if the semantic web would largely be a question of such markup languages within texts. A generation ago there was still an assumption that one could create a universally applicable Standardized General Markup Language (SGML). The past decade has seen a shift towards a less complex, simpler, eXtensible Markup Language (XML), which is generally applicable and can be complemented by specialized markup languages in various fields. At the heart of these activities was a quest to create standardized authority files through metadata tags both in the text and attached to the text.
This made perfect sense in the context of new (personal) knowledge in the form of born digital materials. But as the World Wide Web continues to grow at 7 million pages a day and increasingly reflects also the cumulative memory of enduring knowledge, there is a need to couple such internal efforts with standard reference materials from memory institutions. This points to new, distributed bottom-up databases which collect variants in addition to top-down authority files; to new interplay between personal, collaborative and enduring knowledge; to new, multilingual knowledge structures and ultimately to new forms of publication, of sharing, collaboration, and creativity.
Dr Ingetraut Dahlberg’s pioneering work in knowledge organization remains a constant source of inspiration for which I am ever thankful. I am deeply grateful to Alexander Churanov who has developed the omnilink algorithm and to Vasily Churanov and Andrey Kotov who have helped to develop a prototype for a new System for Universal Media Searching (SUMS). I am very grateful to Frederic Andres for his encouragement to present these ideas as a tutorial at the WWW2005, for kindly reading the draft and providing helpful comments and references.
 Ingetraut Dahlberg developed these ideas in many of her publications of which we cite only a few here: Grundlagen universaler Wissensordnung: Probleme und Möglichkeiten eines universalen Klassifikationssystems des Wissens, Hrsg. von der Deutschen Gesellschaft für Dokumentation e. V. (DGD), Frankfurt/Main. Pullach bei München: Verlag Dokumentation, 1974, especially pp. 100-167. These ideas are further developed in: “Concept and Definition Theory,” Classification Theory in the Computer Age, Albany New York, November 1988, pp. 12-24 and in “Conceptual Structures and Systematization,” International Forum on Information and Documentation, vol. 20, no. 3, July 1995, pp. 9-24.
 J. Perreault, "Categories and Relators", International Classification, Frankfurt, vol. 21, no. 4, 1994, pp. 189-198, especially p. 195. Perrault’s relators were purely syntactically meant. In the Mediaeval period these would have been called synkategoremata.
See: http://www.dfki.de/~wahlster/Dagstuhl_Multi_Modality/WG_4_Multimodal_Meaning_Representation/. This group interacts with SIGSEM.
S. S. Goncharov, V. S. Harizanov, J. F. Knight, C. F. D.McCoy, "," Algebra and Logic, 2004, 43. See: http://home.gwu.edu/~harizanv/RelativelyHyperimmune.pdf.
 Timo Honkela, et al., Self-Organizing Maps and (2000), Proc. of ICEUT'2000, Beijing, August 21-25. See: http://citeseer.ist.psu.edu/honkela00selforganizing.html
 Cf the Council on Library and Information Resources (CLIR) paper on the Societal Role of Archives. See: http://www.clir.org/pubs/reports/pub89/role.html. For specific projects see: MarcOnt initiative http://www.marcont.org/ and JeromeDL http://www.jeromedl.org/
E.g. T. Ludden, “Allegories of Cultural Relations: An Examination of Anne Duden's Mode of Reading Representations of St George and the Dragon,” German Life and Letters, January 2004, vol. 57, iss. 1, pp. 69-89(21).
 Some of this is common knowledge. We all know, for instance, that a human body has parts such as a head, arms, hands, fingers, legs, feet and toes. Even so only a trained medical doctor is typically able to recognize and recall the thousands of “parts” of the body. For a recent discussion of this complex field cf. Olivier Bodenreider and Carol A. Bean, “Relationships Among Knowledge Structures: Vocabulary Integration Within a SubjectDomain,” Relationships in the organization of knowledge, Dordrecht: Kluwer, 2001, pp. 81-98. See:
Some would associate this with monumental relations, namely relations between/among monuments in physical space and in geographical terms.
. Active (Productive, Causing, Originating/Source) and Passive (Produced, Limited, Destroyed).
 For a more detailed discussion see the author’s “Towards a Semantic Web for Culture,”
JoDI (Journal of Digital Information, Oxford, Volume 4, Issue 4, Article No. 255, 2004-03-15, p. 19 (Special issue on New Applications of Knowledge Organization Systems.)
 On Alan Kay and the Dynabook see: http://www.artmuseum.net/w2vr/archives/Kay/01_Dynabook.html
 For a fuller discussion of these topics see the author’s Augmented Knowledge and Culture, Calgary: University of Calgary Press, 2005 (in press), c. 600pp.
Cf. CSC 581 Computer Support for Knowledge Management List of Knowledge Management Tools. See: http://www.csc.calpoly.edu/~fkurfess/Courses/CSC-581/S03/Assignments/KM-Tools-List.shtml
Cf. Danny Sullivan, “Death of a Metatag,” Search Engine Watch, October 1 2002. See: http://searchenginewatch.com/sereport/article.php/2165061