Kim H. Veltman
Access, Claims and Quality on the Internet
Keynote at: Open Culture, Accessing and Sharing Knowledge. Scholarly Production in the Digital Age, Università di Milano, Milan , 27-29 June2005.
In 1934, Otlet outlined a vision of comprehensive electronic access to knowledge. Progress towards this vision entailed initial visions of hypertext, markup languages, the semantic web, Wikipedia and more recently a series of developments with respect to Open Source. If everything is accessible then how do we separate the chaff from the grain and how do we identify quality? This essay suggests that five dimensions need to be included in a future web: 1) variants and multiple claims; 2) levels of certainty in making a claim; 3) levels of authority in defending a claim; 4) levels of significance in assessing a claim; 5) levels of thoroughness in dealing with a claim. These dimensions offer future criteria for scholarship.
The efforts towards hypertext, markup language, semantic web, and Open Source allow us to consider five new kinds of challenges, which are needed to ensure quality: 1) methods for integrating variants; 2) levels of certainty in making a claim; 3) levels of authority in defending a claim; 4) levels of significance in assessing a claim and 5) levels of thoroughness in supporting claims re: extant knowledge in a field. All five of these are important ingredients that can serve as future criteria for scholarship.
1) Methods for integrating Variants
In an ideal world, scholarship is limited to eternal truths. In everyday life, many items are straightforward questions of true or false. In many cases, however, the situation is not so straightforward. We need to incorporate variant names, associations, attributions and claims.
The most obvious of these entails different spellings of a given name. For much of the 20th century there was a conviction that if one could establish a standard version this could serve as an authority file and be adopted by or simply imposed on others. Library systems have complex systems for Machine Readable Cataloging (MARC), which duly reflect standard and variant names. Ironically the potentials of this information are often not exploited fully even by the libraries themselves. In terms of everyday users such systems are not available.
These alternative names are effectively access points to earlier documents which were unaware of the current accepted spelling. So there is a new challenge to create online authority files with all possible variants built in. These lists can be online and freely available to users. If users have additional variants to add they could do so on a simple proviso: that they provide at least one historical document that uses the variant in question. This variant and its source would then become a regular part of the system. In using this method, non-expert users would be spared the deliberations of which variant to use. The system provides it for them. The variants remain accessible at the database level. Hence, even if the user forgets the official spelling next time round, the variants bring him/her back to the currently accepted version. Persons working in specific fields can work with subsets of the master lists to ensure that their tools match the complexity required, while saving them from unnecessary complexity.
1.2 Classifications and Associations
Connected with this theme of variant names are the variant classifications, thesauri and associations provided by previous attempts at systematic organization. The challenge is to move systematically between subsumptive, determinative and ordinal relations.One can imagine a system that allows users to choose whether they wish to deal only with physical instances (particulars) or also include various kinds of metaphysical (universals): e.g. belief, phantasy, play, scenario and fiction at different levels of subsumption. Some of these distinctions can be rendered spatially.
Some of these variant associations are not spatial. For instance, the Virgin Mary is universally known in the West. Some call her Star of the Sea. There are over seventy such alternative names, many of which are potentially useful in increasing the range of our search. Such an approach becomes essential when we are searching in other cultures. For instance, the Indian, Mother Goddess, Durga, has 108 names. Such lists are effectively like mini-specialized thesauri applied to a given deity, person, idea or concept.
Genealogical lists are another example of such contextualizing instruments. Hereby, a single name provides access to a range of related persons. Needed are frameworks, whereby these existing lists are made available to us as we embark on more serious searches. Such lists are again like classification systems and thesauri. They need to be linked with definitions (dictionaries), explanations (encyclopaedias), and titles (catalogues, bibliographies) elsewhere. All this belongs to the domain of virtual reference rooms (levels 1-5 in figure 2).
Present systems such as Google assume that we know the detailed words necessary for the search, and indeed if we happen to know these then Google works surprisingly well. The problem with real research is that we usually do not know the important terms when we embark on our study. Being able to call on existing associations of earlier experts offers a way to go further. Implicit here is a notion that interfaces should include the mental screens of earlier and existing experts. Their ways of organizing knowledge can serve as orientation tools in our own voyages of discovery.
1.3 Attributions and Claims
In the exact sciences only the latest version of an attribution or claim is usually important. By contrast, in the humanities the cumulative history of attributions is potentially important. The latest claim is not always the best and is usually not definitive. For instance, in the case of a painting, one scholar may claim a) that the painting is by Leonardo, another may claim b) that it is by his pupil while a third claims c) that it belongs to his workshop. An either/or mentality from computer science which creates a single category for creator/author in the Dublin Core Framework provides space for only one of these claims and in the process obscures the truth that in this case there exist debates on the question of precise attribution for this painting. This almost banal example illustrates how an overzealous quest for precision can be as misleading as it is meant to be helpful. Needed are tools to distinguish between these various claims re: attributions and claims and to aggregate automatically the cumulative claims of the research literature.
Virtual Reference Room
1. Terms Classifications, Thesauri, Associations
2. Definitions Dictionaries
3. Explanations Encyclopaedias
4. Titles Bibliographies, Catalogues
5. Partial Contents Abstracts, Reviews
Primary Literature in Digital Library
6. Full Contents
Secondary Literature in Digital Library
7. Texts, Objects in Isolation Analyses, Close Reading, Criticism, Interpretation
8. Comparisons Comparative Studies, Parallels, Similarities
9. Interventions in Extant Object Conservation, Restorations
10. Studies of Non-Extant Object Reconstructions
Future Secondary Literature (Virtual Agora)
11. Collaborative Discussions of Contents, Texts, Comparisons, Interventions, Studies
12. E-Preprints of Primary and Secondary Literature in Collaborative Contexts
Figure 1. Virtual Reference Room, Distributed Digital Libraries and Virtual Agoras with different levels of reference and secondary literature.
As the 19th century made an increasing distinction between primary and secondary literature the initial emphasis was to focus on studies of texts and objects in isolation (das Ding an sich). Meanwhile, three further levels of analysis relating to comparisons (comparative studies, parallels, similarities); interventions in extant objects (conservation, restorations) and studies of non-extant objects (reconstructions) slowly came into focus (levels 7-10 in figure 2). Needed are systems that allow us to distinguish between these different kinds of reference and secondary literature.
With the rise of new collaborative environments, virtual agoras can serve as a drafting ground for future secondary literature (levels 11-12 in figure 2). Digital libraries thus entail much more than scanning in printed texts, namely, virtual reference rooms; distributed digital libraries and virtual agoras for collaborative research and creativity. The different levels (1-12 in figure 2) can also be seen as a knowledge life cycle: i.e. reference works point to primary literature, which inspires secondary literature, which prompts collaborative discussions (virtual agoras including discussion groups, blogs, Really Simple Syndication (RSS) etc.), which in turn lead to new primary and secondary literature.
2. Levels of Certainty in Making a Claim
Ever since the advent of hypertext with Douglas Engelbart and Ted Nelson the emphasis has been on linking. Like all important ideas this built on earlier traditions. Footnotes and references were also concerned with linking. Electronic hypertext links introduced two fundamental steps forward: a) the link was only a click away; b) that click could potentially lead to a source outside the document being used at the moment. This is important because, traditionally a footnote in a scholarly book might conscientiously cite another article, book or manuscript in some remote library, to receive a copy of which took weeks or even months. With electronic hypertext such a source is potentially only a click away. Google has filed patents in this domain and aims “to develop technologies that factor in the amount of important coverage produced by a source, the amount of traffic it attracts, circulation statistics, staff size, breadth of coverage and number of global operations” and searching for methods to determine the truth value of claims.” It is important to recall that many aspects of this quest are already reflected in our memory institutions. Instead of spending billions in creating entirely new models it would be advisable to invest in linking the new instruments with existing frameworks.
2.1 Direct and Indirect Links
Not all links are equally effective. A link from a reference concerning Mona Lisa in the Louvre to any of the dozens of sites containing a poor replica of the painting is less effective than a direct link to the Louvre website. One might distinguish between a) materials that are shown live, b) that come from the original location, c) via an agency, or d) via an official publication. In future, the extent to which scholarly books and articles link directly to original sources rather than to vague sites can become a new criterion for the quality of scholarship.
2.2 Degree of Identity
Today, when we type in a word or term, search engines such as Google assume that we are looking for something that is identical to that word. It may also offer materials that are similar to that word but there are no tools in place to define the parameters of a match. Hence, typing in Last Supper (on 15 April 2005) produced 16,200 hits but there are no functions in place to search for cases that are identical in size, shape or colour. Over two millennia ago Aristotle discussed the importance of attributes in defining objects. Adding attributes to our search parameters will mean that we can find things with the same name and then find subsets which are the same size, shape, colour etc. Eventually this could be extended to include attributes entailing all five senses and thus be able to discover surfaces, which look the same but literally feel different.
2.3. Levels of Certainty
Needed also are new tools that allow authors to indicate the level of certainty behind their claims. Such levels of certainty can be built in to cover claims about who, what, where and when? (Appendix 1). The precision with which one covers claims, including the detail with which one indicates the extent to which certainty is possible then becomes a further criterion for defining scholarship.
For the moment we shall focus on the problem of degree of certainty with respect to the question How? For instance, we are studying a painting of a woman’s face. We encounter a related image on the web that suggests the painting is in fact a portrait of Madame X. On one occasion we may find additional evidence which is conclusive. On other occasions the link to be made might be very certain, quite certain, very probably, quite probably or only possibly. Ideally an editing tool makes available a small popup list of choices ranging from Authoritative to Possibly.
We then choose the level of commitment. Levels 3-6 would require us to commit our name to the claim and invite documentation. Level 1, a claim that something is authoritative, requires documentation. Sceptics will rightly object that such a system will never be universally accepted. Many persons will prefer the easy way out and simply dump their unsubstantiated claims on the web. In the interests of freedom of the spirit persons must be free to do so and free to state whatever they wish or not. Failure to permit these options takes one down a path where what a person writes, speaks or even what a person thinks could be seen as a threat to decision makers and the state. Science fiction movies such as Minority Report have warned us of the dilemmas of seeking mind and thought control.
Our approach rejects such basic censorship as a dead end. At the same time, by including rules and frameworks for levels of certainty, we have new possibilities of introducing search parameters which can sometimes choose to ignore unsubstantiated claims. Five centuries of experience with printing have led to similar solutions. We allow sensationalist newspapers such as the Daily Mirror, or the Bild Zeitung to publish many amazing, undocumented claims, but when we are writing a scholarly piece we usually ignore them as evidence. In future, learning how to use sources critically and being required to us sources with a given level of certainty can become new domains for learning in schools and universities.
Authorities, decision makers and sceptics generally will fear that all this assumes honesty in the system and will remain worried about dangers of the system being subverted by dishonest imposters. Fortunately, the simple rules of the game have some built in safety mechanisms. Anyone is free to state something, but anyone who claims to provide levels of certainty must also provide the supporting evidence. Hence, those who wish to use the cover of anonymity are free to do so, but thereby eliminate themselves from the certainty process. Those who add a source must provide a link to that source. If the link is false or does not confirm the claim the system can reject it. If they refer to themselves they implicate their own reputation. If they include their organization, then their organization implicitly becomes liable for defending the claim. For this reason, authoring new tools for levels of certainty in making a claim need to become linked with levels of authority in defending a claim.
3. Levels of Authority in Defending a Claim
This could at first sound like overkill. On reflection, this approach simply formalizes an approach that has been in place informally for centuries. Whenever we meet someone we expect a business card to tell us their affiliation. If they come from a world famous university or company we implicitly give them more respect and trust than if they come from an unknown organization. The purpose of a more systematic approach is not to check all the details of each source at every turn, but rather to have in place a framework that permits us to check these sources if necessary or desired. Hence, scholarly authors wishing to document their claim, might be prompted to indicate the source for this claim that something is authoritative in a further list, i.e. whether it originates in: 1) a memory institution; 2) an organization, usually a professional body, or 3) an individual.
Hereby, searchers will in future be able to use these parameters in their search criteria. For instance, within a library one might be searching for everything under a given name or subject or limit the search to specific forms of documentation (figure 3). The complexity of these lists will depend on the situation at hand. Sometimes, a simple distinction between scholarly and popular press might suffice. At other times a more detailed set of distinctions will be appropriate.
From such examples, we begin to see how the modules and lists for inputting knowledge and the lists to search for knowledge can gradually converge. Again we see that while anyone can make links, only those links which take us back to their sources are truly helpful. The need to cite sources was recognized by Renaissance humanists, who called for a return ad fontes. But whereas the Renaissance quest was limited to pointing to sources beyond the manuscript or book at hand, the new media allow a direct link with such sources. Hence, proper use of electronic equivalents of such sources can improve our success in accessing true and meaningful knowledge and at the same time provide new criteria for judging the quality of humanists and scholars in future.
4. Levels of Significance in Assessing a Claim
History has taught us that significance is one of the most elusive characteristics to assess. Some assure us that given the phrase “publish or perish,” quantity of publications is the prime criterion for significance. Here caution is advised. Andrew Lang (1844-1912) was undoubtedly a significant writer. The Wikipedia records more than 140 books that he published. Of Lao Tse only 81 paragraphs are extant. Yet many would rightly insist that the those 81 paragraphs in a slender book called the Tao te Ching that inspired Taoism had considerably greater significance than the writings of one of the most productive scholars and journalists of 19th century Britain. Meanwhile, peer review, citation indexes and the emerging field of automated citation indexes also termed dynamic contextualization offer further ways of assessing significance.
4.1 Peer Review
Paul Ginsparg, Cornell University, has argued for a two tiered approach whereby articles more articles are accepted almost automatically in the short term and that the full peer review process is applied to a considerably smaller subset in the longer term. Here, once again the science community is suggesting new models that could potentially be used by the entire scholarly community. In terms of our model, the first tier would make personal and collaborative knowledge available at the level of e-preprints (level 12 in figure 2) and the second tier would act as filter in deciding what subset of this flux enters into the category of enduring knowledge (levels 1-10).
4.2 Automatic Citation Indexes
In the 1970s, Derek de Solla Price developed the fields of bibliometrics and scientometrics, to address the problem of significance. Over the past decades these fields have blossomed into a fashion for Citation Indexes. A major breakthrough of the past few years is a trend whereby the process of citation indexes is becoming automated such that it can be integrated seamlessly into scholarly works and potentially reflect all citations rather than a sample as hitherto provided in American Citation Indexes. Michele Barbera and Nicolo D'Ercole (Pisa) and their team have developed Hyperjournal, which includesa Dynamic Contextualization, aP2P tool that: “allows readers to visualize, while reading an article, all the articles quoted by and all those quoting the one they are reading. Dynamic Contextualization also enables you to easily carry out bibliometrical calculations such as: the number of quotations received by an article or by an author, citation source groupings by journal, by topic, by period.”
If this approach were combined with our knowledge concerning kinds of journals (e.g. official journals in a field, journals published by key societies or Special Interest Groups (SIGs) of experts) and/or linked with standard collections of reviews, this could lead to new insights concerning the influence of a given scholar. Meanwhile, this approach to dynamic contextualization is the more significant because Paolo d’Orio, the author of Nietzsche Open Source and Open Source Models in the Humanities. From Hyper Nietzsche to Hyper Learning (April 2004), has integrated this into his project on Hyper Learning (Hypermedia Platform for Electronic Research and Learning), which aims to create an advanced e-learning system in the humanities with 1) complex interactive web sites; 2) a distributed web platform; 3) Virtual Collaborative Learning Communities and 4) an appropriate pedagogical and legal framework.
Ultimately we need some combination of quantity of output, quantity of citations and preferably also an indication of the extent to which authors are cited by experts in their own fields. Some authors establish fields, some authors contribute to accepted fields and some distinguish themselves by demonstrating the boundaries of strictly defined fields are too narrow to address the larger questions of scholarship. We need tools that will help us to recognize the contributions of all three of these types.
5. Levels of Thoroughness in Supporting a Claim
The above cautionary examples concerning significance may seem more evasive than incisive, but their combined thrust is that no single method offers a magic solution. Implicitly this suggests that thoroughness is the only way we can hope to achieve a balanced view. While attractive in theory this poses deep philosophical problems and challenges.
When a world expert gives a brilliant speech, attentive members of the audience are able to judge the points made in the speech. It would take another world expert of equal standing to have some sense of how much the brilliant speech omitted. The problems of brilliant speeches are also the problems of scholarship which all too often is viewed as a series of brilliant books and articles or as a catalogue of those areas which are known and settled. Knowledge is presented as if it were a map of land conquered. All too often, however, we have no equivalent of a world map for knowledge, we have no clues as to how much has been covered so far. Roadmaps, a buzzword from the political arena, have become a fashionable term within the knowledge landscape. Alas they typically show us a few (possible information) highways and provide little indication of everyday roads, streets, paths and trails.
We know from history that such knowledge maps have proved essential in the advancement of science and knowledge. In the early 19th century, once there was a periodic table, once one understood the scope and limits of chemical compounds, one could start a process of looking for them systematically and filling in the missing gaps. It took a century, even then there a few bits to add, but it worked because there was a clear outline of what was not yet known (a map of ignorance in the true sense), which helped to guide explorers of new knowledge.
In spite of all the billions of printed and online pages today, we have remarkably little by way of serious tools to map our ignorance, to provide us some indication of level of thoroughness in dealing with a claim. In terms of Leonardo da Vinci, for instance, bibliographies exist but an updated list of all drawings, paintings of Leonardo and his school, a catalogue raisonnée in the traditional sense, does not yet exist. Hence our quest to make knowledge accessible needs to be complemented by new kinds of cartography that map both our knowledge and our ignorance; the territory covered and the areas left uncharted. If we make maps of accomplishments and dead ends there will be more hope of finding live ends and especially live non-ends. Needed are virtual reference rooms where systematic connections between these resources can be created and virtual agoras where shortcomings can be discussed.
The open source movement, impulses from science, and more recently initiatives from governments have re-introduced the feasibility of universal access to human knowledge. The quest for (distributed) digital libraries needs to be complemented by virtual reference rooms and virtual agoras.Hereby, the ideal of a collective notebook can become an extension of existing systems for cataloguing and searching the cumulative knowledge of collective memory institutions.
The quest for full freedom of expression and open access in terms of quantity needs to be complemented by criteria that highlight the central importance of quality. To this end, we have suggested the need for five new features: 1) variants and multiple claims; 2) levels of certainty in making a claim; 3) levels of authority in defending a claim; 4) levels of significance in assessing a claim; 5) levels of thoroughness in supporting a claim. In the future, these new features can serve as new criteria for scholarship in future. The vision of open source knowledge on a fully semantic web may well take another century to achieve, but this only confirms that if patience is a virtue, endurance and energy are a necessity.
 See the author’s Towards a Semantic Web for Culture, JoDI (Journal of Digital Information, Volume 4, Issue 4, Article No. 255, 2004-03-15.