The Semantic Web in Ten Passages

 

Harold Boley

 

Copyright 2002-10-20, 2004-09-23

 

 

Institute for Information Technology e-Business

National Research Council of Canada

46 Dineen Drive, Fredericton, New Brunswick, E3B 9W4

Canada

 

 

 

The substance of the Semantic Web and the technology of semantic search engines are discussed. The article explains the notions of semantic search, crawlers, precision and recall, standard concepts, as well as standard predicates and knowledge derivation. It discusses the issues of agreeing on standard concepts/predicates, how to assign them to Web pages, and where to store these assignments as metadata. It further shows how refined standard concepts inherit standard predicates and why maintenance is a hard problem in library catalogues and metadata ontologies.

 

 

 

PASSAGE 1: Meaningful Search in the Billion-Fold Planetary Network

 

The World Wide Web currently includes more than four billion (often ‘scrollable’) pages.

 

When you search it for a particular page you have to ‘sieve’ through pages more thoroughly than if you were searching for a specific cubic grain of sand having 1mm side length tightly packed in two boxes of such grains of sand each having, as cubes, 1000mm = 1m side length.

 

Indeed the ‘grains’ of these 1, 2, ... cubic meters of Web ‘sand’ have been dispersed as the nodes of a net that spans the whole planet. You can click from one sand-grain page to the next via the ‘threads’ of this net, the URLs (Uniform Resource Locators). But using this manual ‘navigation’ along the URL links you are not likely to find the pages of interest. That's why so-called ‘search engines’ have been developed.

 

These 10 passages explain how search engines work in principle and how they are currently being improved towards ‘semantics’, ‘sense’ or ‘meaning’.

 

Actually, the conventional Web is being expanded at present – mainly in North America and Europe – into a so-called “Semantic Web”:

 

Search engines in future should be able to ‘understand’ the ‘semantics’ – the meaning – of Web pages far enough to enable ‘sensible’ queries. But at the moment ‘semantic search engines’ only exist for specialized areas of knowledge.

 

For this reason techniques of "Knowledge Representation" are moving into focus for the Web, which have been studied in Artificial Intelligence (AI) for a long time. Semantic search engines are to become ‘intelligent’ insofar as they are to be equipped with a conceptual representation of Web pages. This will help people as direct users. It will also help the so-called ‘agent systems’ of AI, which based on this core technology of the Semantic Web try to offer higher Web services such as information comparison, integration, abstraction, or trading.

 

In the following passages model-type simplifications are used and unnecessary jargon is avoided. Instead of the often-found emphasis on XML-based languages for the Semantic Web, we stress its substance here.

 

The passages are developed using a uniform example in the search for medicine, which, however, serves for demonstration purposes only and can be easily generalized.

 

Many principles will become clear when you imagine a “Semantic Web” in analogy to a (Web-based) "specialized library".

 

 

 

PASSAGE 2: The Search Engine and its Crawler

 

Most search engines use a so-called "crawler", i.e. a program that periodically and automatically navigates across as many of the currently existing Web pages as possible.

 

For every page the crawler mainly analyses the text components. Basically, it enters the central and frequent words of a page into a huge ‘address book’.

 

Every word in this ‘address book’ thus refers to a list of all the pages *in which this word was discovered by the crawler*. More specifically, this list contains a summary of each page together with its URL *address* with which you, as the inquirer, can click through to the complete result page if desired.

 

You get this 'hit list' of pages in bits after you type in that word.

 

Imagine Mr X is looking for pages that include the (composite) word "wonder drug" to see if there is one that helps against head pain.

 

============== Google Search: "wonder drug" ===========

 

A hard-to-penetrate silt of 24,000 pages comes to the fore. The search result rates too low in its so-called “precision”: in particular, the word “drug” is ambiguous in this composition – it can mean medicine or narcotics or perhaps both at the same time. Mr X only wanted a type of medicine – all the other found pages are worthless to him.

 

 

 

PASSAGE 3: Precision and Recall – Conflicting Measures for Search Results

 

Imagine you are looking for pages that contain the word “Aspirin” in order to check it as a remedy for head pain.

 

============== Google Search: Aspirin ===========

 

For prevalent isolated words like this you receive much too many pages. In this case: 640,000. Despite the unambiguousness of “Aspirin” the search result is not precise enough either:

Besides the grains of sand that are ‘precious’ for your search, you still receive too many ‘irritating’ grains, here, e.g., pages about Aspirin for dogs.

 

However, because a Crawler enters *all the important words of an analysed page* into the ‘address book’, you can now narrow down the search by typing a whole combination of words in the search line. Then you continue to receive a page only if the crawler has discovered in it at least *all of these search words*.

 

============ Google Search: Aspirin “head pain” ==========

 

The precision has noticeably improved: we now only get the 82,800 pages in which the words “Aspirin” and “head pain” appear together.

 

But wait: Have we perhaps cut out pages because we only wrote “head pain” but not the word “head hurt” which means the same thing?

 

Indeed: In improving the *precision measure* we have seriously forfeited the so-called *recall measure *.

 

So, as not to exclude any interesting pages, you would have to connect the words meaning the same thing with an OR combination.

 

============= Google Search: Aspirin  “head hurt” OR “head pain”  =============

 

We now again have a search result with a better recall: 87,800 pages. But what about “migraine” and other related words? Unfortunately in omitting “migraine” we have still excluded a whole ‘subweb’ of interesting pages.

 

 

 

PASSAGE 4: Semantics – From Common Words to Standard Concepts

 

Since about 1999 such problems have been attacked via the Semantic Web, a vision of Tim Berners-Lee, who already invented the conventional Web.

 

“Semantically”, i.e. ‘with respect to the meaning’, what we are looking for is the *concept* that can be named in the pages by “head pain” OR “head hurt” OR “migraine” OR another related *word*.

 

A ‘Semantic Search Engine’ could use, for example, *one* semantic standard concept for the whole group of related words, named, e.g., by a Latin term, or by the capitalized English term “Headache”.

 

For this purpose the ‘address book’ would internally only use “Headache”. However, this standard concept would refer to all pages in which the crawler found “head pain” OR “head hurt” OR “migraine” OR another related common word.

 

Conversely, an ambiguous common word such as “wonder drug” would have to be represented internally by several standard concepts.

 

A coinage such as “Aspirin” could be used as its own standard concept.

 

For your query you could now directly use the standard concept “Headache” or any one of the common words it standardizes – the Semantic Search Engine would always find all the pages ‘meant’.

 

================ Semantic Search: Aspirin Headache ============

=== does not work this way as yet in universal search engines such as Google ====

 

The recall would now be complete.

But is the precision perfect as well?

 

 

 

PASSAGE 5:  Semantic Relationships Between Standard Concepts and Knowledge Derivation

 

Up to now the standard concepts Aspirin and Headache stand besides one another in an unconnected manner.

 

However, you only wanted pages claiming that Aspirin *cures* head pain – not the (more scarce) pages claiming that Aspirin *causes* head pain.

 

A Semantic Search Engine should thus be able to also express the semantic relationships between standard concepts.

 

Hence we find ourselves in the middle of the AI field of (Web-based) knowledge representation, for which languages such as RDF and RuleML have been developed.

 

In other words, the ‘address book’ now becomes a “knowledge base”:

it contains so-called ‘facts’ such as “Aspirin CURES Headache”

(here simply a triple of the form “Subject PREDICATE Object”).

 

This fact now only points to URL addresses of pages which claim that Aspirin remedies head pain, where the all-capitalized English term “CURES” serves as a ‘standard predicate’ standing for common words used in the pages such as “remedies”, “heals”, etc.

 

The opposite fact “Aspirin CAUSES Headache” is treated in an analogical manner.

 

This would permit the final version of your query.

 

============= Semantic Search: Aspirin CURES Headache ========

=== does not work this way as yet in universal search engines such as Google ====

 

Now we would also be happy with the precision.

 

Oddly, some pages claim both semantic relationships at the same time, the curing *and* the causing one. The following query would find exactly those pages.

 

============ Semantic Search: Aspirin CURES Headache AND Aspirin CAUSES Headache ==

=== does not work this way as yet in universal search engines such as Google ====

 

In order to be able to compactly label this circumstance and to query it easily, it is possible to describe such pages with a further standard predicate “AMB”, even if they do not contain a corresponding common word such as “ambivalent”, “conflicting” etc.

 

============ Semantic Search: Aspirin AMB Headache ==============

=== does not work this way as yet in universal search engines such as Google ====

 

Instead of storing “Aspirin AMB Headache” as a *fact* in the ‘address book’, a representation language such as RuleML would even allow this triple to be derived from the two stored facts with a so-called *rule*.

 

A special ‘If-then’ derivation such as

IF Aspirin CURES Headache AND Aspirin CAUSES Headache THEN Aspirin AMB Headache

Is performed with the general ‘IF-THEN’ rule

IF Pharm CURES Sick           AND Pharm CAUSES Sick           THEN Pharm AMB Sick

via ‘variable bindings‘ such as ‘Pharm = Aspirin’ and ‘Sick = Headache’.

 

Such a rule thus explicitly deduces knowledge (in this case about an ‘ambivalence’) that was already implicitly hidden in the facts (here in ‘cures’ plus ‘causes’); in parallel to this, as a Semantic Web rule it would find every page that fulfils the ‘IF’ part, hence also the ‘THEN’ part (here each “AMB” page).

 

The central requirement for all these possibilities in the Semantic Web is that the crawler can correctly manage the interplay between common words and standard concepts.

 

This leads us in the next three passages to important research questions about the Semantic Web:

 

PASSAGE 6) Where do the standard concepts and standard predicates come from?

 

PASSAGE 7) How does one assign the standard concepts/predicates to common words?

 

PASSAGE 8) Where will the assignments be stored as metadata?

 

 

 

 

PASSAGE 6:  Where Do the Standard Concepts and Standard Predicates Come from?

 

Standard concepts such as Headache in our example are usually developed as part of a system of interdependent concepts.

 

In order to do this, experts of the specialized field addressed, in this case medicine, have to agree on shared, normative definitions of their concepts and predicates.

 

These can then be published as a reference catalogue of connected standard concepts and standard predicates, e.g. again on a Web page.

 

For this the hierarchical superconcept-subconcept connection is the most important one.

 

Example (will be expanded in PASSAGE 9):

A Pain-Headache connection puts Headache below Pain:

 

                                 Pain

                                   |

                                   |

                                   |

                               Headache

 

For the machine processing of such concept catalogues, special languages such as RDF Schema, DAML+OIL, and OWL have been developed in efforts towards the Semantic Web.

 

For such shared explicit concept catalogues AI has borrowed the expression “ontologies” from philosophy.

 

So-called ‘category-based search engines’ such as Yahoo! and dmoz use hierarchical directories of Web pages.

 

Such a category hierarchy is comparable to the concept hierarchy of an ontology.

However, the experts in a domain did not usually develop it together, but development was done by the respective search-engine providers (exception: dmoz.org).

 

Category-based search engines thus are precursors of the Semantic Search Engines strived for.

However, they usually require complex navigation through the category hierarchy (not yet competitive with the Google search line).

 

 

 

PASSAGE 7: How Does One Assign the Standard Concepts/Predicates to Common Words?

 

 

Ideally, the crawler would navigate through the pages for the important common words and assign the right standard concepts and standard predicates to them fully automatically.

 

But such a full automation is very difficult, because:

 

-    The assignment can often only be established correctly from the meaning context.

 

-      Due to the limited number of standard concepts, sometimes a common word has to be circumscribed with a formula made up of *several* standard concepts (e.g. *OR combination of standard concepts* for – unspecific – “stomach ache”).

 

-      The assignment of standard predicates necessitates a sentence-level analysis (parsing), which is dependent on successful assignments of standard concepts in the subject and object positions of semantic relationships (compare PASSAGE 5).

 

-      Many pages mainly contain audio and video material, from which standard concepts can only be extracted through sound/image analysis.

 

-      Sometimes in order to classify new pages it is even necessary to extend the ontology, which only domain experts should be allowed to do.

 

For this reason the classification of pages should always be done interactively together with experts:

1)   The crawler for a given page proposes standard concepts,

some mutually connected by semantic relationships via standard predicates.

2)   At least for unclear cases these will then be corrected and

      if necessary completed by experts.

 

Thus, in the medium term, only the costs for the semantic classification of relevant parts of the expanding ‘sandstorm’ of Web pages can probably be covered. Interestingly, e.g., dmoz has currently captured about 3.8 million entry pages with about 52,000 honorary (free of charge) experts.

 

In the print-media sector, a similar assignment has been traditionally carried out by specialized librarians more or less manually. (Many of them are increasingly moving towards the area of “specialized digital libraries”, which with ‘vertical’ search engines can become a core piece of the Semantic Web.)

 

 

 

PASSAGE 8: Where Will the Assignments be Stored as Metadata?

 

 

A group of standard concepts – possibly ‘interconnected via semantic relationships’ through standard predicates – is useful for describing a page containing the corresponding common words: the group constitutes “metadata” for this page.

 

There are two principal possibilities for storing these metadata:

 

“EXTERNAL”: The ‘address book’ described earlier can store a standard concept or a semantic relationship together with its assignment to all pages with the corresponding common words. Standard concepts or semantic relationships then act as so-called “external metadata” for the pages they refer to.

 

“INTERNAL”: The pages themselves – if they also have text parts – can store their own descriptive standard concepts or semantic relationships. They then act as so-called “annotations”, i.e. as internally added metadata for the pages in which they appear.

 

Advantage of “EXTERNAL” and disadvantage of “INTERNAL”:

Only by separating the metadata from the pages themselves is it possible to describe pages that one does not own or in which there is no ‘place’ (text part) for annotations (e.g., audio and video pages).

 

Advantage of “INTERNAL” and disadvantage of “EXTERNAL”:

If metadata are stored as annotations directly in their pages, then for every change of a page the affected annotations can be immediately updated as well without first having to search for external metadata of the page – e.g. via an inverse use of the ‘address book’.

 

A compromise would be to refer to the metadata of a page via a URL that is stored internally in the page or can be found directly with it in a special ‘place’ – e.g. in a page header.

 

This finally leads us to the important problem of change/maintenance in the Semantic Web. One source of this problem is that, unlike books, many Web pages have the characteristic – which is unpleasant for the crawler - to often change ‘on the quiet’.

 

 

 

 

 

 

PASSAGE 9: Refined Standard Concepts Inherit Refined Semantic Relationships

 

 

We have just seen:

When page contents with their common words change, often the corresponding standard concepts and semantic relationships are affected as well – they have to be readjusted.

 

But there is also another maintenance problem:

What happens when the standard concepts or semantic relationships themselves change over the years, e.g. through concept refinements following new scientific discoveries or simply due to a new ‘Zeitgeist’?

 

In this way, e.g., our sample standard concept Headache could be split into subconcepts such as Sporadic-Headache and Chronic-Headache, so that for the time being it would be possible to agree on this tiny concept hierarchy:

 

                                 Pain

                                   |

                                   |

                                   |

                               Headache

                                 /   \

                                /     \

                               /       \

                 Sporadic-Headache    Chronic-Headache

 

Using this, also our earlier semantic relationship “Aspirin CURES Headache” could, e.g., be refined by experts within the corresponding ontology

 

 

 

                                 Pain

                                   |

                                   |

                                   |

Aspirin---------CURES--------->Headache

                                 /   \

                                /     \

                               /       \

                 Sporadic-Headache    Chronic-Headache

 

 

to express one of the following assertions:

 

Aspirin cures

 

… sporadic headache:

 

                                 Pain

                                   |

                                   |

                                   |

                               Headache

                                 /   \

                                /     \

                               /       \

Aspirin--CURES-->Sporadic-Headache    Chronic-Headache

 

 

 

… chronic headache:

 

                                 Pain

                                   |

                                   |

                                   |

                               Headache

                                 /   \

                                /     \

                               /       \

Aspirin          Sporadic-Headache    Chronic-Headache

   |                                     ^

   |                                     |

    -----CURES---------------------------

 

 

 

… or both subtypes of headache:

 

                                 Pain

                                   |

                                   |

                                   |

                               Headache

                                 /   \

                                /     \

                               /       \

Aspirin--CURES-->Sporadic-Headache    Chronic-Headache

   |                                     ^

   |                                     |

    -----CURES---------------------------

 

 

However, if as in the last example a semantic relationship is meant for *all* subconcepts (here: Sporadic-Headache, Chronic-Headache), it can also more ‘economically’ be left at the superconcept (here: Headache), from where it is then automatically ‘inherited’ to the subconcepts on demand only (similarly as in the class hierarchies of object-oriented programs).

 

As a result of such concept refinements two principal possibilities arise for the pages classified by them:

 

“UPDATE”: We can try corresponding retroactive updates to the metadata of all the affected ‘old’ pages.

In this case domain experts should decide whether one or more subconcept such as Sporadic-Headache and Chronic-Headache were ‘meant’ or whether their old common superconcept Headache remains correct.

 

“SWITCH”: We can switch the metadata ontology at certain points in time, continue to access the ‘old’ pages via the ‘old’ metadata, and only for the ‘new’ pages use the ‘new’ metadata.

In this case Headache would stay unrefined as a standard concept for an old page, even if domain experts would immediately notice that it were, e.g., only about Sporadic-Headache.

 

Since after each “SWITCH” a further generation of the ontology versions would be needed for the corresponding generation of (normally further changing!) Web pages, this option entails a high permanent administration overhead for the crawler.

 

Therefore “UPDATE” seems to be the better option, even if in each case it entails a substantial amount of work to be done once.

 

 

 

PASSAGE 10: Library Catalogues as Metadata Ontologies

 

 

The so-called “Pinakes of Kallimachos of Kyrene”  (about 250 B.C.) is thought to be the first written catalogue of a library: they classified a selection of scrolls from the library of Alexandria. Since then there has been a maintenance problem analogical to the one of PASSAGE 9) in all libraries, whose catalogues through Gutenberg have become something like ‘metadata for print media’. (While the HTML Web brought back ‘scrollable pages’, digital PDF libraries again favour ‘pages in pieces’: the game “scrolling down versus turning pages” is a draw.)

 

Although “UPDATE” would be the ‘nicer’ solution, many libraries have chosen the solution “SWITCH”, i.e. put up with the fact that users have to search in two or more catalogues sometimes.

 

The problem is rooted in the concept drifts and faults of times or cultures.

 

The Semantic Web will not be able to solve *this* problem either, but both solution possibilities, “UPDATE” and “SWITCH”, will be supported by software tools of the Semantic Web.

 

In particular, initial tools – such as Chimaera, PROMPT, and RDFT – have been developed for the interactive concept bridging between ontologies.

 

These could also later help with maintaining library catalogues.

 

Conversely, the Semantic Web can learn a lot from Library Sciences. Initiatives – e.g. within Math-Net and CISTI – attempt to bring both together.

 

A special (quality) need of Web-based documents arises because of their low ‘entry barrier’ compared to documents that make their way into the multi-copy distribution system of traditional libraries: Efficient RATING of Data, Metadata, and Raters is ESSENTIAL for Semantic Subwebs that want to compete with good, old paper-based libraries.

 

The Semantic Web, on the basis of AI, is a new subfield of computer science with various further interdisciplinary relations, e.g. to logic, linguistics, and cognitive science.