Copyright
2002-10-20, 2004-09-23
Institute for Information Technology e-Business
National Research Council of
The
substance of the Semantic Web and the technology of semantic search engines are
discussed. The article explains the notions of semantic search, crawlers,
precision and recall, standard concepts, as well as standard predicates and
knowledge derivation. It discusses the issues of agreeing on standard
concepts/predicates, how to assign them to Web pages, and where to store these
assignments as metadata. It further shows how refined standard concepts inherit
standard predicates and why maintenance is a hard problem in library catalogues
and metadata ontologies.
The
World Wide Web currently includes more than four billion (often ‘scrollable’)
pages.
When
you search it for a particular page you have to ‘sieve’ through pages more
thoroughly than if you were searching for a specific cubic grain of sand having
1mm side length tightly packed in two boxes of such grains of sand each having,
as cubes, 1000mm = 1m side length.
Indeed
the ‘grains’ of these 1, 2, ... cubic meters of Web ‘sand’ have been dispersed
as the nodes of a net that spans the whole planet. You can click from one
sand-grain page to the next via the ‘threads’ of this net, the URLs (Uniform
Resource Locators). But using this manual ‘navigation’ along the URL links you
are not likely to find the pages of interest. That's why so-called ‘search
engines’ have been developed.
These
10 passages explain how search engines work in principle and how they are
currently being improved towards ‘semantics’, ‘sense’ or ‘meaning’.
Actually,
the conventional Web is being expanded at present – mainly in North America and
Search
engines in future should be able to ‘understand’ the ‘semantics’ – the meaning
– of Web pages far enough to enable ‘sensible’ queries. But at the moment
‘semantic search engines’ only exist for specialized areas of knowledge.
For
this reason techniques of "Knowledge Representation" are moving into
focus for the Web, which have been studied in Artificial Intelligence (AI) for
a long time. Semantic search engines are to become ‘intelligent’ insofar as
they are to be equipped with a conceptual representation of Web pages. This
will help people as direct users. It will also help the so-called ‘agent
systems’ of AI, which based on this core technology of the Semantic Web try to
offer higher Web services such as information comparison, integration,
abstraction, or trading.
In
the following passages model-type simplifications are used and unnecessary
jargon is avoided. Instead of the often-found emphasis on XML-based languages
for the Semantic Web, we stress its substance here.
The
passages are developed using a uniform example in the search for medicine,
which, however, serves for demonstration purposes only and can be easily
generalized.
Many
principles will become clear when you imagine a “Semantic Web” in analogy to a
(Web-based) "specialized library".
Most
search engines use a so-called "crawler", i.e. a program that
periodically and automatically navigates across as many of the currently
existing Web pages as possible.
For
every page the crawler mainly analyses the text components. Basically, it
enters the central and frequent words of a page into a huge ‘address book’.
Every
word in this ‘address book’ thus refers to a list of all the pages *in which
this word was discovered by the crawler*. More specifically, this list contains
a summary of each page together with its URL *address* with which you, as the
inquirer, can click through to the complete result page if desired.
You
get this 'hit list' of pages in bits after you type in that word.
Imagine
Mr X is looking for pages that include the (composite) word "wonder
drug" to see if there is one that helps against head pain.
==============
Google Search: "wonder drug" ===========
A
hard-to-penetrate silt of 24,000 pages comes to the fore. The search result
rates too low in its so-called “precision”: in particular, the word “drug” is
ambiguous in this composition – it can mean medicine or narcotics or perhaps
both at the same time. Mr X only wanted a type of medicine – all the other
found pages are worthless to him.
Imagine
you are looking for pages that contain the word “Aspirin” in order to check it
as a remedy for head pain.
==============
Google Search: Aspirin ===========
For
prevalent isolated words like this you receive much too many pages. In this
case: 640,000. Despite the unambiguousness of “Aspirin” the search result is
not precise enough either:
Besides
the grains of sand that are ‘precious’ for your search, you still receive too
many ‘irritating’ grains, here, e.g., pages about Aspirin for dogs.
However, because a Crawler enters *all the important words of an analysed page* into the ‘address book’, you can now narrow down the search by typing a whole combination of words in the search line. Then you continue to receive a page only if the crawler has discovered in it at least *all of these search words*.
============
Google Search: Aspirin “head pain” ==========
The
precision has noticeably improved: we now only get the 82,800 pages in which
the words “Aspirin” and “head pain” appear together.
But
wait: Have we perhaps cut out pages because we only wrote “head pain” but not
the word “head hurt” which means the same thing?
Indeed:
In improving the *precision measure* we have seriously forfeited the so-called
*recall measure *.
So,
as not to exclude any interesting pages, you would have to connect the words
meaning the same thing with an OR combination.
=============
Google Search: Aspirin “head hurt” OR
“head pain” =============
We
now again have a search result with a better recall: 87,800 pages. But what
about “migraine” and other related words? Unfortunately in omitting “migraine”
we have still excluded a whole ‘subweb’ of interesting pages.
Since
about 1999 such problems have been attacked via the Semantic Web, a vision of
Tim Berners-Lee, who already invented the conventional Web.
“Semantically”,
i.e. ‘with respect to the meaning’, what we are looking for is the *concept*
that can be named in the pages by “head pain” OR “head hurt” OR “migraine” OR
another related *word*.
A
‘Semantic Search Engine’ could use, for example, *one* semantic standard
concept for the whole group of related words, named, e.g., by a Latin term, or
by the capitalized English term “Headache”.
For
this purpose the ‘address book’ would internally only use “Headache”. However,
this standard concept would refer to all pages in which the crawler found “head
pain” OR “head hurt” OR “migraine” OR another related common word.
Conversely,
an ambiguous common word such as “wonder drug” would have to be represented
internally by several standard concepts.
A
coinage such as “Aspirin” could be used as its own standard concept.
For
your query you could now directly use the standard concept “Headache” or any
one of the common words it standardizes – the Semantic Search Engine would
always find all the pages ‘meant’.
================
Semantic Search: Aspirin Headache ============
===
does not work this way as yet in universal search engines such as Google ====
The
recall would now be complete.
But
is the precision perfect as well?
Up to
now the standard concepts Aspirin and Headache stand besides one another in an
unconnected manner.
However,
you only wanted pages claiming that Aspirin *cures* head pain – not the (more
scarce) pages claiming that Aspirin *causes* head pain.
A
Semantic Search Engine should thus be able to also express the semantic
relationships between standard concepts.
Hence
we find ourselves in the middle of the AI field of (Web-based) knowledge
representation, for which languages such as RDF and RuleML have been developed.
In
other words, the ‘address book’ now becomes a “knowledge base”:
it
contains so-called ‘facts’ such as “Aspirin CURES Headache”
(here
simply a triple of the form “Subject PREDICATE Object”).
This
fact now only points to URL addresses of pages which claim that Aspirin
remedies head pain, where the all-capitalized English term “CURES” serves as a
‘standard predicate’ standing for common words used in the pages such as
“remedies”, “heals”, etc.
The
opposite fact “Aspirin CAUSES Headache” is treated in an analogical manner.
This
would permit the final version of your query.
=============
Semantic Search: Aspirin CURES Headache ========
===
does not work this way as yet in universal search engines such as Google ====
Now
we would also be happy with the precision.
Oddly,
some pages claim both semantic relationships at the same time, the curing *and*
the causing one. The following query would find exactly those pages.
============
Semantic Search: Aspirin CURES Headache AND Aspirin CAUSES Headache ==
===
does not work this way as yet in universal search engines such as Google ====
In
order to be able to compactly label this circumstance and to query it easily,
it is possible to describe such pages with a further standard predicate “AMB”,
even if they do not contain a corresponding common word such as “ambivalent”,
“conflicting” etc.
============
Semantic Search: Aspirin AMB Headache ==============
===
does not work this way as yet in universal search engines such as Google ====
Instead
of storing “Aspirin AMB Headache” as a *fact* in the ‘address book’, a
representation language such as RuleML would even allow this triple to be
derived from the two stored facts with a so-called *rule*.
A
special ‘If-then’ derivation such as
IF
Aspirin CURES Headache AND Aspirin CAUSES Headache THEN Aspirin AMB Headache
Is
performed with the general ‘IF-THEN’ rule
IF
Pharm CURES Sick AND Pharm
CAUSES Sick THEN Pharm AMB Sick
via
‘variable bindings‘ such as ‘Pharm = Aspirin’ and ‘Sick = Headache’.
Such
a rule thus explicitly deduces knowledge (in this case about an ‘ambivalence’)
that was already implicitly hidden in the facts (here in ‘cures’ plus
‘causes’); in parallel to this, as a Semantic Web rule it would find every page
that fulfils the ‘IF’ part, hence also the ‘THEN’ part (here each “AMB” page).
The
central requirement for all these possibilities in the Semantic Web is that the
crawler can correctly manage the interplay between common words and standard
concepts.
This
leads us in the next three passages to important research questions about the
Semantic Web:
PASSAGE
6) Where do the standard concepts and standard predicates come from?
PASSAGE
7) How does one assign the standard concepts/predicates to common words?
PASSAGE
8) Where will the assignments be stored as metadata?
Standard
concepts such as Headache in our example are usually developed as part of a
system of interdependent concepts.
In
order to do this, experts of the specialized field addressed, in this case
medicine, have to agree on shared, normative definitions of their concepts and
predicates.
These
can then be published as a reference catalogue of connected standard concepts
and standard predicates, e.g. again on a Web page.
For
this the hierarchical superconcept-subconcept connection is the most important
one.
Example
(will be expanded in PASSAGE 9):
A
Pain-Headache connection puts Headache below Pain:
Pain
|
|
|
Headache
For
the machine processing of such concept catalogues, special languages such as
RDF Schema, DAML+OIL, and OWL have been developed in efforts towards the
Semantic Web.
For
such shared explicit concept catalogues AI has borrowed the expression
“ontologies” from philosophy.
So-called
‘category-based search engines’ such as Yahoo! and dmoz use hierarchical
directories of Web pages.
Such
a category hierarchy is comparable to the concept hierarchy of an ontology.
However,
the experts in a domain did not usually develop it together, but development
was done by the respective search-engine providers (exception: dmoz.org).
Category-based
search engines thus are precursors of the Semantic Search Engines strived for.
However,
they usually require complex navigation through the category hierarchy (not yet
competitive with the Google search line).
Ideally,
the crawler would navigate through the pages for the important common words and
assign the right standard concepts and standard predicates to them fully
automatically.
But
such a full automation is very difficult, because:
- The assignment can often only be
established correctly from the meaning context.
- Due to the limited number of standard concepts,
sometimes a common word has to be circumscribed with a formula made up of
*several* standard concepts (e.g. *OR combination of standard concepts* for –
unspecific – “stomach ache”).
- The assignment of standard predicates necessitates a
sentence-level analysis (parsing), which is dependent on successful assignments
of standard concepts in the subject and object positions of semantic
relationships (compare PASSAGE 5).
- Many pages mainly contain audio and video material,
from which standard concepts can only be extracted through sound/image
analysis.
- Sometimes in order to classify new pages it is even
necessary to extend the ontology, which only domain experts should be allowed
to do.
For
this reason the classification of pages should always be done interactively
together with experts:
1) The crawler for a given page proposes standard
concepts,
some
mutually connected by semantic relationships via standard predicates.
2) At least for unclear cases these will then be
corrected and
if
necessary completed by experts.
Thus,
in the medium term, only the costs for the semantic classification of relevant
parts of the expanding ‘sandstorm’ of Web pages can probably be covered.
Interestingly, e.g., dmoz has currently captured about 3.8 million entry pages
with about 52,000 honorary (free of charge) experts.
In
the print-media sector, a similar assignment has been traditionally carried out
by specialized librarians more or less manually. (Many of them are increasingly
moving towards the area of “specialized digital libraries”, which with
‘vertical’ search engines can become a core piece of the Semantic Web.)
A
group of standard concepts – possibly ‘interconnected via semantic
relationships’ through standard predicates – is useful for describing a page
containing the corresponding common words: the group constitutes “metadata” for
this page.
There
are two principal possibilities for storing these metadata:
“EXTERNAL”:
The ‘address book’ described earlier can store a standard concept or a semantic
relationship together with its assignment to all pages with the corresponding
common words. Standard concepts or semantic relationships then act as so-called
“external metadata” for the pages they refer to.
“INTERNAL”:
The pages themselves – if they also have text parts – can store their own
descriptive standard concepts or semantic relationships. They then act as
so-called “annotations”, i.e. as internally added metadata for the pages in
which they appear.
Advantage
of “EXTERNAL” and disadvantage of “INTERNAL”:
Only
by separating the metadata from the pages themselves is it possible to describe
pages that one does not own or in which there is no ‘place’ (text part) for
annotations (e.g., audio and video pages).
Advantage
of “INTERNAL” and disadvantage of “EXTERNAL”:
If
metadata are stored as annotations directly in their pages, then for every
change of a page the affected annotations can be immediately updated as well
without first having to search for external metadata of the page – e.g. via an
inverse use of the ‘address book’.
A
compromise would be to refer to the metadata of a page via a URL that is stored
internally in the page or can be found directly with it in a special ‘place’ –
e.g. in a page header.
This
finally leads us to the important problem of change/maintenance in the Semantic
Web. One source of this problem is that, unlike books, many Web pages have the
characteristic – which is unpleasant for the crawler - to often change ‘on the
quiet’.
We
have just seen:
When
page contents with their common words change, often the corresponding standard
concepts and semantic relationships are affected as well – they have to be
readjusted.
But
there is also another maintenance problem:
What
happens when the standard concepts or semantic relationships themselves change
over the years, e.g. through concept refinements following new scientific
discoveries or simply due to a new ‘Zeitgeist’?
In
this way, e.g., our sample standard concept Headache could be split into
subconcepts such as Sporadic-Headache and Chronic-Headache, so that for the
time being it would be possible to agree on this tiny concept hierarchy:
Pain
|
|
|
Headache
/ \
/ \
/ \
Sporadic-Headache Chronic-Headache
Using
this, also our earlier semantic relationship “Aspirin CURES Headache” could,
e.g., be refined by experts within the corresponding ontology
Pain
|
|
|
Aspirin---------CURES--------->Headache
/ \
/ \
/ \
Sporadic-Headache Chronic-Headache
to
express one of the following assertions:
Aspirin
cures
…
sporadic headache:
Pain
|
|
|
Headache
/ \
/ \
/ \
Aspirin--CURES-->Sporadic-Headache Chronic-Headache
…
chronic headache:
Pain
|
|
|
Headache
/ \
/ \
/ \
Aspirin Sporadic-Headache Chronic-Headache
| ^
| |
-----CURES---------------------------
… or
both subtypes of headache:
Pain
|
|
|
Headache
/ \
/ \
/ \
Aspirin--CURES-->Sporadic-Headache Chronic-Headache
| ^
| |
-----CURES---------------------------
However,
if as in the last example a semantic relationship is meant for *all*
subconcepts (here: Sporadic-Headache, Chronic-Headache), it can also more
‘economically’ be left at the superconcept (here: Headache), from where it is
then automatically ‘inherited’ to the subconcepts on demand only (similarly as
in the class hierarchies of object-oriented programs).
As a
result of such concept refinements two principal possibilities arise for the
pages classified by them:
“UPDATE”:
We can try corresponding retroactive updates to the metadata of all the
affected ‘old’ pages.
In
this case domain experts should decide whether one or more subconcept such as
Sporadic-Headache and Chronic-Headache were ‘meant’ or whether their old common
superconcept Headache remains correct.
“SWITCH”:
We can switch the metadata ontology at certain points in time, continue to
access the ‘old’ pages via the ‘old’ metadata, and only for the ‘new’ pages use
the ‘new’ metadata.
In
this case Headache would stay unrefined as a standard concept for an old page,
even if domain experts would immediately notice that it were, e.g., only about
Sporadic-Headache.
Since
after each “SWITCH” a further generation of the ontology versions would be
needed for the corresponding generation of (normally further changing!) Web
pages, this option entails a high permanent administration overhead for the
crawler.
Therefore
“UPDATE” seems to be the better option, even if in each case it entails a
substantial amount of work to be done once.
The
so-called “Pinakes of Kallimachos of Kyrene” (about 250 B.C.) is thought
to be the first written catalogue of a library: they classified a selection of
scrolls from the library of
Although
“UPDATE” would be the ‘nicer’ solution, many libraries have chosen the solution
“SWITCH”, i.e. put up with the fact that users have to search in two or more
catalogues sometimes.
The
problem is rooted in the concept drifts and faults of times or cultures.
The
Semantic Web will not be able to solve *this* problem either, but both solution
possibilities, “UPDATE” and “SWITCH”, will be supported by software tools of
the Semantic Web.
In
particular, initial tools – such as Chimaera, PROMPT, and RDFT – have been
developed for the interactive concept bridging between ontologies.
These
could also later help with maintaining library catalogues.
Conversely,
the Semantic Web can learn a lot from Library Sciences. Initiatives – e.g.
within Math-Net and CISTI – attempt to bring both together.
A
special (quality) need of Web-based documents arises because of their low
‘entry barrier’ compared to documents that make their way into the multi-copy
distribution system of traditional libraries: Efficient RATING of Data,
Metadata, and Raters is ESSENTIAL for Semantic Subwebs that want to compete
with good, old paper-based libraries.
The
Semantic Web, on the basis of AI, is a new subfield of computer science with
various further interdisciplinary relations, e.g. to logic, linguistics, and
cognitive science.