Empirical Evaluations of Organizational Memory Information
Systems:
A Literature Overview
Felix-Robinson Aschoff & Ludger van Elst
DFKI GmbH,
Kaiserslautern
FRODO Discussion Paper
Dezember
2001
Abstract
Assessing a
considerable lack of systematic empirical evaluation in the field of Knowlegde
Management, we give an overview of evaluative approaches in different research
areas up to now. We are especially covering those areas which are relevant for
the development of Organizational Memories Informations Systems (OMIS):
Knowledge Engineering (including Knowledge Acquisition and Ontologies), Human
Computer Interaction, Information Retrieval and Software Engineering. We report
about (experimental) studies and general guidelines for evaluation from the
different research fields. Finally, we show implications for the evaluation of
OMIS, propose rules of thumb for the realization of a systematic evaluative
study and sketch first ideas for the evaluation of FRODO.
Table
of Contents
1
Introduction................................................................................................................ 1
2
Contributions from Related Fields............................................................. 2
2.1 Knowledge Engineering.................................................................................... 2
2.1.1
Generel Methods and Guidelines..................................................................... 2
2.1.2 Knowledge
Acquisition...................................................................................... 8
2.1.3
Ontologies.......................................................................................................... 11
2.2 Human Computer
Interaction................................................................... 16
2.3 Information Retrieval................................................................................... 18
2.4 Software Engineering
(Goal-Question-Metric Technique)....... 20
3
Implications for Organizational Memory Information Systems 23
3.1 Implications for the
evaluation of OMIS.......................................... 23
3.2 Relevant aspects of
OMs for evaluations and rules of thumb for conducting evaluative research.............................................................................................................................................. 25
3.3 Preliminary sketch of
an evaluation of FRODO............................ 27
References....................................................................................................................... 30
Appendix
A: Technical evaluation of Ontologies taken from Gómez-Pérez (1999): 32
Aim of this document is to discuss important aspects of systematic evaluation in the field of knowledge management (KM). Many agree that systematic evaluations become more and more important in this area but so far more general methods and guidelines need to be developed (Tallis, Kim & Gill, 1999; Nick, Althoff & Tautz, 1999). Shadbolt (1999) states: "If our field is approaching maturity, our set of evaluation tools is in its infancy. This is not a healthy state of affairs."
We are
especially interested in methods to empirically evaluate frameworks for
Organizational Memory Information Systems (OMIS). Since frameworks for
organizational memories rely on a broad range of approaches and methods we
cover the following research fields: Knowledge Engineering (including Knowledge
Acquisition and Ontologies), Human Computer Interaction, Information Retrieval
and Software Engineering.
By
‘empirical evaluation’ we understand “the appraisal of a theory by observation
in experiments” (Chin, 2001)[1].
In the literature we only found few well controlled experiments revealing the
interaction between OMIS and users. We believe that partly this is due to a
shifted scope in the construction of knowledge management systems. The
classical expert systems (like MYCIN, an expert system for diagnose and
treatment of bacterial infection in medicine) were developed for domain experts
storing their knowledge in a computer system. The goal was to elicit knowledge,
formalize and implement it to process and apply this knowledge in circumstances
when the expert is not available.
A typical
approach to technically support knowledge management are frameworks for
organizational memories (e.g., FRODO; Abecker, Bernardi, van Elst, Lauer, Maus,
Schwarz & Sintek, 2001) which rely more on a successful interaction between
a heterogenous group of users and a broader range of domains. It is not the
goal anymore to make the system independent from the expert, but a constant
interaction between users who enter knowledge, the system and users who
retrieve knowledge is intended. In our recommendation for evaluation we turn
our attention to this aspect of system-user interaction since we believe it to
be a core aspect of today’s knowledge management, which is not covered
sufficiently in evaluative research.
In chapter 2
we give an overview of systematic evaluation in the KM field so far. We
describe approaches for general methods and guidelines and cover evaluation
studies which contain important aspects and hints for an evaluation of OMIS. We
will not report the results of the evaluation studies in detail. We are rather
interested in the general methods of evaluation the authors use. We will
concentrate on the (experimental) research designs, the formulated hypothesis
and the quantitative and qualitative metrics that are recorded for evaluation.
Chapter 2.1
is dedicated to the field of Knowledge Engineering. 2.1.1 covers general
approaches for evaluation in this field like the Critical Success Metrik (CSM),
the Sysiphus Initiative and High Performance Knowledge Bases. Chapter 2.1.2
deals with the process of knowledge acquisition and 2.1.3 with ontologies.
Chapter 2.2 surveys evaluation in the field of Human Computer Interaction
and 2.3 in the field of Information
Retrieval. In chapter 2.3 we deal with the field of Software Engineering
especially with the Goal-Question-Metric Technique.
In chapter 3
we show relations between the research efforts reported in chapter 2 and the
evaluation of OMIS. We propose a number of general steps that can be understood
as rules of thumb for the experimental evaluation of OMIS. We finally sketch
first ideas for the evaluation of FRODO (A framework for distributed organizational
memories; Abecker et al., 2001).
Tim Menzies and Frank van Harmelen (1999)
give the introduction to a special issue of the International Journal of Human-Computer Studies dedicated to the
evaluation of knowledge engineering techniques. They see one of the core
problems concerning evaluation in the fact that many KE researchers do not
recognize the general purpose of an experimental study. Results seem to be
limited to the concrete technology, tools and circumstances at hand and can
hardly be generalized for the entire research field. Menzies and van Harmelen
propose researchers to take a more general view and encourage to evaluate
broader concepts. They ask: “Can we build better knowledge-based systems (KBS)
faster now than in the 1980s.” With their essential
theory approach they provide a broader conceptual base for comparing
different schools of knowledge engineering. They
figured out a number of general theories (T0...T5) in building KBS and suggest
to benchmark them against each other. These six theories differ in to what
extent they rely on the following concepts: Libraries of procedures, General
inference engine, Axioms, Ontologies and Libraries of Problem Solving Methods
(see Fig.1). T1 for example relies on axioms and inference engines. ”Crudely
expressed, in T1, KE is just a matter of stuffing axioms into an inference
engine and letting the inference engine work it all out”. Menzies and van
Harmelen claim that most of KE researchers work in one of these six niches.
They propose an comparative evaluation across these essential theories in the
followin steps:
1) Identify a process of interest.
2) Create an essential theory T for that process.
3) Identify some competing process
description, Ø T.
4) Design a study that explores core pathways in
both Ø T and T.
5) Acknowledge that your study may not be
definitive.
Fig 1:
Different schools of knowledge engineering. (Menzies & van Harmelen,
1999)
With this essential
theory approach Menzies and van Harmelen take a rather broad view of KE
research and in evaluating two or more of these theories or technologies
against each other using multiple experiments they see the most promising
approach for future evaluations in the KE field. The entire journal issue on
evaluation can be recommended as preparation literature for the developement of
an evalution study. Especially Shadbolt, O'Hara & Crow (1999) give a very
good overview of history and problems in the field of evaluating knowledge
acquisition techniques and methods.
Tim Menzies also maintains a website on Evaluation Methods for Knowledge Engineering[2]. He formulates question on vital issues in the field of KE. He asks for example:
How good is KE technique X ?
Given KE techniques X and Y, which one
should be used for some problem Z?
What makes for a good ontology/PSM?
Menzies also stresses the importance of “good” controlled experiments in evaluative research. Such experiments must have certain features such as addressing some explicit, refutable hypothesis; being repeatable; or precisely defining the measurement techniques. Menzies states that “most current KE evaluation are not "good" controlled experiments.” He shortly formulates requirements for good measurement refering to statistical requirements, measurement requirements and hypothesis requirements. In this document we will not cover questions of experimental design, hypothesis formulation or statistical theory in detail. Paul Cohen (1995) gives a comprehensive insight into empirical methods necessary for the evaluation of AI programms. He covers the design of experimental settings and statistical methods with a slight focus on the latter. For the construction of experiments we also recommend Martin (1995) and Chin (2001). In addition to Cohen, Hays (1994) can be recommended as a widely recognised reference for the field of statistical methods. We will cover literature about software metrics theory under 2.4. On his Website Menzies divides evaluations in the field of knowledge engineering into six areas: Knowledge-Level (KL) studies, Panel-based evaluation, Software Engineering (SE) studies, Repair studies, Human-Computer Interaction (HCI) studies, Mutation studies and Simulated experts and cites typical research studies for each area.
With his Critical Success Metrics (CSM) Menzies (1999a) proposes the formulation of a critical question, that can decide conclusively if an expert system is a success or not. This question or CSM should reflect the business concern that prompted the development of the expert system. In Menzies (1999a) for example he reports the evaluation of PIGE, a farm management expert system. He formulates the CSM: Can PIGE improve farm profitability as well as a pig nutrition expert does? Menzies could demonstrate that, measured in purely economic terms, PIGE outperformed his human author, an expert for pig nutrition.
One big advantage of the CSM methods lies in the fact that evaluation can take place while the expert system is fully operating. This is achieved by defining pre-disaster points, which refer to states of the system that are less-than-optimum, but not yet critically under-performing. Having these pre-disaster points trials can be performed by human experts and the expert system (or by two or more expert systems), that are terminated each time a pre-disaster point is reached. These trials can be compared using performance scores derived of the CSM. The disadvantage lays in the fact that CSM is explicitly a method for an yes-no assessment. Either the system reaches the critical succes margins or it does not. If it fails reasons for this failure can hardly be infeered using the CSM method. An "assess and repair" approach is not supported.
One important aspect in evaluating knowledge-based system is of course the comparison of different techniques or tools. What tools are superior to others? What tools are most effective for which tasks? These are interesting questions from a theoretical point of view which are difficult to adress in the applied field. Tools are developed by more or less independent research groups and evaluation is often not the core interest of these groups. Comparing tools on a fair base is difficult since one would need a neutral instance, which normally does not exist. One research group could evaluate different tools but normally would be biased to the own tool both in user competence and in personal interest. Instead of letting research groups evaluate different tools the Sisyphus Initiative takes a different approach. Focusing on Problem Solving Methods (PSM) a number of common abstract problems were formulated that could be used for evaluation by different research groups. In the hope a fair comparison would be possible researchers could demonstrate how their techniques and tools were able to solve these Sisyphus problems. For an overview of Sisyphus I to IV see http://ksi.cpsc.ucalgary.ca/KAW/Sisyphus/.
Sisyphus I was a Room Allocation Problem, in which a number of persons with different requirements have to be allocated in a number of rooms or offices with different characteristics. Sisyphus I proved to be relatively easy for the different tools and so Sisyphus II was created to provide a more realistic and more pretentious knowledge engineering problem. The Sisyphus II Elevator Configuration Problem was taken from a real problem and the task was to configurate an elevator in a building, having a large body of knowledge about building specifications, elevator components and safety constraints.
The Sisyphus I and II problems were a good way to bring the Knowledge Engineering community closer together. Researcher were working on the same problem and got hints on how the different tools behaved. Sisyphus I and II though can not be seen as a systematic evaluation for the following reasons (see Shadbolt et al. (1999) for drawbacks of the Sisyphus Inititiative).
1) There were no "higher referees" who judged if the tool of a certain research group was better than that of another.
2)
No common metrics to compare tools
on a fair base were defined.
3)
The Room Allocation and the
Elevator Configuration Problem focussed on the modelling of knowledge. In the
process of solving this problems, the effort to build a model of the domain
knowledge was usually not recorderd.
4)
Other significant aspects of knowledge engineering like the
accumulation of knowledge and cost-effectiveness calculations were not paid any
attention
In an attempt to encounter the weaknesses mentioned above, Shadbolt (1996) initiated Sisyphus III: The Igneous Rock Classification Problem. The task was to design an expert system that could assist astronauts, which are normally not specialists for geology, to classify igneous rocks on their missions to moon or mars. Sisyphus III takes a more systematic approach by
1) defining quantitative achievement measures to allow a controlled comparison of different approaches.
2) releasing information in staged series to have more realistic circumstances, since knowledge engineers usually do not get information as a whole at one time, but need to evolve it in an number of steps.
3) asking researchers to protocol their action so process variable could be recorded (knowledge engineering meta-protocols).
One of the biggest problems with Sisyphus III seemed to be that the willingness of researchers to participate in the initiative dropped significantly with the above mentioned requirements (partly because of funding problems) and that many of those who were participating did not follow the requirements very accurately (Shadbolt et al. 1999). In 1999 Shadbolt writes sceptical about Sisyphus : "Thus far none of the Sisyphus experiments have yielded much evaluation information (though at the time of writing Sisyphus III is not yet complete)". Nevertheless he suggests a continuation of a coordinated series of benchmark studies in the line of Sisyphus as most promising for further evaluation of frameworks.
The aim of the Sisyphus IV initiative is the
collaboration and integration of knowledge techniques or tools over the
Internet and the World Wide Web in order to increase the effectiveness of tools
at different sites. It seems that in Sisyphus IV the scope of the Initiative
has shifted towards collaboration over the Internet and that systematic
evaluation or benchmarking of approaches was not the main interest anymore.
By initiating Sisyphus V in (1999b) Menzies follows the tradition of the first three Sisyphus initiatives and developes it further. With his High Quality Kowledge Base Initiative (hQkb) Menzies encounters a number of problems of the other Sisyphus initiatives. His approach is at least as systematic as Sisyphus III was. He explicitly wants to benchmark a wide range of systems by evaluating their quality using a Design & Learn approach as reference frame. A great step forward seems to be the centralized independent assessment planned in Sysiphus V. All hQkb products are planned to be assessed at NASA's independent verification and validation facility. Menzies applied at NASA for funding the hQkb evaluation, which in case of approval would diminish the funding problem (whereas independent research groups still had to secure the funding of their hQkb products). It can be hoped that these improved circumstances will lead to higher participation in Sisyphus V than in Sisyphus III. We have to keep in mind though that an inferior judgement of an hQkb product will probably have greater negative effects for the research group participating in Sisyphus V than in earlier Sisyphus Initiatives.
High Performance Knowledge Bases
(HPKB)[3] is a research project that is run by the Defence
Advanced Research Project Agency (DARPA) in the United States. Its goal is the
development and evaluation of very large, flexible and reusable knowledge
bases. One core interest of the programm is the rate at which knowledge can be
modified in KBS. We will describe the setting of the program and the first
project phase with its products and evaluations with a focus on research design
and performance measures.
Three groups of researchers
participated in the programme: 1) challenge problem developers 2) technology
developers 3) integration teams. Challenge problem developers had the task do
develop realistic scenarios which were of interest for the Defense Departement
and which could serve as challenging problems for the technology developers.
Technology developers came from a number of mostly US Universities and from
industrial research groups and worked on solutions for the challenge problems.
The integration teams were formed to put all the technology together into an
integrated system and if necessary to develop products which could tie
technology together into an integrated solution.
Cohen et al. (1998) report about the development and
evaluation of three challenge problems. One problem is taken from the field of
international crisis management and the other two concern battlespace problems.
The international crisis scenario takes place in the Persian Gulf and involves
hostilities between Saudia Arabia and Iran that culminate in Iran closing the
Strait of Hormuz to international shipping. HPKB researchers made it their
objective to construct a system that could answer natural language questions
about the crisis and the options for the two sides. Questions the system should
be able to answer could be for example: Is Iran capaple of firing upon tankers
in the Strait of Hormuz? With what weapons? What risk would Iran face in
closing the strait to shipping? The
guiding philosophy during knowledge base development for this problem
was to reuse knowledge whenever it made sense. The integrator team for the
crisis management scenario used three existing knowledge bases: 1) the HPKB
upper-level ontology developed by Cycorp 2) the World Fact Book knowledge base
from the Central Intelligence Agency (CIA) and 3) the Units and Measures
Ontology from Stanford. Performance metrics for
the evaluation of the crisis management problem were based on the answers the
system gave to question like those cited above. Overall competence was a
function of the number of questions answered correctly. Since the system was
also required to justify the answer by explaining the reasoning process and
citing relevant sources, this additional information was also evaluated. The
answer key to the question about the risks Iran faces when closing the street
for example contains: Economic sanctions from {Saudi Arabia, GCC, U.S., U.N.,},
because Iran violates an international norm promoting freedom of the seas. To
substantiate its answer the system should name the Convention on the law of the sea as reference. Each of the
following four official evaluation criteria was rated on a scale between 0 and
3 by challenge problems developers and subject matter experts:
1)
the
correctness of the answer.
2)
the
quality of the explanation of the answer.
3)
the
completeness and quality of the cited sources.
4)
the
quality of the representation of the question.
The other two challenge problems had
to do with strategic decision-making during military operations. The movement
analysis problem was a scenario with military and non-military traffic occuring
in a certain region. Task of the system was
1)
to
distuingish between military and non-military traffic.
2)
to identify
the sites between which military convoys travel and determine their military
significance and their type.
3)
to
identify which enemy units are participating in each military convoy.
4)
to
determine the purpose of each convoy movement.
5)
infer
the exact types of the vehicles that make up each convoy.
Performance metrics for the
evaluation of the movement analysis problem were related to recall and
precision. Performance was a function of how many entities (sites, convoys,
vehicles..) were identified correctly by the system and how many incorrect
identifications were made.
The third challenge problem also is
a battlefield scenario which is called the workaround problem. Interesting
military targets can be infrastructure like bridges or tunnels, which in case
of destruction disable the movement of enemy troops. When a crucial facility is
destroyed an army will try to “work around” the blocked way to reach its
target, e.g. by building a temporary bridge. By analysing the enemies
possibilites to circumvent damaged infrastructure one is able to locate the
facilities with the highest effect on enemy troop movement. The task of the
workaround challenge problem is to automatically assess how rapidly and by what
method an enemy can reconstitute or bypass damage to a target. Performance
measures for evaluation included:
- coverage (the generation of all
workarounds generated)
-
appropriateness
(the generation of workarounds appropriate given the action)
-
specifity
(the exact implementation of the workaround),
-
accuracy
of timing inferences (the length each step in the workaround takes to
implement).
The authors’ claim for evaluation
was that HPKB technology facilitates rapid modification of knowledge based
systems. All three challenge problems were evaluated in a study that followed a
two phase, test-retest schedule. In the first phase the system was confronted
with a problem quite similar to the problem that were used to design the
knowledge base whereas in the second phase a significant modification to the
knowledge base was required. Within each phase the system was tested and
retested on the same problem. The first test served as baseline which was
compared to the retest after improvements to the knowledge bases had taken
place. The results of the evaluation studies met in many aspects the
expectations. The scores between tests and retests increased, especially in the
second phase where the system had to be modified significantly because of new
problem structures. Many reasearch studies also showed the performance difference
between tools of the participating research groups, which developed their
technology in a friendly competition.
Cohen et al. state that performance
evaluation like the one reported are essential but tell us little about the
reasons why a system works succesful or not. Questions if a certain strategy or
tool is important for a good technology and why can not be answered this way.
One would need a concrete theory or hypothesis that can be put to the test in
an experimental study. Cohen et al. claim that HPKB facilitates rapid
construction of knowledge-based systems because ontologies and knowledge bases
can be reused. It is yet unclear which kind of challenge problem most favors
the reuse claim and why. Cohen et al. are working on analytic models of reuse
and plan to test the predictions of these models in future evaluation studies.
In addition to this we would suggest
to define critical success margins whenever possible. If reasonable predictions
can be made not only that a system works successful but also to what extend,
the evaluation study can yield stronger results. With his Critical Succes
Metrics (CSM) Tim Menzies (1999a) proposes the formulation of a critical
question which can definitely be answered with yes or no. It might be
interesting to relate the improvement of the HPKB knowledge bases between the
test and the retest to some standard derived from other knowledge bases or the
performance of human experts.
Knowledge Acquisition (KA) – the process of obtaining knowledge from humans or other sources for use in an expert system – is a difficult and complex task in the field of KB development. Especially eliciting knowledge from human experts results problematic and within the development cycle of a KB researchers speak of a knowledge elicitation bottle neck.
Shadbolt, O'Hara & Crow (1999) give a very good overview of history and problems in the field of evaluating knowledge acquisition techniques and methods. They structure the difficulties in evaluating the KA process into five problem areas:
1) the availabiliy of human experts
2) the need for a “gold standard” of knowledge
3) the question of how many different domains and tasks should be included in the evaluation
4) the difficulty of isolating the value-added of a single technique or tool and
5) how to quantify knowledge and knowledge engineering effort. In the following sections we will describe these problems and point out solutions.
One of the main problems when conducting an evaluation study to compare the effects of different KA techniques is the limited number of human experts available. To assemble a number of experts which is great enough to grant statistical significance in an experimental design (say >20) will in most cases not be possible. A compromise is to work with few experts and give up the possibility of statistical inference testing. Shadbolt et. al (1999) report about a study where only a single expert was examined in two experiments. In the second experiment he judged his own performance in the first. Of course the possibility to generalize the results diminish when using only few subjects. A different solution is not to use domain experts but expert models, like students. Students have reached a certain level of expertise in their field and are usually available in greater number. They can be used as substitutes in evaluation studies of KA techniques and have the additional advantage that real experts can be taken as “gold standard” to evaluate the results of the experiments. It can be called into question, however, if knowledge elicitation with students can be compared to knowledge elicitation with experts. Experts might use different strategies and have a different representation of domain knowledge which a student has not yet developed. A final approach for this proplem lays in the possibility to use a domain of day-to-day life, like reading or the identification of fruits. Since most people are “experts” in these capabilities it is easy to assemble a sufficient number of subjects for an experiment. It is unclear, however, if the expertise in a complex scientific field can be compared to usual abilities necessary in everyday life.
The second
problem relates to the nature of the acquired knowledge. If knowledge is
elicited from leading experts in a knowledge domain there obviously can be no
“gold standard” as reference mark for comparison. It
cannot be evaluated if the resulting knowledge base is covering the domain
sufficently. The two
approaches for this problem were already mentioned. If students are used as expert
models a “gold standard” can be defined by real experts and domains of everyday
life also allow the formulation of an optimal knowledge coverage. In addition
the calculation of inference power can yield information about the quality of
the acquired knowledge. Inferential power of knowledge can for expample be
measured by representing it as productions rules using metrics from formal
grammar theory. Further ways of measuring inferential power can be found in
Berger et al. (1989).
The third
problem raises the question if a certain KA technique is independent from
different domains and tasks or favors certain areas or forms of use. The ideal
would be to evaluate a techniqe using as many different domains and tasks as
possible. This would of course lead to a scaling up of any experimental
programme and will usually not be viable. It is important though to reflect to
what extend the domain and the task influence the result of an evaluation.
The fourth
aspect addresses the difficulty to design experiments in which the resulting
effect can clearly be linked to the KA technique. With only one experiment it
is not possible to decide if a positive or negative result is due to the
technique or due to the implementation, the user interface or the plattform
used. In addition to this KA tools are usually not used as stand-alone but in
combination with other tools. This makes the isolation of the value-added of a
tool or technique even more difficult. Shadbolt el al. name the following
approaches to gain a better experimental controll on the different factors:
1)
To
disentangle confounded influences one can conduct a series of experiments.
2)
Different
implementation of the same technique can be tested against each other or
against a paper-and-pencil version.
3)
Groups
of tools in complementary pairings can be tested as well as different orderings
of the same set of tools.
4)
The
value of single sessions can be tested against multiple sessions and the effect
of feedback in multiple sessions can be tested.
5)
Finally
one should exploit techniques from the evaluation of standard software to
control for effects from interface, implementation etc.
All these
approaches, however, lead to a scale-up of the experimental program. The rapid
pace of software development will often make a thorough evaluation difficult
since the tool would problably be obsolete by the time it is evaluated with
high scientific standards. Software developers will have to compromise between
necessary evaluation and the speed of their development cycles.
The final topic
relates to quantification of knowledge and knowledge engineering effort. The
quantification of knowledge is obviously not a trivial task and a number of
possible metrics can be proposed. One
is to use production rules in the form of “IF condition AND condition.. THEN
Action” as base for quantification. The number of IF and AND clauses acuired in
a session can for example be counted and can be one measure to quantify
knowledge. Another way would be to use emerging standards, like Ontolingua
(Gruber, 1993) for quantification. An interesting parameter for the efficiency
of a KA technique is of course the number of acquired rules per time period
(e.g. rules/minute). Here the time for preparation of the session as well as
coding time after the session has also be taken into account. There seems to be
a link between certain psychometric test scores of experts and the number of
rules they can produce during an elicitation session. Shadbolt reports about a
study by Burton et al. who found a positive correlation between subjects’
embedded figure test (EFT) scores and both the total amount of effort and the
effort required to code transcripts of laddering seesions. One of the
parameters of the EFT is called “field-dependence” which indicates to what
degree persons are overwhelmed by context. Burton et al. deduced from their
results that persons with a high “field-dependence” would have difficulty with
a spatial technique such as laddering. So it can be useful to apply
psychometric tests to find the optimal combination of experts and KA technique.
Shadbolt et al. (1999) also throw light on the
enormous difficulties of systematically evaluating an entire framework. Since frameworks are much more general in scope and are designed to
cover a wide ranges of tasks and problems, if not the entire problem space, the
systematic controll of influencing variables becomes even more difficult. To
control the way from the specific result to the general concept is the
challenge in evaluationg frameworks. Shadbolt et al. state: "Only a whole series of experiments across a number of
different tasks and a number of different domains could control for all the
factors that would be essential to take into account."(p. 732) Shadbolt et al.
propose a continuation of the Sisyphus programme or Sisyphus-like programme a
most promising way for the evaluation of frameworks. We remind that Menzies and
van Harmelen (1999) explicitly take a different view on this matter and prefer
their essential theory approach (see
2.1.1) for KE evaluation in general. Even though they do not cover frameworks
explicitly they would probably argue that their proposed comparison of KE
school is a more adequate approach because of the broad conceptual covering of
the entire KE field.
Tallis, Kim & Gil (1999) report that user studies are still uncommon in AI research. Most evaluations include run-time behavior with no human in the loop. They report about an experimental user study of knowledge aquisition tools. We will cite the steps they propose for designing experiments and report the lessons learned form their study.
The following steps for experimental studies are listed by the authors:
1. State general claims and specific hypthesis – what is to be tested
2. Determine the set of experiments to be carried out – what experiments will test what hypothesis
3. Design the experimental setup
a) Choose type of users that will be involved – what kind of background and skills
b) Determine the knowledge base used and KA task to be performed – what kinds of modifications or additions to what kinds of knowledge
c) Design the experiment procedure – what will the subject be told to do at each point
4. Determine data collection needs – what will be recorded
5. Perform experiment
6. Analyze results – what results are worth reporting
7. Assess evidence for the hypothesis and claims – what did we learn from the experiment
Tallis et al. point out that these steps are not to be understood as strictly sequential. Pre-tests, for example, can be very helpful to refine and improve the research study in a iterative process. The authors report the following lessons learned from conducting their experiment:
- Use within-subjects experiments. Participants with different skill levels turned out to be a problem and comparison between different groups was difficult. With within-subject designs this problem can be solved. Another approach we would like to add here is the specification of skill level as covariate variable (see Chin 2001 for further details on covariates).
- Minimalize the variables unrelated to the claims to be proven. In the experiment user could use different tools (text editor or menu based interface) to accomplish a task. These possibilities did not add any value to the experiment but increased unnecessary variablity of the outcome.
- Minimize the chances that subjects make mistakes unrelated to the claims. Participants of the experiment made a number of mistakes (syntax errors, misunderstanding of domain and task) which made the interpretation of the results difficult. We would suggest to keep the experimental procedure as easy as possible and to conduct pre-tests to find out where participants problems lie.
- Ensure that subjects understood the domain and the assigned KA task. (see above)
- Avoid the use of text editors. Participant can make syntax errors when using text editors and different skills in using text editors make it difficult to compare differences between subjects.
-
Isolate as much as possible the KA
activities and the data that are relevant to the hypothesis. We recommend to be
as precise as possible and to plan a experimental design which conclusively
relates data to the formulated hypothesis.
An ontology is a formal, explicit specification of a shared conceptualization (Gruber 1993). As highly structured representations of a knowledge domain ontologies serve a number of purposes in KM. By defining and interrelating concepts of a knowledge domain ontologies enable the comunication about a field of interest among humans and software agents. They make the reuse of knowledge and the combination with other domain knowledge possible and make knowledge more accessible by explicating domain assumptions. Ontologies can be used to analyze domain knowledge and to separate domain knowledge from operational knowledge. Ontologies are also important elements of Problem-solving methods allowing inference tools to solve task problems (Noy & McGuiness, 2001). We separate the process of evaluating ontologies into three parts:
1) the process of constructing the ontology
2) the techniqual evaluation
3) end user assessment and system-user interaction.
The evaluation of constructing an ontology
is closely connected to the field of Knowledge Acquisition and approaches and
problems are dealed with in section 2.1.2. Tennison, O’Hara & Shadbolt
(1999) report about
their experimental evaluation of APECKS. APECKS (Adaptive Presentation
Environment for Collaborative Knowledge Structuring) is a system for the
collaborative construction, comparison and discussion of ontologies. Aim of the
evaluation mainly consisted of two aspects: 1) the identification of features
of the tool that need improvement and 2) observation of how the tool was used
during evaluation to better understand the user process. Specific hypothese were:
1) that reported usability of all tasks involving APECKS
would increase over time, as subjects
gained experience,
2) that subjects would expand all aspects of their
ontologies over time
3) that the pattern of use would change over time,
reflecting an increase in interest and use of other people’s roles. The third
hypotheses has to do with the general concept of APECKS. It supports the
creation of personal ontologies (roles) and the comparison and discussion of
these ontologies.
For reasons discussed in
section 2.1.2 Tennison et al. used undergraduate students for their study, which had
to construct ontologies in the domain of ‘mammals’. They recorded a number of
metrics to evaluate the ontology construction process with APECKS: Subjects
attended four sessions constructing ontologies and completed a usability
questionnaire at the end of each session. Subjects had to rate the usability
concerning each of these six acitivities: finding, adding, changing and
removing information and comparing roles and discussion. In addition to these
usability metrics APECKS logged the pages the subjects visited and recorded the
lengths of time spent at each. After each session the following three
parameters concerning the subject’s ontology’s states were recorded: 1) the number
of each type of object, 2) the number of hierarchies present within the
ontology and 3) the number of subclass partition that had been created. At the
end of the experiment the ontologies were judged subjectively by a knowledge
engineer.
The
system usability was evaluated by comparing the usability ratings after the
four sessions in a time series analysis. Tennison et. al used a one-way
within-subject analysis of variance to compare the four points in time. The
ANOVA showed a significant difference of four of the six activities and an
afterward applied t-Test showed a significant increase in usability between the
first and the final session for the following activities: finding information,
adding information and comparing roles. The following quantitative measures
were recorded to evaluate the quality of the subject’s personal ontology after
each of the four sessions: number of individuals, classes, slots, distinct
hierarchies, subclass partitions, annotations and criteria. Again a one-way
within-subject ANOVA followed by t-tests were applied. The ontologies had
significantly more individuals, classes, hierarchies and annotations in the
final session than they did in the first. The protocoll analysis showing the
time spend on each side by the subjects yielded among other results an
significant increase of the proportion of page requests that were visits to
pages owned by others. A result that supports the third hypothesis that people
will have an increasing interest in other peoples ontologies during the course
of the study. Finally Tennison et al. let subject make comments on the
Presentation, the Navigation, the Discussion and the Ontology Construction and
Comparison of ASPECKS and obtained valuable hints concerning advantages and
weaknesses of their system.
Tennison
et al. report about a lack of evaluation of other ontology servers that could
serve as a baseline against which APECKS could be evaluated. Without such
comparison the authors cannot judged wether APECKS is better or worse than
other systems. Because of the small number of evaluation studies there is no
generally accepted KA tool evaluation methodology available, which would enable
researchers to routinely evaluate over a series of useful aspects. Against this
background it is understandable why Tennison et al. use a broad range of
qantitative and qualitative, objective and subjective measures. In a phase were
systematic evaluation of KA techniques is just evolving this approach can yield
important hints for further research. Although an more explorative evaluation
appears to be senseful at this stage one has to be aware that the possibility
to draw conclusions is limited. When many parameters are recorded without
stating an active hypothesis Menzies[4]
calls this an “shotgun experiment”. Here the likelihood of finding
relationships merely by chance are high. Or in other words if I predict a big
bundle of parameters to rise during my evaluation study, the chance that a
share of them actually do increase is high. We are not saying that Tennison et
al. conducted such an experiment. We just want to show the problem when many
parameters are recorded in a unspecific manner. In addition to this we would
always suggest to be as concrete as possible in the prediction of parameters. Tennison et al. stated the lack
of baselines or other evaluation studies that could serve as comparison.
Whereever such a comparison or baseline can be found or infeered we would
suggest to apply it to increase the possibility to draw important conclusions.
After its
construction there a number of techniqual requirements an ontology has to
meet. According to Gómez-Pérez (1999) “the evaluation of ontologies refers to the correct
building of the content of the ontology, that is, ensuring that its defnitions
(…) correctly implement ontology requirements and competency questions or
perform correctly in the real world. The goal is to prove compliance of the
world model (if it exists and is known) with the world modeled formally.”
Gómez-Pérez identifies the following five criteria for the techniqual
evaluation of ontologies:
1)
Consistency refers to wether it is possible to obtain
contradictory conclusions from valid input definitions.
2)
Completeness of definitions, class hierarchy, domains, classes
etc.
3)
Conciseness refers to wether all the information in the
ontology is precise.
4)
Expandability refers to the effort required in adding more
knowledge to the ontology.
5)
Sensitiveness refers to how small changes in a definition
alter the set of well-defined properties that are already guaranteed.
In addition
to these criteria the author lists the following errors
that can occur when taxonomic knowledge is build into an ontology: Circularity
errors, Partition errors, Redundancy errors, Grammatical errors, Semantic
errors and Incompleteness errors. For a comprehensive description and definition of these evaluation
criteria and errors see appendix A.
Gómez-Pérez reports about her evaluation of
the Standard-Units Ontology, which is an ontology with a taxonomy of standard
measurement units used in physics and chemistry (like seconds, meter, Ampere
etc.). The ontology was to be included into a chemistry element ontology. After
experts had drawn up an inspection document setting out the properties to be
checked Gómez-Péreze evaluated the ontology finding a number of problematic
aspects (e.g. violation of standard naming conventions, definitions with poor
informal naming descriptions etc.) In a synthesis process Gómez-Perez
implemented the ontology again. She evaluated the ontology a second time to
make sure that all necessary changes had been made.
Grüninger & Fox (1995) propose a framework for the evaluation of ontologies which is based on the requirements the ontology has to meet. Informal competency questions are derived from a motivating scenario. These informal questions are transformed into formal competency questions in the language of the ontology. The competence of the ontology can be evaluated by investigating if the ontology is able to answer the competency questions. On the base of the formal competency questions the completeness of the ontology’s solutions to these questions can be proven. Figure 2 shows the procedure of ontology design and evaluation developed by Grüninger & Fox.
Gómez-Pérez (1999) differentiates between
technical evalution of an ontology and the assessment of an ontology. One
advantage of ontologies lays in the possibility to reuse knowledge contained in
existing ontologies. With the growing number of available ontologies the
process of deciding what ontologies are appropriate for a knowledge engineering
project become more interesting and more difficult. “Assessment is focused on
judging the understanding, usability, usefulness, abstraction, quality and
portability of the definitions from the user’s point of view.” (Gómez-Pérez, 1999). Knowledge
Engineers should consider questions like: Does the ontology development
environment provide methods and tools that help design the new knowledge base?
By how much does the ontology reduce the bottleneck in the knowledge
acquisition phase? Is it possible to integrate the definitions into the KB
without making significant modifications to the KB?
We would
like to stress a further area of evaluation of ontologies. Assessment according
to Gómez-Pérez refers to the suitability of the system for further knowledge
engineering. It does not deal with the people who actually use the ontologies
after their completion. Gómez-Pérez reports about a lack
of application-dependent and end-user methods to judge the usability and
utility of an ontology to be used in an application and names this a problem
for further research. One reason for the lack of evaluation in this field may
be the limited number of end-user yet. Up to now ontologies were primarly
constructed for a circumsized number of experts with either domain knowledge or
knowledge engineering expierence. As we
will lay out in chapter three frameworks for distributed organizational
memories like FRODO are designed for people with heterogenous background with
different tasks. System user interaction is therefore more important.
“Human Computer Interaction (HCI) is the study of how people design, implement, and use interactive computer systems, and how computers affect individuals, organizations, and society.” (Myers, Hollan & Cruz, 1996) One aim of the HCI appraoch is to facilitate interaction between users and computer systems and to make computers useful to a wider population. We include a summary of evaluation in the field of HCI in this report because the above mentioned aspect is more central in frameworks for organizational memories than in traditional expert systems. A continuous interaction between the organizational memories and users from different backgrounds and with different capabilities in the handling of computer systems takes place. The integration of different needs and grades of expertise becomes more important than in expert systems where only a comparativly small group of experts or specialized users needs to interact with the system. Myers et al. point out the immense decrease in financial costs when a thourough usability engineering has taken place. In critical places like airport towers and planes problems with the human-computer interface can have desastrous consequences. The importance and impact of usability and interfaces reportet by Myers and others should be taken as a hint by the knowledge engineering community. Once KB are used by a broad population usability studies and systematic evaluation will be indispensable.
Chin (2001) demands more empirical evaluation in the field of user-modelling and user-adapted interaction: “Empirical evaluations are needed to determine which users are helped or hindered by user-adapted interaction in user modeling systems. He defines empirical evaluation as the “appraisal of a theory by observation in experiments”. He reports that only one third of the articles in the first nine years of User Modeling and User-Adapted Interaction included any kind of evaluation, many having preliminary character and methodological weaknesses. He claims this to be insufficient and formulates rules of thumb for designing controlled experiments. Chin names the uneven influence of nuisance variables as one big problem for experimental research and proposes the following steps to counter this problem:
. Randomly assign enough participants to groups.
. Randomly assign time slots to participants.
. Test room should not have windows or other distractions (e.g. posters) and
should be quiet. Participant should be isolated as much as possible.
. The computer area should be prepared ergonomically in anticipation of differ-
ent sized participants.
. If a network is used, avoid high load times.
. Prepare uniform instructions to participants, preferably in a written or taped
(audio or video) form. Check the instructions for clarity with sample
participants in a pilot study. Computer playback of instructions is also helpful.
. Experimenters should not know whether or not the experimental condition has
a user model. Each experimenter should run equal numbers of each treatment
condition (independent variable values) to avoid inadvertent bias from differ-
ent experimenters. Experimenters should plan to minimize interactions with
participants. However, the experimenters should be familiar with the user
modeling system and be able to answer questions.
. Be prepared to discard participant data if the participant requires interaction
with the experimenter during the experiment.
. Follow all local rules and laws about human experimentation. For example, in
the USA all institutions receiving federal funds must have a local committee
on human subjects (CHS) that approves experiments. Typically, participants
should at least sign a consent form.
. Allow enough time. Experiments typically take months to run.
. Do run a pilot study before the main study.
. Brainstorm about possible nuisance variables.
Chin explains the meaning and significance of the effect size of an experimental result, the power of an experimental setting and the role of covariate variables for experimental research. He proposes the following standards for reporting results from experiments. These reports should include:
1)
the number, source, and relevant background
of the participants
2)
the independent, dependent, and covariant
variables
3)
the analysis method
4)
the post-hoc probabilities
5)
the raw data (in a table or appendix) if not
too voluminous
6)
the effect size (treatment magnitude), and
the power (inverse sensitivity),
which should be at least 0.8.
Reiterer, Mußler & Mann (2001) evaluate
the add-on value of different visualisations supporting the information seeking
process in the WWW, like Scatterplot, Barcharts or Tile Bars. As measurement
criteria and dependent variables they use effectiveness, efficiency and
subjective satisfaction. Effectiveness is defined as the degree to which the
test-task is fullfilled measured in percentage of solved test tasks. Efficiency
is the effectiveness divided by the time the person needed to fullfill the test
task. As independent variables, which are factors that influence the dependent
measurements, Reiterer et. al vary target user group, type and number of data
and task to be done. Fig. 3 shows the design of their research plan. The
information seeking task could be a specific or an extended fact finding, users
could either be beginners or experts, the amount of results could be 20 or 500,
the number of keywords of each query could be 1, 3, or 8.
Fig. 3: Test combinations (Reiterer, Mußler & Mann, 2001)
The results show that effectiveness and efficiency do not really increase when using visualisations, but the motivation and the subjective satisfaction do. Reiterer et al. assume that training effects could play a crucial role and that effectiveness and efficiency might increase when persons are more customed to the visualizations. Training effects are a general problem when new tools are compared to tools which the participants are used to. It is hard to decide how much training is necessary until the technical improvements of new developments will show effect in user effectiveness and efficiency. Reiterer et al. used a comprehensive experimental research design with three dependent variables and four independent factors with two or more levels. However they did not formulate specific hypothesis predicting the results of their experiment.
Without specific hypothesis it is hardly possible to interpret the data of such a complex experimental design in a senseful manner. Reiterer et al. do not report the influences of their factors but only state that the factors “have shown to influence the efficiency of the visualizations.” We would always recommend to develope an experimental design on the base of testable and refusable hypothesis, whenever it is possible to formulate them (see also Menzies[5] argumentation concerning “shotgun experiments”). When conducting a more exploratory study we would suggest a simpler design which will probably yield clearer results.
Due to the growing amount of knowledge
availabe through the World Wide Web and other electronic archieves the
retrieval of information becomes increasingly important. WWW search engines are
used by millions every day and a knowledge-based system need an efficient
information retrieval tool to work succesfully. Traditionally the evaluation of
IR tools is based on two measures: Recall
is calculated taking the number of relevant documents retrieved divided by the
total number of relevant documents in the collection. Precision is calculated taking the number of relevant documents
retrieve divided by the total number of documents retrieved. Problems with
these two measure arise from the concept of “relevance”. Kagolovsky & Moehr (2000) point out that precision and
recall are not absolut terms but are subjective and depend on many different
factors. They report that IR research became more user-centered over the years,
recognizing the holistic and dynamic character of the process. Cognitive and
behavioral aspects were considered as well as multiple user interaction with a
search engine during the same session. They plan further investigation with the
substitution of precision and recall by “methods of search engine evaluation,
based on 1) formal representation of text semantics and 2) evaluation of
“conceptual” overlap between 2a) different sets of retrieved documents and 2b)
retrieved documents and users’ information needs.”
With the growing number of available ontologies conventional key-word based retrieval can today be enhanced by an ontology-based retrieval. Aitken & Reid (2000) gained experimental data comparing ontology-enhanced retrieval with key-word retrieval. They used the CB-IR information retrieval tool, which was developed for a UK engineering company and uses ontology-enhanced retrieval as well as key-word retrieval. They defined five different queries beforehand for the automated test equipment (ATE) systems, which store information about technical devices used to test high integrity-radar and missile systems. They applied these queries comparing the performance of ontology-enhanced retrieval with key-word retrieval. To test the robustness of the system they used the original database on which the system was developed as well as new previously unseen datasets. As measurements they recorded recall and precision. Their study was influenced by the Goal-Question-Metric technique described in section 2.4. As specific hypothesis they formulated:
H1. recall and precision are greater for ontology-based matching than for keyword-based matching on the original data set.
for adequacy:
H2. recall and precision are greater than 90% for ontology-based matching on the original data set
for robustness:
H3. recall and precision are greater for ontology-based matching than for key word based matching on the new data sets
H4. recall and precision are greater than 80% for ontology based matching on the new data sets
Speaking in very general terms the results broadly supported the hypothesis about absolute and relative performance of the system and about the adequacy and robustness of the ontology. Some hypothesis, however, had to be rejected (e.g. H3 concerning precision).
The problems we discussed in section 2.1.2 about knowledge acquisition concerning the limited availibility of human experts also gain relevance in the study of Aitken & Red. As we already pointed out the metrics recall and precision are based on the concept of relevance, which need to be assessed by humans in a time consuming process. For this reason Aitken & Red were not able to conduct an experiment with results that could plausibly be tested for statistical significance. Solution approaches for this problem could be developed based on Shadbolt et al. (1999) (see section 2.1.2) or the approaches for evaluation formulated by Kagolovsky et al. (2000).
Another problem we would like to point out lays in the fact that the recall and precision ratings Aitken & Red recorded were quite high on average. Out of 48 recall and precision ratings 31 had the value of 100%. Of course this is not easy to predict beforehand, but whenever possible we would suggest to formulated test queries with a degree of difficulty which yield sufficient variance in the results to distuinguish reliable between the experimental groups. Finally, Aitken & Red reported the Goal-Question-Metric approach to be a useful organizing framework for evaluation. We will describe this technique in the following section.
We will focus on the Goal-Question-Metric
technique in this chaper since we found it especially helpful for the
evaluation of OMIS. The
Goal-Question-Metric Technique is an
industrial-strength technique for goal oriented measurement and evaluation from
the field of software engineering (Nick, Althoff, Tautz, 1999). It helps to
systematically carry out evaluations by explicitly pointing out the importance
of formulating goals of the evaluation with respect to business needs. Basili, Caldiera & Rombach
(1994) describe the basic concepts of GQM. They
differentiate between a conceptual level
(goals), an operational level
(questions) and a quantitative level
(metrics). On the operational level goals are defined for objects of
measurement. These objects can be products
(e.g. artifacts, programmes, documents), processes (software related activities like designing or testing)
or resources (items used by processes
like personnal, hardware or office space). Goals can be defined for a variety
of reasons, with respect to various models of quality, from various points of
view and relative to a particular environment. Basili et al. formulate the
following goal as an example: “Improve the timeliness of change request
processing from the project manager’s point of view.” GQM Goals need to specify
a purpose, a quality issue, an object (product, process or resource) and a
viewpoint. In the example the quality issue is timeliness, the object is a
process, namely the change request process and the viewpoint is the manager’s
viewpoint. The purpose is to improve the process. After the goal is formulated
the next step consists in asking meaningful questions that characterize the
goal in a quantifiable way. Basili et al. propose at least three groups of
questions.
1)
How
can we characterize the object (product, process, or resource) with respect to
the overall goal of the specific GQM
model? For our example a question could be: What is the current change request
processing speed?
2)
How
can we characterize the attributes of the object that are relevant with respect
to the issue of the specific GQM
model? E.g. Is the performance of the process improving?
3)
How do
we evaluate the characteristics of the object that are relevant with respect to
the issue of the specific GQM model? E.g. Is the performance satisfactory from
the viewpoint of the project manager?
The next step after formulating the question consists in finding appropriate metrics. Aspects to be considered are the amount of quality of the existing data. It has to be decided if objective and subjective measure are recorded. “Informal or unstable objects should rather be measured with subjective metrics whereas more mature object are better measured with objective measures.” Since GQM models need constant refinment and adaption the reliability of the models also need to be of interest for the evaluator. So we finally end up with a number of questions and corresponding metrics. The question: What is the current change request processing speed? for example can be answered with the metrics: Average cycle time, standard deviation, % cases outside of the upper limit. In summary, the GQM method is a way to systematically derive metrics from evaluation goals and cover the scope of an evaluation in a precise and comprehensive manner.
Nick, Althoff, Tautz (1999) report about the evaluation of CBR-PEB using the Goal-Question-Metric Approach. CBR-PEB is a experience base for the development of case-based reasoning systems and it was the first time that GQM was applied to an organizational memory. Fig. 3 shows the standard GQM cycle for the evaluation of CBR-PES. During the prestudy phase relevant information for the GQM programme is collected. This includes a description of the environment, “overall project goals”, and “task of the system”. In the next step the GQM goals are defined by interviewing experts. After an informal statement goals are being formalized using the specification requirements for GQM goals described above. The three goals for the CBR-PEB refer to the “Technical Utility”, the “Economic Utility” and the “User Friendliness” of the system. The formal goal for “User Friendliness” was formulated: Analyze the organizational memory for the purpose of characterization with respect to user friendliness from the viewpoint of the CBR system developers in the context of decision support for CBR system development. After formal definition the goals are ranked and the ones to be used in the measurement programme are selected.
Fig. 3 : The standard GQM cycle and its instantiation for CBR-PEB
A GQM Plan is developed by formulating questions derived from the goals and by defining measures and analysis models, by which the questions can be answered. For this purpose the group of people, which is specified in the formal GQM goal, is interviewed. Abstraction sheets are filled out, that divide the relevant information into four quadrants:
- the “quality factors” which refer to the properties of the goal to be measured
- the “variation factors” which define variables that could have an impact on the “quality factors”
- the “impact of the variation factors” which specify the kind and direction of this impact (e.g. variation factor: background knowledge, impact: higher background knowledge->better retrieval results)
- the “baseline hypothesis” which refer to the current state of the properties to be measured.
The measures have to be chosen carefully to
correspond to the questions and it has to be specified how measurement results
will be interpreted. Data
collection takes place with questionnaires, which can either be paper-based or
on-line, which was the collection method Nick et al. used. After collection the
data are interpreted in feedback session with the experts.The evaluated system
is assessed as well as the GQM measurement plan. Result of the feedback
sessions are taken into account for the next measurement cycle since GQM is an
iterative approach which is refining the measurement and the system
continuously. It is also advisable to formulate explicit lessons learned
statements form each GQM cycle which can be considered as guidelines for future
measurement programmes.
For the
evolution of GQM-based measurement programs Nick et al. recommend to take into
account the following principles:
1) Start with small items that are well
understood and easily measurable. Based on these well understood items the
measurement programme can be improved in each cycle. This also takes account of
the cost/benefit aspect of the programme. In the beginning it is important to
demonstrate the benefits of the programme.
2) The evaluation should guide
development and improvement of the system.
3) The evaluation may not interfere
with the evolution and improvement of the system. It is not acceptable to
hamper the operating system for the sake of measurement (e.g. delay updates of
information).
Nick et al.
define three phases of OMs with different focus for evaluation: (a)
prototypical use (b) use on regular basis and (c) wide spread use. During
prototypical use evaluation should mainly be concerned with the generell
acceptance of the system measured in terms of system use and informal user
feedback. During regular use the system should be improved on the base of more
formal user feedback. Once wide spread use takes place cost/benefit calculations
and economic aspect become important.
Tautz (2000)
reports about a comprehensive experimental evaluation of a repository-based
system (which can be compared with an OM). We will only sketch the main points
here. Tautz compared the use of the repository-based system with a human-based
approach where the information seeker talks to his colleagues to obtain the
experience he needs for his task. Tautz formulated an effectiveness hypothesis
and an efficiency hypothesis. In the first he predicted that the repository-based
approach complements the human-based approach by providing additional useful
observations and guidelines. In the second he predicted that the
repository-based approach is more efficient than the human-based approach
(Efficiency was measured as the time needed to find a useful guideline or
observation). Tautz conducted an experiment with an within-subject design where
subjects used the system and talked to “simulated colleagues.” Because of the
problematic availability of the experts and for reasons of the experimental
design experts gave their answers concerning guidelines and observations once
during the preparation phase of the experiment. After subjects had chosen an expert the prerecorded answers
were presented. Many subjects judged this “simulation” to be realistic.
Subjects rated the obtained guidelines or observations as “useful”, “not
useful” or “don’t know”. Both hypothesis could be validated. The
repository-based approach was more efficient and improved the human-based
approach by at least 50% on average (with an error probaility of 0.4%).
We reported literature with general guidelines and examplary (experimental) studies from research fields relevant for Organisational Memory Information Systems. Our aim was it to give hints for the realization of evaluative research and to show where problems and solutions approaches lie.
We already mentioned in the introduction
that we need to make a distinction between traditional expert system
development and the Organisational Memory Information System approach (like
e.g. FRODO by Abecker et al. 2001). Whereas expert systems (like the systems in
the line of Sisyphus or the HPKB program) are designed to assist or even
replace domain experts, the success of OMIS depends to a great extent on the
interaction between system and user. Instead of imitating the human mind, organisational memory assistant
systems foster a hybride approach where the cooperation between man and machine
is the focus of attention (Abecker et al., 1998).
This
different approach brings new processes to interest and also changes the focus
of evaluation. Employees from different parts of an organisation will input
information into the organisational memory. This information can be stored in a
highly formalized structure but can also be in the form of text, audio, video
files or other multimedia applications. The documents have to be administered
by one or more ontologies which suit the demands of the organisation. People
within the organisation from possibly different departments and knowledge
domains have to be able to retrieve the information that enables them to
fullfil their tasks. When working with OMs the group of people which delivers
the information input can either be different from the group that retrieves the
information or can be identical with it. The expert levels of users of an OM
are much more heterogeneous than those of expert system users and the knowledge
of the domain(s) will probably be more shallow and informal. In this scenario
the interaction between users and system plays a more crucial role than in
conventional expert systems. The usability of the system is an important aspect
for its success. The system relys on being accepted by the member of an
organisation, since the knowledge cooperation among the users and between the
users and the system depends on a continous and frequent use of the system.
Especially in times where information and knowledge tends to be obsolete in
smaller and smaller cycles the smooth use of the system has to be granted. We
would now like to point out how OMIS relate to the research covered in chapter
2.
Menzies and
van Harmelen (1999) take a broad view on the field of knowledge engineering
with their essential theory approach.
They propose to compare different KE schools to answer the qestion if we can
build better KBS faster now than in the 1980s. The strength of this approach
lays in the demand for general results which are relevant for the entire
research field. Because of the big range of different domains and tasks in the
KE field we doubt that only one of the six essential theories (T0-T5) will turn
out to be superior, but we believe an approach that takes a more general view
than the evaluation of a concrete technique or tool at hand to be fruitful and
necessary to make scientific progress. For reasons we described in the
paragraph above we found it difficult to link the OMIS approach to one of the
six essential theories. Ontologies are obviously an important aspect of
organizational memories but the other elements like libraries of procedures,
general inference engine, axioms and library of PSMs are not used in the way
Menzies & van Harmelen propose in their six possible KE schools. It would
be interesting to extend the essential
theory approach so it would also include OMIS and compare the performance
of expert systems with hybride man-machine solutions for different domains and
tasks.
With his CSM
approach Menzies (1999a) demands to explicitly formulate critical success
metrics for evaluation. Main questions are: Is the system’s output useful and
correct? Can it compete with a certain standard like human experts? We believe this to be a useful approach for
OMIS, too. The formulated success margins, however, would have a different
scope. Whether the system outperforms a human expert is not of interest since
OMIS are designed to cooperate with the experts. The OMIS approach would rather
state that man and machine in cooperation can outperform an expert system
working by itself (Abecker, 1998).
The Sisyphus Initiative and the HPKB program
also evaluates technical aspects of expert systems. Sisyphus concentrates on
PSMs and HPKB on the rapid modification of KBS. These two programmes show that
central organization of an evaluation of a number of competing systems can be a
crucial issue. Because of DARPA the HPKB program could work much more
systematically and structured than Sisyphus in its beginning. For the research
in the line of Sisyphus Menzies hQpb (Sisyphus V) could bring equal
possibilities, since Menzies applied at NASA for funding and central
assessment. From the viewpoint of OMIS research these two research programmes
focus to much on the construction and the run-time behavior of knowledge basis.
For the evaluation of OMIS human in the loop experiments with users entering
and retrieving information with different degree of formality are of crucial
interest. (see Tallis et al,
1999)
Concerning knowledge aquisition some of the problems covered in Shadbold, O’Hara & Crow (1999) are equally relevant for OMIS others are not. There is no knowledge elicitation bottle neck like in the development of an expert system. In OMIS knowledge is in many cases not formalized to such a high degree and a wider range of people with different levels and domains of expertise enter and retrieve information. So the problematic availability of human experts and the need for a “gold standard” of knowledge is not as relevant. Other aspects like the difficulty of isolating the value added of a single technique or tool or the question of how many different domains and tasks should be considered remain important. From the viewpoint of OMIS we would add the problematic aspect of evaluating the usability of the system when people with different background enter information.
In the section about ontologies we reported about technical evaluation sensu Gómez-Pérez and the evaluation of the ontology construction process. We already pointed out that the interaction between the end-user who seeks information and the ontology is a highly relevant aspect for OMIS. Future evaluation of OMIS have to concentrate on this matter since OMIS can only work successful if they are accepted by its intended users und are used continuously.
Human computer interaction research focuses on the system-user interaction mentioned above and experience in conducting experimental user studies from this field can be very valuable for the evaluation of OMIS. We recommend Chin (2000) for a good starting point to experimentation in the HCI field. Information retrieval should also be evaluated regarding usability and user friendliness since user acceptance is crucial for the success of OMIS. Studies that investigate the added value of ontology-based information retrieval with key-word based information retrieval (Aitken & Reid, 2000) are important for OMIS since ontologies form the heart of the information system of an OM. The advantages and weaknesses of the ontology approach has to be investigated comprehensivley. We finally found the GQM metric technique very helpful for the develpement of an evaluation study and will later sketch a first idea of the evaluation of FRODO based on the GQM approach.
In the
section about knowledge aquisition we already pointed out what kind of problems
researcher face when they design the evaluation of an entire framework. Since a
framework is theoretical metaconcept it is very difficult to isolate the
influence of a single factor. If one would like to test the benefits of
different elements of a framework in an empirical study he has to implement a
certain tool and use a certain interface. With just one experiment he will hardly
be able to trace the influence of his conceptual element. We cited Shadbolt et
al. (1999) who state that only a whole series of studies would be necessary to evaluate a framework.
If we look
at an entire framework for organisational memories there are of course many
starting points that would be worth to evaluate: the question if the ontology
is adequately covering the domain; if the input and retrieval process work
properly; if the knowledge can be kept up to date and if the evolution of the
system in the company takes place successfully to name just a view.
Focusing on
usability there a number of further question that could serve as possible
starting points for evaluation:
How must the
OM be designed to grant high usability (Interfaces, Ontologies, tools…) ?
How much
kowledge about the ontology must a person have to efficiently enter Information into the OM? How does
the person achieve this knowledge? How much time does she /he need to aquire
the necessary knowlegde about the ontology?
How much
kowledge about the ontology must a person have to efficiently retrieve Information out of the OM? How
does the person achieve this knowledge? How much time does she /he need to
aquire the necessary knowlegdge about the ontology?
How much
effort is it for a person to learn how to deal with the interface and the
different tools available?
What aspects
of the system (ontology structure, tools etc) are used often which are used
scarcely? Why?
Does the
system offer information which supports people’ actions. Does it offer relevant
information for the activities it was designed for?
Feel people
content using the system? Do they have the impression that the system serves
their needs?
As already pointed out most of these exemplary questions cannot be answered with just one experiment. To sufficiently cover these questions a whole line of experiments will be necessary. We would now like to point out some aspects we believe to be important when conducting a research study. We do not claim this list to be complete and it takes a very broad view on evaluation, but we find these aspects helpful for orientation when conducting a research study. Please consult the cited literature for important details.
Formulate the main purposes of your framework or application. (GQM) What was it designed for? What does it have to accomplish in later use? What does it have to accomplish with respect to the user?
Define clear performance metrics. What are good indicators for the success or the failure of a system? For what purpose was the system designed and what are important characteristics for later use?
Formulate precise hypothesis. If possible one should predict exactly what to expect as result of the evaluation (see Menzies website for an explanation of the “shot gun effect”). At best there is a model or a line of reasoning which makes the formulation of a hypothesis possible, which can also answer questions as to why a system is a success or not. If one is only in the position to asses if a system meets a certain level of performance or not we suggest to formulate a Critical Succes Metric (Menzies 1999a). Usually a standard of comparison is required. This can be another knowledge base system or the competence of a human expert. Tallis et al. (1999) propose ablation experiments: A tool is evaluated by comparing it with a version of the tool where certain capabilities are disabled. This allows the evaluation of the added value of the tool in a controlled manner. Improvement can also be measured without explicit prediction in a more explorative way. In this case, however, it is important to take into consideration that the possibilities to interpret the results are limited.
Standardize the measurement of your performance metrics. For later comparison it is crucial to be precise about the way measurement has to takes place. Especially when working with a number of research teams different measurement procedure can jeopardize the research programme.
(Experimental) Research Design. Be thourough with designing your research. Reflect about what conclusions you can draw from a field study, from a quasi-experimental design and from an experimental design. Reflect on what conclusions you are not allow to draw. For an introduction to this field read Martin (1995) and for an comprehensive coverage Cohen (1995). Note that a pre-test can be very helpful to debug and refine your design. Be aware that with one experiment you can only study a limited number of variables. Reflect about other factors which might have an important influence on your results (e.g. domain, task, user skill etc.).
Use inference statistics to decide about your experimental hypothesis. If you have an experimental research design it is in most cases inaccurate just to compare absolute values without considering statistical theory. (for literature see Cohen (1995) and Hays (1994))
Report results. Results should be reportet based on the standards proposed by Chin (2001).
We found the
Goal-Question-Metric technique to be helpful for defining relevant starting
points for the evaluation of OMIS. In the next sections we would like to sketch
a preliminary evaluation of FRODO (Abecker et al. 2001). Please consider this
only to be a first draft to demonstrate the methods described in chapter 2. Further
refinment and validation of the research plan has to take place.
Based on the
GQM technique in a first step informal goals have to be formulated concerning
overall project goals and the task of the system. For FRODO such goals could be
( taken from Abecker et al. 2001 and the project’s website[6]):
1)
Since
OMs are usually not implemented centrally for all departements of an
organisation at one time the concept of distributed OMs, which can cooperate
and share their knowledge, is more appropriate. This also demands for
decentralized and possibly heterogenous ontologies, which also need to be able
to comunicate and cooperate. Thus, FRODO
will provide a flexible, scalable OM framework for evolutionary growth.
2)
These
distributed ontologies have to incorporate new knowledge automatically or
semi-automatically as far as possible. Thus,
FRODO will provide a comprehensive toolkit for the construction and maintenance
of domain ontologies.
3)
One
big challenge of OMs in times of immanent information overlow is to bridge the
gab between document and user, describing their information needs with personal
profiles, by employing document analysis and understanding techniques (DAU). Thus, FRODO will improve information
delivery by the OM by developing more integrated and easier adaptable DAU
techniques.
4)
Knowledge
intensive tasks (KiTs) are not sufficiently supported by a-priori strictly
formalized workflows but are better represented with weaker dependencies and
sequence constraints. Thus, FRODO will
develop a methodology and tool for business-process oriented knowledge
management relying on the notion of weakly-structured workflows.
These
informal goals now have to be specified into formal GQM goals concerning at
least a purpose, a process, a viewpoint and a quality issue.
We will show this specification exemplarily for the fourth goal concerning
weakly-structured flexible workflows.
One could
formulate: Analyze a knowledge
intensive task with the purpose of
comparing the issue of efficiency of
task completion with weakly-structured workflows and strictly structured
workflows (objects) from the viewpoint of the end-user.
One could
formulate GQM goals for all informal goals, rank those goals and decide which
ones are to be used in the measurement programme. We proceed with the issue of
workflows and could come up with the following abstraction sheet:
Quality factors: |
Variaton factors: |
efficiency
of task completion |
task types
as described in Abecker et al. 2001(dimension: negotiation, co-decision
making, projects, workflows-processes) FRODO KiTs
lay between co-decision making and projects |
Baseline hypothesis: |
Impact of variation factors: |
No current
knowledge concerning the properties to be measured can be entered here
beforehand. The experimental design will provide a controll group for
comparison |
FRODO KiTs
are more successfully supported by weakly-structured flexible workflows than
by strictly-structured workflows. Classical work flow processes are better
supported by a-priori strictly structured workflows |
From this
abstraction sheet a comprehensive GQM plan is to be developed. This is only
shown in parts here.
Formulated
Questions could be:
What is the
efficiency of task completion using strictly-structured workflows for KiTs?
What is the
efficiency of task completion using weakly-structured flexible workflows for
KiTs?
It has to be
clearly defined how relevant parameters are to be measured. ‘Efficiency of task
completion’ for example could be defined as number of errors made by
participants divided by the time needed for completion of the task.
Our specific
hypothesis could be:
H1: For
knowledge intensive tasks (KiTs) weakly structured flexible workflows as
proposed by FRODO will yield higher efficiency of task completion than strictly
structured work flows
H2: For classical
workflow processes strictly-structured workflows will yield higher efficiency
of task completion than weakly structured workflows.
With this
preparation using the GQM method we could now design an experiment to answer
the raised questions. We could plan a 2 x 2 factoriel experiment with the two
factors workflow and task type as independent variable and efficiency of task
completion as dependent variable:
|
Workflow |
|
Task Type |
weakly-structured
flexible wf / KiT |
strictly-structured
workflow / KiT |
weakly-structured
flexible wf / classical workflow process |
strictly-structured
workflow / classical workflow process |
We could now
form groups of subjects considering the rules of thumbs by Chin (2001) (see
section 2.2) and based on Martin (1995). Participant had to complete a
knowledge intensive task and/or a classical workflow task using either
strictly-structured or weakly-structured dynamic workflows. To yield results
which can reasonably be tested for statistical significance we would need four
groups with about 15-20 participants.
One would have to decide if a between subject or a within subject design should
be carried out. For a between-subject design more participants are needed
(60-80) whereas a within-subject design would need less subjects (probably 30
to 40) but had to deal with practice
effects. Recall that Tallies et al. (1999) recommended a within-subject design
because it is better suited for participants with a big variance of skill
level.
After
completion the experiment had to be analized statistically (see Chin,
2001;Cohen, 1995; Hays, 1994) and should be reported considering the standards
formulated by Chin (2001) (see 2.2).
Abecker, A., Bernardi,
A., Hinkelmann, K., Kühn, O. & Sintek, M. (1998). Towards a technology for organizational
memories. IEEE Intelligent Systems.
13(3):40-48
Abecker, A., Bernardi,
A., van Elst, L., Lauer, A., Maus, H., Schwarz, S. & Sintek, M. (2001). FRODO: A framework for distributed
organizational memories. Milestone M1: Requirements Analysis and System
Architecture. DFKI Document D-01-01. DFKI GmbH, August 2001
Aitken, S.
& Reid, S. (2000). Evaluation of an ontology-based information retrieval
tool. Proceedings of 14th European
Conference on Artificial Intelligence. http://delicias.dia.fi.upm.es/WORKSHOP/ECAI00/accepted-papers.html
Basili, V.R., Caldiera, G. & Rombach, H.D. (1994). Goal question metric paradigm. In John J. Marciniak, editor, Encyclopedia of Software Engineering, volume 1, pages 528532. John Wiley & Sons
Berger, B., Burton, A.M., Christiansen, T., Corbridge, C., Reichelt, H. & Shadbolt, N.R.(1989) Evaluation criteria for knowledge acquisition, ACKnowledge project deliverable ACK-UoN-T4.1-DL-001B. University of Nottingham, Nottingham
Chin, D. N. (2001). Empirical evaluation of user models and user-adapted systems. User Modeling and User-Adapted Interaction, 11: 181-194
Cohen, P.
(1995). Empirical Methods for Artificial
Intelligence. Cambridge: MIT Press.
Cohen, P.R., Schrag,R., Jones E., Pease,
A., Lin, A., Starr, B., Easter, D.,
Gunning D., & Burke, M. (1998). The DARPA
high performance knowledge bases project. Artificial
Intelligence Magazine. Vol. 19, No. 4, pp.25-49.
Gaines,
B. R. & Shaw, M. L. G. (1993). Knowledge acquisition tools based an
personal construct psychology. The
Knowledge Engineering Review. Vol. 8:1. 49-85.
Gómez-Pérez,
A. (1999). Evaluation
of taxonomic knowledge in ontologies and knowledge bases. Proceedings of KAW'99. http://sern.ucalgary.ca/KSI/KAW/KAW99/papers.html
Gruber, T. R. (1993). A translation approach to portable
ontology specifications, Knowledge
Acquisition, 5:199-220.
Grüninger,
M. & Fox, M.S. (1995) Methodology for the design and evaluation of
ontologies, Workshop on Basic Ontological
Issues in Knowledge Sharing, IJCAI-95, Montreal.
Hays, W. L.
(1994). Statistics. Orlando: Harcourt
Brace.
Kagolovsky, Y., Moehr, J.R. (2000).
Evaluation of Information Retrieval: Old
problems and new perspectives. Proceedings of 8th International
Congress
on Medical Librarianship. http://www.icml.org/tuesday/ir/kagalovosy.htm
Martin,
D.W. (1995). Doing Psychological
Experiments. Pacific Grove: Brooks/Cole.
Menzies,
T. (1999a). Critical
sucess metrics: evaluation at the business level. International Journal of Human-Computer Studies, 51, 783-799.
Menzies,
T. (1999b). hQkb - The high quality knowledge base initiative (Sisyphus V:
learning design assessment knowledge). Proceedings
of KAW'99. http://sern.ucalgary.ca/KSI/KAW/KAW99/papers.html
Menzies, T.
& van Harmelen, F. (1999). Editorial: Evaluating knowledge engineering techniques. International
Journal of Human-Computer Studies, 51, 715-727.
Myers, B., Hollan, J.
& Cruz, I. (Ed.) (1996). Strategic directions in human computer
interaction. ACM Computing Surveys,
28, 4
Nick, M., Althoff, K.,
& Tautz, C. (1999). Facilitating
the practical evaluation of knowledge-based systems and organizational memories
using the goal-question-metric technique. Proceedings of KAW ´99. http://sern.ucalgary.ca/KSI/KAW/KAW99/papers.html
Noy, N.F. & McGuinness, D.L. (2001). Ontology
development 101: A guide to creating your first ontology. Stanford Knowledge
Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics
Technical Report SMI-2001-0880
Reiterer, H., Mußler, G.
& Mann, T.M. (2001). A
visual information seeking system for web search. Computer and Information
Science, University of Konstanz.
http://kniebach.fmi.uni-konstanz.de/pub/german.cgi/0/337957/reiterermusslermannMC2001.pdf
Shadbolt,
N. R. (1996). Sisyphus III. Problem statement available at http://psyc.nott.ac.uk/research/ai/sisyphus
Shadbolt,
N., O'Hara, K. & Crow, L. (1999).The experimental evaluation of knowledge
acquisition techniques and methods: history, problems and new directions. International Journal of Human-Computer
Studies, 51, 729-755.
Tallis, M.,
Kim, J., & Gil, Y. (1999). User studies of knowledge acquisition tools:
methodology and lessons learned. Proceedings
of KAW ´99 http://sern.ucalgary.ca/KSI/KAW/KAW99/papers.html
Tautz,
C. (2000). Customizing software engineering experience management systems to
organizational needs. Dissertation, Fachbereich Informatik, Universität Kaiserslautern
Tennison,
J., O’Hara, K., Shadbolt, N. (1999) Evaluating KA tools: Lessons from an
experimental evaluation of APECKS. Proceedings
of KAW’99
http://sern.ucalgary.ca/KSI/KAW/KAW99/papers/Tennison1/
Uschold, M. &
Grüninger, M. (1996). Ontologies:
Principles, methods and applications, Knowledge
Engineering Review, Vol. 11, Nr. 2.
Ontology evaluation includes:
The goal of the evaluation process is to determine what the ontology defines correctly, does not define or even defines incorrectly. We also have to look at the scope of the definitions and axioms by figuring out what can be inferred, cannot be inferred or can be inferred incorrectly. To evaluate a given ontology, the following criteria were identified: consistency, completeness, conciseness, expandability and sensitiveness.
In order to provide a mechanism to evaluate completeness, the following activities can be of assistance in finding incomplete definitions.
Errors in developing taxonomies
This section presents a set of possible errors that can be made by ontologists when building taxonomic knowledge into an ontology or by Knowledge Engineers when building KBs under a frame-based approach. They are classed as circularity errors, partition errors, redundancy errors, grammatical errors, semantic errors, and incompleteness errors.
A) Circularity errors
They occur when a class is defined as a specialization or generalization of itself. Depending on the number of relations involved, circularity errors can be classed as: circularity errors at distance zero (a class with itself), circularity errors at distance 1 and circularity errors at distance n.
B) Partition Errors
Partitions can define concept classifications in a disjoint and/or complete manner. Errors could appear when:
As exhaustive subclass partitions merely add the completeness constraint to the established subsets, they have been distinguished as: non-exhaustive subclass partition errors and exhaustive subclass partition errors.
B.1) There are three manifestations of non-exhaustive subclass partition errors:
B.2) The errors associated with exhaustive subclass partitions can be considered as a subclass of non-exhaustive subclass partition errors with added constraints. This type of errors are characterized by not respecting the completeness of the classes that form the exhaustive subclass partitions. The following two errors would have to be added to those identified above:
C) Redundancy Errors
Redundancy is a type of error that occurs when redefining expressions that were already explicitly defined or that can be inferred using other definitions. These errors occur in taxonomies when there is more than one explicit definition of any of the hierarchical relations.
D) Grammatical errors
A grammatical error occurs when the taxonomic relations are used incorrectly from the syntactical viewpoint. Examples would be to define: the class dog as an instance of the class mammal, the instance Pluto as a subclass of the class cartoon-dogs, the class cartoon-ducks as an instance of the instance Donald, etc.
E) Semantic errors
They usually occur because the developer makes an incorrect semantic classification, that is, classes a concept as a subclass of a class of a concept to which it does not really belong; for example, classes the concept dog as a subclass of the concept house.
F) Incompleteness errors
Generally, an error of this type is made whenever concepts are classed without accounting for them all, that is, concepts existing in the domain are overlooked. An error of this type occurs if a concept classification musical instruments is defined considering only the classes formed by string instruments and wind instruments and overlooking, for example, the percussion instruments.
[1] We know this to be a quite narrow definition of ‘empirical evaluation’ and know that experiments are not appropriate for all circumstances. We would like to reach a level of controll, however, which can probably only be realized with experiments.
[3] www.teknowledge.com/HPKB/
[4] http://www.cse.unsw.edu.au/~timm/pub/eval/
[5] http://www.cse.unsw.edu.au/~timm/pub/eval/
[6] www.dfki.uni-kl.de/frodo/Proposal/index.html