next up previous

From Paper to a Corporate Memory -
A First Step

Stephan Baumann, Michael Malburg, Harald Meyer auf'm Hofe, and Claudia Wenzel
German Research Center for Artificial Intelligence DFKI GmbH
P. O. Box 2080, D-67 608 Kaiserslautern
E-mail: {Stephan.Baumann, Michael.Malburg, Harald.Meyer, Claudia.Wenzel}


Computer-based corporate memories aim to enable an efficient use of corporate knowledge. Therefore, such systems demand a formal representation of the knowledge itself and of its meta-aspects, i. e., how to use corporate knowledge effectively. Since typical information sources are continuous text and paper-based documents they initially need to be translated into a formal representation.

We argue that techniques of document analysis and understanding can be applied for this transformation. Furthermore, an integration of document and document analysis knowledge into the corporate knowledge leads to a significant improvement of analysis results.

Keywords: Knowledge Management, Knowledge-based Systems, Document Analysis, Document Understanding, Workflow Management

1 Introduction

Computer-based corporate memories aim to enable an efficient use of corporate knowledge. Knowledge-based systems can play a main role in this application scenario [1] because they contribute main ideas to realize organizational memory information systems (OMIS). Such systems obviously demand access to application and company specific knowledge concerning two aspects:

* The system uses the available knowledge in a goal-directed way, e. g., when answering queries. Therefore, corporate knowledge is either completely formalized in some knowledge representation language or at least main issues of unstructured knowledge are described formally.

* The system acts intelligently with respect to currently open tasks. A representation of tasks and processes occuring in the company is necessary to achieve such behaviour.

Disadvantageously, it is necessary to spend a lot of effort for the acquisition and the maintainance of such knowledge. Acquisition can be improved by re-use of previously formalized knowledge, e. g., workflow models. In the past, these models have often been specified to deploy workflow management systems (WFMS). Additionally, acquisition and use of corporate knowledge can be improved by extracting significant information from informal knowledge sources such as paper documents and electronic texts. This point is of special interest since the paperless office is still far away. Looking at today's situation in even modern enterprises only 20 % of relevant information is electronically available while 80 % is paper-based.

This paper claims that an integration of document analysis and understanding (DAU) techniques with workflow management systems can provide main contributions to realize such information extraction tasks [2]. DAU aims at extracting information from paper sources and translating it into a structured representation. Hence, this integration can be considered as a first step towards a corporate memory.

The analysis process is conducted in five steps [3]: First image analysis algorithms normalize the bitmap of the scanned document by deskewing, noise reduction, etc. Segmentation perfoms the isolation of textual and non-textual document portions and generates an initial hierarchy of blocks, lines, words, characters and connected components. The attachment of logical labels to these segments is part of the structure analysis. Subsequent text recognition converts the character and word segments to ASCII code. For the processing of electronic documents (e. g., txt, e-mail, HTML) text recognition is superfluous but techniques for the attachment of logical labels may also be applied. At this stage the final information extraction takes place independent of the previously processed media-type (paper or electronic). Domain-dependent approaches are applied in order to classify the entire document and to extract desired information portions. Considering a business letter, e. g., in a purchasing domain, document classification typically distinguishes different document types such as offer, delivery note, invoice, etc. Furthermore the recognition of sender and recipient as well as additional informations about articles, prices and discounts may be extracted.

Classification and information extraction results are appropriate to assign the analysed document to the corresponding open workflow instance of the company's WFMS. Such assignments can be considered as representing the meta aspects of the documents' contents - because recognizing the most relevant open workflow instance determines the way, the document's contents are further processed and stored.

Additionally, this procedure determines the semantics of the previously extracted information and, thus, turns it into a knowledge representation which formal methods can be applied to. For instance, recognizing an extracted digit sequence as an (urgent) invoice date probably leads to immediate money transactions. Contrarily, the digit sequence possibly denotes a date referring to previous letters that are of interest in searching for helpful information relevant to the current workflow instance. This is a completely different interpretation of the digit sequence. Consequently, the result of document analysis can be considered as partial translations into formal representation providing information about titel, author, sender, recipient etc. Informal knowledge is projected into formal representations that are useful to implement intelligent archives, to identify relevant information concerning the current task, and to assist finding the next steps to go, hence, to implement an organizational memory.

However, providing these services reliably requires availability of knowledge on open processes (context information), expected document types, and several data concerning address data bases etc. from the corporate memory. Hence, we have a bidirectional producer-consumer-relation of document analysis and a corporate memory.

This paper is structured as follows. The first main section introduces into the services of document analysis and their implementation. In a following section, questions arosen in the integration of document analysis and workflow management into an OMIS are answered. Referring to the first section the second one points out that optimal performance of document analysis can only be achieved by exploiting previously archived knowledge coming from the organizational memory.

2 Document Analysis Tasks

In this chapter, we explain how to get from paper documents to an electronic representation of the document's contents.

Document analysis begins as soon as an incoming document is scanned: Layout extraction comprises all low-level processing routines like skew angle adjustment and segmentation to compute a document's layout structure, i. e., the detection of boundaries for characters, words, lines, and blocks. Afterwards, logical labeling assigns so-called logical labels like sender, recipient, body, signature, etc., to layout regions of the document. Text recognition explores the captured text of logical objects. This is done by performing OCR for each character segment and generating word hypotheses afterwards.

Figure 1: Information Extraction within OfficeMAID

The subsequent components for document understanding are shown in Figure 1. Each component gets three kinds of input: The first input source is our blackboard structure where actual analysis results such as OCR results or segmented blocks are stored. The second input source is used by each component exclusively and contains the knowledge specific for the domain and the task in a declarative form (drawn as clouds). The last kind of input is lexical data (drawn as folders). Note that the results of these components - shown at the right hand side - are also stored in the document analysis blackboard. The three document understanding components are used for different tasks as follows.

Firstly, the document is classified according to pre-defined message types. This information is used to start a more in-depth analysis. We apply a rule-based component [4] to determine the appropriate message type. Therefore, it propagates hit scores for text phrases and OCR scores for words and word stems through a network of rules. It employs a pattern matcher [5] for the detection of phrases. In addition, this pattern matcher is used for the extraction of information out of fragmentary sentences. Recognition of sender and recipient is done by applying a standard island-driven parser for unification-based context-free grammars [6]. Up to now, we are not able to extract the whole product information since it is mostly kept in tables requiring a specific treatment.

At this point, the task of document analysis and understanding is completed and the whole extracted information is represented electronically. Therefore the conversion to a desired representation by means of standardized document descriptions (SGML, ODA, ODMA, DMA, etc.) is obviously possible.

In the following, the three components for document understanding are explained in more detail in order to answer the following questions:

* What knowledge is required about incoming documents? All the abovementioned DAU components require knowledge on document types and typical document contents to work correctly.

* Where does this knowledge come from? In the existing system OfficeMAID nearly all the knowledge required has been acquired during training phases condicted initially at system installation. Of course, this scenario contradicts the intention of an OMIS to provide services even with a minimum of initially given application specific knowledge. Hence, detailed analysis should clarify in how far application class specific knowledge and available data can be used instead.

* What is the internal representation of this knowledge? Document analysis procedures are computation time intensive. Hence, the data they use has to be given in a format enabling an efficient use. This aspect is essential to understand problems maintaining corporate knowledge distributed over the OMIS and the document analysis system.

2.1 Information Extraction by Pattern Matching

The pattern matcher is used for the extraction of information in stereotypical or fragmentary phrases. It matches all occurrences of so-called text patterns. These are templates specifying syntactic and semantic features of word contexts. In contrast to a parsing component, the pattern matcher carries out a more shallow analysis but reaches a higher performance.

The Pattern Language

Our pattern language aims at combining words with special features into more complex phrases. Such combinations involve nestings of conjunctions, disjunctions, skips (up to n words), and optionality operators.

The two atomic units of our language are

* words attributed with features:

- The word is mandatory, indicated by the word itself or the word written within quotation marks.

- The word is optional, indicated by parentheses ( ).

- The word specified is a substring of the word to be located in the text, indicated by :sub.

- The word specified is a morphological word stem, indicated by :morph_stem.

* information to be extracted (= information unit) which is specified by {F= type_of_extraction} where the data to be extracted currently are restricted to numbers, prices, proper names of people or cities, telephone numbers, and dates.

These atomic units may be concatenated to patterns according to the following rules:

* If two words follow each other (horizontally) in a text phrase, they are written down consecutively. If they follow each other vertically, this is indicated by the keyword :down. If they may follow either horizontally or vertically, the keyword :both_dir is used.

* A disjunction of words is specified by a vertical bar | between the different alternatives.

* Word skips are defined by giving a maximal word distance <n> between two words.

If a pattern or a group of patterns is given a proper name (indicated by the keyword :sub_pattern followed by this name), this group of patterns can be used as a subpattern within other patterns (indicated by square brackets and the name of the subpattern). This enables the definition of lexical categories and synonyms. In order to meet document analysis requirements, a Levenshtein distance [7] may be defined for each pattern. It controls the degree of tolerance in word matching. To enable a better understanding of the syntax, we now give some English examples out of our domain:

(in) reply | answer to your letter dated of {F= get_date}

confirmation of :both_dir order :both_dir

Thank you (very much) for your order {F= get_number} <2> Mr. | Mrs. | Ms. {F=get_name}

A pattern can be linked to a template-like concept to capture the semantic content of a pattern. It is defined by assigning a set of slots to be filled by the pattern matcher. For each pattern, one can specify which slots should be filled by which kind of information.

Pattern Acquisition

When establishing the DAU system within a new domain, off-line learning results in the definition of generic patterns to be searched for. Therefore, several conditions have to be fulfilled: New information units which are characterized by a common intra-word structure (e.g., a date) have to be defined as regular expressions manually. Furthermore, training documents have to be provided. Images of these documents are shown to a human trainer. She or he then marks the locations of new pattern instances by dragging a frame and typing in the correct ASCII-text of the pattern.

A generic pattern is generated the following way: At least one common word must appear in each pattern instance. Then, the word positions of all pattern instances are aligned according to the position of the common word. This word is inserted as first part of the generic pattern. Now, the generic pattern is completed by moving a pointer stepwise to the left until the start of the shortest instance is reached. At each position, the algorithm computes inter-word similarities between two words in different instances occuring at similar positions. These similarities are based on the existence of different relationships between two words, e.g., identity, matching morphologic word stem, matching hyperonym, matching regular expression, matching part-of-speech. Inter-word similarities between two words are summed for all possible combinations, i.e., for each instance, all similarities for exactly one word are incorporated. The word combination with the maximal sum is chosen as the most similar word pattern. The corresponding commom word features are inserted into the generic pattern on the left-hand side, including word skips. Having reached the start of the shortest instance, the procedure is repeated for a pointer moving from the first common word to the right until the end of the shortest instance is reached. Finally, a description of the generic pattern is obtained.

Pattern Compilation

Before pattern search is invoked the original pattern file is split into different representations. First, it is transformed into a more efficient representation. Second, all of the mandatory words in the patterns are selected, split into trigrams, and compiled into a finite state automaton for characters. This splitting permits a parallel approximate string matching. The compilation is based on the Aho-Corasick algorithm for efficient string search. Thirdly, all words used in the definitions of patterns are picked up and stored in a wordlist for faster access.

Pattern Processing

Pattern processing with the pattern matcher can be divided into several steps as shown in Figure 2. The left-hand side shows at which step the different pattern representations are utilized by the pattern matcher while the right-hand side shows how analysis results are transferred to and from the OfficeMAID's global blackboard.

Figure 2: Pattern Processing

Because all of the words occurring in the text patterns were extracted during the compilation of patterns, lexical pre-processing [8] is now able to verify the results of OCR with respect to a list of valid words. Thus, dictionary look-up returns valid word hypotheses based on the weighted edit distance including the consideration of splits or merges of word segments. Moreover, lexical pre-processing checks OCR results for the appearance of distinct types of information units.

Now the pattern matching process begins. All mandatory words appearing in patterns and present in the actual document are located in one pass. First, all word hypotheses in the document are split into trigrams. Then, the finite state automaton which contains all possible trigram sequences of mandatory words, successively checks the initial letters of all trigram hypotheses of the actual document. If a character matches, the next one is tested and so on until either a terminal state is reached, i. e. a trigram is matched, or the relative Levenshtein distance is exceeded. After all trigrams are located, the system tries to match the entire word within a maximum Levenshtein distance constraint.

Then, all patterns are tested successively for completion: Patterns containing subpatterns are expanded and split into several patterns. All mandatory words and substrings in a pattern are checked from left to right. This takes into account horizontal and vertical directions, disjunctions of words, and possible word skips. If all of these words are matched at an appropriate position, the pattern is checked further to locate optional keywords and to match information units. Finally, the system generates an instantiated set of detected patterns. Redundant patterns are removed and pattern probabilities are calculated bottom-up by multiplying word probabilities.

Finally, the information units of an instantiated pattern are used to instantiate the appropriate concept, i. e. , they are transformed into values for the different concept slots. Note that the slots need not conform to exactly one information unit, but the information they capture may as well be implicitly given by the pattern. Finally, instantiated concepts are stored in the blackboard structure and build up a part of the so-called message of the letter.

2.2 Rule Interpreter

This component takes text features of a document as input and propagates their certainty factors through a rule network to determine the correct class.

Rule Acquisition

A classifier rule consists of either a text pattern or a single word on the left-hand side, and a class name on the right-hand side. Moreover, a certainty factor is attached to the rule. Two different procedures are invoked for rule acquisition:

* For the generation of rules where single words are important text features, acquisition can be done automatically. A training set is required which contains the assignment of correct classes to training documents. First, conditional probabilities for all the words in the training set are computed with respect to the distinct classes. Afterwards, a threshold is applied: Whenever a word occurs significantly often in a certain class, a rule is created for this word. The probability of the rule is calculated on the basis of the conditional probability.

* The abovementioned off-line learning for patterns must take place before the generation of pattern rules. Then, patterns are treated as text features in the same way as described above to enable their integration into a rule.


Our text classifier is based on the rule interpreter FuzzyClips [9] for representing and manipulating fuzzy or uncertain facts and rules. FuzzyClips is freely available for educational and research purposes.

At the beginning of the classification process, the rule interpreter is provided the necessary information: Words, together with recognition scores, are entered as facts, the rulebase is loaded and checked for identifiers of text patterns which appear in rule conditions. These identifiers are collected and the pattern matcher is invoked to locate appearances of these text patterns in the original text. Afterwards, pattern instances, along with their recognition scores, are entered as facts in the factbase. At this point, all rules and facts have been inserted and the rule interpreter is started. Now, the recognition scores of single words and patterns are propagated through the rule network to provide and refine an estimate of the certainty of the different classes. The system computes the certainty of a rule action based on the certainty of the rule condition and the rule certainty itself. Moreover, the certainty of a fact there are several evidences for must be computed. In both cases, several evidences for each hypothesis are combined. Therefore, we use an approach similar to the one of the expert system MYCIN. At the end of the inference process, the fact base contains new facts which consists of classes in conjunction with certainty factors expressing how well the respective class corresponds to the actual document.

2.3 Information Extraction by Parsing

Like the pattern matcher is used especially for information extraction of stereotyped phrases, the analysis of highly structured parts (i. e., parts showing regular linguistic structures in some sense) can be performed by a parsing component. For the application domain of printed business letters, we exemplarily use this technique for identification of correspondents, i. e., sender and recipient of a message. Therefore, we have built a component upon a standard parsing system for feature grammars which starts with the text block expected to hold the relevant information, i. .e the address, an in performing the following steps: lexical pre-processing, address parsing, and data base matching. Before we describe these analysis steps, we say a few words on preparatory steps, i. e., creating the lexicon, grammar, and address data base. Such preparations show some special relation to corporate knowledge.

2.3.1 Preparatory Steps

In document analysis, recording of the so-called Ground Truth (GT) data is performed in order to enable the evaluation and benchmarking of the system. This is comparable to the related procedure in speech recognition, where acoustic speech data is transcribed into electronic text. This text transcription corresponds to GT for OCR in document analysis.

For information extraction, we additionally record GT data at the conceptual level. In particular, this GT comprises - beneath the ASCII text - the logical arrangement of a letter's parts, such as recipient, sender, body and relational data. Furthermore, message level information is given to describe the content, e. g., "this is an order sent by company x for the items y to sell".

Starting from this GT data which comprises both message-level and person-specific (address) data, a lexicon and an address data base are compiled. In order to reduce manual effort, GT data is recorded at some aggregational level between words and phrases. For addresses, this means actually that the slots to be filled are organization's name and type, person's name, etc. Because GT is not always given at word level and not part-of-speech tagged, several heuristics have to be used within the lexicon generation step. A typical example are such cases where a company's name consists of its owner's name plus the company type, e. g., "Peter Stuyvesant & Sons, Ltd.". Here, the occurrence of the keywords "Sons" and "Ltd." is sufficient for the hypothesis that "Peter" is the forename and "Stuyvesant" is the surname of the company's owner. Anyway, such heuristics may fail since there are several possible deep structures for a single form; e. g., compare "Peter Stuyvesant Corporation" and "General Electric Company".

In addition to this semi-automatically generated lexicon, a small hand-crafted lexicon of function words is drawn up manually during the construction of the grammar. In detail, each word having a special role within the address grammar is attached as an entry in this lexicon. Such function words are the German equivalents to "P. O. Box", "Mister", "Miss", etc.

Within the given context of an OMIS, perviously analyzed documents and respectively recorded results are used in the same way which GT data is used for system preparation.

2.3.2 Lexical Pre-Processing

The first step in the analysis of a given text block is the recognition of its text. After OCR, the dictionary generated from the GT data is used with the above-mentioned word recognition component. This leads to a word hypotheses lattice comparable to those used in speech processing.

Since not all words are known in advance, a heuristic word description is assigned additionally. The respective default categories depend on the character contents of the recognized words, e. g., 5-digit numbers are assumed to be ZIP codes, words with upper-case initial letter are assumed to be names, and so on. Sometimes, more interesting word categories can be assumed, e. g., street names typically ending in "street", "avenue", and town names having typical endings like the English "town" or "ton".

2.3.3 Address Parsing

Subsequently, the parser is invoked to compute valid parses. The parsing itself is standard and therefore needs no detailed description. It works on a feature-based grammar formalism related to PATR-II and consists of an island-driven chart parser, i. e., a parser driven by recognition probabilities. We have enhanced this standard parsing technique by adding the possibility of partial parsing which means that input to the parser may be fragmentary. This is important for document analysis where OCR errors or deficiencies in the lexical data base occur frequently.

Partial parsing is performed in the following way. Since the components of an address are easy to delimit, a kind of case frame is defined consisting of the parts town, street/P. O. Box, name, department, and institution. This frame is mapped to the resulting feature structure of a complete parsing result and thus a uniform post-processing can be done for partial and complete parses. For partial parses, strict rules are given on how to fit a partial parse into this frame. Thus, partial parses can be combined to more complete ones by applying these rules.

2.3.4 Data Base Matching

Data base matching is carried out as follows. For each pair of parse and a particular data base entry in question, a formula is generated; this formula computes the total matching weight of the parse/data-base pair mainly by summing the matching weights of the parse's components. Some primitive matching weights are defined for names and numbers; the more complex weights are aggregated from a few primitives.

The principle of these formulas will be described by an example. A partial parse yields a case frame with department, name, and city; in particular, the corresponding organization and the street or P. O. Box address are unknown. For each data base entry where - in this case - the person's name matches the GT data, the respective formula is determined. The formula's denominator is determined only by the parse. It ensures that the resulting weight is not influenced by the amount of GT data available for one entry (which is, in general, a more accidential property of a GT entry's content). Thus, the best possible weight occurs for those data base entries containing the same frame slots as the parse. In our example, this happens for entries with department, name, and city given. A loss of weight occurs for those cases where one of those slots is empty, e. g., if the department is not given.

This way, a list of weighted pairs of parses and data base entries is computed. Actually, the total weight of such a pair is calculated by simple multiplication of the parse weight and the match weight. In a final step, the most probable parse-to-data-base match is determined by looking at some of the best weighted matches, taking into account their differences in content (e. g., persons with the same name in one company but with different departments), and the difference of their weights.

For the final result, we also consider the relative differences between the best parsing results. Assume that a well weighted parse is computed but no data base match is found for it. Instead of returning some possible match for the second best parse, it may be better to reject this parse, i. e., to say "no result found". In such cases, the lexical heuristics described in Section 2.3.2 may have major influence on the resulting weights. Finally, the component outputs exactly one result or a reject.

3 Integrating in an Organizational Memory.

As mentioned in the introduction, the relation of DAU and an OMIS is twofold. On the one hand, DAU helps to keep formalized knowledge in organizational memories up to date providing the services described in Section 2. On the other hand, DAU techniques require knowledge from the organizational memory in order to work reliably. Consequently, the DAU system has to be linked to the embedding OMIS in both directions: Information retrieved from incoming documents has to be related to the data and knowledge bases in which it has to be stored. Furthermore, the system needs to know how to retrieve useful information from the OMIS. Unfortunately, the different kinds of knowledge will suitably be represented in different formalisms. For instance, generic document types and knowledge on how to perform DAU tasks are parts of the DAU system whereas corporate knowledge typically is given in terms specific to that corporation. Hence, integrating DAU systems into a corporate memory is a non-trivial task (cf. Figure 3) and is very similar to the problem of knowledge base integration in general [10].

Figure 3: DAU within an Organizational Memory

The workflow model plays a key role in this scenario. On the one hand, the main services of DAU systems assign analysed documents to open processes in the corresponding workflow. For instance, after finishing an "order some goods" task a workflow management system will usually generate a "handle invoice" workflow instance referring to exactly this order. If DAU is able to assign the incoming invoice to this open workflow instance, all further processing can be left to the workflow management system. Hence, available workflow models can help building a bridge from DAU to the organizational memory. On the other hand, open workflow instances concerning DAU tasks provide valuable task and context information for the improvement of DAU reliability. Task information prevents the system from performing unnecessary analysis steps. If e. g. a classification of documents is required there is no need to extract further information from the document. Another property of tasks states whether to optimize accuracy or response time of DAU steps. Context information addresses knowledge about what is expected rather than what is to do. Returning to the example mentioned above, the DAU system needs to know whether an open workflow instance is waiting for an invoice or not. If the workflow management system does not expect any invoices but the DAU system thinks to have found one there must be something wrong.

The following sections cover these problems and chances of tight interaction between DAU and organizational memory.

3.1 Different knowledge types

Having a more detailed view on DAU in the context of a corporate memory, three kinds of knowledge play an important role. These knowledge portions differ in content, terminology and life time. However, providing reliable document analysis services requires extensive use of all these knowledge sources.

Corporate Knowledge comprises the contents of the OMIS, workflow model, databases, organizational diagrams and ER-diagrams etc. This knowledge is represented in terms specific to the corporation. Thus, knowledge on corporation specific conventions is necessary to understand the meaning of proper names and identifiers occuring in the representation of corporate knowledge.

Document Knowledge is composed of structural knowledge about documents, i. e., layout objects, logical labels, relevant message types etc. Document knowledge is represented due to the more generic terms of the document analysis domains like for instance business letters. Representations of document knowledge are therefore understandable beyond the context of a specific corporation. Proper names and identifiers are commonly used in the application domain, e. g., business letters.

Figure 4 sketches our idea on how to represent generic knowledge on documents (within the dashed box). Basically, the representation is built of an aggregational and taxonomical hierarchy enriched by inherited attribute/value pairs and constraints between attribute values. Attributes are listed in the box representing a term followed by constraints on attribute values. Additionally, an aggregation represents part/whole relations and is denoted by thin lines labeled with an identifier of the relation and a range of cardinal numbers stating how many components can be part of a whole due to the named relation.

As an example consider class TIFF_Document. Relevant attributes of global interest concern the questions: Are colours represented (attribute colours)? Has the pixel graphic been normalized by standard procedures (attribute preprocessed)? Instances of another concept Word_Hypotheses represent word hypotheses. Lexicons that have been used to prove hypotheses can be parts of these instances. However, the most important concepts concern document types. Attributes of subsumptions of the concept Document_Type typically are possible goals of information extraction tasks. Parts of a document type refer to additional data useful for accomplishing document analysis tasks like pattern descriptions and grammars. In fact, such data is compatibel with the intention of a taxonomy because patterns and grammars can be considered as constraints on texts. Hence, demanding consistency with a pattern or a grammar can be viewed as concept specialization.

Figure 4: Concepts of document knowledge - a small cut

Viewing documents as instances of document concepts enables application specific specializations of generic document types as gateways to available corporate knowledge. An example is given in Figure 4. Three document types specific to the XY Ltd. company have been added to the knowledge base inheriting most of their properties from generic document types. Either attributes are added or inherited attributes are constrained to more concrete expectations. Document type XY_Ltd_Product_Req for instance adds an additional product_name attribute that is constrained by the available products. An example for providing more detailed information on documents by a more specific concept is XY_Ltd_Support_Req. Here, senders of such a request are assumed to have bought a product and, hence, have an entry in the customer database of XY Ltd. Consequently, the sender attribute is constrained to a grammer retrieved by the customer database. Such information is of obvious interest for enhanced document classification and information extraction operations.

The sketched formalism can be considered as an ontology for sharing knowledge on documents [10] appropriate for the following two purposes: On the one hand the ontology is a specification of data structures for storage and communication of document analysis inputs and results. On the other hand these concepts are required for specifying document analysis tasks and characterizing program modules accomplishing document analysis tasks describing their input requirements and their deliverables. This use of ontologies is very similar to their use for example in system for knowledge-based design[11].

Strategic Knowledge refers to the latter point. As pointed out in Section 2, document analysis typically is accomplished by an application specific configuration of program modules. In contrast to state of the art applications, document analysis within a corporate memory can be expected to concern wide application domains (like the whole correspondence of a company by snail mail and e-mail) with many document types. As mentioned before, access to a large variety of useful corporate knowledge is required. The opportunity to flexibly choose document analysis modules and their parameters with reference to the current application context is desired. An abstract notation can be used to support linking generic knowledge on the performance of document analysis modules to the corporate memory and, as we think, to automatically configure document analysis processes according to dynamically changing requirements.

Our ideas on how to specify document analysis processes are sketched in Figure 5. Again, a concept language is employed with the same elements as in the document knowledge. Concept DAU_Op is the most general concept of document analysis program modules resp. tasks. Some attributes of global interest like a qualitative characterization of the precision or the recall can be used to represent either properties of or requirements on document analysis program modules. Parts of that concept represent either provided resp. required input (part input) or provided resp. desired output (part output).

Figure 5: Concepts of strategic knowledge - a small cut

Subconcept DAU_Plan describes possible compositions of document analysis operations. Instances of DAU_Interface are used to represent something like a consistent resource-oriented configuration [12][13] where all inputs required by analysis programs (the parts of op) are either inputs of the whole plan or provided by another program.

Subconcepts of Specialist_Op describe the available program modules. Hence, a reference to the data base of avaiable programs is maintained. Optionally, additional constraints on the input parts can be used to represent applicability of a program module, whereas constraints on the output parts represent deliverables. Note, that the program modules gain access to relevant data through the properties of their input and output parts. E.g. the rule classifier can retrieve relevant rules and patterns from the concept description of its output (crf. Figure 4) , the island parser reads the relevant grammar from its output concept representation. Hence, representing concepts of docuement analysis operations enables application specific tuning with corporate knowledge without detailed knowledge on document analysis techniquesApplication specific extensions of this ontology can be used to specify application specific document analysis tasks. Subconcepts of Specialist_Op introduce application specific analysis moduls resp. modified usage of the original modules with e.g. specific paramters. Additionally, subconcepts of DAU_Op may be specified in order to create new application specific analysis tasks. As an example consider concept Analyse_Invoice_Op. This new tasks assumes a TIFF_Document as input and is supposed to classify a XY_Ltd_Invoice (as Defined in Figure 4) as output and extract the attribute order_no. The VirtualOffice project aims at developing methods founded on the configuration of technical systems [12][14][15] for finding realizations of such application specific tasks either automatically or in close dialog to an expert. More detailed: Building a configuration of an abstract concept like Analyse_Input_Op means finding an instance of the composed concept Analyse_Input_Op and DAU_Plan.

An intelligent DAU system (with the ability to configure analysis processes automatically due to abstract task concepts) enables use of task and context information as described above. Exploiting task information is relatively easy. For instance in concept Analyse_Invoice_Op the only required information extraction task is to find out the number of the order preceeding the current invoice. Neither the sender nor the date attribute have to be extracted in order to complete the task Analyse_Invoice_Op. Hence, task specification is appropriate to avoid unnecessary computations and optimize performance of the DAU system.

Moreover, context information aims at improving reliability of DAU processes by providing assumptions on incoming documents. Consider an order that has been submitted to a supplier. Consequently, beneath always expected product requests the OMIS is waiting for an invoice referring to the submitted order. Hence, the DAU is told to conduct an application specific procedure where the output is constrained to the document types XY_Ltd_Product_Req and XY_Ltd_Invoice. Automatic configuration now chooses appropriate patterns and rules to be processed and uses expected data for internal consistency checks. After having received the invoice, the corresponding document type is removed from the goal concept. The configuration of the DAU system changes again because there is no open order left.

Besides the usual problems of knowledge-based configuration systems [16] automatic and dynamic configuration of DAU tasks requires to cope with unexpected behaviour of the employed DAU algorithms. Figure 6 gives an example. Plan1 has been expected to provide a XY_Ltd_invoice result with order_no constrained to 25. Unfortunately, the result of plan1 violates this constraint perhaps because of an OCR failure. The system recognizes this inconsistency and starts a replanning operation with a new constraint on the new plan. It is required to differ from plan1.

Figure 6: Replanning because of an unsatisfied integrity constraint

3.2 Interaction between DAU and the process model

The different knowledge types and their relevance for conducting DAU in a corporate memory have been discussed in chapter 3.1. Furthermore, we have sketched their representation. For the purpose of context-oriented instantiation and specialization of these knowledge types in an application domain (e.g. purchasing) we want to make use of given workflows specifying the daily work processes. When talking about workflows one has to make a clear distinction what our interest in this paper is.

Their exist a large variety of different workflow models, representation techniques and their implementation through WFMS in research and industrial use. In this paper, we are interested in the common concepts of workflow models with respect to their applicability as valuable knowledge source for document analysis and understanding. Therefore we will start in this section with a short overview on common modeling elements and show how to use them for gaining application-specific document knowledge as well as providing input for the handling of DAU tasks and contexts. The latter will be explained for both, the specification of a generic workflow with integrated DAU tasks at buildtime and the benefits of using the context of workflow instances at runtime.

Workflow models may be classified regarding their design centers - the main concepts used for implementing a workflow. Most commercial systems are obviously centered around the process, but also agent and data-centered systems are known [17]. Due to this fact modeling elements for the specification of workflows slightly differ. In the following we want to concentrate on common elements. First of all a workflow model should contain the following perspectives [18] on a process-centric view of work:

* Organization: this perspective describes the organizational structure of a company and the different individuals (actors) being involved. Links between specific organizational units and tasks specify responsibility and performing actors (role concept). In case of automated tasks the required applications are triggered by the workflow engine without user interaction, resp. the role indicates an organizational unit representing a system (automated actions, agents).

* Data: this perspective on data, resp. documents describes the work objects which are required to perform tasks, resp. generate or modify data. Therefore the notion of input/output container or source and sink relations may be used.

* Tasks: tasks represent a collection of actions or one elementary action which describe the work content. The tools or applications required for working and therefore creating or modifying objects have to be specified. Trigger conditions are necessary to define the beginning and completion of a task. Here logical expressions on contents of the sink and source container are used to determine these trigger conditions or events.

* Process: Using the abovementioned modeling concepts allows a specification of the entire process. The individual tasks are nested to each other by triggers and the source/sink relations. The role concept specifies the embedding into the companys' organizational structure.

Figure 7: Perspectives on a process-centric view of work

These concepts are supported by standard WFMS and used to specify the work process of interest. In our scenario of an integrated DAU and corporate memory system document knowledge and strategic knowledge can be located in the following way:

Document knowledge

* Data / documents: if data is represented in input or output documents it may contain additional document knowledge (e.g. appearance of a form of supplier XY and relevant data field descriptions). Furthermore the description of application-specific document types (e.g. an invoice consists of: products delivered, extension, reference to represents document knowledge. On the other hand links to data descriptions as contained in corporate databases (e.g. product specifications can be accessed via the oracle-product-DB) represent corporate knowledge.

* Role / Actor: the human actors being specified in the organizational structure (this part has to be included in almost every WFMS) appear often as recipient / sender in incoming and outgoing documents and represent at a first look corporate knowledge. But also document and specialist knowledge can be derived from this source (e. g., the recipient_grammar)

DAU tasks

* Tasks: if a task consists of actions for information extraction out of documents or generating or modifying documents it may contain DAU-relevant tasks, i.e. strategic knowledge ( e.g. extract the extension of an invoice, generate an inquiry via MS-WORD containing demanded products and the running inquiry number).

The concepts of strategic knowledge support different kind of granularity for representing DAU tasks in the workflow description. Workflow designers without knowledge about DAU design their workflow in a task-driven manner. Actions concerning information extraction or classification tasks are specified as automated actions. The performing actor is the entire DAU system (which performs automated configuration on demand using a standard DAU_Plan). By denoting a specific Specialist_Op (see Figure 5) as performing actors the experienced workflow designer has the possibility to specify concrete DAU techniques.

DAU context

As we mentioned earlier an efficient way to support DAU is the usage of workflow context at runtime. Instances of the generic workflow will be handled by a workflow engine which implements the underlying execution semantics. If the trigger condition for a DAU action fires the workflow engine makes a switch of control to the DAU system and all available context stemming from the set of running workflow instances are given to the consuming DAU operation. This context may include the following informations, resp. expectations (actually the contents of the internal workflow database):

* actors having performed previous working steps of this workflow instance

* acquired data

* expected document types

* expected sender, recipient

* references (date, documents)

As a final step the extracted results are passed to the data fields (optionally after human expert verification) of the corporate database or document container as specified in the workflow. In this way DAU can be embedded as an information supplier into a WFMS, resp. corporate memory.

3.3 Changes in Knowledge Sources

An organizational memory can be expected to change dynamically during use [1]. It permanently incorporates new knowledge and maybe forgets old and now irrelvant knowledge chunks. However, such behaviour although desired causes problems for all software components like the DAU system that exploit the mutating knowledge represented in the OMIS. These problems concern the consistency between the organizational memory and the results of the component. Possibly, some relevant knowledge changed while computing the latest results. There are principally two extreme opportunities to cope with this situation:

1. Forget all intermediate results of the current computations after possibly relevant changes in the organizational memory.

2. Ignore changes in the organizational memory and accept a somehow fuzzy validity of results.

In the VirtualOffice system we will try to find some ways to represent opportunities in between these extremes as a property of DAU tasks.

4 Conclusion

Summing up, this paper presented fundamental arguments to employ document analysis methods for keeping an organizational memory up to date with newly received unstructured documents. This is necessary to answer questions concerning for instance a company's correspondence by snail mail and e-mail. However, the corresponding analysis tasks require much more flexibility than state of the art document analysis applications and the ability to exploit corporate knowledge by the DAU system. Our suggestion to cope with these requirements is to apply methods coming from knowledge-based systems.

In this paper we sketched an ontology of documents, document analysis tasks, and operations in order to bridge the terminological gap between more generic knowledge on typical documents and the application specific representation of organizational memories. This method enables document analysis modules to access corporate knowledge if needed avoiding spending too much effort on integrating the DAU system into an existing OMIS.

Applicability of the DAU system is increased by a modular software architecture that is configured due to the dynamically changing requirements and expectations of the OMIS. Information on the current context of document analysis can be retrieved from a workflow modell. Abstract specification of DAU tasks enables the application of a knowledge-based configuration system in order to determine the configuration of the system that complies best with the current context.

In fact, both problems are probably not specific to the application of document analysis systems. Kühn and Abecker proposed a modular architecture for generic OMISs [1]. All these modules require and deliver knowledge chunks. They have to communicate and must be kept consistent. Hence, this paper may be viewed as a case study on the integration of such modules using ontologies enhanced by application specific concepts for knowledge interchange between these modules and knowledge-based configuration system to implement context sensitive behaviour of the modules.


This work has been supported by the German federal ministry of education, science, research and technology (BMBF) under contract FKZ-ITW-9702 (Project VirtualOffice).


[1] Otto Kühn and Andreas Abecker. Corporate memories for knowledge management in industrial practice: Prospects and challenges. Journal of Universal Computer Science, 1997. submitted.

[2] S. Baumann, M.H. Malburg, and C. Wenzel. May document analysis tools bridge the gap between paper and workflows? In Proceedings IFCIS: International Conference on Cooperative Information Systems (CoopIS-96), Brussels, Belgium, pages 135-142, June 1996.

[3] A. Dengel, R. Bleisinger, F. Fein, R. Hoch, F. Hönes, M. Malburg. OfficeMAID -- A System for Office Mail Analysis, Interpretation and Delivery. In: L. Spitz and A. Dengel (eds.), Document Analysis Systems, World Scientific Publishing Co. Inc., Singapore (1995), pp. 52-75.

[4] C. Wenzel, R. Hoch. Text Categorization of Scanned Documents Applying a Rule-based Approach. Fourth Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, Nevada, April 24-26, 1995, pp. 333-346.

[5] C. Wenzel. Supporting Information Extraction from Printed Documents by Lexico-Semantic Pattern Matching. Fourth International Conference on Document Analysis and Recognition, Ulm, Germany, August 18-20, 1997. To appear.

[6] M. H. Malburg. Address Recognition with Robust NLU Technology. Florida Artificial Intelligence Research Symposium, Special Track on Real-World Natural Language Understanding, Key West, Florida, Mai 1996.

[7] V. I. Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Phys. Dokl., vol.10, no.8, 1966, pp. 707-710.

[8] A. Weigel, S. Baumann, J. Rohrschneider. Lexical Postprocessing by Heuristic Search and Automatic Determination of the Edit Costs. Third International Conference on Document Analysis and Recognition, Montreal, Canada, August 1995, pp. 857-860.

[9] Knowledge Systems Laboratory, Institute for Information Technology, National Research Council Canada. FuzzyCLIPS Version 6.02A User's Guide. September 1994.

[10] Thomas R. Gruber. Toward principles for the design of ontologies used for knowledge sharing. In Nicola Guarino and Roberto Poli, editors, Formal Ontology in Conceptual Analysis and Knowledge Representation. Kluwer Academic Publishers, 1993.

[11] Thomas R. Gruber, Jay M. Tenenbaum, and Jay C. Weber. Towards a knowledge medium for collaborative product development. In John S. Gero, editor, Proceedings on the Second International Conference on Artificial Intelligence in Design, pages 413--432. Kluwer Academic Publishers, 1992.

[12] M. Heinrich and E.-W. Jüngst. A resource-based paradigm for the configuring of technical systems from modular components. In CAIA-91: Proc. of the 7th IEEE Conference on AI Applications, 1991.

[13] O. Najmann and B. Stein. A theoretical framework for configuration. In Proceedings of the 5th IEAAIE, 1992.

[14] A. Günter. Wissensbasiertes Konfigurieren - Ergebnisse aus dem Projekt PROKON. Infix Verlag, Sankt Augustin, 1995.

[15] F. Baader, H.-J. Burckert, A. Günter, and W. Nutt, editors. WRKP-96: Proceedings of the Workshop on Knowledge Representation and Configuration, number D-96-04 in DFKI Document, August 1996.

[16] Harald Meyer auf'm Hofe. What is still to do in order to solve configuration problems in practice? In [15], pages 25-32, 1996.

[17] K. Schwab. Conception, Development and Implementation of a Computer Based Office System for the Modeling of Activity Classes and Execution of and Control of Activities, Diss., University of Bamberg, Germany 1993.

[18] S. Jablonski, C. Bussler. Workflow Management - Modeling Concepts, Architecture and Implementation. International Thomson Computer Press, 1996.