KnoWoGen

Abstract

Current publicly available knowledge work data collections lack diversity, extensive annotations, and contextual information about the users and their documents. These issues hinder objective and comparable data-driven evaluations and optimizations of knowledge work assistance systems. Due to the considerable resources needed to collect such data in real-life settings and the necessity of data censorship, collecting such a dataset appears nearly impossible. For this reason, we propose a configurable, multi-agent knowledge work dataset generator. This system simulates collaborative knowledge work among agents, producing Large Language Model-generated documents and accompanying data traces. Additionally, the generator captures all background information from its configuration or created during the simulation process in a knowledge graph. Finally, the resulting dataset can be used and shared without privacy or confidentiality concerns.

Authors

Examples (Prompts and Generated Documents)

A) Seed Document (Without Predecessor)

Prompt:

<s>[INST] You are a helpful, creative tool that generates documents that seem like real documents. Be creative in how to structure the document. You are encouraged to generate artificial information and add additional content not stated in the prompt to enrich the content. Please always use HTML to encode your answer but do not use CSS styling. Never output gaps or input fields but just put in imaginary information without stating that it is generated. [...]
Do not include links, thus do not use <a> tags. Do not include images or figures. Do not give any additional comments or notes besides the generated document as it has highly undesirable effects. Output a long document. [...]
Please generate a document as described below:
Description:
Generate an detailed, innovative project proposal for an industry project with a background in the Automotive Engineering domain. [/INST]

Generated document:

B) E-mail Threads

Summarization-based Reply Generation:

Prompt (Initial e-mail):

<s>[INST] You are a helpful, creative tool that generates documents that seem like real documents. Be creative in how to structure the document. You are encouraged to generate artificial information and add additional content not stated in the prompt to enrich the content. Please always use HTML to encode your answer but do not use CSS styling. Never output gaps or input fields but just put in imaginary information without stating that it is generated. [...]
Do not include links, thus do not use <a> tags. Do not include images or figures. Do not give any additional comments or notes besides the generated document as it has highly undesirable effects. Output a long document. [...]
Only write the described email and no additional responses. Do not include the subject or other email header fields such as email addresses and just output the email text. Do not be too enthusiastic. Put the sender's name at the end of the email.
Please generate a document as described below:
Description:
Write an email from Emma Stoneheart to Arthur Winterborne, Olivia Moonshadow, Theodore Sundown inviting them to a planning meeting of the gothic architecture course [/INST]

Generated document (Initial e-mail):

Prompt (Reply):

<s>[INST] You are a helpful, creative tool that generates documents that seem like real documents. Be creative in how to structure the document. You are encouraged to generate artificial information and add additional content not stated in the prompt to enrich the content. Please always use HTML to encode your answer but do not use CSS styling. Never output gaps or input fields but just put in imaginary information without stating that it is generated. [...]
Do not include links, thus do not use <a> tags. Do not include images or figures. Do not give any additional comments or notes besides the generated document as it has highly undesirable effects. Output a long document. [...]
Please generate a document as described below:
Description:
Write an email from Olivia Moonshadow answering to the given summarized email. Do only address some aspects of the previous email.
Previous Email: [Summary of the first e-mail] [/INST]

Generated document (Reply):
Question-based Reply Generation:
- Variant 1: Implicit Question Generation:
  
  Prompt (Initial e-mail):
  
  <s>[INST] You are a helpful, creative tool that generates documents that seem like real documents. Be creative in how to structure the document. You are encouraged to generate artificial information and add additional content not stated in the prompt to enrich the content. Please always use HTML to encode your answer but do not use CSS styling. Never output gaps or input fields but just put in imaginary information without stating that it is generated. [...]
  Do not include links, thus do not use <a> tags. Do not include images or figures. Do not give any additional comments or notes besides the generated document as it has highly undesirable effects. Output a long document. [...]
  Include exactly one specific question to one of the receivers.
  Only write the described email and no additional responses. Do not include the subject or other email header fields such as email addresses and just output the email text. Do not be too enthusiastic. Put the sender's name at the end of the email.
  Please generate a document as described below:
  Description:
  Write an email from Alice Johnson to Benjamin Brown, Carolina Perez, Daniel Lee inviting them to a planning meeting of the Machine Learning course and in particular the different topics that should be addressed. Include an agenda. [/INST]
  
  Generated document (Initial e-mail):
  
  Prompt (Reply):
  
  <s>[INST] You are a helpful, creative tool that generates documents that seem like real documents. Be creative in how to structure the document. You are encouraged to generate artificial information and add additional content not stated in the prompt to enrich the content. Please always use HTML to encode your answer but do not use CSS styling. Never output gaps or input fields but just put in imaginary information without stating that it is generated. [...]
  Do not include links, thus do not use <a> tags. Do not include images or figures. Do not give any additional comments or notes besides the generated document as it has highly undesirable effects. Output a long document. [...]
  Please generate a document as described below:
  Description:
  Write an email from Carolina Perez answering to the following question.
  Question: Could you please bring some research papers or articles on the current state of deep learning and its applications to the meeting?
  Previous Email: [See e-mail above] [/INST]
  
  Generated document (Reply):
  
  Knowledge graph:
- Variant 2: Explicit Question Generation:
  
  Prompt (Initial e-mail):
  
  <s>[INST] You are a helpful, creative tool that generates documents that seem like real documents. Be creative in how to structure the document. You are encouraged to generate artificial information and add additional content not stated in the prompt to enrich the content. Please always use HTML to encode your answer but do not use CSS styling. Never output gaps or input fields but just put in imaginary information without stating that it is generated. [...]
  Do not include links, thus do not use <a> tags. Do not include images or figures. Do not give any additional comments or notes besides the generated document as it has highly undesirable effects. Output a long document. [...]
  Only write the described email and no additional responses. Do not include the subject or other email header fields such as email addresses and just output the email text. Do not be too enthusiastic. Put the sender's name at the end of the email.
  Please generate a document as described below:
  Description:
  Write an email from Alexander James to Victoria Anne, Mark David, Emily Rose inviting them to a planning meeting of the intermediate English as a foreign language course. [/INST]
  
  Generated document (Initial e-mail):
  
  The following questions were generated: Questions
  
  Prompt (Second Reply):
  
  <s>[INST] You are a helpful, creative tool that generates documents that seem like real documents. Be creative in how to structure the document. You are encouraged to generate artificial information and add additional content not stated in the prompt to enrich the content. Please always use HTML to encode your answer but do not use CSS styling. Never output gaps or input fields but just put in imaginary information without stating that it is generated. [...]
  Do not include links, thus do not use <a> tags. Do not include images or figures. Do not give any additional comments or notes besides the generated document as it has highly undesirable effects. Output a long document. [...]
  Please generate a document as described below:
  Description:
  Write an email from Victoria Anne answering to the following email and posing the following question to the receiver: What new materials could be incorporated into the intermediate English as a foreign language course?
  Previous Email: [See e-mail above] [/INST]
  
  Generated document (Reply):
  
  Prompt (Second Reply):
  
  <s>[INST] You are a helpful, creative tool that generates documents that seem like real documents. Be creative in how to structure the document. You are encouraged to generate artificial information and add additional content not stated in the prompt to enrich the content. Please always use HTML to encode your answer but do not use CSS styling. Never output gaps or input fields but just put in imaginary information without stating that it is generated. [...]
  Do not include links, thus do not use <a> tags. Do not include images or figures. Do not give any additional comments or notes besides the generated document as it has highly undesirable effects. Output a long document. [...]
  Please generate a document as described below:
  Description:
  Write an email from Alexander answering to the given email.
  Previous Email: [See e-mail above] [/INST]
  
  Generated document (Reply):
  
  Knowledge graph:

Experiments

Experiment 1: Comparing the authenticity between real and generated documents

In this experiments, we let participants rate the authenticity of real and generated documents on a 7-point Likert-scale. Higher scores mean that the participants perceived the document as being highly authentic.
The following chart visualizes the score distribution for real and generated documents:

Experiment 2: Assessing the quality of generated e-mail threads

In this experiments, we let participants rate the quality of generated e-mail threads on a 5-point Likert-scale.
We asked the following questions:

Single-turn threads:
- How natural is the social interaction between sender and receiver(s)?
  E.g. does the sender address the receivers in a consistent manner (always uses first name or last name)
- How natural is the conversation overall disregarding social aspects?
  E.g. many repetitions of the same content decreases the naturalness
- How well does the reply respect the content of the previous email?
  Is the content of the reply and the previous email consistent or are there any content discrepancies? Does the reply respect the information mentioned in the previous email?
- How well does the reply address and answer existing questions posed to the respective receiver?
  Are all questions posed to the respective receiver answered? And are the answers complete?
Multi-turn threads:
- How good are the contentual advances of the communication?
  Are new contents produced during the conversation or is it "getting nowhere"?
- How well do the replies respect the preceding conversation (not only the last email)?
  Are the replies' contents consistent with respect to the previous communication? Or are there discrepancies between a reply and one of the previous emails?

Below are the links to the generated e-mail threads used in the study:

Single-turn threads:
- Thread 1: E-mail 1 - E-mail 2
- Thread 2: E-mail 1 - E-mail 2
Multi-turn threads:
- Thread 3: E-mail 1 - E-mail 2 - E-mail 3 - E-mail 4
- Thread 4: E-mail 1 - E-mail 2 - E-mail 3 - E-mail 4

Papers

Desiree Heim, Christian Jilek, Adrian Ulges, Andreas Dengel:
Towards Synthesizing E-Mail Conversations as Part of Knowledge Work Datasets with Large Language Models
24th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2024), Amsterdam, Netherlands, Nov 26-28, 2024 (accepted for publication)
Desiree Heim, Christian Jilek, Adrian Ulges, Andreas Dengel:
Using Large Language Models to Generate Authentic Multi-agent Knowledge Work Datasets
AI@WORK Workshop at the Informatik Festival 2024, Wiesbaden, Germany, Sep 24-26, 2024

KnoWoGen – The Knowledge Work Dataset Generator