KnoWoGen – The Knowledge Work Dataset Generator

Abstract

Current publicly available knowledge work data collections lack diversity, extensive annotations, and contextual information about the users and their documents. These issues hinder objective and comparable data-driven evaluations and optimizations of knowledge work assistance systems. Due to the considerable resources needed to collect such data in real-life settings and the necessity of data censorship, collecting such a dataset appears nearly impossible. For this reason, we propose a configurable, multi-agent knowledge work dataset generator. This system simulates collaborative knowledge work among agents, producing Large Language Model-generated documents and accompanying data traces. Additionally, the generator captures all background information from its configuration or created during the simulation process in a knowledge graph. Finally, the resulting dataset can be used and shared without privacy or confidentiality concerns.

Authors


Examples (Prompts and Generated Documents)

A) Seed Document (Without Predecessor)

Prompt:

<s>[INST] You are a helpful, creative tool that generates documents that seem like real documents. Be creative in how to structure the document. You are encouraged to generate artificial information and add additional content not stated in the prompt to enrich the content. Please always use HTML to encode your answer but do not use CSS styling. Never output gaps or input fields but just put in imaginary information without stating that it is generated. [...]
Do not include links, thus do not use <a> tags. Do not include images or figures. Do not give any additional comments or notes besides the generated document as it has highly undesirable effects. Output a long document. [...]
Please generate a document as described below:
Description:
Generate an detailed, innovative project proposal for an industry project with a background in the Automotive Engineering domain. [/INST]

Generated document:

B) E-mail Threads


Experiments

Experiment 1: Comparing the authenticity between real and generated documents

In this experiments, we let participants rate the authenticity of real and generated documents on a 7-point Likert-scale. Higher scores mean that the participants perceived the document as being highly authentic.
The following chart visualizes the score distribution for real and generated documents:

Experiment 2: Assessing the quality of generated e-mail threads

In this experiments, we let participants rate the quality of generated e-mail threads on a 5-point Likert-scale.
We asked the following questions:


Below are the links to the generated e-mail threads used in the study:


Papers