ptpDG - A Purchase-To-Pay Dataset Generator for Evaluating Knowledge-Graph-Based Services

Welcome to the project site of ptpDG, a purchase-to-pay dataset generator for evaluating knowledge-graph-based services. This is work in progress and part of the research projects SensAI and Innoprom (EFRE Rheinland-Pfalz).

Authors: Michael Schulze, Markus Schröder, Christian Jilek, Andreas Dengel

Abstract: ptpDG is a labeled-dataset generator that generates various data assets for evaluating knowledge graph construction approaches and downstream knowledge services in the purchase- to-pay domain: While organizations sell, purchase and complain about products in a multi-agent-system simulation, a ground truth knowledge graph emerges with different kinds of purchase-to-pay processes. Based on this knowledge graph, heterogeneous electronic purchase-to-pay documents such as e-invoices, credit notes and orders are generated. To those documents, noise patterns are added that we have frequently encountered in real industrial data. Finally, a provenance graph is generated which contains provenance information between document elements and ground truth triples. In this way, for such privacy sensitive scenarios, ptpDG enables data-driven evaluation and its publication.

The codebase is available here.

Dataset Generation Tutorial

This little tutorial describes how a dataset with 105k triples was generated with ptpDG. The codebase can be found here.

The generated dataset from this tutorial is available here.

Configuration

First, we have adjusted properties in the file config/config.properties:

Simulation related

Line1: The simulation is round based, which means that time and time distances are represented in rounds. For this dataset, the number of rounds is set to 60. Only Integers are allowed.
Line2: In a simulation round, organizations place orders. The number of tries of placing an order per round is set to 4.
Line3: Whether a try is successful or not depends on the probability specified here. For this dataset, it is set to 0.8.
Line4: Orders as well as all other documents are sent from organizations to organizations. In a round, organizations process their postboxes. The mean of processed documents per round is here set to 6.
Line5: Complains are created when organizations process invoices in their postboxes, and when they then complain about it. The probability that an organization complains about an invoice is here set to 0.7. This number affects the creation of credit notes because credit notes are created when complains are processed in the postboxes.
Line6: Invoices are created when organizations process purchase orders in their postboxes. In some cases, it is a partial invoice. The probability of this is set here to 0.7.
Line7: The mean quantity per item is set to 6. This only affects the total amounts of the synthetic documents.
Line8: The standard deviation of the quantity per item is set to 3 (which also only affects the total amounts of the synthetic documents).

Patterns related

Line9: The following probabilities are set in order to control how often particular noise patterns are applied in the generated documents. In Line 9, the probability that there is no document reference in a document is set to 0.2.
Line10: The probability that a document reference consists only of the last digits is set to 0.5.
Line11: The probability that the person or his/her name abbreviation (using artificial generated names) is entered in the document reference field is set to 0.1.

Second, we have made configurations in the file config/config.ttl:

Sources where labels come from are specified with ConditionalLiteralGeneration-Instances (e.g., product and company names). Hierarchies, such as for state, countries, and cities, are configured with hierarchy instances. Relationships, such between organizations and cities and between organizations and items, are configured with relationship instances. Here, the cardinality can be set to OneToOne, OneToMany, ManyToOne and ManyToMany.

Execution

When ptpDG is executed, a folder called output-dataset is created in the root folder with one folder and three files:

synthetic-documents: This folder contains all generated synthetic purchase-to-pay documents where noise patterns are added.
ground-truth-kg.ttl: the ground truth knowledge graph
provenance-kg.ttl: the provenance knowledge graph
metadata.json: configuration parameters as well as some statistics about the dataset (more statistics beyond knowledge graph size are coming soon).

Statistics of the generated tutorial dataset:

Triples in ground truth knowledge graph:	105 314
Triples in provenance knowledge graph:	672 161
Synthetic documents generated:	2 277
Commercial (normal) invoices generated:	243
Partial invoices generated:	706
Final invoices generated:	50
Credit notes generated:	146
Purchase orders generated:	1 132
Invoice lines generated:	5 725
Sum of all processes (order processes count once):	1 328
Incidental purchase order processes (same for in advance):	1 132
Corrective invoicing processes:	146
Partial and final invoicing processes:	50

Configuration Overview

Simulation

Parameter	Description
ROUNDS_OF_SIMULATION	Number of rounds the simulation will last. Note that in the current version, there is a hard end after the last round. So, purchase-to-pay processes will not be simulated to their end. This means that also incomplete processes are created.
MEAN_OF_ORDER_TRIES_PER_ROUND	Number of tries an organization will make to place an order in a round.
PROBABILITY_OF_PLACING_ORDER	The probability of success of placing an order.
MEAN_OF_PROCESSED_DOCUMENTS_IN_POSTBOX	Number of documents an organization can process within a round.
PROBABILITY_OF_COMPLAINING_ABOUT_AN_INVOICE	Probability that an organization will complain about a received invoice.
PROBABILITY_OF_PARTIAL_INVOICE	Probability of generating a partial invoice when organizations process purchase orders.
MEAN_QUANTITY_PER_ITEM	The mean quantity of an item in an order/invoice/credit note line (position).
SD_QUANTITY_PER_ITEM	Standard deviation of the quantity per item.

Noise Patterns

Parameter	Description
PROBABILITY_OF_PROCESS_PATTERN_NO_DOC_REFERENCE	Probability that a document has no reference to another document.
PROBABILITY_OF_PROCESS_PATTERN_PARTIAL_DOC_REFERENCE	Probability that a document reference in a document has only the last digits.
PROBABILITY_OF_PROCESS_PATTERN_PERSON_NAME	Probability that a person name is entered in the field for document references.

Plausibility Check	Expected Results	Results from Tutorial Dataset	Resources
Comparison between final invoices and final-partial invoicing processes	Number of final invoices equals number of final-partial invoicing processes	Final invoices: 50 Final-partial invoicing processes: 50	SPARQL Queries/Results
Comparison between orders and invoices	Number of purchase orders is slightly bigger than the sum of commercial (normal) invoices, partial invoices and final invoices (this is because partial processes exist)	Purchase orders: 1 132 Commercial invoices: 243 Partial invoices: 706 Final invoices: 50 Sum of invoices: 999	SPARQL Queries/Results
Comparison between credit notes and corrective invoicing processes	Number of credit notes and number of corrective invoicing processes is equal	Credit notes: 146 Corrective invoicing processes: 146	SPARQL Queries/Results
Comparison between partial invoices and final invoices	Number of partial invoices is minimum the number of final invoices	Partial invoices: 706 Final invoices: 50	SPARQL Queries/Results
Comparison between purchase orders and incidental order processes	Number of purchase orders equals number of incidental order processes	Purchase orders: 1 132 Incidental order processes: 1 132	SPARQL Queries/Results
Comparison between the parameter for partial invoices and number of partial invoices	Parameter set for the probability of partial invoices is round about the number of partial invoices in relation to all invoices	Set probability of partial invoices: 0.7 Partial invoices: 706 Commercial invoices: 243 Final invoices: 50 Sum of all invoices: 999 706/999 = 0.7067	SPARQL Queries/Results

ptpDG - A Purchase-To-Pay Dataset Generator

Overview