ptpDG - A Purchase-To-Pay Dataset Generator


Welcome to the project site of ptpDG, a purchase-to-pay dataset generator for evaluating knowledge-graph-based services. This is work in progress and part of the research projects SensAI and Innoprom (EFRE Rheinland-Pfalz).

Authors: Michael Schulze, Markus Schröder, Christian Jilek, Andreas Dengel

Abstract: ptpDG is a labeled-dataset generator that generates various data assets for evaluating knowledge graph construction approaches and downstream knowledge services in the purchase- to-pay domain: While organizations sell, purchase and complain about products in a multi-agent-system simulation, a ground truth knowledge graph emerges with different kinds of purchase-to-pay processes. Based on this knowledge graph, heterogeneous electronic purchase-to-pay documents such as e-invoices, credit notes and orders are generated. To those documents, noise patterns are added that we have frequently encountered in real industrial data. Finally, a provenance graph is generated which contains provenance information between document elements and ground truth triples. In this way, for such privacy sensitive scenarios, ptpDG enables data-driven evaluation and its publication.

The codebase is available here.

Dataset Generation Tutorial

This little tutorial describes how a dataset with 105k triples was generated with ptpDG. The codebase can be found here.

The generated dataset from this tutorial is available here.


First, we have adjusted properties in the file config/

Second, we have made configurations in the file config/config.ttl:

Sources where labels come from are specified with ConditionalLiteralGeneration-Instances (e.g., product and company names). Hierarchies, such as for state, countries, and cities, are configured with hierarchy instances. Relationships, such between organizations and cities and between organizations and items, are configured with relationship instances. Here, the cardinality can be set to OneToOne, OneToMany, ManyToOne and ManyToMany.


When ptpDG is executed, a folder called output-dataset is created in the root folder with one folder and three files:

Statistics of the generated tutorial dataset:

Triples in ground truth knowledge graph: 105 314
Triples in provenance knowledge graph: 672 161
Synthetic documents generated: 2 277
Commercial (normal) invoices generated: 243
Partial invoices generated: 706
Final invoices generated: 50
Credit notes generated: 146
Purchase orders generated: 1 132
Invoice lines generated: 5 725
Sum of all processes (order processes count once): 1 328
Incidental purchase order processes (same for in advance): 1 132
Corrective invoicing processes: 146
Partial and final invoicing processes: 50

Configuration Overview


Parameter Description
ROUNDS_OF_SIMULATION Number of rounds the simulation will last. Note that in the current version, there is a hard end after the last round. So, purchase-to-pay processes will not be simulated to their end. This means that also incomplete processes are created.
MEAN_OF_ORDER_TRIES_PER_ROUND Number of tries an organization will make to place an order in a round.
PROBABILITY_OF_PLACING_ORDER The probability of success of placing an order.
MEAN_OF_PROCESSED_DOCUMENTS_IN_POSTBOX Number of documents an organization can process within a round.
PROBABILITY_OF_COMPLAINING_ABOUT_AN_INVOICE Probability that an organization will complain about a received invoice.
PROBABILITY_OF_PARTIAL_INVOICE Probability of generating a partial invoice when organizations process purchase orders.
MEAN_QUANTITY_PER_ITEM The mean quantity of an item in an order/invoice/credit note line (position).
SD_QUANTITY_PER_ITEM Standard deviation of the quantity per item.

Noise Patterns

Parameter Description
PROBABILITY_OF_PROCESS_PATTERN_NO_DOC_REFERENCE Probability that a document has no reference to another document.
PROBABILITY_OF_PROCESS_PATTERN_PARTIAL_DOC_REFERENCE Probability that a document reference in a document has only the last digits.
PROBABILITY_OF_PROCESS_PATTERN_PERSON_NAME Probability that a person name is entered in the field for document references.

Plausibility Checks

Plausibility Check Expected Results Results from Tutorial Dataset Resources
Comparison between final invoices and final-partial invoicing processes Number of final invoices equals number of final-partial invoicing processes Final invoices: 50
Final-partial invoicing processes: 50
SPARQL Queries/Results
Comparison between orders and invoices Number of purchase orders is slightly bigger than the sum of commercial (normal) invoices, partial invoices and final invoices (this is because partial processes exist) Purchase orders: 1 132
Commercial invoices: 243
Partial invoices: 706
Final invoices: 50
Sum of invoices: 999
SPARQL Queries/Results
Comparison between credit notes and corrective invoicing processes Number of credit notes and number of corrective invoicing processes is equal Credit notes: 146
Corrective invoicing processes: 146
SPARQL Queries/Results
Comparison between partial invoices and final invoices Number of partial invoices is minimum the number of final invoices Partial invoices: 706
Final invoices: 50
SPARQL Queries/Results
Comparison between purchase orders and incidental order processes Number of purchase orders equals number of incidental order processes Purchase orders: 1 132
Incidental order processes: 1 132
SPARQL Queries/Results
Comparison between the parameter for partial invoices and number of partial invoices Parameter set for the probability of partial invoices is round about the number of partial invoices in relation to all invoices Set probability of partial invoices: 0.7
Partial invoices: 706
Commercial invoices: 243
Final invoices: 50
Sum of all invoices: 999
706/999 = 0.7067
SPARQL Queries/Results