Authors: Michael Schulze, Markus Schröder, Christian Jilek, Andreas Dengel
Abstract: ptpDG is a labeled-dataset generator that generates various data assets for evaluating knowledge graph construction approaches and downstream knowledge services in the purchase- to-pay domain: While organizations sell, purchase and complain about products in a multi-agent-system simulation, a ground truth knowledge graph emerges with different kinds of purchase-to-pay processes. Based on this knowledge graph, heterogeneous electronic purchase-to-pay documents such as e-invoices, credit notes and orders are generated. To those documents, noise patterns are added that we have frequently encountered in real industrial data. Finally, a provenance graph is generated which contains provenance information between document elements and ground truth triples. In this way, for such privacy sensitive scenarios, ptpDG enables data-driven evaluation and its publication.
The codebase is available here.
This little tutorial describes how a dataset with 105k triples was generated with ptpDG. The codebase can be found here.
The generated dataset from this tutorial is available here.
First, we have adjusted properties in the file config/config.properties:
Second, we have made configurations in the file config/config.ttl:
Sources where labels come from are specified with ConditionalLiteralGeneration-Instances (e.g., product and company names). Hierarchies, such as for state, countries, and cities, are configured with hierarchy instances. Relationships, such between organizations and cities and between organizations and items, are configured with relationship instances. Here, the cardinality can be set to OneToOne, OneToMany, ManyToOne and ManyToMany.
Triples in ground truth knowledge graph: | 105 314 |
Triples in provenance knowledge graph: | 672 161 |
Synthetic documents generated: | 2 277 |
Commercial (normal) invoices generated: | 243 |
Partial invoices generated: | 706 |
Final invoices generated: | 50 |
Credit notes generated: | 146 |
Purchase orders generated: | 1 132 |
Invoice lines generated: | 5 725 |
Sum of all processes (order processes count once): | 1 328 |
Incidental purchase order processes (same for in advance): | 1 132 |
Corrective invoicing processes: | 146 |
Partial and final invoicing processes: | 50 |
Parameter | Description |
---|---|
ROUNDS_OF_SIMULATION | Number of rounds the simulation will last. Note that in the current version, there is a hard end after the last round. So, purchase-to-pay processes will not be simulated to their end. This means that also incomplete processes are created. |
MEAN_OF_ORDER_TRIES_PER_ROUND | Number of tries an organization will make to place an order in a round. |
PROBABILITY_OF_PLACING_ORDER | The probability of success of placing an order. |
MEAN_OF_PROCESSED_DOCUMENTS_IN_POSTBOX | Number of documents an organization can process within a round. |
PROBABILITY_OF_COMPLAINING_ABOUT_AN_INVOICE | Probability that an organization will complain about a received invoice. |
PROBABILITY_OF_PARTIAL_INVOICE | Probability of generating a partial invoice when organizations process purchase orders. |
MEAN_QUANTITY_PER_ITEM | The mean quantity of an item in an order/invoice/credit note line (position). |
SD_QUANTITY_PER_ITEM | Standard deviation of the quantity per item. |
Parameter | Description |
---|---|
PROBABILITY_OF_PROCESS_PATTERN_NO_DOC_REFERENCE | Probability that a document has no reference to another document. |
PROBABILITY_OF_PROCESS_PATTERN_PARTIAL_DOC_REFERENCE | Probability that a document reference in a document has only the last digits. |
PROBABILITY_OF_PROCESS_PATTERN_PERSON_NAME | Probability that a person name is entered in the field for document references. |
Plausibility Check | Expected Results | Results from Tutorial Dataset | Resources |
---|---|---|---|
Comparison between final invoices and final-partial invoicing processes | Number of final invoices equals number of final-partial invoicing processes | Final invoices: 50 Final-partial invoicing processes: 50 |
SPARQL Queries/Results |
Comparison between orders and invoices | Number of purchase orders is slightly bigger than the sum of commercial (normal) invoices, partial invoices and final invoices (this is because partial processes exist) | Purchase orders: 1 132 Commercial invoices: 243 Partial invoices: 706 Final invoices: 50 Sum of invoices: 999 |
SPARQL Queries/Results |
Comparison between credit notes and corrective invoicing processes | Number of credit notes and number of corrective invoicing processes is equal | Credit notes: 146 Corrective invoicing processes: 146 |
SPARQL Queries/Results |
Comparison between partial invoices and final invoices | Number of partial invoices is minimum the number of final invoices | Partial invoices: 706 Final invoices: 50 |
SPARQL Queries/Results |
Comparison between purchase orders and incidental order processes | Number of purchase orders equals number of incidental order processes | Purchase orders: 1 132 Incidental order processes: 1 132 |
SPARQL Queries/Results |
Comparison between the parameter for partial invoices and number of partial invoices | Parameter set for the probability of partial invoices is round about the number of partial invoices in relation to all invoices |
Set probability of partial invoices: 0.7 Partial invoices: 706 Commercial invoices: 243 Final invoices: 50 Sum of all invoices: 999 706/999 = 0.7067 |
SPARQL Queries/Results |