This post is by Alessandro Berti, Software Engineer in the Process And Data Science group at RWTH Aachen University. Contact him via email for further inquiries.
Process Mining is a branch of Data Science that aims to extract process-related information from event data contained in information systems, that is steadily increasing in amount. Many algorithms, and a general-purpose open source framework (ProM 6), have been developed in the last years for process discovery, conformance checking, machine
learning on event data. The amount of event data stored by modern information systems is steadily increasing, and this is making progressively more difficult to apply process mining with mainstream workstations on real-life event data with any open source process mining framework. Hence, exploring more scalable storage techniques, in-memory data structures, more performant algorithms is a strictly incumbent need.
In the last few years, columnar and key-value storage techniques have been evaluated in a process mining context.
Column-based storage systems
Column-based storage systems are optimized to read event logs “by columns”, making possible to choose which attributes are needed before starting to load the file. In this way, only the values of these attributes are parsed, making the load operation faster. Since each column of the file contains data of the same format (integer, string, …), more effective compression techniques can be deployed.
The Apache Parquet format provides an implementation of a columnar format. Apache Parquet is supported by a large number of big data frameworks (Apache Hive, Apache Drill, Apache Impala, Apache Pig, Apache Spark, Cascading…). Among the compression algorithms supported by Parquet, there is the Snappy compression and the Gzip compression. Snappy is more fast in compressing/decompressing, while Gzip obtains better compression ratios at the expense of performance. Choosing the Snappy compression, Parquet remains very efficient while avoiding the compression/decompression performance deficit of Gzip.
Column-based storages map naturally into a dataframe memory structure. The Pandas python package offers a convenient implementation of a dataframe. Also, Apache Spark offers different concepts of dataframes. A dataframe is composed by:
• A set of indices (corresponding to the different rows of the file).
• A set of types columns.
• A function that, taking an index and a column, returns a value.
In Berti, Alessandro. “Increasing scalability of process mining using event dataframes: How data structure matters.” arXiv preprint arXiv:1907.12817 (2019), it is shown that dataframes supports the following operations:
• Projection on a given expression
• Grouping function
• Shifting rows
• Concatenation
• Sorting
• Merging of columns
Also in Berti, Alessandro. “Increasing scalability of process mining using event dataframes: How data structure matters.” arXiv preprint arXiv:1907.12817 (2019), it is shown that process mining operations such as the calculation of the directly-follows graph are possible also on dataframes with two different paradigms:
• Map-Reduce approaches
• Shifting and counting
It is nice to compare classic event log structures and dataframes, when considering classic process mining operations such as the filtering on attribute values and the computation of the directly-follows graph.
Filtering on attribute values: the average complexity is linear in both cases. The worst case complexity is quadratic on a classic event log structure, and always linear on top of the dataframes.
Computation of the directly-follows graph: the complexity for both event log structures is linear on average and quadratic in the worst case.
Key-value Stores
Key-value stores are very simple databases, that corresponds to a key (generally a binary key) a value (generally a binary value). Key-value stores have been used in the ProM framework (MapDB) to efficiently retrieve cases of event logs which size is bigger than the amount of RAM.
The most convenient way to host process mining event logs in key-value stores is to store as key the case identifier, or the index of the case in the log, and as value the content of the case.
A popular key-value store nowadays is Redis. Redis provides a network interface that can be queried in order to retrieve the value associated to a key or an interval of keys.
While getting the entire event log from Redis is much more expensive than reading a XES log, it is convenient when only some cases need to be sampled. This is the case for many process discovery algorithms (as a model can be discovered using only a subset of the behavior).