Selecting Behavior from Uncertainty: Process Discovery on Uncertain Event Logs
This post has been authored by Marco Pegoraro.
Introduction
Process mining bridges the gap between data science and business process management by extracting insights from event logs—records of activities captured by modern information systems. Traditional discovery techniques assume event data is precise and accurately recorded, but in many real-world settings, logs contain explicit uncertainty, such as ambiguous timestamps or multiple possible activity labels. In [1] we introduce the concept of uncertain event logs, aiming to extend conformance and discovery algorithms to handle data imprecision without discarding valuable information.
Why Uncertainty Matters
In practice, data imperfections arise from manual entries, system delays, or coarse timestamp granularity. For example, two activities may share the same recorded time unit, making their order unclear, or a sensor might register one of several possible activity types. Ignoring such uncertainties can lead to misleading models or force analysts to prune important cases. By explicitly modeling uncertainty, process mining can produce more faithful representations of actual behavior, highlighting both certain and ambiguous aspects of the process.
A Taxonomy of Uncertain Event Logs
Uncertain event data is classified into two main categories:
- Strong uncertainty, where the log lists all possible values for an attribute without probabilities (e.g., an event’s activity label is either “Approve” or “Reject”). Table 1 shows an example of strongly uncertain trace.
- Weak uncertainty, where a probability distribution over possible values is provided.
Logs displaying, respectively, strong and weak uncertainty on activity labels are also known in literature as stochastically-known an stochastically-unkown logs [2].
Their focus is on a simplified subset of uncertain behavior that encompasses strong uncertainty on control flow attributes: activity names, timestamps (expressed as intervals), and indeterminate events whose presence is optional. This clear taxonomy guides the design of algorithms that handle varying levels of data confidence.

Table 1: An example of strongly uncertain trace. Possible activity labels for events are enclosed in curly braces. Uncertain timestamps are represented by time intervals. Event e3 is indeterminate: it might have been recorded without occurring.
Capturing Uncertain Behavior with Graphs
In [3], we describe an extension of the Inductive Miner family of algorithms able to ingest strongly uncertain event logs.
At the core of the proposed approach is the uncertain directly-follows graph (UDFG)—an extension of the classic directly-follows graph that retains information about ambiguity. Instead of a single directed edge from activity labels A to B representing that B directly follows A, the UDFG records:
- Certain edges, where all traces support the relation A→B.
- Possible edges, where some traces may support A→B under certain resolutions of uncertainty.
The nodes are also enriched with the same parameters, in relation to the certain executions of a single activity, and the possible executions of a single activity. As a result, the UDFG succinctly encodes where the process behavior (as illustrated by the data) is definitive, and where alternate real-life scenarios exist.
Discovering Models from Uncertain Data
To transform the UDFG into an interpretable process model, we apply inductive mining—a robust technique that produces block-structured models free of spurious behavior. The workflow is:
- Construct the UDFG from the uncertain log, marking edges as certain or possible.
- Filter edges with a specific set/configuration of parameters, which induce inclusion/exclusion criteria for uncertain aspects of the input log.
- Apply inductive mining: we obtain a process tree from the filtered UDFG, through the Inductive Miner directly-follows algorithm [4].
- Merge results to highlight which parts of the model are supported by all possible interpretations and which depend on resolving uncertainty.
This dual-mining strategy yields two related models: one conservative and one inclusive, giving analysts a spectrum of process variants to consider.
Experimental Insights
In their experiments on both synthetic and real-world logs, we show that:
- The UDFG can be easily defined and obtained even for large logs with complex uncertainty patterns.
- Models derived from a “traditional” (certain) DFG avoid underfitting noise but may miss legitimate behavior expressed by uncertainty.
- Inclusive models reveal potential flows that warrant further data cleaning or validation.
Overall, the approach offers filtering mechanisms that can balance precision and fitness, allowing process mining specialists to control how conservatively or aggressively they treat uncertain data.
Conclusion and Future Directions
By embracing rather than discarding uncertainty, this work advances process discovery to better reflect real-life data quality issues. The proposed UDFG and dual inductive mining deliver models that clearly distinguish between guaranteed and hypothetical behavior. The authors highlight several avenues for future research, including:
- Defining quantitative metrics to compare uncertain models.
- Extending the approach to weak uncertainty with probability distributions.
- Incorporating uncertainty in case identifiers and other perspectives beyond control flow.
For practitioners, this paper offers practical guidance on modeling and visualizing ambiguous traces, ensuring that insights remain grounded in the realities of data collection.
References
- Pegoraro, Marco, and Wil M.P. van der Aalst. “Mining uncertain event data in process mining.” In 2019 International Conference on Process Mining (ICPM), pp. 89-96. IEEE, 2019.
- Bogdanov, Eli, Izack Cohen, and Avigdor Gal. “Conformance checking over stochastically known logs.” In International Conference on Business Process Management, pp. 105-119. Cham: Springer International Publishing, 2022.
- Pegoraro, Marco, Merih Seran Uysal, and Wil M.P. van der Aalst. “Discovering process models from uncertain event data.” In International Conference on Business Process Management, pp. 238-249. Cham: Springer International Publishing, 2019.
-
Leemans, Sander J.J., Dirk Fahland, and Wil M.P. van der Aalst. “Scalable process discovery and conformance checking.” Software & Systems Modeling 17 (2018): 599-631.
Object-Centric Process Mining: A New Perspective for Sustainability Analysis
This post has been authored by Nina Graves.
Current approaches to organizational sustainability analysis face significant methodological challenges. Life Cycle Assessment (LCA) and similar frameworks require time-consuming manual data collection, rely on static models, and struggle to connect environmental impacts to their process-level causes. This often results in sustainability analysis becoming a reporting exercise rather than an integrated management approach.
Object-Centric Process Mining (OCPM) represents a methodological advancement that may address these limitations. The approach leverages Object-Centric Event Logs (OCEL), which capture relationships between events and multiple objects in business processes. The OCEL data structure contains timestamps, activities, objects, and their attributes—allowing for multi-dimensional analysis.
When enhanced with sustainability metrics, these logs provide a structural foundation for more granular environmental impact assessment. The methodology integrates inventory data, impact factors, and allocation mechanisms directly with process execution data.
Recent Work
In our recent explorative paper, we used an exemplary OCEL to explore and discuss the usage of OCPM for sustainability assessment. We showcased an approach in which a sustainability-enriched OCEL is used for 1) impact detection, 2) impact allocation, to enable 3) system analysis.
We demonstrated the analytical capabilities to track environmental impacts across the process, supporting the
- determination of sustainability-related data from an OCEL,
- storage of sustainability data using an OCEL,
- automated modelling of complex process landscapes,
- flexible impact allocation, and
- potential automation for impact detection using impact databases.
Furthermore, we showed that the OCEL can support more accurate and flexible impact assessment and analysis by combining the same sustainability data used for traditional sustainability assessment with event data.
Figure 1- Example for more differentiated and accurate impact considerations.
The distinguishing differences lie in:
- Multi-level analysis: Environmental impacts are calculated for individual instances (events and objects) which can be aggregated and differentiated, e.g., to activities or object types or by specific attributes.
- Multi-perspective analysis: The environmental impact can be considered with regard to different organizational elements, such as products, resources, total systems, individual (sub-)processes,…
- Combining difference reference units: The OCEL allows for the association of relevant primary data to events, (sets of) objects, and event-object combinations. This requires less allocation efforts in the pre-processing of the data enabling a stronger decoupling of impact assessment and impact allocation. This decoupling allows for the previously mentioned increased flexibility.
Naturally, the integration of sustainability data also allows for the application of OCPM techniques for causal investigations and may potentially even support impact management and compliance checking.
PoC Web Application: OCEAn – Object-Centric Environmental Analysis
As a proof of concept, we provide OCEAn—a software tool that links company data with sustainability information. It enables the definition of environmental impact rules, supports semi-automatic data processing, and provides various visualizations of results.
OCEAn supports:
- Integration of environmental data with process event logs
- Definition of impact rules at activity and attribute levels
- Multiple allocation algorithms based on object relationships
- Different visualization of environmental impacts
Discussion
The research presents both methodological advantages and challenges.
Advantages:
- Leverages existing digital process traces
- Aligns process management with sustainability objectives
- Supports more accurate impact allocation through object relationships
- Enables root cause analysis of environmental hotspots
- Provides a data-driven foundation for ongoing assessment
Limitations:
- Limited by data availability and quality
- Requires identifiable process elements
- Depends on comprehensive domain expertise
- Allocation methodologies require further development
- Extracting OCELs and enhancement with sustainability data
This exploratory work establishes a foundation for further investigation into data-driven sustainability assessment. Future research directions include developing standardized frameworks for sustainability-enhanced OCELs, more sophisticated allocation methodologies, and improved visualization techniques for complex impact relationships.
The work contributes to bridging conceptual gaps between process science and sustainability science, potentially enabling more dynamic and comprehensive environmental performance assessment in organizational contexts.
References:
Find the Paper on Research Gate: https://www.researchgate.net/publication/391736048_Object-Centric_Process_Mining_for_Semi-Automated_and_Multi-Perspective_Sustainability_Analyses
Repository: https://github.com/rwth-pads/ocel4lca
Developing Tooling for Models-in-the-Middle
This post has been authored by Leah Tacke genannt Unterberg.
In the Cluster of Excellence “Internet of Production” (IoP) of RWTH Aachen University, the concept of Models-in-the-Middle (MitM) has been proposed to avoid the recurring development of custom data pipelines from source to analytics.
The project and its predecessor have been running for over ten years, with over 100 financed researcher positions from over 30 institutes/independent organizational units, each with their own goals. Over time, the isolated development of “data infrastructures” – which has historically not been a research activity in the field of mechanical engineering – has led to the expenditure of much duplicated effort.
As data is the basis for basically all the promises of the project and any shiny AI application, every researcher needs to have some kind of pipeline from data as it is produced by sensors, machines, etc., to algorithms running on their laptop, their institute’s server, or the cloud. As coordination on that front has not been part of the project’s realization, barely any reusable artifacts have been created, shared, and curated within the IoP. This includes IoP-accessible data sets themselves, but that is another story…
To be able to perform data-based research more efficiently, shared data models that can serve as the basis for implementation interfaces across a whole domain may be quite useful. In the domain of process mining, this has been shown to be quite the enabler. See XES, ProM, and now the OCED Model.
Essentially, as any standardization argument goes, an agreed-upon model and representation drives collaboration, reuse/adaptation of existing tools, and thus, broadens the horizon for newcomers into the project.
The following graphic exemplifies this situation.
As the most prominent type of data in the IoP is that coming from sensors and machines during operation, we have introduced a specialized model for it.
It focuses on the aspect of time and specifically aims to support rich time series analysis via the inclusion of time intervals/segments which play a major role in the data pre-processing in this domain, as data is often continuously recorded and needs to be cut into individual operations, regimes, etc.
Consider the following model and schematic example of the proposed data model for Measurement and Event Data (MAED).
To start, we developed a tool for finding, mapping, and exporting, MAED-conforming data from relational databases – which typically lie at the heart of a research institute that documents its experiments.
Well, in the best case that is. It can also lie on a thumb drive of a researcher who’s no longer employed there.
Projects like Apache Hive extend the reach of SQL queries to JSON and object-oriented data stores, so assuming some tabular interface is not too far-fetched for most use-cases.
MAED (MitM) Exporter
Enter, the MAED Exporter WebApp. It can connect to most relevant DBMS (via sqlachemy and generate MAED data sets in the proposed text-based representation.
The project is available here and hosted https://maed-exporter.cluster.iop.rwth-aachen.de/.
The user can iteratively explore, transform, and map the tables of the connected DB (or uploaded sqlite file).
Finally, the resulting virtual table definitions, mappings, and queried and transformed data sets can be exported.
Once the mapping work is done and saved as a preset, data sets can be exported on demand, and, in future extensions, be queried directly by visualization tooling.
Generalization
When developing the exporter, we envisioned to support not just this specific data model in its current iteration, but rather to make the implementation configurable to any data model expressed in a JSON configuration file.
The result is that the exporter app itself lives on the level “connecting relational data to meta-modeled concepts” instead of “connecting relational data to MAED”. Hence, it can be considered a MitM Exporter, not just MAED.
The next logical step after getting data in the specified representation is of course to display it.
MitM Superset
To make this tooling robust, scalable, reusable, and extensible to the extent that single-person research projects typically are not, we decided to customize Apache Superset, an open-source “dashboarding” tool.
While the upfront effort required is immense compared to starting from scratch, there are great benefits from adapting software that has 1200 contributors. For one, the project is developed to a high standard with regard to security and scalability. Further, there are some features that make it very extensible, e.g., the ability to create new visualizations entirely in the frontend by providing it as a plugin that reuses code from the superset core.
To connect MitM data to such a tool, we first developed a canonical relational representation of the “family of MitMs” that can be described like MAED.
Based on this, we can create interactive dashboards that can give a comprehensive overview over specific data sets.
In the beginning, the process of importing the data into a Superset instance would be manual, but we have started to implement a Superset fork that has “native MitM support”, i.e., that knows about the concept of MitMDataset in addition to the usual Database, Table, Chart and Dashboard.
Our fork includes an architectural extension to the regular Superset docker-compose network: independent MitM services can be added to the network, and their APIs be reached from the Superset front- and backend.
You can follow the development here.
The first hosted instance should be available in mid-April.
Can AI Really Model Your Business Processes? A Deep Dive into LLMs and BPM
This post has been authored by Humam Kourani.
Business process modeling (BPM) is crucial for understanding, analyzing, and improving how a company operates. Traditionally, this has involved painstaking manual work. But what if Artificial Intelligence could lend a hand? Large Language Models (LLMs) are showing promise in this area, offering the potential to automate and enhance the BPM process. Let’s dive into how LLMs are being used, how effective they can be, and what the research shows.
What is Process Modeling (and Why Does it Matter)?
Before we get to the AI, let’s quickly recap the basics. Process modeling is all about representing the steps, actions, and interactions within a business process. The goal is to:
- Understand: Make complex operations clear and visible.
- Analyze: Identify bottlenecks, inefficiencies, and areas for improvement.
- Improve: Optimize workflows for better performance, reduced costs, and increased customer satisfaction.
Enter the LLMs: AI-Powered Process Modeling
The core idea is to leverage the power of LLMs to automatically generate and refine process models based on natural language descriptions. Imagine simply describing your process in plain text, and having an AI create a BPMN diagram for you! The general framework involves:
- Starts with Natural Language: You describe the process in words.
- Uses POWL as an Intermediate: The LLM translates the description into the Partially Ordered Workflow Language (POWL).
- Generates Standard Models: The POWL representation is then converted into standard notations like BPMN or Petri nets.
Figure 1: AI-Powered process modeling using POWL for intermediate representation.
Why POWL for intermediate representation?
The Partially Ordered Workflow Language (POWL) [1] serves as a crucial bridge in our AI-powered process modeling framework. Unlike some traditional modeling notations, POWL is designed with a hierarchical, semi-block structure that inherently guarantees soundness. This means we can avoid common modeling errors like deadlocks. Furthermore, POWL has a higher expressive power compared to hierarchical modeling languages that provide similar quality guarantees. The resulting POWL models can be seamlessly converted into standard notations like BPMN and Petri nets for wider use.
Fine-Tuning vs. Prompt Engineering: A Key Choice
A fundamental question in working with LLMs is how to best tailor them to a specific task:
- Fine-Tuning: Retraining the LLM on a specific dataset. This is highly tailored but expensive and requires significant data.
- Prompt Engineering: Crafting clever prompts to guide the LLM. This is more adaptable and versatile but requires skill in prompt design.
The LLM-Based Process Modeling Framework: A Closer Look
Our framework [2] is iterative and involves several key steps:
1. Prompt Engineering: This includes the following strategies:
- Knowledge Injection: Providing the LLM with specific information about POWL and how to generate POWL models.
- Few-Shots Learning: Giving the LLM examples of process descriptions and their corresponding models.
- Negative Prompting: Telling the LLM what not to do, avoiding common errors.
2.Model Generation: The LLM generates executable code (in this case, representing the POWL model). This code is then validated for correctness and compliance with the coding guidelines and the POWL specifications.
3.Error Handling:The system detects errors (both critical functional errors and less critical qualitative issues) and prompts the LLM to fix them.
4.Model Refinement: Users can provide feedback in natural language, and the LLM uses this feedback to improve the model.
Figure 2: LLM-Based process modeling framework.
ProMoAI: A Practical Tool
ProMoAI [3] is a tool that implements this framework. Key features of ProMoAI include:
- Support for Multiple LLMs: It can work with LLMs from various AI providers, including Google, OpenAI, DeepSeek, Anthropic, and DeepInfra.
- Flexible Input: Users can input text descriptions, existing models, or data.
- Multiple Output Views: It can generate models in BPMN, POWL, and Petri net formats.
- Interactive Feedback: Users can provide feedback and see the model updated in real-time.
Figure 3: ProMoAI (https://promoai.streamlit.app/).
Benchmarking the Best: Which LLMs Perform Best?
We’ve benchmarked various state-of-the-art LLMs on their process modeling capabilities [4]. This involves testing quality (how well the generated models match the ground truth, using conformance checking with simulated event logs) and time performance (how long it took to generate the models).
Our extensive testing, using a diverse set of business processes, revealed significant performance variations across different LLMs. Some models consistently achieved higher quality scores, closely approaching the ideal, while others demonstrated faster processing times.
Figure 4:Benchmarking results: average quality score, total time, time per iteration, and number of iterations for different LLMs.
Key Findings:
- A crucial finding was a positive correlation between efficient error handling and overall model quality. LLMs that required fewer attempts to generate a valid, error-free model tended to produce higher-quality results overall.
- Despite the variability in individual runs, we observed consistent quality trends within similar groups of LLMs. This implies that while specific outputs might differ, the overall performance level of a particular LLM type tends to be relatively stable.
- Some speed-optimized models maintained quality comparable to their base counterparts, while others showed a noticeable drop in quality. This highlights the trade-offs involved in optimizing for speed.
Can LLMs Improve Themselves? Self-Improvement Strategies
We’re exploring whether LLMs can improve their own performance through self-evaluation and optimization. Several strategies are being investigated:
- LLM Self-Evaluation: The LLM evaluates and selects the best model from a set of candidates it generates. We found the effectiveness of this strategy to be highly dependent on the specific LLM. Some models showed improvement, while others performed worse after self-evaluation.
- LLM Self-Optimization of Input: The LLM improves the natural language description before generating the model. We found this approach to be generally not effective and could even be counterproductive. Our findings suggest LLMs may lack the specific domain knowledge needed to reliably improve process descriptions.
- LLM Self-Optimization of Output: The LLM refines the generated model itself. This strategy showed the most promise, particularly for models that initially produced lower-quality outputs. While average improvements were sometimes modest, we observed significant gains in specific instances. However, there was also a risk of quality degradation, emphasizing the need for careful prompt design to avoid unintended changes (hallucinations).
Conclusion:
LLMs hold significant potential for transforming business process modeling, moving it from a traditionally manual and expert-driven task towards a more automated and accessible one. The framework we’ve developed, leveraging prompt engineering, a robust error-handling mechanism, and the sound intermediate representation of POWL, provides a viable pathway for translating natural language process descriptions into executable models in standard notations like BPMN. Our evaluation revealed not only variations in performance across different LLMs, but also consistent patterns. We found a notable correlation between efficient error handling and overall model quality and observed consistent performance trends within similar LLMs. We believe that the ability to translate natural language into accurate and useful process models, including executable BPMN diagrams, could revolutionize business operations.
References:
[1] Kourani, H., van Zelst, S.J. (2023). POWL: Partially Ordered Workflow Language. In: Di Francescomarino, C., Burattin, A., Janiesch, C., Sadiq, S. (eds) Business Process Management. BPM 2023. Lecture Notes in Computer Science, vol 14159. Springer, Cham. https://doi.org/10.1007/978-3-031-41620-0_6.
[2] Kourani, H., Berti, A., Schuster, D., van der Aalst, W.M.P. (2024). Process Modeling with Large Language Models. In: van der Aa, H., Bork, D., Schmidt, R., Sturm, A. (eds) Enterprise, Business-Process and Information Systems Modeling. BPMDS EMMSAD 2024. Lecture Notes in Business Information Processing, vol 511. Springer, Cham. https://doi.org/10.1007/978-3-031-61007-3_18.
[3] Kourani, H., Berti, A., Schuster, D., & van der Aalst, W. M. P. (2024). ProMoAI: Process Modeling with Generative AI. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. https://doi.org/10.24963/ijcai.2024/1014.
[4] Kourani, H., Berti, A., Schuster, D., & van der Aalst, W. M. P. (2024). Evaluating Large Language Models on Business Process Modeling: Framework, Benchmark, and Self-Improvement Analysis. arXiv preprint. https://doi.org/10.48550/arXiv.2412.00023.
Quantum Computing in Process Mining: A New Frontier
This post has been authored by Alessandro Berti.
Introduction
Process mining is a crucial field for understanding and optimizing business processes by extracting knowledge from event logs. Traditional process mining techniques may encounter limitations as data volume and complexity increase. Quantum computing offers a potential solution by tackling these challenges in a fundamentally different way.
What is Quantum Computing?
Quantum computing utilizes quantum mechanics to solve complex problems that are intractable for classical computers. It employs quantum bits, or qubits, which can represent 0, 1, or a combination of both, enabling parallel computations.
How Can Quantum Computing Assist Process Mining?
Quantum computing can potentially revolutionize process mining by:
- Solving Complex Optimization Problems: Process discovery often involves finding the optimal process model that best fits the event log. Quantum algorithms, such as Quadratic Unconstrained Binary Optimization (QUBO), can efficiently solve such optimization problems, leading to more accurate and efficient process discovery.
- Enhancing Anomaly Detection: Quantum kernel methods can map process data into a high-dimensional feature space, enabling better anomaly detection. This can help identify unusual or unexpected behavior in processes, leading to quicker interventions and improvements.
- Improving Process Simulation: Quantum Generative Adversarial Networks (QGANs) can generate synthetic event logs that capture complex correlations in data. This can be used for anonymizing sensitive data, augmenting small datasets, and improving the accuracy of process simulation models.
- Developing Advanced Process Models: Quantum Markov Models can potentially express concurrency and complex rules in a way that is not possible with current models. This can lead to more accurate and realistic representations of business processes.
Challenges and Opportunities
While quantum computing offers significant potential for process mining, it is still in its early stages of development. The current generation of quantum computers, known as Noisy Intermediate-Scale Quantum (NISQ) devices, have limited qubits and are prone to errors. However, advancements in quantum hardware and software are rapidly progressing.
Conclusion
Quantum computing holds immense promise for revolutionizing process mining by enabling faster, more accurate, and more efficient analysis of complex business processes. It allows for a deeper understanding of intricate relationships within process data. As quantum technologies mature, we can expect to see even more innovative applications of quantum computing in process mining, leading to significant improvements in business process management.
Call to Action
We encourage researchers and practitioners in process mining to explore the potential of quantum computing and contribute to the development of new quantum-enhanced process mining techniques.
The Quest For Efficient Stochastic Conformance Checking
This post has been authored by Eduardo Goulart Rocha.
Conformance Checking is a key field of process mining. At its heart, conformance checking deals with two questions:
- How is a process deviating from its ideal flow?
- How severe are these deviations?
To exemplify that, we consider a simplified hiring process inside a company and two event logs depicting its executions in two distinct business units (in the Netherlands and in Germany):
Event logs for a hiring process in a fictitious company’s Dutch (left) and German (right) business units
The log contains a few violations. In this case, some applications are reviewed multiple times, and interviews are sometimes conducted before checking for an applicant’s backgrounds. These can be detected using state of the art conformance checking techniques [1]. A process owner may decide that these violations are acceptable and update the reference model to allow for that, leading to the following model:
Now, both event logs have the same set of variants and both achieve an alignment-based fitness and precision of 1. However, both logs are not the same and we intuitively know which event log is preferred. Repeated CV screenings drain manual resources and should be minimized. Additionally, interviews are more effectively conducted after backgrounds are checked (as more information on the candidate can be collected).
Why Stochastic Conformance Checking
The dilemma above serves as starting point for stochastic conformance checking. While all flows are permitted, some are less desirable. Therefore, we would like to capture what is the preferred behavior of a model and leverage this information when evaluating an event log. In the literature, Stochastic Labeled Petri Nets are used for that. These add weights on top of traditional labeled Petri nets that should be interpreted as “whenever a set of transitions is enabled in a marking, then each enabled transition fires with probability proportional to its weight”. Suppose we assign weights as follows:
This makes it clearer that while repeated reviews are possible, these should be the exception. And that the interview should be preferably conducted after checking for references. This assigns ideal relative frequencies (probabilities) to each trace variant as follows:Now, it is clear that the Dutch business unit is more conforming.
State of the Art in Stochastic Conformance Checking
In its simples form, stochastic conformance checking aims at quantifying deviations considering a process model’s stochastic perspective. An ideal stochastic conformance measure should present three properties:
- It is robust to partial mismatches
- It can be efficiently and exactly computed for a broad class of stochastic languages
- It considers the log and model’s stochastic perspective
In recent years, multiple stochastic conformance measures have been proposed [2-6]. Unfortunately, state of the art measures fall short of one or more of these tasks. The table below summarizes their shortcomings:
Latest Development
In a recent work presented at the ICPM 2024 [7], we made a small step to improve on that. The main idea is to abstract the model and log’s stochastic languages into an N-gram-like model (called its K-th order Markovian abstraction) that represents the relative frequency of each subtrace in the language. In our running example, when k = 2 we obtain:
Model and Logs abstractions: The relative frequency of each subtrace in their respective languages
RA = Review Application, CR = Check References, I = Interview
This abstraction can then be compared using any existing stochastic conformance measure as illustrated in the framework below:
By using the language’s subtraces (instead of full-traces), measures based on this abstraction are naturally more robust to partial mismatches in the data. Furthermore, in REF we also show that this abstraction can be efficiently computed for bounded livelock-free stochastic labeled Petri nets. Last, the model’s abstraction does not depend on sampling and considers the model’s full behavior.
Outlook
While this was some progress, there is still much work to be done in the field. First, the proposed abstraction cannot handle long-term dependencies. Second, we would like to provide diagnostics beyond a single number as feedback to the end-user. Efficient and easy to use conformance methods are imperative for the development of stochastic process mining.
References
- Arya Adriansyah, Boudewijn F. van Dongen, Wil M. P. van der Aalst: Conformance Checking Using Cost-Based Fitness Analysis. EDOC 2011: 55-64
- Sander J. J. Leemans, Wil M. P. van der Aalst, Tobias Brockhoff, Artem Polyvyanyy: Stochastic process mining: Earth movers’ stochastic conformance. Inf. Syst. 102: 101724 (2021)
- Sander J. J. Leemans, Fabrizio Maria Maggi, Marco Montali: Enjoy the silence: Analysis of stochastic Petri nets with silent transitions. Inf. Syst. 124: 102383 (2024)
- Sander J. J. Leemans, Artem Polyvyanyy: Stochastic-Aware Conformance Checking: An Entropy-Based Approach. CAiSE 2020: 217-233
- Artem Polyvyanyy, Alistair Moffat, Luciano García-Bañuelos: An Entropic Relevance Measure for Stochastic Conformance Checking in Process Mining. ICPM 2020: 97-104
- Tian Li, Sander J. J. Leemans, Artem Polyvyanyy: The jensen-shannon distance metric for stochastic conformance checking. ICPM Workshops 2024
- Eduardo Goulart Rocha, Sander J. J. Leemans, Wil M. P. van der Aalst: Stochastic Conformance Checking Based on Expected Subtrace Frequency. ICPM 2024: 73-80
Fast & Sound: Improving the Scalability of Synthesis-Rules-Based Process Discovery
This post has been authored by Tsung-Hao Huang.
Process discovery is a cornerstone of process mining, enabling organizations to uncover the behaviors hidden in their event logs and transform them into actionable process models. While many algorithms exist, few balance between scalability and providing sound, free-choice workflow nets. The Synthesis Miner [1] is one of the algorithms that guarantee these desirable properties while also supporting non-block structures. However, scalability issues have posed challenges for its widespread adoption in real-world applications.
In our recent work [2], we introduced two extensions to address the bottlenecks in the Synthesis Miner’s computation. By leveraging log heuristics and isolating minimal subnets, these extensions reduce the search space and break down generation and evaluation tasks into smaller, more manageable components. The results speak for themselves: our experiments show an average 82.85% reduction in computation time without compromising model quality.
Log heuristics help pinpoint the most likely positions for modifications, reducing the number of nodes and transitions considered for connection. Meanwhile, minimal subnet extraction isolates only the relevant parts of the process model, enabling faster candidate generation and conformance checking. Together, these improvements streamline the process discovery workflow, making it more feasible to apply the Synthesis Miner to larger, real-life event logs.
This work highlights how targeted optimizations can unlock the potential of advanced algorithms in process mining. By addressing scalability challenges, we hope to make tools like the Synthesis Miner more accessible for practical use cases, bridging the gap between process theory and business applications.
[1] Huang, TH., van der Aalst, W.M.P. (2022). Discovering Sound Free-Choice Workflow Nets with Non-block Structures. In: Almeida, J.P.A., Karastoyanova, D., Guizzardi, G., Montali, M., Maggi, F.M., Fonseca, C.M. (eds) Enterprise Design, Operations, and Computing. EDOC 2022. Lecture Notes in Computer Science, vol 13585. Springer, Cham. https://doi.org/10.1007/978-3-031-17604-3_12
[2] Huang, TH., Schneider, E., Pegoraro, M., van der Aalst, W.M.P. (2024). Fast & Sound: Accelerating Synthesis-Rules-Based Process Discovery. In: van der Aa, H., Bork, D., Schmidt, R., Sturm, A. (eds) Enterprise, Business-Process and Information Systems Modeling. BPMDS EMMSAD 2024 2024. Lecture Notes in Business Information Processing, vol 511. Springer, Cham. https://doi.org/10.1007/978-3-031-61007-3_20
Improving Student Success: Helping Study Planners by Evaluating Study Plans with Partial Orders
This post has been authored by Christian Rennert.
Improving Student Success: Helping Study Planners by Evaluating Study Plans with Partial Orders
In recent work, we’re tackling a critical challenge in higher education: how to help students complete their studies on time. For example, across all study programs offered in 2022 in Germany, nearly 247,000 students received their bachelor’s degree [1]. However, first-time graduates received their degree the same year on average after around 4 years of studies [2]. Understanding why these delays occur and what can be done to address them is vital for improving education systems.
Our recent research focuses on how study plans — blueprints for the sequence of courses that students must take — align with actual student behavior. Using data analysis techniques and partial order alignments from the field of conformance checking, we’ve developed a method to uncover where students deviate from their study plans and how / how much they deviate.
A New Approach that Supports to Understand Study Behavior
We use process models to represent a study plan and compare the resulting study plans to the actual traces students take, which are described by an educational event log. An educational event log contains the course enrollments and completions for each student. By modeling these traces as partial orders — an approach to avoid introducing a strict order when courses are taken in parallel during a semester — we can identify mismatches between the planned and actual course orders.
Figure 1: A proposed framework to obtain aggregated deviation information based on order-based and temporal-based deviations from the present study plan.
Our approach can be better explained using the framework shown in Figure 1. Here’s how it works:
1. Model the Study Plan and Translate the Event Log into Partial Orders:
You can model several study plans and check for the best fitting if there are changes to the study plans in between what may well happen in a university setting. The process model of the study plan describes for each course a range of terms that the exam can be taken in. It does not allow for any courses to be skipped since they are mandatory. Further, each student’s exam-taking behavior must be transformed into a partial order. Therefore, each exam try of a course is mapped to a relative term for the student and then a partial order is created. An actual study plan that we obtained data for is shown in Figure 2. In Figure 3, we show an example educational event log that one can create a partial order.
Figure 2: RWTH’s computer science study plan from 2018 being modeled as a process model.
Figure 3: Translation from an example educational event log to a partial order.
2. Computing the Partial Order Alignments:
To determine deviations in the ordering between the expected ordering in the study plan and the actual ordering of exams and courses for a student, a partial order alignment [3] is computed. Such an alignment is a sequence of synchronous moves between both the trace and the model, log moves, and model moves. Here, you choose the best fitting alignment in case you have several study plans that may be equally possible for a student.
3. Aggregate Information Based on the Alignment and Term Distances Between the Actual and Expected Terms for a Course:
Based on the partial order alignment, we know that each course occurs in the model side either as a model or as a synchronous move. Therefore, there are the following relative positions that a course can have that is happening on the log side:
-
-
- A course occurs synchronously between the process model and the partial order: This means that a student is likely to agree to their exam-taking order with the expected order.
- At an earlier position in the partial order alignment a log move occurs with the course ID and later the mandatory model move: The student is likely to have taken the course earlier than expected.
- At a later position in the partial order alignment a log move occurs with the course ID and earlier the mandatory model move: The student is likely to have taken the course later than expected.
- There is only a model move for a course and no log move: This may be some data quality issue, where a course is missing in the educational event log for the student.
-
Figure 4: Example of the different ordering-based cases for a total order (right) and a partial order (left) and their partial order alignments. Obtaining such different alignments is also the reason why we use partial orders instead of total orders which is beyond the scope of this blogpost.
This derived information from the partial order alignments is then combined with some course-taking distances between the actual and the expected term that the course should be taken in. The distance is calculated in years and for a cohort of students the combination of an order-based relation and the corresponding temporal distance for a course are combined and counted, resulting in the data shown in Table 1 and Table 2.
Table 1: An excerpt from the aggregated result for the investigated students and the 2010 study plan.
Table 2: An excerpt from the aggregated result for the investigated students and the 2018 study plan.
Key Insights from the Research
Using data from RWTH Aachen University, we applied this approach to study plans from 2010 and 2018. Here’s what we found:
- Shifting Courses between expected and actual position: We can detect whether courses are moved backward or forward in the order in which courses should be taken. For example, courses C05 and C12 in Table 1 are moved forward by a smaller fraction of students while most students comply between expected position and expected time. Courses C16 and C18 are courses that occur more often to be taken later in studies and may lead to a longer study duration since they are most often also delayed by at least one year.
- Well-conforming courses: We can check if courses conform well between expectations and actual data. For example, the course C17 in Table 2 is taken from most students in the right order and a lower fraction of students take a course late.
- Adaptations over study plan: While study plans can change over time, this can have an effect on the conformance between students and study plans. Here, we can compare courses C12 and C17 between the aggregated results in Table 1 and Table 2 that belong to the analysis of a 2010 and a 2018 computer science study plan, respectively. While the change improved conformance for course C17, changes to course C12 reduced conformance here.
Possible Things to Come
Our findings highlight the potential for universities to use this methodology to evaluate and refine their study plans systematically. The results derived may also be used to enhance the information within our event logs directly, e.g., by adding a notion for an unconforming activity using what type of non-conformance it is. However, since optimal alignments are not necessarily deterministic, there may be improvements to make towards the reproducibility of each run of the presented framework and its interpretability. Further, we could also analyze the framework’s capability for educational event logs of other degree programs, and we can imagine that the framework can also be used to gain deeper insights into a cohort of students or for other metrics as well. We can also imagine the approach to be applicable to other types of event data that contain relative timings and corresponding process models.
Further Reading
This post is based on the research paper [3] that was accepted for publication and was presented at the EduPM – ICPM 2024 Workshop. Please find the preprint in the references section.
References:
[1] The average study duration of first-degree university graduates in Germany from 2003 to 2023, https://www.statista.com/statistics/584454/bachelor-and-master-degrees-number-universities-germany/, 2024, last access 2025-01-21
[2] Number of Bachelor’s and Master’s degrees in universities in Germany from 2000 to 2023, https://www.statista.com/statistics/584454/bachelor-and-master-degrees-number-universities-germany/, 2023, last access 2025-01-21
[3] Rennert, Christian, Mahsa Pourbafrani, and Wil van der Aalst. “Evaluation of Study Plans using Partial Orders.” arXiv preprint arXiv:2410.03314 (2024).
Detecting and Explaining Process Variability Across Performance Dimensions
This post has been authored by Ali Norouzifar.
Detecting and Explaining Process Variability Across Performance Dimensions
In the dynamic landscape of business processes, understanding variability is pivotal for organizations aiming to optimize their workflows and respond to inefficiencies. While much of the focus in process mining has been on detecting changes over time [1], such as concept drift, there is a less-explored yet equally critical dimension to consider: variability across performance metrics like case durations, risk scores, and other indicators relevant to business goals.
In this blog post, we summarize the process variant identification framework presented in [2], outlining the advancements made and potential future directions. The research introduces a novel framework that detects change points across performance metrics using a sliding window technique combined with the earth mover’s distance to evaluate significant control-flow changes. While the framework excels at identifying where variability occurs, the task of explaining these detected control-flow changes across performance dimensions remains an open challenge. This ongoing work, currently under review, aims to bridge that gap. The framework not only pinpoints variability but also provides actionable insights into the reasons and mechanisms behind process changes, empowering organizations to make informed, data-driven decisions.
A Motivating Example
To demonstrate how our algorithm works, we use a simple yet illustrative motivating example. In this example, the exact change points are known, allowing us to clearly show how our technique identifies and explains these changes. We encourage you to explore the implemented tool yourself by visiting our GitHub repository (https://github.com/aliNorouzifar/X-PVI). Using Docker, you can pull the image and follow along with this blog post to test the algorithm in action.
Processes are inherently complex, influenced by various dimensions beyond just time. For instance, consider the BPMN model illustrating a synthetic claim-handling process in Figure 1. In this process, the risk score of a case significantly impacts execution behavior. High-risk cases (risk score between 10 and 100) might be terminated early through cancellation after creating an application, whereas low-risk cases (risk score between 0 and 3) may bypass additional checks, creating distinct behavioral patterns. These variations, often hidden when processes are analyzed from a singular perspective like time, can lead to overlooked opportunities for targeted improvements. The event log corresponding to this example consisting of 10000 cases is available online (https://github.com/aliNorouzifar/X-PVI/blob/master/assets/test.xes). We use this event log in the following sections to show the capabilities of our framework.
Figure 1: BPMN representation of a claim handling process, highlighting variations based on risk score [1].
The Explainable Process Variant Identification Framework
Our framework combines robust detection of control-flow changes with enhanced explainability, focusing of the performance dimensions. Here is how it works:
Change Point Detection with Earth Mover s Distance:
First, we sort all the cases based on the selected process indicator. Once the cases are sorted, the user specifies the desired number of buckets, ensuring that each bucket contains an equal frequency of cases. Next, we apply a sliding window approach, where the window spans w buckets on both the left and right sides. This sliding window moves across the range of the performance indicator, from the beginning to the end. At each step, we calculate the earth mover s distance to measure the difference between the distributions on the left and right sides of the window. Refer to [3] for a detailed explanation of the earth mover’s distance, its mathematical foundations, and its practical applications. The results are visualized in a heatmap, which highlights specific points where significant process changes occur. In Figure 2, we show a simple example considering 15 buckets and window size of 3.
Figure 2: An example of change detection with earth mover s distance.
To determine the change points, we use a user-defined threshold that specifies the significance level for the earth mover s distance, enabling the segmentation process.
In our motivating example (cf. Figure 1), the risk score of the cases is selected as the performance indicator. Considering 100 buckets, each bucket contains 1% of the total cases. The first bucket includes the 1% of cases with the lowest risk scores, while the last bucket contains the 1% of cases with the highest risk scores. In Figure 3, visualizations for different window sizes (2, 5, 10, and 15) are provided. Using a window size of 2 and a significance threshold of 0.15, we can identify three distinct segments. These change points are utilized to define meaningful process segments that align with our initial understanding of the process dynamics. The identified change points are at risk score values of 3.0 and 10.0, accordingly the process is divided into three segments: (1) cases with risk scores between 0 and 3, (2) cases with risk scores between 3 and 10, and (3) cases with risk scores between 10 and 100.
Figure 3: Control flow change detection using the earth mover s distance framework with 100 buckets and different window sizes w∊{2, 5, 10, 15}. The color intensity indicates the magnitude of control-flow changes.
Explainability Extraction:
The explainability extraction framework begins with the feature space generation, where we derive all possible declarative constraints from the set of activities in the event log. This set can potentially be very large. For a detailed explanation of declarative constraints, refer to [4]. Below are some examples of declarative constraints derived from the motivating example event log:
* End(cancel application): cancel application is the last to occur.
* AtLeast1(check documents): check documents occurs at least once.
* Response(create application, cancel application): If create application occurs, then cancel application occurs after create application.
* CoExistence( in-person interview 1, check documents): If in-person interview 1 occurs, check documents occurs as well and vice versa.
For each sliding window, we calculate a specific evaluation metric for each declarative constraint, such as its confidence. For example, if the event create application occurs 100 times within a window, and only 10 of those instances are followed by cancel application, the confidence of the constraint Response(create application, cancel application) in that window is 10/100 or 10%. As the sliding window moves across the range of the process indicator, this evaluation metric is recalculated at each step. This process generates a behavioral signal for each constraint, providing insights into how the behavior evolves across different segments of the process. We do some preprocessing and only include the informative signals, for example, some of the features, may have constant value signals, we remove such signal.
The next step involves clustering the behavioral signals, grouping together signals that exhibit similar changes. This clustering serves as a visual aid, highlighting which signals change in tandem and how these clusters correspond to distinct segments identified during the earth mover s based change point detection step. By analyzing the correlation between the behavioral signals within these clusters and the identified segments, we gain valuable insights into the control-flow characteristics driving process variations as the range of the process indicator shifts from one segment to another.
Considering the window size of 2 and the significant distance threshold of 0.15, Figure 4 visualizes the different behavioral signals clustered into 7 groups. In Figure 5, the correlation between the clusters of behavioral signals and identified segments is illustrated.
Figure 4: Control flow feature clusters derived from behavioral signals and change-point detection.
Figure 5: Correlation analysis between identified segments and behavioral feature clusters. The heatmap highlights positive and negative correlations, illustrating how specific clusters explain segment-level behaviors.
For instance, the strong negative correlation between Cluster 1 and Segment 2 indicates that the behavioral signals in this cluster have significantly higher values in other segments compared to Segment 2. To enhance readability, some examples of declarative constraints from Cluster 1, translated into natural language, are as follows:
* in-person appointment 2 must never occur.
* request documents must never occur.
The strong positive correlation between Cluster 2 and Segment 3 indicates that the behavioral signals in this cluster have significantly higher values in Segment 3 compared to other segments. Below are some examples of declarative constraints from Cluster 2:
* create application and cancel appointment occurs if and only if cancel appointment immediately follows create application.
* cancel appointment is the last to occur.
* check documents must never occur
* If create application occurs, then cancel appointment occurs immediately after it.
* in-person appointment 1 must never occur.
* cancel appointment occurs only if create application occurs immediately before it.
A comparison of the extracted explainability with the ground truth, as illustrated in Figure 1, demonstrates that the results align closely with the actual process dynamics. This indicates that the designed framework is both effective at identifying changes and capable of providing meaningful explanations for them.
Why This Matters
Traditional process mining methods often overlook the rich variability that exists across performance dimensions. Our framework addresses this gap by not only detecting process changes but also integrating explainability into the analysis. This empowers process experts to better understand the detected changes and take informed actions.
The result? A powerful tool for uncovering hidden inefficiencies, adapting workflows to dynamic requirements, and driving continuous improvement. Additionally, our open-source implementation ensures accessibility for organizations across industries, enabling widespread adoption and collaboration. Please check our GitHub repository for more information https://github.com/aliNorouzifar/X-PVI.
We are committed to continuous improvement, regularly updating the framework to enhance its functionality and usability. Your feedback and insights are invaluable to us. We welcome your suggestions and encourage you to report any issues or potential enhancements to further refine this approach. Here is my email address: ali.norouzifar@pads.rwth-aachen.de
References:
[1] Sato, D.M.V., De Freitas, S.C., Barddal, J.P. and Scalabrin, E.E., 2021. A survey on concept drift in process mining. ACM Computing Surveys (CSUR), 54(9), pp.1-38.
[2] Norouzifar, A., Rafiei, M., Dees, M. and van der Aalst, W., 2024, May. Process Variant Analysis Across Continuous Features: A Novel Framework. In International Conference on Business Process Modeling, Development and Support (pp. 129-142). Cham: Springer Nature Switzerland.
[3] Leemans, S.J., van der Aalst, W.M., Brockhoff, T. and Polyvyanyy, A., 2021. Stochastic process mining: Earth movers stochastic conformance. Information Systems, 102, p.101724.
[4] Di Ciccio, C. and Montali, M., 2022. Declarative Process Specifications: Reasoning, Discovery, Monitoring. Process mining handbook, 448, pp.108-152.
Introducing PM-LLM-Benchmark v2.0: Raising the Bar for Process-Mining-Specific Large Language Model Evaluation
This post has been authored by Alessandro Berti.
1. Introduction
In recent years, the synergy between process mining (PM) and large language models (LLMs) has grown at a remarkable pace. Process mining, which focuses on analyzing event logs to extract insights into real-world business processes, benefits significantly from the context understanding and domain knowledge provided by state-of-the-art LLMs. Despite these promising developments, until recently, there was no specific benchmark for evaluating LLM performance in process mining tasks.
To address this gap, we introduced PM-LLM-Benchmark v1.0 (see the paper)—the first attempt to systematically and qualitatively assess how effectively LLMs handle process mining questions. Now, we are excited to announce PM-LLM-Benchmark v2.0, a comprehensive update that features an expanded range of more challenging prompts and scenarios, along with the continued use of an expert LLM serving as a judge (i.e., LLM-as-a-Judge) to automate the grading process.
This post provides an overview of PM-LLM-Benchmark v2.0, highlighting its major features, improvements over v1.0, and the significance of LLM-as-a-Judge for robust evaluations.
2. PM-LLM-Benchmark v2.0 Highlights
2.1 A New and More Complex Prompt Set
PM-LLM-Benchmark v2.0 is a drop-in replacement for v1.0, designed to push the boundaries of what LLMs can handle in process mining. While the same categories of tasks have been preserved to allow continuity in evaluations, the prompts are more complex and detailed, spanning:
- Contextual understanding of event logs, including inference of case IDs, event context, and process structure.
- Conformance checking and the detection of anomalies in textual descriptions or logs.
- Generation and modification of declarative and procedural process models.
- Process querying and reading process models, both textual and visual (including diagrams).
- Hypothesis generation to test domain knowledge.
- Assessment of unfairness in processes and potential mitigations.
- Diagram reading and interpretation for advanced scenarios.
These new prompts ensure that high-performing models from v1.0 will face fresh challenges and demonstrate whether their reasoning capabilities continue to scale as tasks become more intricate.
2.2 LLM-as-a-Judge: Automated Expert Evaluation
A defining feature of PM-LLM-Benchmark is its use of an expert LLM to evaluate (grade) the responses of other LLMs. We refer to this approach as LLM-as-a-Judge. This setup enables:
1. Systematic Scoring: Each response is scored from 1.0 (minimum) to 10.0 (maximum) according to how well it addresses the question or prompt.
2. Reproducible Assessments: By relying on a consistent “judge” model, different LLMs can be fairly compared using the same grading logic.
3. Scalability: The automated evaluation pipeline makes it easy to add new models or updated versions, as their outputs can be quickly scored without the need for full manual review.
For example, textual answers are judged with a prompt of the form:
Given the following question: [QUESTION CONTENT], how would you grade the following answer from 1.0 (minimum) to 10.0 (maximum)? [MODEL ANSWER]
When an image-based question is supported, the judge LLM is asked to:
Given the attached image, how would you grade the following answer from 1.0 (minimum) to 10.0 (maximum)? [MODEL ANSWER]
The final score for a model on the benchmark is computed by summing all scores across the questions and dividing by 10.
2.3 Scripts and Usage
We have included answer.py and evalscript.py to facilitate the benchmark procedure:
1. answer.py: Executes the prompts against a chosen LLM (e.g., an OpenAI model), collecting outputs.
2. evalscript.py: Takes the collected outputs and feeds them to the LLM-as-a-Judge for automated grading.
Users can customize the API keys within answering_api_key.txt and judge_api_key.txt, and configure which model or API endpoint to query for both the answering and judging phases.
3. Benchmark Categories
PM-LLM-Benchmark v2.0 continues to organize tasks into several categories, each reflecting real-world challenges in process mining:
1. Category 1: Contextual understanding—tasks like inferring case IDs and restructuring events.
2. Category 2: Conformance checking—identifying anomalies in textual descriptions or event logs.
3. Category 3: Model generation and modification—creating and editing both declarative and procedural process models.
4. Category 4: Process querying—answering questions that require deeper introspection of the process models or logs.
5. Category 5: Hypothesis generation—proposing insightful research or improvement questions based on the provided data.
6. Category 6: Fairness considerations—detecting and mitigating unfairness in processes and resources.
7. Category 7: Diagram interpretation—examining an LLM’s ability to read, understand, and reason about process mining diagrams.
Each category tests different aspects of an LLM’s capacity, from linguistic comprehension to domain-specific reasoning.
4. Leaderboards and New Baselines
Following the approach of v1.0, PM-LLM-Benchmark v2.0 includes leaderboards to track the performance of various LLMs (see the leaderboard_gpt-4o-2024-11-20.md). Our latest results demonstrate that even the most capable current models achieve only around 7.5 out of 10, indicating the increased difficulty of v2.0 relative to v1.0, where performances had largely plateaued near the 9–10 range.
Model Highlights
- OpenAI O1 & O1 Pro Mode:
The new O1 model and the enhanced O1 Pro Mode deliver strong performance, with O1 Pro Mode showing about a 5% improvement over O1. Some initial concerns about the standard O1 model’s shorter reasoning depth have been largely mitigated by these results. - Google Gemini-2.0-flash-exp and gemini-exp-1206:
Gemini-2.0-flash-exp shows performance comparable to the established gemini-1.5-pro-002. However, the experimental gemini-exp-1206 variant, expected to inform Gemini 2.0 Pro, displays promising improvements over earlier Gemini releases. Overall, Gemini models fall slightly behind the O1 series on v2.0 tasks.
5. How to Get Started
1. Clone the Repository: Access the PM-LLM-Benchmark v2.0 repository, which contains the questions/ folder and scripts like answer.py and evalscript.py.
2. Install Dependencies: Make sure you have the necessary Python packages (e.g., requests, openai, etc.).
3. Configure API Keys: Place your API keys in answering_api_key.txt and judge_api_key.txt.
4. Run the Benchmark:
- Execute python answer.py to generate LLM responses to the v2.0 prompts.
- Run python evalscript.py to evaluate and obtain the final scores using an expert LLM.
5. Analyze the Results: Compare the results in the generated scoreboard to see where your chosen model excels and where it struggles.
6. Conclusion and Outlook
PM-LLM-Benchmark v2.0 raises the bar for process-mining-specific LLM evaluations, ensuring that continued improvements in model architectures and capabilities are tested against truly challenging and domain-specific tasks. Leveraging LLM-as-a-Judge also fosters a consistent, automated, and scalable evaluation paradigm.
Whether you are an LLM researcher exploring specialized domains like process mining, or a practitioner who wants to identify the best model for analyzing process logs and diagrams, we invite you to test your models on PM-LLM-Benchmark v2.0. The expanded prompts and systematic grading method provide a rigorous environment in which to measure and improve LLM performance.
References & Further Reading
- Original PM-LLM-Benchmark v1.0 Paper: https://arxiv.org/pdf/2407.13244
- Leaderboard (updated regularly): leaderboard_gpt-4o-2024-11-20.md