This post has been authored by Alessandro Berti.
1. Introduction
In recent years, the synergy between process mining (PM) and large language models (LLMs) has grown at a remarkable pace. Process mining, which focuses on analyzing event logs to extract insights into real-world business processes, benefits significantly from the context understanding and domain knowledge provided by state-of-the-art LLMs. Despite these promising developments, until recently, there was no specific benchmark for evaluating LLM performance in process mining tasks.
To address this gap, we introduced PM-LLM-Benchmark v1.0 (see the paper)—the first attempt to systematically and qualitatively assess how effectively LLMs handle process mining questions. Now, we are excited to announce PM-LLM-Benchmark v2.0, a comprehensive update that features an expanded range of more challenging prompts and scenarios, along with the continued use of an expert LLM serving as a judge (i.e., LLM-as-a-Judge) to automate the grading process.
This post provides an overview of PM-LLM-Benchmark v2.0, highlighting its major features, improvements over v1.0, and the significance of LLM-as-a-Judge for robust evaluations.
2. PM-LLM-Benchmark v2.0 Highlights
2.1 A New and More Complex Prompt Set
PM-LLM-Benchmark v2.0 is a drop-in replacement for v1.0, designed to push the boundaries of what LLMs can handle in process mining. While the same categories of tasks have been preserved to allow continuity in evaluations, the prompts are more complex and detailed, spanning:
- Contextual understanding of event logs, including inference of case IDs, event context, and process structure.
- Conformance checking and the detection of anomalies in textual descriptions or logs.
- Generation and modification of declarative and procedural process models.
- Process querying and reading process models, both textual and visual (including diagrams).
- Hypothesis generation to test domain knowledge.
- Assessment of unfairness in processes and potential mitigations.
- Diagram reading and interpretation for advanced scenarios.
These new prompts ensure that high-performing models from v1.0 will face fresh challenges and demonstrate whether their reasoning capabilities continue to scale as tasks become more intricate.
2.2 LLM-as-a-Judge: Automated Expert Evaluation
A defining feature of PM-LLM-Benchmark is its use of an expert LLM to evaluate (grade) the responses of other LLMs. We refer to this approach as LLM-as-a-Judge. This setup enables:
1. Systematic Scoring: Each response is scored from 1.0 (minimum) to 10.0 (maximum) according to how well it addresses the question or prompt.
2. Reproducible Assessments: By relying on a consistent “judge” model, different LLMs can be fairly compared using the same grading logic.
3. Scalability: The automated evaluation pipeline makes it easy to add new models or updated versions, as their outputs can be quickly scored without the need for full manual review.
For example, textual answers are judged with a prompt of the form:
Given the following question: [QUESTION CONTENT], how would you grade the following answer from 1.0 (minimum) to 10.0 (maximum)? [MODEL ANSWER]
When an image-based question is supported, the judge LLM is asked to:
Given the attached image, how would you grade the following answer from 1.0 (minimum) to 10.0 (maximum)? [MODEL ANSWER]
The final score for a model on the benchmark is computed by summing all scores across the questions and dividing by 10.
2.3 Scripts and Usage
We have included answer.py and evalscript.py to facilitate the benchmark procedure:
1. answer.py: Executes the prompts against a chosen LLM (e.g., an OpenAI model), collecting outputs.
2. evalscript.py: Takes the collected outputs and feeds them to the LLM-as-a-Judge for automated grading.
Users can customize the API keys within answering_api_key.txt and judge_api_key.txt, and configure which model or API endpoint to query for both the answering and judging phases.
3. Benchmark Categories
PM-LLM-Benchmark v2.0 continues to organize tasks into several categories, each reflecting real-world challenges in process mining:
1. Category 1: Contextual understanding—tasks like inferring case IDs and restructuring events.
2. Category 2: Conformance checking—identifying anomalies in textual descriptions or event logs.
3. Category 3: Model generation and modification—creating and editing both declarative and procedural process models.
4. Category 4: Process querying—answering questions that require deeper introspection of the process models or logs.
5. Category 5: Hypothesis generation—proposing insightful research or improvement questions based on the provided data.
6. Category 6: Fairness considerations—detecting and mitigating unfairness in processes and resources.
7. Category 7: Diagram interpretation—examining an LLM’s ability to read, understand, and reason about process mining diagrams.
Each category tests different aspects of an LLM’s capacity, from linguistic comprehension to domain-specific reasoning.
4. Leaderboards and New Baselines
Following the approach of v1.0, PM-LLM-Benchmark v2.0 includes leaderboards to track the performance of various LLMs (see the leaderboard_gpt-4o-2024-11-20.md). Our latest results demonstrate that even the most capable current models achieve only around 7.5 out of 10, indicating the increased difficulty of v2.0 relative to v1.0, where performances had largely plateaued near the 9–10 range.
Model Highlights
- OpenAI O1 & O1 Pro Mode:
The new O1 model and the enhanced O1 Pro Mode deliver strong performance, with O1 Pro Mode showing about a 5% improvement over O1. Some initial concerns about the standard O1 model’s shorter reasoning depth have been largely mitigated by these results. - Google Gemini-2.0-flash-exp and gemini-exp-1206:
Gemini-2.0-flash-exp shows performance comparable to the established gemini-1.5-pro-002. However, the experimental gemini-exp-1206 variant, expected to inform Gemini 2.0 Pro, displays promising improvements over earlier Gemini releases. Overall, Gemini models fall slightly behind the O1 series on v2.0 tasks.
5. How to Get Started
1. Clone the Repository: Access the PM-LLM-Benchmark v2.0 repository, which contains the questions/ folder and scripts like answer.py and evalscript.py.
2. Install Dependencies: Make sure you have the necessary Python packages (e.g., requests, openai, etc.).
3. Configure API Keys: Place your API keys in answering_api_key.txt and judge_api_key.txt.
4. Run the Benchmark:
- Execute python answer.py to generate LLM responses to the v2.0 prompts.
- Run python evalscript.py to evaluate and obtain the final scores using an expert LLM.
5. Analyze the Results: Compare the results in the generated scoreboard to see where your chosen model excels and where it struggles.
6. Conclusion and Outlook
PM-LLM-Benchmark v2.0 raises the bar for process-mining-specific LLM evaluations, ensuring that continued improvements in model architectures and capabilities are tested against truly challenging and domain-specific tasks. Leveraging LLM-as-a-Judge also fosters a consistent, automated, and scalable evaluation paradigm.
Whether you are an LLM researcher exploring specialized domains like process mining, or a practitioner who wants to identify the best model for analyzing process logs and diagrams, we invite you to test your models on PM-LLM-Benchmark v2.0. The expanded prompts and systematic grading method provide a rigorous environment in which to measure and improve LLM performance.
References & Further Reading
- Original PM-LLM-Benchmark v1.0 Paper: https://arxiv.org/pdf/2407.13244
- Leaderboard (updated regularly): leaderboard_gpt-4o-2024-11-20.md