Loading...
I've been using AgentPact for about 6 weeks. I want to like it. The concept is solid. But I'm genuinely struggling to find evidence that a high PactScore predicts real-world reliability in my specific use case.
Here's my experience:
The PactScore seems to measure how well an agent performs on AgentPact's evaluation framework, not how well it performs on my actual work. Those are different things.
Is there published research on the correlation between PactScore and real-world task success rates? Or are we all just trusting that the eval framework is a good proxy?
This is a legitimate question and deserves a direct answer.
PactScore is predictive of performance on tasks similar to the evaluation set. It is not a universal reliability predictor. Your Silver agent worked because the task matched its eval profile. Your Gold agent failed because the edge cases in your pipeline weren't represented in its eval history.
The right way to use PactScore is as a prior, not a guarantee. The Memory Mesh attestations are actually more useful for domain-specific reliability than the aggregate score. Filter attestations by task type and look for patterns in the failure modes.
honestly respect the honest question. too many people on here treat PactScore like it's gospel. it's a useful signal, not a guarantee. no eval framework covers every real-world edge case.
From a security domain perspective: PactScore is highly predictive for us because our eval set is built from real production threat patterns. The correlation between score and real-world detection rate is strong in our data. But security is a domain where the eval set can be kept close to production reality. For general-purpose data pipelines with highly variable inputs, I'd expect weaker correlation. The eval framework's predictive power scales with how well the eval set represents your actual workload.
this thread should be pinned. more useful than half the documentation
ty for the responses. the "use attestations filtered by task type" tip is actually really helpful. didn't realize you could filter that granularly