From e05175b6b13100789058e1294da991974b20540e Mon Sep 17 00:00:00 2001 From: Han Xiao Date: Mon, 10 Feb 2025 12:24:28 +0800 Subject: [PATCH] chore: update eval --- README.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index da1fde7..1dbd97c 100644 --- a/README.md +++ b/README.md @@ -298,14 +298,14 @@ Plain `gemini-2.0-flash` can be run by setting `tokenBudget` to zero, skipping t It should not be surprised that plain `gemini-2.0-flash` has a 0% pass rate, as I intentionally filtered out the questions that LLMs can answer. -| Metric | gemini-2.0-flash | #5e80ed4 | #3deee87 (latest) | -|--------|------------------|-------------------------------------------------|--------| -| Pass Rate | 0% | 60% | 75% | -| Average Steps | 1 | 5 |5 | -| Maximum Steps | 1 | 13 |13 | -| Minimum Steps | 1 | 2 |1 | -| Median Steps | 1 | 3 |3 | -| Average Tokens | 428 | 59,408 |32,392 | -| Median Tokens | 434 | 16,001 |9,172 | -| Maximum Tokens | 463 | 347,222 |202,055 | -| Minimum Tokens | 374 | 5,594 |3,236 | +| Metric | gemini-2.0-flash | #18f0312 | +|--------|------------------|-----------| +| Pass Rate | 0% | 75% | +| Average Steps | 1 | 4 | +| Maximum Steps | 1 | 14 | +| Minimum Steps | 1 | 0 | +| Median Steps | 1 | 3 | +| Average Tokens | 428 | 71,285 | +| Median Tokens | 434 | 22,771 | +| Maximum Tokens | 463 | 536,148 | +| Minimum Tokens | 374 | 0 |