From 19ee1f782c43057193acd8e5fd0eb6f7eb353579 Mon Sep 17 00:00:00 2001 From: Han Xiao Date: Sun, 9 Feb 2025 17:16:29 +0800 Subject: [PATCH] chore: update eval --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 32e3e61..92341d4 100644 --- a/README.md +++ b/README.md @@ -294,7 +294,7 @@ Plain `gemini-2.0-flash` can be run by setting `tokenBudget` to zero, skipping t It should not be surprised that plain `gemini-2.0-flash` has a 0% pass rate, as I intentionally filtered out the questions that LLMs can answer. -| Metric | gemini-2.0-flash | #5e80ed4 | #3deee87 | +| Metric | gemini-2.0-flash | #5e80ed4 | #3deee87 (latest) | |--------|------------------|-------------------------------------------------|--------| | Pass Rate | 0% | 60% | 75% | | Average Steps | 1 | 5 |5 |