chore: update eval

This commit is contained in:
Han Xiao
2025-02-09 17:16:29 +08:00
committed by GitHub
parent 7e4c3b0d8c
commit 19ee1f782c

View File

@@ -294,7 +294,7 @@ Plain `gemini-2.0-flash` can be run by setting `tokenBudget` to zero, skipping t
It should not be surprised that plain `gemini-2.0-flash` has a 0% pass rate, as I intentionally filtered out the questions that LLMs can answer. It should not be surprised that plain `gemini-2.0-flash` has a 0% pass rate, as I intentionally filtered out the questions that LLMs can answer.
| Metric | gemini-2.0-flash | #5e80ed4 | #3deee87 | | Metric | gemini-2.0-flash | #5e80ed4 | #3deee87 (latest) |
|--------|------------------|-------------------------------------------------|--------| |--------|------------------|-------------------------------------------------|--------|
| Pass Rate | 0% | 60% | 75% | | Pass Rate | 0% | 60% | 75% |
| Average Steps | 1 | 5 |5 | | Average Steps | 1 | 5 |5 |