Benchmarking hackbots and agents

## **PoC** metr have released their benchmarking tool (along with benchmarks) [Vivaria](https://github.com/METR/vivaria?tab=readme-ov-file) it's used for running evaluations and conducting agent elicitation research. [A tasking benchmark](https://github.com/METR/task-standard/) by metr.org. [cyberseceval3](https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks) also contains benchmarks for a range of offensive capabilities. ## **Details** The METR task standard is designed to assist in evaluating language model agents for autonomous capabilities. "Many organizations and individuals are interested in writing tasks to evaluate language model agents for autonomous capabilities. The goal of the METR Task Standard is to define a common format for such tasks, so that everyone can evaluate their agents on tasks developed by other parties, rather than just their own. Making and validating informative tasks is a large amount of work, so de-duplicating efforts is important for the overall evals ecosystem." - METR Github. [Paper](https://metr.org/blog/2023-08-01-new-report/)