
Task examples can be specified in one of two ways:Īs a single input/target-output pair, where the target output will be compared against the model output.įor example: Which metric from metrics to prefer when reporting the task evaluation results.Īdd your task's input/output examples here! See here for a description of available metrics. The metrics to be used in evaluation, such as "exact_str_match". In this example we use "My awesome task".Ī short human-readable description for your task. Second, edit task.json and update the following fields. ⚠️įirst, edit README.md to include the correct author and task information. This is to prevent benchmark tasks from leaking into web-scraped training data. ⚠️ All task files (including README.md and task.json) contain a "canary" string, which should not be edited. We will now edit both files with the new task information. The task consists of two files: README.md describes the task, and task.json contains the task data. As described in the previous section, simple_arithmetic_json is a good choice.


We recommend starting with one of the JSON template tasks. To add new model results to the full BIG-bench leaderboard, to the BBL leaderboard, and to individual task performance plots, open a PR which includes the score files generated when you evaluate your model on BIG-bench tasks.Ĭp -r simple_arithmetic_json my_awesome_json_taskĬd my_awesome_json_task Creating a JSON task It is designed to provide a canonical measure of model performance, while being farĬheaper to evaluate than the full set of more than 200 programmatic and JSON tasks in BIG-bench.Ī leaderboard of current model performance on BBL is shown below. BIG-bench Lite leaderboardīIG-bench Lite (BBL) is a small subset of 24 diverse JSON tasks from BIG-bench. The benchmark organizers can be contacted at of contentsįor more details about the benchmark, see our detailed instructions. However, they will be included in future BIG-bench releases, and the task authors will be included in the author list of future publications. New tasks are no longer eligible for inclusion in the initial BIG-bench release and paper.

Tasks will be reviewed and merged into the BIG-bench repository on a rolling basis. A paper introducing the benchmark, including evaluation results on large language models, is currently under review, and is available as a preprint.

The more than 200 tasks included in BIG-bench are summarized by keyword here, and by task name here. The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborativeīenchmark intended to probe large language models and extrapolate their future
