Evaluation
Language model evaluation harness starter resource¶
This showcases how to evaluate your models on the LM Eval harness from EleutherAI
Source: https://colab.research.google.com/drive/1zmZfdETnQ-AR2BBIK3pFtnP5937J1yaz?usp=sharing
In [1]:
Copied!
!git clone https://github.com/EleutherAI/lm-evaluation-harness/
!git clone https://github.com/EleutherAI/lm-evaluation-harness/
Cloning into 'lm-evaluation-harness'... remote: Enumerating objects: 22462, done. remote: Counting objects: 100% (5248/5248), done. remote: Compressing objects: 100% (687/687), done. remote: Total 22462 (delta 4760), reused 4854 (delta 4559), pack-reused 17214 Receiving objects: 100% (22462/22462), 20.70 MiB | 14.91 MiB/s, done. Resolving deltas: 100% (15465/15465), done.
In [2]:
Copied!
%cd lm-evaluation-harness
%cd lm-evaluation-harness
/content/lm-evaluation-harness
In [3]:
Copied!
!git checkout 'e47e01beea79cfe87421e2dac49e64d499c240b4'
!git checkout 'e47e01beea79cfe87421e2dac49e64d499c240b4'
Note: switching to 'e47e01beea79cfe87421e2dac49e64d499c240b4'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by switching back to a branch. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -c with the switch command. Example: git switch -c <new-branch-name> Or undo this operation with: git switch - Turn off this advice by setting config variable advice.detachedHead to false HEAD is now at e47e01be Merge pull request #435 from EleutherAI/haileyschoelkopf-patch-1
In [4]:
Copied!
!pip install -q -e .
!pip install -q -e .
Preparing metadata (setup.py) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 519.6/519.6 kB 8.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.0/77.0 kB 6.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79.5/79.5 kB 6.7 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85.6/85.6 kB 7.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 227.7/227.7 kB 13.9 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.1/10.1 MB 53.2 MB/s eta 0:00:00 Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 111.1/111.1 kB 14.4 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.6/65.6 kB 9.1 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.7/7.7 MB 106.1 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.7/2.7 MB 104.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.3/115.3 kB 16.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 18.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.0/302.0 kB 35.1 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 117.0/117.0 kB 17.3 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 258.1/258.1 kB 31.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 89.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.8/3.8 MB 119.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 295.0/295.0 kB 34.5 MB/s eta 0:00:00 Building wheel for antlr4-python3-runtime (setup.py) ... done Building wheel for rouge-score (setup.py) ... done Building wheel for pycountry (pyproject.toml) ... done Building wheel for sqlitedict (setup.py) ... done ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. llmx 0.0.15a0 requires cohere, which is not installed. llmx 0.0.15a0 requires tiktoken, which is not installed.
In [5]:
Copied!
import transformers
import accelerate
import transformers
import accelerate
In [6]:
Copied!
#!pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
#!pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
In [7]:
Copied!
# 3 minutes to run with Accuracy of 44%
%%time
!python main.py \
--model gpt2 \
--num_fewshot 0 \
--tasks arc_easy \
--device 0
# 3 minutes to run with Accuracy of 44%
%%time
!python main.py \
--model gpt2 \
--num_fewshot 0 \
--tasks arc_easy \
--device 0
2023-10-21 03:33:47.513338: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Selected Tasks: ['arc_easy'] Using device '0' Downloading (…)lve/main/config.json: 100% 665/665 [00:00<00:00, 2.78MB/s] Downloading model.safetensors: 100% 548M/548M [00:02<00:00, 228MB/s] Downloading (…)neration_config.json: 100% 124/124 [00:00<00:00, 754kB/s] Downloading (…)olve/main/vocab.json: 100% 1.04M/1.04M [00:00<00:00, 3.17MB/s] Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 1.87MB/s] Downloading (…)/main/tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 4.19MB/s] Downloading builder script: 100% 5.37k/5.37k [00:00<00:00, 12.1MB/s] Downloading metadata: 100% 4.47k/4.47k [00:00<00:00, 19.1MB/s] Downloading readme: 100% 8.66k/8.66k [00:00<00:00, 20.7MB/s] Downloading data: 100% 681M/681M [00:30<00:00, 22.1MB/s] Generating train split: 100% 2251/2251 [00:00<00:00, 7998.81 examples/s] Generating test split: 100% 2376/2376 [00:00<00:00, 10862.60 examples/s] Generating validation split: 100% 570/570 [00:00<00:00, 11669.10 examples/s] Running loglikelihood requests 100% 9496/9496 [02:48<00:00, 56.19it/s] { "results": { "arc_easy": { "acc": 0.43813131313131315, "acc_stderr": 0.010180937100600062, "acc_norm": 0.3947811447811448, "acc_norm_stderr": 0.010030038935883556 } }, "versions": { "arc_easy": 0 }, "config": { "model": "gpt2", "model_args": "", "num_fewshot": 0, "batch_size": null, "device": "0", "no_cache": false, "limit": null, "bootstrap_iters": 100000, "description_dict": {} } } gpt2 (), limit: None, provide_description: False, num_fewshot: 0, batch_size: None | Task |Version| Metric |Value | |Stderr| |--------|------:|--------|-----:|---|-----:| |arc_easy| 0|acc |0.4381|± |0.0102| | | |acc_norm|0.3948|± |0.0100| CPU times: user 3.22 s, sys: 365 ms, total: 3.59 s Wall time: 4min 6s
In [8]:
Copied!
#10 minutes to run with Accuracy of 56%
%%time
!python main.py \
--model gpt2 \
--model_args pretrained=EleutherAI/gpt-neo-1.3B \
--num_fewshot 0 \
--tasks arc_easy \
--device 0
#10 minutes to run with Accuracy of 56%
%%time
!python main.py \
--model gpt2 \
--model_args pretrained=EleutherAI/gpt-neo-1.3B \
--num_fewshot 0 \
--tasks arc_easy \
--device 0
2023-10-21 03:37:52.813056: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Selected Tasks: ['arc_easy'] Using device '0' Downloading (…)lve/main/config.json: 100% 1.35k/1.35k [00:00<00:00, 7.19MB/s] Downloading model.safetensors: 100% 5.31G/5.31G [00:22<00:00, 231MB/s] Downloading (…)okenizer_config.json: 100% 200/200 [00:00<00:00, 892kB/s] Downloading (…)olve/main/vocab.json: 100% 798k/798k [00:00<00:00, 2.43MB/s] Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 1.86MB/s] Downloading (…)cial_tokens_map.json: 100% 90.0/90.0 [00:00<00:00, 477kB/s] Running loglikelihood requests 100% 9496/9496 [08:25<00:00, 18.80it/s] { "results": { "arc_easy": { "acc": 0.5618686868686869, "acc_stderr": 0.010180937100600074, "acc_norm": 0.502104377104377, "acc_norm_stderr": 0.010259692651537032 } }, "versions": { "arc_easy": 0 }, "config": { "model": "gpt2", "model_args": "pretrained=EleutherAI/gpt-neo-1.3B", "num_fewshot": 0, "batch_size": null, "device": "0", "no_cache": false, "limit": null, "bootstrap_iters": 100000, "description_dict": {} } } gpt2 (pretrained=EleutherAI/gpt-neo-1.3B), limit: None, provide_description: False, num_fewshot: 0, batch_size: None | Task |Version| Metric |Value | |Stderr| |--------|------:|--------|-----:|---|-----:| |arc_easy| 0|acc |0.5619|± |0.0102| | | |acc_norm|0.5021|± |0.0103| CPU times: user 7.27 s, sys: 774 ms, total: 8.04 s Wall time: 9min 49s
In [9]:
Copied!
#18 #10 minutes to run with Accuracy of 61%
%%time
!python main.py \
--model gpt2 \
--model_args pretrained=EleutherAI/gpt-neo-2.7B \
--num_fewshot 0 \
--tasks arc_easy \
--device 0
#18 #10 minutes to run with Accuracy of 61%
%%time
!python main.py \
--model gpt2 \
--model_args pretrained=EleutherAI/gpt-neo-2.7B \
--num_fewshot 0 \
--tasks arc_easy \
--device 0
2023-10-21 03:47:48.449712: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Selected Tasks: ['arc_easy'] Using device '0' Downloading (…)lve/main/config.json: 100% 1.46k/1.46k [00:00<00:00, 7.78MB/s] Downloading model.safetensors: 100% 10.7G/10.7G [01:10<00:00, 152MB/s] ^C CPU times: user 1.09 s, sys: 250 ms, total: 1.34 s Wall time: 2min 4s
Its best practice to use getpass in these types of notebooks. Your API key should be something like sk-...
In [10]:
Copied!
import getpass
import os
open_ai_key = getpass.getpass('Enter your OPENAI API Key')
os.environ['OPENAI_API_SECRET_KEY'] = open_ai_key
import getpass
import os
open_ai_key = getpass.getpass('Enter your OPENAI API Key')
os.environ['OPENAI_API_SECRET_KEY'] = open_ai_key
Enter your OPENAI API Key··········
In [11]:
Copied!
#2 minutes to run with 88 requests and Accuracy of 86%
%%time
!python main.py \
--model gpt3 \
--model_args engine=davinci \
--num_fewshot 2 \
--tasks sst
#2 minutes to run with 88 requests and Accuracy of 86%
%%time
!python main.py \
--model gpt3 \
--model_args engine=davinci \
--num_fewshot 2 \
--tasks sst
2023-10-21 03:51:06.509333: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Selected Tasks: ['sst'] Downloading builder script: 100% 28.8k/28.8k [00:00<00:00, 36.3MB/s] Downloading metadata: 100% 28.7k/28.7k [00:00<00:00, 35.4MB/s] Downloading readme: 100% 27.9k/27.9k [00:00<00:00, 49.1MB/s] Downloading data: 100% 7.44M/7.44M [00:00<00:00, 81.4MB/s] Generating train split: 100% 67349/67349 [00:02<00:00, 26948.60 examples/s] Generating validation split: 100% 872/872 [00:00<00:00, 27797.11 examples/s] Generating test split: 100% 1821/1821 [00:00<00:00, 24548.83 examples/s] Running loglikelihood requests 100% 88/88 [01:34<00:00, 1.07s/it] { "results": { "sst": { "acc": 0.8600917431192661, "acc_stderr": 0.011753981006588683 } }, "versions": { "sst": 0 }, "config": { "model": "gpt3", "model_args": "engine=davinci", "num_fewshot": 2, "batch_size": null, "device": null, "no_cache": false, "limit": null, "bootstrap_iters": 100000, "description_dict": {} } } gpt3 (engine=davinci), limit: None, provide_description: False, num_fewshot: 2, batch_size: None |Task|Version|Metric|Value | |Stderr| |----|------:|------|-----:|---|-----:| |sst | 0|acc |0.8601|± |0.0118| CPU times: user 821 ms, sys: 121 ms, total: 942 ms Wall time: 2min 3s
Making a new task for the harness¶
This part documents how to create a new task for the language model evaluation harness and is based on this document.
In [12]:
Copied!
# After forking...
!cd .. && git clone https://github.com/esbenkc/lm-evaluation-harness.git lm-evaluation-harness-new-task
%cd lm-evaluation-harness-new-task
!git checkout -b "cool-patrol"
!pip install -q -e ".[dev]"
# After forking...
!cd .. && git clone https://github.com/esbenkc/lm-evaluation-harness.git lm-evaluation-harness-new-task
%cd lm-evaluation-harness-new-task
!git checkout -b "cool-patrol"
!pip install -q -e ".[dev]"
Cloning into 'lm-evaluation-harness-new-task'... remote: Enumerating objects: 7910, done. remote: Counting objects: 100% (766/766), done. remote: Compressing objects: 100% (65/65), done. remote: Total 7910 (delta 730), reused 701 (delta 701), pack-reused 7144 Receiving objects: 100% (7910/7910), 9.49 MiB | 21.74 MiB/s, done. Resolving deltas: 100% (5080/5080), done. [Errno 2] No such file or directory: 'lm-evaluation-harness-new-task' /content/lm-evaluation-harness Switched to a new branch 'cool-patrol' Preparing metadata (setup.py) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 13.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.3/58.3 kB 7.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 203.7/203.7 kB 14.4 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.6/62.6 kB 8.5 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.9/98.9 kB 12.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.7/3.7 MB 28.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 227.5/227.5 kB 26.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 468.9/468.9 kB 31.3 MB/s eta 0:00:00
In [13]:
Copied!
# See https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_guide.md#creating-your-task-file
!cp templates/new_task.py lm_eval/tasks/cool-patrol.py
# See https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_guide.md#creating-your-task-file
!cp templates/new_task.py lm_eval/tasks/cool-patrol.py
In [13]:
Copied!