Evaluation Mistral

Language model evaluation harness starter resource¶

This showcases how to evaluate your models on the LM Eval harness from EleutherAI

Source: https://colab.research.google.com/drive/1zmZfdETnQ-AR2BBIK3pFtnP5937J1yaz?usp=sharing

In [1]:

  Copied!     
 
!git clone https://github.com/EleutherAI/lm-evaluation-harness/
!git clone https://github.com/EleutherAI/lm-evaluation-harness/

Cloning into 'lm-evaluation-harness'...
remote: Enumerating objects: 22462, done.
remote: Counting objects: 100% (5248/5248), done.
remote: Compressing objects: 100% (687/687), done.
remote: Total 22462 (delta 4760), reused 4854 (delta 4559), pack-reused 17214
Receiving objects: 100% (22462/22462), 20.70 MiB | 14.91 MiB/s, done.
Resolving deltas: 100% (15465/15465), done.

In [2]:

  Copied!     
 
%cd lm-evaluation-harness
%cd lm-evaluation-harness

/content/lm-evaluation-harness

In [3]:

  Copied!     
 
!git checkout 'e47e01beea79cfe87421e2dac49e64d499c240b4'
!git checkout 'e47e01beea79cfe87421e2dac49e64d499c240b4'

Note: switching to 'e47e01beea79cfe87421e2dac49e64d499c240b4'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at e47e01be Merge pull request #435 from EleutherAI/haileyschoelkopf-patch-1

In [4]:

  Copied!     
 
!pip install -q -e .
!pip install -q -e .

  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 519.6/519.6 kB 8.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.0/77.0 kB 6.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79.5/79.5 kB 6.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85.6/85.6 kB 7.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 227.7/227.7 kB 13.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.1/10.1 MB 53.2 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 111.1/111.1 kB 14.4 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.6/65.6 kB 9.1 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.7/7.7 MB 106.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.7/2.7 MB 104.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.3/115.3 kB 16.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 18.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.0/302.0 kB 35.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 117.0/117.0 kB 17.3 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 258.1/258.1 kB 31.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 89.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.8/3.8 MB 119.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 295.0/295.0 kB 34.5 MB/s eta 0:00:00
  Building wheel for antlr4-python3-runtime (setup.py) ... done
  Building wheel for rouge-score (setup.py) ... done
  Building wheel for pycountry (pyproject.toml) ... done
  Building wheel for sqlitedict (setup.py) ... done
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not installed.

In [5]:

  Copied!     
 
import transformers
import accelerate
import transformers import accelerate

In [6]:

  Copied!     
 
#!pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
#!pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt

https://github.com/google/BIG-bench/issues/934

In [7]:

  Copied!     
 
# 3 minutes to run with Accuracy of 44%
%%time
!python main.py \
    --model gpt2 \
    --num_fewshot 0 \
    --tasks arc_easy \
    --device 0
# 3 minutes to run with Accuracy of 44% %%time !python main.py \ --model gpt2 \ --num_fewshot 0 \ --tasks arc_easy \ --device 0

2023-10-21 03:33:47.513338: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Selected Tasks: ['arc_easy']
Using device '0'
Downloading (…)lve/main/config.json: 100% 665/665 [00:00<00:00, 2.78MB/s]
Downloading model.safetensors: 100% 548M/548M [00:02<00:00, 228MB/s]
Downloading (…)neration_config.json: 100% 124/124 [00:00<00:00, 754kB/s]
Downloading (…)olve/main/vocab.json: 100% 1.04M/1.04M [00:00<00:00, 3.17MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 1.87MB/s]
Downloading (…)/main/tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 4.19MB/s]
Downloading builder script: 100% 5.37k/5.37k [00:00<00:00, 12.1MB/s]
Downloading metadata: 100% 4.47k/4.47k [00:00<00:00, 19.1MB/s]
Downloading readme: 100% 8.66k/8.66k [00:00<00:00, 20.7MB/s]
Downloading data: 100% 681M/681M [00:30<00:00, 22.1MB/s]
Generating train split: 100% 2251/2251 [00:00<00:00, 7998.81 examples/s]
Generating test split: 100% 2376/2376 [00:00<00:00, 10862.60 examples/s]
Generating validation split: 100% 570/570 [00:00<00:00, 11669.10 examples/s]
Running loglikelihood requests
100% 9496/9496 [02:48<00:00, 56.19it/s]
{
  "results": {
    "arc_easy": {
      "acc": 0.43813131313131315,
      "acc_stderr": 0.010180937100600062,
      "acc_norm": 0.3947811447811448,
      "acc_norm_stderr": 0.010030038935883556
    }
  },
  "versions": {
    "arc_easy": 0
  },
  "config": {
    "model": "gpt2",
    "model_args": "",
    "num_fewshot": 0,
    "batch_size": null,
    "device": "0",
    "no_cache": false,
    "limit": null,
    "bootstrap_iters": 100000,
    "description_dict": {}
  }
}
gpt2 (), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
|  Task  |Version| Metric |Value |   |Stderr|
|--------|------:|--------|-----:|---|-----:|
|arc_easy|      0|acc     |0.4381|±  |0.0102|
|        |       |acc_norm|0.3948|±  |0.0100|

CPU times: user 3.22 s, sys: 365 ms, total: 3.59 s
Wall time: 4min 6s

In [8]:

  Copied!     
 
#10 minutes to run with Accuracy of 56%
%%time
!python main.py \
    --model gpt2 \
    --model_args pretrained=EleutherAI/gpt-neo-1.3B \
    --num_fewshot 0 \
    --tasks arc_easy \
    --device 0
#10 minutes to run with Accuracy of 56% %%time !python main.py \ --model gpt2 \ --model_args pretrained=EleutherAI/gpt-neo-1.3B \ --num_fewshot 0 \ --tasks arc_easy \ --device 0

2023-10-21 03:37:52.813056: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Selected Tasks: ['arc_easy']
Using device '0'
Downloading (…)lve/main/config.json: 100% 1.35k/1.35k [00:00<00:00, 7.19MB/s]
Downloading model.safetensors: 100% 5.31G/5.31G [00:22<00:00, 231MB/s]
Downloading (…)okenizer_config.json: 100% 200/200 [00:00<00:00, 892kB/s]
Downloading (…)olve/main/vocab.json: 100% 798k/798k [00:00<00:00, 2.43MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 1.86MB/s]
Downloading (…)cial_tokens_map.json: 100% 90.0/90.0 [00:00<00:00, 477kB/s]
Running loglikelihood requests
100% 9496/9496 [08:25<00:00, 18.80it/s]
{
  "results": {
    "arc_easy": {
      "acc": 0.5618686868686869,
      "acc_stderr": 0.010180937100600074,
      "acc_norm": 0.502104377104377,
      "acc_norm_stderr": 0.010259692651537032
    }
  },
  "versions": {
    "arc_easy": 0
  },
  "config": {
    "model": "gpt2",
    "model_args": "pretrained=EleutherAI/gpt-neo-1.3B",
    "num_fewshot": 0,
    "batch_size": null,
    "device": "0",
    "no_cache": false,
    "limit": null,
    "bootstrap_iters": 100000,
    "description_dict": {}
  }
}
gpt2 (pretrained=EleutherAI/gpt-neo-1.3B), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
|  Task  |Version| Metric |Value |   |Stderr|
|--------|------:|--------|-----:|---|-----:|
|arc_easy|      0|acc     |0.5619|±  |0.0102|
|        |       |acc_norm|0.5021|±  |0.0103|

CPU times: user 7.27 s, sys: 774 ms, total: 8.04 s
Wall time: 9min 49s

In [9]:

  Copied!     
 
#18 #10 minutes to run with Accuracy of 61%
%%time
!python main.py \
    --model gpt2 \
    --model_args pretrained=EleutherAI/gpt-neo-2.7B \
    --num_fewshot 0 \
    --tasks arc_easy \
    --device 0
#18 #10 minutes to run with Accuracy of 61% %%time !python main.py \ --model gpt2 \ --model_args pretrained=EleutherAI/gpt-neo-2.7B \ --num_fewshot 0 \ --tasks arc_easy \ --device 0

2023-10-21 03:47:48.449712: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Selected Tasks: ['arc_easy']
Using device '0'
Downloading (…)lve/main/config.json: 100% 1.46k/1.46k [00:00<00:00, 7.78MB/s]
Downloading model.safetensors: 100% 10.7G/10.7G [01:10<00:00, 152MB/s] 
^C
CPU times: user 1.09 s, sys: 250 ms, total: 1.34 s
Wall time: 2min 4s

Its best practice to use getpass in these types of notebooks. Your API key should be something like sk-...

In [10]:

  Copied!     
 
import getpass
import os
open_ai_key = getpass.getpass('Enter your OPENAI API Key')

os.environ['OPENAI_API_SECRET_KEY'] = open_ai_key
import getpass import os open_ai_key = getpass.getpass('Enter your OPENAI API Key') os.environ['OPENAI_API_SECRET_KEY'] = open_ai_key

Enter your OPENAI API Key··········

In [11]:

  Copied!     
 
#2 minutes to run with 88 requests and Accuracy of 86%
%%time
!python main.py \
    --model gpt3 \
    --model_args engine=davinci \
    --num_fewshot 2 \
    --tasks sst
#2 minutes to run with 88 requests and Accuracy of 86% %%time !python main.py \ --model gpt3 \ --model_args engine=davinci \ --num_fewshot 2 \ --tasks sst

2023-10-21 03:51:06.509333: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Selected Tasks: ['sst']
Downloading builder script: 100% 28.8k/28.8k [00:00<00:00, 36.3MB/s]
Downloading metadata: 100% 28.7k/28.7k [00:00<00:00, 35.4MB/s]
Downloading readme: 100% 27.9k/27.9k [00:00<00:00, 49.1MB/s]
Downloading data: 100% 7.44M/7.44M [00:00<00:00, 81.4MB/s]
Generating train split: 100% 67349/67349 [00:02<00:00, 26948.60 examples/s]
Generating validation split: 100% 872/872 [00:00<00:00, 27797.11 examples/s]
Generating test split: 100% 1821/1821 [00:00<00:00, 24548.83 examples/s]
Running loglikelihood requests
100% 88/88 [01:34<00:00,  1.07s/it]
{
  "results": {
    "sst": {
      "acc": 0.8600917431192661,
      "acc_stderr": 0.011753981006588683
    }
  },
  "versions": {
    "sst": 0
  },
  "config": {
    "model": "gpt3",
    "model_args": "engine=davinci",
    "num_fewshot": 2,
    "batch_size": null,
    "device": null,
    "no_cache": false,
    "limit": null,
    "bootstrap_iters": 100000,
    "description_dict": {}
  }
}
gpt3 (engine=davinci), limit: None, provide_description: False, num_fewshot: 2, batch_size: None
|Task|Version|Metric|Value |   |Stderr|
|----|------:|------|-----:|---|-----:|
|sst |      0|acc   |0.8601|±  |0.0118|

CPU times: user 821 ms, sys: 121 ms, total: 942 ms
Wall time: 2min 3s

Making a new task for the harness¶

This part documents how to create a new task for the language model evaluation harness and is based on this document.

In [12]:

  Copied!     
 
# After forking...
!cd .. && git clone https://github.com/esbenkc/lm-evaluation-harness.git lm-evaluation-harness-new-task
%cd lm-evaluation-harness-new-task
!git checkout -b "cool-patrol"
!pip install -q -e ".[dev]"
# After forking... !cd .. && git clone https://github.com/esbenkc/lm-evaluation-harness.git lm-evaluation-harness-new-task %cd lm-evaluation-harness-new-task !git checkout -b "cool-patrol" !pip install -q -e ".[dev]"

Cloning into 'lm-evaluation-harness-new-task'...
remote: Enumerating objects: 7910, done.
remote: Counting objects: 100% (766/766), done.
remote: Compressing objects: 100% (65/65), done.
remote: Total 7910 (delta 730), reused 701 (delta 701), pack-reused 7144
Receiving objects: 100% (7910/7910), 9.49 MiB | 21.74 MiB/s, done.
Resolving deltas: 100% (5080/5080), done.
[Errno 2] No such file or directory: 'lm-evaluation-harness-new-task'
/content/lm-evaluation-harness
Switched to a new branch 'cool-patrol'
  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 13.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.3/58.3 kB 7.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 203.7/203.7 kB 14.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.6/62.6 kB 8.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.9/98.9 kB 12.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.7/3.7 MB 28.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 227.5/227.5 kB 26.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 468.9/468.9 kB 31.3 MB/s eta 0:00:00

In [13]:

  Copied!     
 
# See https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_guide.md#creating-your-task-file

!cp templates/new_task.py lm_eval/tasks/cool-patrol.py
# See https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_guide.md#creating-your-task-file !cp templates/new_task.py lm_eval/tasks/cool-patrol.py

In [13]: