Language Models
bocoel.GenerativeModel
Bases: Protocol
generate abstractmethod
generate(prompts: Sequence[str]) -> Sequence[str]
TODO
Add logits.
Generate a sequence of responses given prompts. The length of the response is the same as the prompt. The response would be a continuation of the prompt, and the prompts would be the prefix of the response.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompts | Sequence[str] | The prompts to generate. | required |
Returns:
Type | Description |
---|---|
Sequence[str] | The generated responses. The length must be the same as the prompts. |
Source code in src/bocoel/models/lms/interfaces/generative.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
bocoel.ClassifierModel
Bases: Protocol
choices abstractmethod
property
choices: Sequence[str]
The choices for this language model.
Returns:
Type | Description |
---|---|
Sequence[str] | The choices for this language model. |
classify
classify(prompts: Sequence[str]) -> NDArray
Classify the given prompts.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompts | Sequence[str] | The prompts to classify. | required |
Returns:
Type | Description |
---|---|
NDArray | The logits for each prompt and choice. |
Raises:
Type | Description |
---|---|
ValueError | If the shape of the logits is not [len(prompts), len(choices)]. |
Source code in src/bocoel/models/lms/interfaces/classifiers.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
|
_classify abstractmethod
_classify(prompts: Sequence[str]) -> NDArray
Generate logits given prompts.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompts | Sequence[str] | The prompts to classify. | required |
Returns:
Type | Description |
---|---|
NDArray | The logits for each prompt and choice. |
Source code in src/bocoel/models/lms/interfaces/classifiers.py
36 37 38 39 40 41 42 43 44 45 46 47 48 |
|
bocoel.HuggingfaceCausalLM
HuggingfaceCausalLM(
model_path: str, batch_size: int, device: str, add_sep_token: bool = False
)
The Huggingface implementation of language model. This is a wrapper around the Huggingface library, which would try to pull the model from the huggingface hub.
FIXME
add_sep_token
might cause huggingface to bug out with index out of range. Still unclear how this might occur as [SEP]
is a special token.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_path | str | The path to the model. | required |
batch_size | int | The batch size to use. | required |
device | str | The device to use. | required |
add_sep_token | bool | Whether to add the sep token. | False |
Source code in src/bocoel/models/lms/huggingface/causal.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
|
bocoel.HuggingfaceGenerativeLM
HuggingfaceGenerativeLM(
model_path: str, batch_size: int, device: str, add_sep_token: bool = False
)
Bases: HuggingfaceCausalLM
, GenerativeModel
The generative model backed by huggingface's transformers library.
Since huggingface's tokenizer needs padding to the left to work, padding doesn't guarentee the same positional embeddings, and thus, results. If sameness with generating one by one is desired, batch size should be 1.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_path | str | The path to the model. | required |
batch_size | int | The batch size to use. | required |
device | str | The device to use. | required |
add_sep_token | bool | Whether to add the sep token. | False |
Source code in src/bocoel/models/lms/huggingface/generative.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|
bocoel.HuggingfaceLogitsLM
HuggingfaceLogitsLM(
model_path: str,
batch_size: int,
device: str,
choices: Sequence[str],
add_sep_token: bool = False,
)
Bases: HuggingfaceCausalLM
, ClassifierModel
Logits classification model backed by huggingface's transformers library. This means that the model would use the logits of ['1', '2', '3', '4', '5'] as the output, if choices = 5
, for the current batch of inputs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_path | str | The path to the model. | required |
batch_size | int | The batch size to use. | required |
device | str | The device to use. | required |
choices | Sequence[str] | The choices to classify. | required |
add_sep_token | bool | Whether to add the sep token. | False |
Source code in src/bocoel/models/lms/huggingface/logits.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
|
classify
classify(prompts: Sequence[str]) -> NDArray
Classify the given prompts.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompts | Sequence[str] | The prompts to classify. | required |
Returns:
Type | Description |
---|---|
NDArray | The logits for each prompt and choice. |
Raises:
Type | Description |
---|---|
ValueError | If the shape of the logits is not [len(prompts), len(choices)]. |
Source code in src/bocoel/models/lms/interfaces/classifiers.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
|
bocoel.HuggingfaceSequenceLM
HuggingfaceSequenceLM(
model_path: str,
device: str,
choices: Sequence[str],
add_sep_token: bool = False,
)
Bases: ClassifierModel
The sequence classification model backed by huggingface's transformers library.
Source code in src/bocoel/models/lms/huggingface/sequences.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
|
classify
classify(prompts: Sequence[str]) -> NDArray
Classify the given prompts.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompts | Sequence[str] | The prompts to classify. | required |
Returns:
Type | Description |
---|---|
NDArray | The logits for each prompt and choice. |
Raises:
Type | Description |
---|---|
ValueError | If the shape of the logits is not [len(prompts), len(choices)]. |
Source code in src/bocoel/models/lms/interfaces/classifiers.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
|
bocoel.HuggingfaceTokenizer
HuggingfaceTokenizer(model_path: str, device: str, add_sep_token: bool)
A tokenizer for Huggingface models.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_path | str | The path to the model. | required |
device | str | The device to use. | required |
add_sep_token | bool | Whether to add the sep token. | required |
Raises:
Type | Description |
---|---|
ImportError | If the transformers library is not installed. |
Source code in src/bocoel/models/lms/huggingface/tokenizers.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
|
to
to(device: str) -> Self
Move the tokenizer to the given device.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device | str | The device to move to. | required |
Source code in src/bocoel/models/lms/huggingface/tokenizers.py
44 45 46 47 48 49 50 51 52 |
|
tokenize
tokenize(prompts: Sequence[str], /, max_length: int | None = None)
Tokenize, pad, truncate, cast to device, and yield the encoded results. Returning BatchEncoding
but not marked in the type hint due to optional dependency.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompts | Sequence[str] | The prompts to tokenize. | required |
Returns:
Type | Description |
---|---|
BatchEncoding | The tokenized prompts. |
Source code in src/bocoel/models/lms/huggingface/tokenizers.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|
encode
encode(
prompts: Sequence[str],
/,
return_tensors: str | None = None,
add_special_tokens: bool = True,
)
Encode the given prompts.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompts | Sequence[str] | The prompts to encode. | required |
return_tensors | str | None | Whether to return tensors. | None |
add_special_tokens | bool | Whether to add special tokens. | True |
Returns:
Type | Description |
---|---|
Any | The encoded prompts. |
Source code in src/bocoel/models/lms/huggingface/tokenizers.py
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|
decode
decode(outputs: Any, /, skip_special_tokens: bool = True) -> str
Decode the given outputs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
outputs | Any | The outputs to decode. | required |
skip_special_tokens | bool | Whether to skip special tokens. | True |
Returns:
Type | Description |
---|---|
str | The decoded outputs. |
Source code in src/bocoel/models/lms/huggingface/tokenizers.py
107 108 109 110 111 112 113 114 115 116 117 118 119 |
|
batch_decode
batch_decode(outputs: Any, /, skip_special_tokens: bool = True) -> list[str]
Batch decode the given outputs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
outputs | Any | The outputs to decode. | required |
skip_special_tokens | bool | Whether to skip special tokens. | True |
Returns:
Type | Description |
---|---|
list[str] | The batch decoded outputs. |
Source code in src/bocoel/models/lms/huggingface/tokenizers.py
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
|