Skip to content

Language Models

bocoel.GenerativeModel

Bases: Protocol

generate abstractmethod

generate(prompts: Sequence[str]) -> Sequence[str]
TODO

Add logits.

Generate a sequence of responses given prompts. The length of the response is the same as the prompt. The response would be a continuation of the prompt, and the prompts would be the prefix of the response.

Parameters:

Name Type Description Default
prompts Sequence[str]

The prompts to generate.

required

Returns:

Type Description
Sequence[str]

The generated responses. The length must be the same as the prompts.

Source code in src/bocoel/models/lms/interfaces/generative.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
@abc.abstractmethod
def generate(self, prompts: Sequence[str], /) -> Sequence[str]:
    """
    TODO:
        Add logits.

    Generate a sequence of responses given prompts.
    The length of the response is the same as the prompt.
    The response would be a continuation of the prompt,
    and the prompts would be the prefix of the response.

    Parameters:
        prompts: The prompts to generate.

    Returns:
        The generated responses. The length must be the same as the prompts.
    """

    ...

bocoel.ClassifierModel

Bases: Protocol

choices abstractmethod property

choices: Sequence[str]

The choices for this language model.

Returns:

Type Description
Sequence[str]

The choices for this language model.

classify

classify(prompts: Sequence[str]) -> NDArray

Classify the given prompts.

Parameters:

Name Type Description Default
prompts Sequence[str]

The prompts to classify.

required

Returns:

Type Description
NDArray

The logits for each prompt and choice.

Raises:

Type Description
ValueError

If the shape of the logits is not [len(prompts), len(choices)].

Source code in src/bocoel/models/lms/interfaces/classifiers.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def classify(self, prompts: Sequence[str], /) -> NDArray:
    """
    Classify the given prompts.

    Parameters:
        prompts: The prompts to classify.

    Returns:
        The logits for each prompt and choice.

    Raises:
        ValueError: If the shape of the logits is not [len(prompts), len(choices)].
    """

    classified = self._classify(prompts)

    if list(classified.shape) != [len(prompts), len(self.choices)]:
        raise ValueError(
            f"Expected logits to have shape {[len(prompts), len(self.choices)]}, "
            f"but got {classified.shape}"
        )

    return classified

_classify abstractmethod

_classify(prompts: Sequence[str]) -> NDArray

Generate logits given prompts.

Parameters:

Name Type Description Default
prompts Sequence[str]

The prompts to classify.

required

Returns:

Type Description
NDArray

The logits for each prompt and choice.

Source code in src/bocoel/models/lms/interfaces/classifiers.py
36
37
38
39
40
41
42
43
44
45
46
47
48
@abc.abstractmethod
def _classify(self, prompts: Sequence[str], /) -> NDArray:
    """
    Generate logits given prompts.

    Parameters:
        prompts: The prompts to classify.

    Returns:
        The logits for each prompt and choice.
    """

    ...

bocoel.HuggingfaceCausalLM

HuggingfaceCausalLM(
    model_path: str, batch_size: int, device: str, add_sep_token: bool = False
)

The Huggingface implementation of language model. This is a wrapper around the Huggingface library, which would try to pull the model from the huggingface hub.

FIXME

add_sep_token might cause huggingface to bug out with index out of range. Still unclear how this might occur as [SEP] is a special token.

Parameters:

Name Type Description Default
model_path str

The path to the model.

required
batch_size int

The batch size to use.

required
device str

The device to use.

required
add_sep_token bool

Whether to add the sep token.

False
Source code in src/bocoel/models/lms/huggingface/causal.py
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def __init__(
    self, model_path: str, batch_size: int, device: str, add_sep_token: bool = False
) -> None:
    """
    Parameters:
        model_path: The path to the model.
        batch_size: The batch size to use.
        device: The device to use.
        add_sep_token: Whether to add the sep token.
    """

    # Optional dependency.
    from transformers import AutoModelForCausalLM

    self._model_path = model_path
    self._tokenizer = HuggingfaceTokenizer(
        model_path=model_path, device=device, add_sep_token=add_sep_token
    )

    # Model used for generation
    self._model = AutoModelForCausalLM.from_pretrained(model_path)
    self._model.pad_token = self._tokenizer.pad_token

    self._batch_size = batch_size

    self.to(device)

bocoel.HuggingfaceGenerativeLM

HuggingfaceGenerativeLM(
    model_path: str, batch_size: int, device: str, add_sep_token: bool = False
)

Bases: HuggingfaceCausalLM, GenerativeModel

The generative model backed by huggingface's transformers library.

Since huggingface's tokenizer needs padding to the left to work, padding doesn't guarentee the same positional embeddings, and thus, results. If sameness with generating one by one is desired, batch size should be 1.

Parameters:

Name Type Description Default
model_path str

The path to the model.

required
batch_size int

The batch size to use.

required
device str

The device to use.

required
add_sep_token bool

Whether to add the sep token.

False
Source code in src/bocoel/models/lms/huggingface/generative.py
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def __init__(
    self, model_path: str, batch_size: int, device: str, add_sep_token: bool = False
) -> None:
    """
    Parameters:
        model_path: The path to the model.
        batch_size: The batch size to use.
        device: The device to use.
        add_sep_token: Whether to add the sep token.
    """

    super().__init__(
        model_path=model_path,
        batch_size=batch_size,
        device=device,
        add_sep_token=add_sep_token,
    )

bocoel.HuggingfaceLogitsLM

HuggingfaceLogitsLM(
    model_path: str,
    batch_size: int,
    device: str,
    choices: Sequence[str],
    add_sep_token: bool = False,
)

Bases: HuggingfaceCausalLM, ClassifierModel

Logits classification model backed by huggingface's transformers library. This means that the model would use the logits of ['1', '2', '3', '4', '5'] as the output, if choices = 5, for the current batch of inputs.

Parameters:

Name Type Description Default
model_path str

The path to the model.

required
batch_size int

The batch size to use.

required
device str

The device to use.

required
choices Sequence[str]

The choices to classify.

required
add_sep_token bool

Whether to add the sep token.

False
Source code in src/bocoel/models/lms/huggingface/logits.py
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def __init__(
    self,
    model_path: str,
    batch_size: int,
    device: str,
    choices: Sequence[str],
    add_sep_token: bool = False,
) -> None:
    """
    Parameters:
        model_path: The path to the model.
        batch_size: The batch size to use.
        device: The device to use.
        choices: The choices to classify.
        add_sep_token: Whether to add the sep token.
    """

    super().__init__(
        model_path=model_path,
        batch_size=batch_size,
        device=device,
        add_sep_token=add_sep_token,
    )

    self._choices = choices
    self._encoded_choices = self._encode_tokens(self._choices)

classify

classify(prompts: Sequence[str]) -> NDArray

Classify the given prompts.

Parameters:

Name Type Description Default
prompts Sequence[str]

The prompts to classify.

required

Returns:

Type Description
NDArray

The logits for each prompt and choice.

Raises:

Type Description
ValueError

If the shape of the logits is not [len(prompts), len(choices)].

Source code in src/bocoel/models/lms/interfaces/classifiers.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def classify(self, prompts: Sequence[str], /) -> NDArray:
    """
    Classify the given prompts.

    Parameters:
        prompts: The prompts to classify.

    Returns:
        The logits for each prompt and choice.

    Raises:
        ValueError: If the shape of the logits is not [len(prompts), len(choices)].
    """

    classified = self._classify(prompts)

    if list(classified.shape) != [len(prompts), len(self.choices)]:
        raise ValueError(
            f"Expected logits to have shape {[len(prompts), len(self.choices)]}, "
            f"but got {classified.shape}"
        )

    return classified

bocoel.HuggingfaceSequenceLM

HuggingfaceSequenceLM(
    model_path: str,
    device: str,
    choices: Sequence[str],
    add_sep_token: bool = False,
)

Bases: ClassifierModel

The sequence classification model backed by huggingface's transformers library.

Source code in src/bocoel/models/lms/huggingface/sequences.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def __init__(
    self,
    model_path: str,
    device: str,
    choices: Sequence[str],
    add_sep_token: bool = False,
) -> None:
    # Optional dependency
    from transformers import AutoModelForSequenceClassification

    self._model_path = model_path
    self._tokenizer = HuggingfaceTokenizer(
        model_path=model_path, device=device, add_sep_token=add_sep_token
    )

    self._choices = choices

    classifier = AutoModelForSequenceClassification.from_pretrained(model_path)
    self._classifier = classifier.to(device)
    self._classifier.config.pad_token_id = self._tokenizer.pad_token_id

classify

classify(prompts: Sequence[str]) -> NDArray

Classify the given prompts.

Parameters:

Name Type Description Default
prompts Sequence[str]

The prompts to classify.

required

Returns:

Type Description
NDArray

The logits for each prompt and choice.

Raises:

Type Description
ValueError

If the shape of the logits is not [len(prompts), len(choices)].

Source code in src/bocoel/models/lms/interfaces/classifiers.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def classify(self, prompts: Sequence[str], /) -> NDArray:
    """
    Classify the given prompts.

    Parameters:
        prompts: The prompts to classify.

    Returns:
        The logits for each prompt and choice.

    Raises:
        ValueError: If the shape of the logits is not [len(prompts), len(choices)].
    """

    classified = self._classify(prompts)

    if list(classified.shape) != [len(prompts), len(self.choices)]:
        raise ValueError(
            f"Expected logits to have shape {[len(prompts), len(self.choices)]}, "
            f"but got {classified.shape}"
        )

    return classified

bocoel.HuggingfaceTokenizer

HuggingfaceTokenizer(model_path: str, device: str, add_sep_token: bool)

A tokenizer for Huggingface models.

Parameters:

Name Type Description Default
model_path str

The path to the model.

required
device str

The device to use.

required
add_sep_token bool

Whether to add the sep token.

required

Raises:

Type Description
ImportError

If the transformers library is not installed.

Source code in src/bocoel/models/lms/huggingface/tokenizers.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def __init__(self, model_path: str, device: str, add_sep_token: bool) -> None:
    """
    Parameters:
        model_path: The path to the model.
        device: The device to use.
        add_sep_token: Whether to add the sep token.

    Raises:
        ImportError: If the transformers library is not installed.
    """

    # Optional dependency.
    from transformers import AutoTokenizer

    # Initializes the tokenizer and pad to the left for sequence generation.
    self._tokenizer = AutoTokenizer.from_pretrained(
        model_path, padding_side="left", truncation_side="left"
    )

    # Always add the pad token.
    if (eos := self._tokenizer.eos_token) is not None:
        self._tokenizer.pad_token = eos
    else:
        self._tokenizer.add_special_tokens({"pad_token": "[PAD]"})

    if add_sep_token:
        if self._tokenizer.sep_token is None:
            self._tokenizer.add_special_tokens({"sep_token": "[SEP]"})

    self._device = device

to

to(device: str) -> Self

Move the tokenizer to the given device.

Parameters:

Name Type Description Default
device str

The device to move to.

required
Source code in src/bocoel/models/lms/huggingface/tokenizers.py
44
45
46
47
48
49
50
51
52
def to(self, device: str, /) -> Self:
    """
    Move the tokenizer to the given device.

    Parameters:
        device: The device to move to.
    """
    self._device = device
    return self

tokenize

tokenize(prompts: Sequence[str], /, max_length: int | None = None)

Tokenize, pad, truncate, cast to device, and yield the encoded results. Returning BatchEncoding but not marked in the type hint due to optional dependency.

Parameters:

Name Type Description Default
prompts Sequence[str]

The prompts to tokenize.

required

Returns:

Type Description
BatchEncoding

The tokenized prompts.

Source code in src/bocoel/models/lms/huggingface/tokenizers.py
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
def tokenize(self, prompts: Sequence[str], /, max_length: int | None = None):
    """
    Tokenize, pad, truncate, cast to device, and yield the encoded results.
    Returning `BatchEncoding` but not marked in the type hint
    due to optional dependency.

    Parameters:
        prompts: The prompts to tokenize.

    Returns:
        (BatchEncoding): The tokenized prompts.
    """
    if not isinstance(prompts, list):
        prompts = list(prompts)

    inputs = self._tokenizer(
        prompts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=max_length,
    )
    return inputs.to(self.device)

encode

encode(
    prompts: Sequence[str],
    /,
    return_tensors: str | None = None,
    add_special_tokens: bool = True,
)

Encode the given prompts.

Parameters:

Name Type Description Default
prompts Sequence[str]

The prompts to encode.

required
return_tensors str | None

Whether to return tensors.

None
add_special_tokens bool

Whether to add special tokens.

True

Returns:

Type Description
Any

The encoded prompts.

Source code in src/bocoel/models/lms/huggingface/tokenizers.py
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
def encode(
    self,
    prompts: Sequence[str],
    /,
    return_tensors: str | None = None,
    add_special_tokens: bool = True,
):
    """
    Encode the given prompts.

    Parameters:
        prompts: The prompts to encode.
        return_tensors: Whether to return tensors.
        add_special_tokens: Whether to add special tokens.

    Returns:
        (Any): The encoded prompts.
    """

    return self._tokenizer.encode(
        prompts,
        return_tensors=return_tensors,
        add_special_tokens=add_special_tokens,
    )

decode

decode(outputs: Any, /, skip_special_tokens: bool = True) -> str

Decode the given outputs.

Parameters:

Name Type Description Default
outputs Any

The outputs to decode.

required
skip_special_tokens bool

Whether to skip special tokens.

True

Returns:

Type Description
str

The decoded outputs.

Source code in src/bocoel/models/lms/huggingface/tokenizers.py
107
108
109
110
111
112
113
114
115
116
117
118
119
def decode(self, outputs: Any, /, skip_special_tokens: bool = True) -> str:
    """
    Decode the given outputs.

    Parameters:
        outputs: The outputs to decode.
        skip_special_tokens: Whether to skip special tokens.

    Returns:
        The decoded outputs.
    """

    return self._tokenizer.decode(outputs, skip_special_tokens=skip_special_tokens)

batch_decode

batch_decode(outputs: Any, /, skip_special_tokens: bool = True) -> list[str]

Batch decode the given outputs.

Parameters:

Name Type Description Default
outputs Any

The outputs to decode.

required
skip_special_tokens bool

Whether to skip special tokens.

True

Returns:

Type Description
list[str]

The batch decoded outputs.

Source code in src/bocoel/models/lms/huggingface/tokenizers.py
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
def batch_decode(
    self, outputs: Any, /, skip_special_tokens: bool = True
) -> list[str]:
    """
    Batch decode the given outputs.

    Parameters:
        outputs: The outputs to decode.
        skip_special_tokens: Whether to skip special tokens.

    Returns:
        The batch decoded outputs.
    """

    return self._tokenizer.batch_decode(
        outputs, skip_special_tokens=skip_special_tokens
    )