Language Models

bocoel.GenerativeModel

Bases: Protocol

generate `abstractmethod`

generate(prompts: Sequence[str]) -> Sequence[str]

TODO

Add logits.

Generate a sequence of responses given prompts. The length of the response is the same as the prompt. The response would be a continuation of the prompt, and the prompts would be the prefix of the response.

Parameters:

Name	Type	Description	Default
`prompts`	`Sequence[str]`	The prompts to generate.	required

Returns:

Type	Description
`Sequence[str]`	The generated responses. The length must be the same as the prompts.

Source code in src/bocoel/models/lms/interfaces/generative.py

@abc.abstractmethod
def generate(self, prompts: Sequence[str], /) -> Sequence[str]:
    """
    TODO:
        Add logits.

    Generate a sequence of responses given prompts.
    The length of the response is the same as the prompt.
    The response would be a continuation of the prompt,
    and the prompts would be the prefix of the response.

    Parameters:
        prompts: The prompts to generate.

    Returns:
        The generated responses. The length must be the same as the prompts.
    """

    ...

bocoel.ClassifierModel

Bases: Protocol

choices `abstractmethod` `property`

choices: Sequence[str]

The choices for this language model.

Returns:

Type	Description
`Sequence[str]`	The choices for this language model.

classify

classify(prompts: Sequence[str]) -> NDArray

Classify the given prompts.

Parameters:

Name	Type	Description	Default
`prompts`	`Sequence[str]`	The prompts to classify.	required

Returns:

Type	Description
`NDArray`	The logits for each prompt and choice.

Raises:

Type	Description
`ValueError`	If the shape of the logits is not [len(prompts), len(choices)].

Source code in src/bocoel/models/lms/interfaces/classifiers.py

def classify(self, prompts: Sequence[str], /) -> NDArray:
    """
    Classify the given prompts.

    Parameters:
        prompts: The prompts to classify.

    Returns:
        The logits for each prompt and choice.

    Raises:
        ValueError: If the shape of the logits is not [len(prompts), len(choices)].
    """

    classified = self._classify(prompts)

    if list(classified.shape) != [len(prompts), len(self.choices)]:
        raise ValueError(
            f"Expected logits to have shape {[len(prompts), len(self.choices)]}, "
            f"but got {classified.shape}"
        )

    return classified

_classify `abstractmethod`

_classify(prompts: Sequence[str]) -> NDArray

Generate logits given prompts.

Parameters:

Name	Type	Description	Default
`prompts`	`Sequence[str]`	The prompts to classify.	required

Returns:

Type	Description
`NDArray`	The logits for each prompt and choice.

Source code in src/bocoel/models/lms/interfaces/classifiers.py

@abc.abstractmethod
def _classify(self, prompts: Sequence[str], /) -> NDArray:
    """
    Generate logits given prompts.

    Parameters:
        prompts: The prompts to classify.

    Returns:
        The logits for each prompt and choice.
    """

    ...

bocoel.HuggingfaceCausalLM

HuggingfaceCausalLM(
    model_path: str, batch_size: int, device: str, add_sep_token: bool = False
)

The Huggingface implementation of language model. This is a wrapper around the Huggingface library, which would try to pull the model from the huggingface hub.

FIXME

add_sep_token might cause huggingface to bug out with index out of range. Still unclear how this might occur as [SEP] is a special token.

Parameters:

Name	Type	Description	Default
`model_path`	`str`	The path to the model.	required
`batch_size`	`int`	The batch size to use.	required
`device`	`str`	The device to use.	required
`add_sep_token`	`bool`	Whether to add the sep token.	`False`

Source code in src/bocoel/models/lms/huggingface/causal.py

def __init__(
    self, model_path: str, batch_size: int, device: str, add_sep_token: bool = False
) -> None:
    """
    Parameters:
        model_path: The path to the model.
        batch_size: The batch size to use.
        device: The device to use.
        add_sep_token: Whether to add the sep token.
    """

    # Optional dependency.
    from transformers import AutoModelForCausalLM

    self._model_path = model_path
    self._tokenizer = HuggingfaceTokenizer(
        model_path=model_path, device=device, add_sep_token=add_sep_token
    )

    # Model used for generation
    self._model = AutoModelForCausalLM.from_pretrained(model_path)
    self._model.pad_token = self._tokenizer.pad_token

    self._batch_size = batch_size

    self.to(device)

bocoel.HuggingfaceGenerativeLM

HuggingfaceGenerativeLM(
    model_path: str, batch_size: int, device: str, add_sep_token: bool = False
)

Bases: HuggingfaceCausalLM, GenerativeModel

The generative model backed by huggingface's transformers library.

Since huggingface's tokenizer needs padding to the left to work, padding doesn't guarentee the same positional embeddings, and thus, results. If sameness with generating one by one is desired, batch size should be 1.

Parameters:

Name	Type	Description	Default
`model_path`	`str`	The path to the model.	required
`batch_size`	`int`	The batch size to use.	required
`device`	`str`	The device to use.	required
`add_sep_token`	`bool`	Whether to add the sep token.	`False`

Source code in src/bocoel/models/lms/huggingface/generative.py

def __init__(
    self, model_path: str, batch_size: int, device: str, add_sep_token: bool = False
) -> None:
    """
    Parameters:
        model_path: The path to the model.
        batch_size: The batch size to use.
        device: The device to use.
        add_sep_token: Whether to add the sep token.
    """

    super().__init__(
        model_path=model_path,
        batch_size=batch_size,
        device=device,
        add_sep_token=add_sep_token,
    )

bocoel.HuggingfaceLogitsLM

HuggingfaceLogitsLM(
    model_path: str,
    batch_size: int,
    device: str,
    choices: Sequence[str],
    add_sep_token: bool = False,
)

Bases: HuggingfaceCausalLM, ClassifierModel

Logits classification model backed by huggingface's transformers library. This means that the model would use the logits of ['1', '2', '3', '4', '5'] as the output, if choices = 5, for the current batch of inputs.

Parameters:

Name	Type	Description	Default
`model_path`	`str`	The path to the model.	required
`batch_size`	`int`	The batch size to use.	required
`device`	`str`	The device to use.	required
`choices`	`Sequence[str]`	The choices to classify.	required
`add_sep_token`	`bool`	Whether to add the sep token.	`False`

Source code in src/bocoel/models/lms/huggingface/logits.py

def __init__(
    self,
    model_path: str,
    batch_size: int,
    device: str,
    choices: Sequence[str],
    add_sep_token: bool = False,
) -> None:
    """
    Parameters:
        model_path: The path to the model.
        batch_size: The batch size to use.
        device: The device to use.
        choices: The choices to classify.
        add_sep_token: Whether to add the sep token.
    """

    super().__init__(
        model_path=model_path,
        batch_size=batch_size,
        device=device,
        add_sep_token=add_sep_token,
    )

    self._choices = choices
    self._encoded_choices = self._encode_tokens(self._choices)

classify

classify(prompts: Sequence[str]) -> NDArray

Classify the given prompts.

Parameters:

Name	Type	Description	Default
`prompts`	`Sequence[str]`	The prompts to classify.	required

Returns:

Type	Description
`NDArray`	The logits for each prompt and choice.

Raises:

Type	Description
`ValueError`	If the shape of the logits is not [len(prompts), len(choices)].

Source code in src/bocoel/models/lms/interfaces/classifiers.py

def classify(self, prompts: Sequence[str], /) -> NDArray:
    """
    Classify the given prompts.

    Parameters:
        prompts: The prompts to classify.

    Returns:
        The logits for each prompt and choice.

    Raises:
        ValueError: If the shape of the logits is not [len(prompts), len(choices)].
    """

    classified = self._classify(prompts)

    if list(classified.shape) != [len(prompts), len(self.choices)]:
        raise ValueError(
            f"Expected logits to have shape {[len(prompts), len(self.choices)]}, "
            f"but got {classified.shape}"
        )

    return classified

bocoel.HuggingfaceSequenceLM

HuggingfaceSequenceLM(
    model_path: str,
    device: str,
    choices: Sequence[str],
    add_sep_token: bool = False,
)

Bases: ClassifierModel

The sequence classification model backed by huggingface's transformers library.

Source code in src/bocoel/models/lms/huggingface/sequences.py

def __init__(
    self,
    model_path: str,
    device: str,
    choices: Sequence[str],
    add_sep_token: bool = False,
) -> None:
    # Optional dependency
    from transformers import AutoModelForSequenceClassification

    self._model_path = model_path
    self._tokenizer = HuggingfaceTokenizer(
        model_path=model_path, device=device, add_sep_token=add_sep_token
    )

    self._choices = choices

    classifier = AutoModelForSequenceClassification.from_pretrained(model_path)
    self._classifier = classifier.to(device)
    self._classifier.config.pad_token_id = self._tokenizer.pad_token_id

classify

classify(prompts: Sequence[str]) -> NDArray

Classify the given prompts.

Parameters:

Name	Type	Description	Default
`prompts`	`Sequence[str]`	The prompts to classify.	required

Returns:

Type	Description
`NDArray`	The logits for each prompt and choice.

Raises:

Type	Description
`ValueError`	If the shape of the logits is not [len(prompts), len(choices)].

Source code in src/bocoel/models/lms/interfaces/classifiers.py

def classify(self, prompts: Sequence[str], /) -> NDArray:
    """
    Classify the given prompts.

    Parameters:
        prompts: The prompts to classify.

    Returns:
        The logits for each prompt and choice.

    Raises:
        ValueError: If the shape of the logits is not [len(prompts), len(choices)].
    """

    classified = self._classify(prompts)

    if list(classified.shape) != [len(prompts), len(self.choices)]:
        raise ValueError(
            f"Expected logits to have shape {[len(prompts), len(self.choices)]}, "
            f"but got {classified.shape}"
        )

    return classified

bocoel.HuggingfaceTokenizer

HuggingfaceTokenizer(model_path: str, device: str, add_sep_token: bool)

A tokenizer for Huggingface models.

Parameters:

Name	Type	Description	Default
`model_path`	`str`	The path to the model.	required
`device`	`str`	The device to use.	required
`add_sep_token`	`bool`	Whether to add the sep token.	required

Raises:

Type	Description
`ImportError`	If the transformers library is not installed.

Source code in src/bocoel/models/lms/huggingface/tokenizers.py

def __init__(self, model_path: str, device: str, add_sep_token: bool) -> None:
    """
    Parameters:
        model_path: The path to the model.
        device: The device to use.
        add_sep_token: Whether to add the sep token.

    Raises:
        ImportError: If the transformers library is not installed.
    """

    # Optional dependency.
    from transformers import AutoTokenizer

    # Initializes the tokenizer and pad to the left for sequence generation.
    self._tokenizer = AutoTokenizer.from_pretrained(
        model_path, padding_side="left", truncation_side="left"
    )

    # Always add the pad token.
    if (eos := self._tokenizer.eos_token) is not None:
        self._tokenizer.pad_token = eos
    else:
        self._tokenizer.add_special_tokens({"pad_token": "[PAD]"})

    if add_sep_token:
        if self._tokenizer.sep_token is None:
            self._tokenizer.add_special_tokens({"sep_token": "[SEP]"})

    self._device = device

to

to(device: str) -> Self

Move the tokenizer to the given device.

Parameters:

Name	Type	Description	Default
`device`	`str`	The device to move to.	required

Source code in src/bocoel/models/lms/huggingface/tokenizers.py

def to(self, device: str, /) -> Self:
    """
    Move the tokenizer to the given device.

    Parameters:
        device: The device to move to.
    """
    self._device = device
    return self

tokenize

tokenize(prompts: Sequence[str], /, max_length: int | None = None)

Tokenize, pad, truncate, cast to device, and yield the encoded results. Returning BatchEncoding but not marked in the type hint due to optional dependency.

Parameters:

Name	Type	Description	Default
`prompts`	`Sequence[str]`	The prompts to tokenize.	required

Returns:

Type	Description
`BatchEncoding`	The tokenized prompts.

Source code in src/bocoel/models/lms/huggingface/tokenizers.py

def tokenize(self, prompts: Sequence[str], /, max_length: int | None = None):
    """
    Tokenize, pad, truncate, cast to device, and yield the encoded results.
    Returning `BatchEncoding` but not marked in the type hint
    due to optional dependency.

    Parameters:
        prompts: The prompts to tokenize.

    Returns:
        (BatchEncoding): The tokenized prompts.
    """
    if not isinstance(prompts, list):
        prompts = list(prompts)

    inputs = self._tokenizer(
        prompts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=max_length,
    )
    return inputs.to(self.device)

encode

encode(
    prompts: Sequence[str],
    /,
    return_tensors: str | None = None,
    add_special_tokens: bool = True,
)

Encode the given prompts.

Parameters:

Name	Type	Description	Default
`prompts`	`Sequence[str]`	The prompts to encode.	required
`return_tensors`	`str \| None`	Whether to return tensors.	`None`
`add_special_tokens`	`bool`	Whether to add special tokens.	`True`

Returns:

Type	Description
`Any`	The encoded prompts.

Source code in src/bocoel/models/lms/huggingface/tokenizers.py

def encode(
    self,
    prompts: Sequence[str],
    /,
    return_tensors: str | None = None,
    add_special_tokens: bool = True,
):
    """
    Encode the given prompts.

    Parameters:
        prompts: The prompts to encode.
        return_tensors: Whether to return tensors.
        add_special_tokens: Whether to add special tokens.

    Returns:
        (Any): The encoded prompts.
    """

    return self._tokenizer.encode(
        prompts,
        return_tensors=return_tensors,
        add_special_tokens=add_special_tokens,
    )

decode

decode(outputs: Any, /, skip_special_tokens: bool = True) -> str

Decode the given outputs.

Parameters:

Name	Type	Description	Default
`outputs`	`Any`	The outputs to decode.	required
`skip_special_tokens`	`bool`	Whether to skip special tokens.	`True`

Returns:

Type	Description
`str`	The decoded outputs.

Source code in src/bocoel/models/lms/huggingface/tokenizers.py

def decode(self, outputs: Any, /, skip_special_tokens: bool = True) -> str:
    """
    Decode the given outputs.

    Parameters:
        outputs: The outputs to decode.
        skip_special_tokens: Whether to skip special tokens.

    Returns:
        The decoded outputs.
    """

    return self._tokenizer.decode(outputs, skip_special_tokens=skip_special_tokens)

batch_decode

batch_decode(outputs: Any, /, skip_special_tokens: bool = True) -> list[str]

Batch decode the given outputs.

Parameters:

Name	Type	Description	Default
`outputs`	`Any`	The outputs to decode.	required
`skip_special_tokens`	`bool`	Whether to skip special tokens.	`True`

Returns:

Type	Description
`list[str]`	The batch decoded outputs.

Source code in src/bocoel/models/lms/huggingface/tokenizers.py

def batch_decode(
    self, outputs: Any, /, skip_special_tokens: bool = True
) -> list[str]:
    """
    Batch decode the given outputs.

    Parameters:
        outputs: The outputs to decode.
        skip_special_tokens: Whether to skip special tokens.

    Returns:
        The batch decoded outputs.
    """

    return self._tokenizer.batch_decode(
        outputs, skip_special_tokens=skip_special_tokens
    )

Language Models

bocoel.GenerativeModel

generate abstractmethod

bocoel.ClassifierModel

choices abstractmethod property

classify

_classify abstractmethod

bocoel.HuggingfaceCausalLM

bocoel.HuggingfaceGenerativeLM

bocoel.HuggingfaceLogitsLM

classify

bocoel.HuggingfaceSequenceLM

classify

bocoel.HuggingfaceTokenizer

to

tokenize

encode

decode

batch_decode

generate `abstractmethod`

choices `abstractmethod` `property`

_classify `abstractmethod`