Skip to content

Corpus

bocoel.Corpus

Bases: Protocol

Corpus is the entry point to handling the data in this library.

A corpus has 3 main components: - Index: Searches one particular column in the storage.Provides fast retrival. - Storage: Used to store the questions / answers / texts. - Embedder: Embeds the text into vectors for faster access.

An index only corresponds to one key. If search over multiple keys is desired, a new column or a new corpus (with shared storage) should be created.

storage instance-attribute

storage: Storage

Storage is used to store the questions / answers / etc. Can be viewed as a dataframe of texts.

index instance-attribute

index: Index

Index searches one particular column in the storage into vectors.

bocoel.ComposedCorpus dataclass

Bases: Corpus

Simply a collection of components.

index_storage classmethod

index_storage(
    storage: Storage,
    embedder: Embedder,
    keys: Sequence[str],
    index_backend: type[Index],
    concat: Callable[[Iterable[Any]], str] = " [SEP] ".join,
    **index_kwargs: Any
) -> Self

Creates a corpus from the given storage, embedder, key and index class, where storage entries would be mapped to strings,

Parameters:

Name Type Description Default
storage Storage

The storage to index.

required
embedder Embedder

The embedder to use.

required
keys Sequence[str]

The keys to use for the index.

required
index_backend type[Index]

The index class to use.

required
concat Callable[[Iterable[Any]], str]

The function to use to concatenate the keys.

join
**index_kwargs Any

Additional arguments to pass to the index class.

{}

Returns:

Type Description
Self

The created corpus.

Source code in src/bocoel/corpora/corpora/composed.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
@classmethod
def index_storage(
    cls,
    storage: Storage,
    embedder: Embedder,
    keys: Sequence[str],
    index_backend: type[Index],
    concat: Callable[[Iterable[Any]], str] = " [SEP] ".join,
    **index_kwargs: Any,
) -> Self:
    """
    Creates a corpus from the given storage, embedder, key and index class,
    where storage entries would be mapped to strings,

    Parameters:
        storage: The storage to index.
        embedder: The embedder to use.
        keys: The keys to use for the index.
        index_backend: The index class to use.
        concat: The function to use to concatenate the keys.
        **index_kwargs: Additional arguments to pass to the index class.

    Returns:
        The created corpus.
    """

    def transform(mapping: Mapping[str, Sequence[Any]]) -> Sequence[str]:
        data = [mapping[k] for k in keys]
        return [concat(datum) for datum in zip(*data)]

    return cls.index_mapped(
        storage=storage,
        embedder=embedder,
        transform=transform,
        index_backend=index_backend,
        **index_kwargs,
    )

index_mapped classmethod

index_mapped(
    storage: Storage,
    embedder: Embedder,
    transform: Callable[[Mapping[str, Sequence[Any]]], Sequence[str]],
    index_backend: type[Index],
    **index_kwargs: Any
) -> Self

Creates a corpus from the given storage, embedder, key and index class, where storage entries would be mapped to strings, using the specified batched transform function.

Parameters:

Name Type Description Default
storage Storage

The storage to index.

required
embedder Embedder

The embedder to use.

required
transform Callable[[Mapping[str, Sequence[Any]]], Sequence[str]]

The function to use to transform the storage entries.

required
index_backend type[Index]

The index class to use.

required
**index_kwargs Any

Additional arguments to pass to the index class.

{}

Returns:

Type Description
Self

The created corpus.

Source code in src/bocoel/corpora/corpora/composed.py
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
@classmethod
def index_mapped(
    cls,
    storage: Storage,
    embedder: Embedder,
    transform: Callable[[Mapping[str, Sequence[Any]]], Sequence[str]],
    index_backend: type[Index],
    **index_kwargs: Any,
) -> Self:
    """
    Creates a corpus from the given storage, embedder, key and index class,
    where storage entries would be mapped to strings,
    using the specified batched transform function.

    Parameters:
        storage: The storage to index.
        embedder: The embedder to use.
        transform: The function to use to transform the storage entries.
        index_backend: The index class to use.
        **index_kwargs: Additional arguments to pass to the index class.

    Returns:
        The created corpus.
    """

    embeddings = embedder.encode_storage(storage, transform=transform)
    return cls.index_embeddings(
        embeddings=embeddings,
        storage=storage,
        index_backend=index_backend,
        **index_kwargs,
    )

index_embeddings classmethod

index_embeddings(
    storage: Storage,
    embeddings: NDArray,
    index_backend: type[Index],
    **index_kwargs: Any
) -> Self

Create the corpus with the given embeddings. This can be used to save time by encoding once and caching embeddings.

Parameters:

Name Type Description Default
storage Storage

The storage to use.

required
embeddings NDArray

The embeddings to use.

required
index_backend type[Index]

The index class to use.

required
**index_kwargs Any

Additional arguments to pass to the index class.

{}

Returns:

Type Description
Self

The created corpus.

Source code in src/bocoel/corpora/corpora/composed.py
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
@classmethod
def index_embeddings(
    cls,
    storage: Storage,
    embeddings: NDArray,
    index_backend: type[Index],
    **index_kwargs: Any,
) -> Self:
    """
    Create the corpus with the given embeddings.
    This can be used to save time by encoding once and caching embeddings.

    Parameters:
        storage: The storage to use.
        embeddings: The embeddings to use.
        index_backend: The index class to use.
        **index_kwargs: Additional arguments to pass to the index class.

    Returns:
        The created corpus.
    """

    index = index_backend(embeddings, **index_kwargs)
    return cls(index=index, storage=storage)