Corpus
bocoel.Corpus
Bases: Protocol
Corpus is the entry point to handling the data in this library.
A corpus has 3 main components: - Index: Searches one particular column in the storage.Provides fast retrival. - Storage: Used to store the questions / answers / texts. - Embedder: Embeds the text into vectors for faster access.
An index only corresponds to one key. If search over multiple keys is desired, a new column or a new corpus (with shared storage) should be created.
bocoel.ComposedCorpus dataclass
Bases: Corpus
Simply a collection of components.
index_storage classmethod
index_storage(
storage: Storage,
embedder: Embedder,
keys: Sequence[str],
index_backend: type[Index],
concat: Callable[[Iterable[Any]], str] = " [SEP] ".join,
**index_kwargs: Any
) -> Self
Creates a corpus from the given storage, embedder, key and index class, where storage entries would be mapped to strings,
Parameters:
Name | Type | Description | Default |
---|---|---|---|
storage | Storage | The storage to index. | required |
embedder | Embedder | The embedder to use. | required |
keys | Sequence[str] | The keys to use for the index. | required |
index_backend | type[Index] | The index class to use. | required |
concat | Callable[[Iterable[Any]], str] | The function to use to concatenate the keys. | join |
**index_kwargs | Any | Additional arguments to pass to the index class. | {} |
Returns:
Type | Description |
---|---|
Self | The created corpus. |
Source code in src/bocoel/corpora/corpora/composed.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
|
index_mapped classmethod
index_mapped(
storage: Storage,
embedder: Embedder,
transform: Callable[[Mapping[str, Sequence[Any]]], Sequence[str]],
index_backend: type[Index],
**index_kwargs: Any
) -> Self
Creates a corpus from the given storage, embedder, key and index class, where storage entries would be mapped to strings, using the specified batched transform function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
storage | Storage | The storage to index. | required |
embedder | Embedder | The embedder to use. | required |
transform | Callable[[Mapping[str, Sequence[Any]]], Sequence[str]] | The function to use to transform the storage entries. | required |
index_backend | type[Index] | The index class to use. | required |
**index_kwargs | Any | Additional arguments to pass to the index class. | {} |
Returns:
Type | Description |
---|---|
Self | The created corpus. |
Source code in src/bocoel/corpora/corpora/composed.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
|
index_embeddings classmethod
index_embeddings(
storage: Storage,
embeddings: NDArray,
index_backend: type[Index],
**index_kwargs: Any
) -> Self
Create the corpus with the given embeddings. This can be used to save time by encoding once and caching embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
storage | Storage | The storage to use. | required |
embeddings | NDArray | The embeddings to use. | required |
index_backend | type[Index] | The index class to use. | required |
**index_kwargs | Any | Additional arguments to pass to the index class. | {} |
Returns:
Type | Description |
---|---|
Self | The created corpus. |
Source code in src/bocoel/corpora/corpora/composed.py
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
|