Indices
bocoel.Index
Index(embeddings: NDArray, distance: str | Distance, **kwargs: Any)
Bases: Protocol
Index is responsible for fast retrieval given a vector query.
Source code in src/bocoel/corpora/indices/interfaces/indices.py
19 20 21 22 23 |
|
data abstractmethod
property
data: NDArray
The underly data that the index is used for searching.
NOTE
This has the shape of [n, dims], where dims is the transformed space.
Returns:
Type | Description |
---|---|
NDArray | The data. |
batch abstractmethod
property
batch: int
The batch size used for searching.
Returns:
Type | Description |
---|---|
int | The batch size. |
boundary property
boundary: Boundary
The boundary of the queries. This is used to check if the query is in range. By default, this is [-1, 1] for all dimensions, since embeddings are normalized.
Returns:
Type | Description |
---|---|
Boundary | The boundary of the input. |
distance abstractmethod
property
distance: Distance
dims property
dims: int
The number of dimensions that the query vector should be.
Returns:
Type | Description |
---|---|
int | The number of dimensions. |
__len__
__len__() -> int
The number of items in the index.
Returns:
Type | Description |
---|---|
int | The number of items. |
Source code in src/bocoel/corpora/indices/interfaces/indices.py
29 30 31 32 33 34 35 36 37 |
|
__getitem__
__getitem__(idx: int) -> NDArray
Get the item at the given index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
idx | int | The index of the item. | required |
Returns:
Type | Description |
---|---|
NDArray | The item. |
Source code in src/bocoel/corpora/indices/interfaces/indices.py
39 40 41 42 43 44 45 46 47 48 49 50 |
|
search
search(query: ArrayLike, k: int = 1) -> SearchResultBatch
Calls the search function and performs some checks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query | ArrayLike | The query vector. Must be of shape | required |
k | int | The number of nearest neighbors to return. | 1 |
Returns:
Type | Description |
---|---|
SearchResultBatch | A |
Source code in src/bocoel/corpora/indices/interfaces/indices.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
|
_search abstractmethod
_search(query: NDArray, k: int = 1) -> InternalResult
Search the index with a given query.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query | NDArray | The query vector. Must be of shape [query_dims]. | required |
k | int | The number of nearest neighbors to return. | 1 |
Returns:
Type | Description |
---|---|
InternalResult | A numpy array of shape [k]. This corresponds to the indices of the nearest neighbors. |
Source code in src/bocoel/corpora/indices/interfaces/indices.py
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
|
bocoel.HnswlibIndex
HnswlibIndex(
embeddings: NDArray,
distance: str | Distance,
*,
normalize: bool = True,
threads: int = -1,
batch_size: int = 64
)
Bases: Index
HNSWLIB index. Uses the hnswlib library.
Score is calculated slightly differently https://github.com/nmslib/hnswlib#supported-distances
Initializes the HNSWLIB index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embeddings | NDArray | The embeddings to index. | required |
distance | str | Distance | The distance metric to use. | required |
normalize | bool | Whether to normalize the embeddings. | True |
threads | int | The number of threads to use. | -1 |
batch_size | int | The batch size to use for searching. | 64 |
Raises:
Type | Description |
---|---|
ValueError | If the distance is not supported. |
Source code in src/bocoel/corpora/indices/backend/hnswlib.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
|
boundary property
boundary: Boundary
The boundary of the queries. This is used to check if the query is in range. By default, this is [-1, 1] for all dimensions, since embeddings are normalized.
Returns:
Type | Description |
---|---|
Boundary | The boundary of the input. |
dims property
dims: int
The number of dimensions that the query vector should be.
Returns:
Type | Description |
---|---|
int | The number of dimensions. |
__len__
__len__() -> int
The number of items in the index.
Returns:
Type | Description |
---|---|
int | The number of items. |
Source code in src/bocoel/corpora/indices/interfaces/indices.py
29 30 31 32 33 34 35 36 37 |
|
__getitem__
__getitem__(idx: int) -> NDArray
Get the item at the given index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
idx | int | The index of the item. | required |
Returns:
Type | Description |
---|---|
NDArray | The item. |
Source code in src/bocoel/corpora/indices/interfaces/indices.py
39 40 41 42 43 44 45 46 47 48 49 50 |
|
search
search(query: ArrayLike, k: int = 1) -> SearchResultBatch
Calls the search function and performs some checks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query | ArrayLike | The query vector. Must be of shape | required |
k | int | The number of nearest neighbors to return. | 1 |
Returns:
Type | Description |
---|---|
SearchResultBatch | A |
Source code in src/bocoel/corpora/indices/interfaces/indices.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
|
bocoel.FaissIndex
FaissIndex(
embeddings: NDArray,
distance: str | Distance,
*,
normalize: bool = True,
index_string: str,
cuda: bool = False,
batch_size: int = 64
)
Bases: Index
Faiss index. Uses the faiss library.
Initializes the Faiss index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embeddings | NDArray | The embeddings to index. | required |
distance | str | Distance | The distance metric to use. | required |
index_string | str | The index string to use. | required |
cuda | bool | Whether to use CUDA. | False |
batch_size | int | The batch size to use for searching. | 64 |
Source code in src/bocoel/corpora/indices/backend/faiss.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
|
boundary property
boundary: Boundary
The boundary of the queries. This is used to check if the query is in range. By default, this is [-1, 1] for all dimensions, since embeddings are normalized.
Returns:
Type | Description |
---|---|
Boundary | The boundary of the input. |
__len__
__len__() -> int
The number of items in the index.
Returns:
Type | Description |
---|---|
int | The number of items. |
Source code in src/bocoel/corpora/indices/interfaces/indices.py
29 30 31 32 33 34 35 36 37 |
|
__getitem__
__getitem__(idx: int) -> NDArray
Get the item at the given index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
idx | int | The index of the item. | required |
Returns:
Type | Description |
---|---|
NDArray | The item. |
Source code in src/bocoel/corpora/indices/interfaces/indices.py
39 40 41 42 43 44 45 46 47 48 49 50 |
|
search
search(query: ArrayLike, k: int = 1) -> SearchResultBatch
Calls the search function and performs some checks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query | ArrayLike | The query vector. Must be of shape | required |
k | int | The number of nearest neighbors to return. | 1 |
Returns:
Type | Description |
---|---|
SearchResultBatch | A |
Source code in src/bocoel/corpora/indices/interfaces/indices.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
|
bocoel.WhiteningIndex
WhiteningIndex(
embeddings: NDArray,
distance: str | Distance,
*,
reduced: int,
whitening_backend: type[Index],
**backend_kwargs: Any
)
Bases: Index
Whitening index. Whitens the data before indexing. See https://arxiv.org/abs/2103.15316 for more info.
Initializes the whitening index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embeddings | NDArray | The embeddings to index. | required |
distance | str | Distance | The distance metric to use. | required |
reduced | int | The reduced dimensionality. NOP if larger than embeddings shape. | required |
whitening_backend | type[Index] | The backend to use for indexing. | required |
**backend_kwargs | Any | The backend specific keyword arguments. | {} |
Source code in src/bocoel/corpora/indices/whitening.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
|
dims property
dims: int
The number of dimensions that the query vector should be.
Returns:
Type | Description |
---|---|
int | The number of dimensions. |
data property
data: NDArray
Returns the data. This does not necessarily have the same dimensionality as the original transformed embeddings.
Returns:
Type | Description |
---|---|
NDArray | The data. |
__len__
__len__() -> int
The number of items in the index.
Returns:
Type | Description |
---|---|
int | The number of items. |
Source code in src/bocoel/corpora/indices/interfaces/indices.py
29 30 31 32 33 34 35 36 37 |
|
__getitem__
__getitem__(idx: int) -> NDArray
Get the item at the given index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
idx | int | The index of the item. | required |
Returns:
Type | Description |
---|---|
NDArray | The item. |
Source code in src/bocoel/corpora/indices/interfaces/indices.py
39 40 41 42 43 44 45 46 47 48 49 50 |
|
search
search(query: ArrayLike, k: int = 1) -> SearchResultBatch
Calls the search function and performs some checks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query | ArrayLike | The query vector. Must be of shape | required |
k | int | The number of nearest neighbors to return. | 1 |
Returns:
Type | Description |
---|---|
SearchResultBatch | A |
Source code in src/bocoel/corpora/indices/interfaces/indices.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
|
bocoel.PolarIndex
PolarIndex(
embeddings: NDArray,
distance: str | Distance,
*,
polar_backend: type[Index],
**backend_kwargs: Any
)
Bases: Index
Index that uses N-sphere coordinates as interfaces. See wikipedia linked below for details.
Converting the spatial indices into spherical coordinates has the following benefits:
- Since the coordinates are normalized, the radius is always 1.
- The search region is rectangular in spherical coordinates, ideal for bayesian optimization.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embeddings | NDArray | The embeddings to index. | required |
distance | str | Distance | The distance metric to use. | required |
polar_backend | type[Index] | The backend to use for indexing. | required |
**backend_kwargs | Any | The backend specific keyword arguments. | {} |
Source code in src/bocoel/corpora/indices/polar.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
|
dims property
dims: int
The number of dimensions that the query vector should be.
Returns:
Type | Description |
---|---|
int | The number of dimensions. |
__len__
__len__() -> int
The number of items in the index.
Returns:
Type | Description |
---|---|
int | The number of items. |
Source code in src/bocoel/corpora/indices/interfaces/indices.py
29 30 31 32 33 34 35 36 37 |
|
__getitem__
__getitem__(idx: int) -> NDArray
Get the item at the given index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
idx | int | The index of the item. | required |
Returns:
Type | Description |
---|---|
NDArray | The item. |
Source code in src/bocoel/corpora/indices/interfaces/indices.py
39 40 41 42 43 44 45 46 47 48 49 50 |
|
search
search(query: ArrayLike, k: int = 1) -> SearchResultBatch
Calls the search function and performs some checks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query | ArrayLike | The query vector. Must be of shape | required |
k | int | The number of nearest neighbors to return. | 1 |
Returns:
Type | Description |
---|---|
SearchResultBatch | A |
Source code in src/bocoel/corpora/indices/interfaces/indices.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
|
_polar_boundary
_polar_boundary(dims: int) -> Boundary
The boundary of the queries. For polar coordinate it is [0, pi] for all dimensions except the last one which is [0, 2 * pi].
Returns:
Type | Description |
---|---|
Boundary | The boundary of the input. |
Source code in src/bocoel/corpora/indices/polar.py
82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
|
polar_to_spatial staticmethod
polar_to_spatial(r: ArrayLike, theta: ArrayLike) -> NDArray
Convert an N-sphere coordinates to cartesian coordinates. See wikipedia linked in the class documentation for details.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
r | ArrayLike | The radius of the N-sphere. Has the shape [N]. | required |
theta | ArrayLike | The angles of the N-sphere. Hash the shape [N, D]. | required |
Returns:
Type | Description |
---|---|
NDArray | The cartesian coordinates of the N-sphere. |
Source code in src/bocoel/corpora/indices/polar.py
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
|
spatial_to_polar staticmethod
spatial_to_polar(x: ArrayLike) -> tuple[NDArray, NDArray]
Convert cartesian coordinates to N-sphere coordinates. See wikipedia linked in the class documentation for details.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x | ArrayLike | The cartesian coordinates. Has the shape [N, D]. | required |
Returns:
Type | Description |
---|---|
tuple[NDArray, NDArray] | A tuple. The radius and the angles of the N-sphere. |
Source code in src/bocoel/corpora/indices/polar.py
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 |
|
bocoel.Boundary dataclass
The boundary of embeddings in a corpus. The boundary is defined as a hyperrectangle in the embedding space.
bounds instance-attribute
bounds: NDArray
The boundary arrays of the corpus. Must be of shape [dims, 2]
, where dims is the number of dimensions. The first column is the lower bound, the second column is the upper bound.
dims property
dims: int
The number of dimensions.
lower property
lower: NDArray
The lower bounds. Must be of shape [dims]
.
upper property
upper: NDArray
The upper bounds. Must be of shape [dims]
.
fixed classmethod
fixed(lower: float, upper: float, dims: int) -> Self
Create a fixed boundary for all dimensions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lower | float | The lower bound. | required |
upper | float | The upper bound. | required |
dims | int | The number of dimensions. | required |
Returns:
Type | Description |
---|---|
Self | A |
Raises:
Type | Description |
---|---|
ValueError | If lower > upper. |
Source code in src/bocoel/corpora/indices/interfaces/boundaries.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|
bocoel.Distance
Bases: StrEnum
Distance metrics.
L2 class-attribute
instance-attribute
L2 = 'L2'
L2 distance. Also known as Euclidean distance.
INNER_PRODUCT class-attribute
instance-attribute
INNER_PRODUCT = 'IP'
Inner product distance. When normalized, this is equivalent to cosine similarity.
bocoel.corpora.indices.interfaces.results._SearchResult dataclass
query instance-attribute
query: NDArray
Query vector. If batched, should have shape [batch, dims]. Or else, should have shape [dims].
vectors instance-attribute
vectors: NDArray
Nearest neighbors. If batched, should have shape [batch, k, dims]. Or else, should have shape [k, dims].
distances instance-attribute
distances: NDArray
Calculated distance. If batched, should have shape [batch, k]. Or else, should have shape [k].
indices instance-attribute
indices: NDArray
Index in the original embeddings. Must be integers. If batched, should have shape [batch, k]. Or else, should have shape [k].
bocoel.corpora.SearchResultBatch dataclass
Bases: _SearchResult
A batched version of search result.
query instance-attribute
query: NDArray
Query vector. If batched, should have shape [batch, dims]. Or else, should have shape [dims].
vectors instance-attribute
vectors: NDArray
Nearest neighbors. If batched, should have shape [batch, k, dims]. Or else, should have shape [k, dims].
distances instance-attribute
distances: NDArray
Calculated distance. If batched, should have shape [batch, k]. Or else, should have shape [k].
indices instance-attribute
indices: NDArray
Index in the original embeddings. Must be integers. If batched, should have shape [batch, k]. Or else, should have shape [k].
bocoel.corpora.SearchResult dataclass
Bases: _SearchResult
A non-batched version of search result.
query instance-attribute
query: NDArray
Query vector. If batched, should have shape [batch, dims]. Or else, should have shape [dims].
vectors instance-attribute
vectors: NDArray
Nearest neighbors. If batched, should have shape [batch, k, dims]. Or else, should have shape [k, dims].
distances instance-attribute
distances: NDArray
Calculated distance. If batched, should have shape [batch, k]. Or else, should have shape [k].
indices instance-attribute
indices: NDArray
Index in the original embeddings. Must be integers. If batched, should have shape [batch, k]. Or else, should have shape [k].
bocoel.corpora.indices.interfaces.InternalResult
Bases: NamedTuple
distances instance-attribute
distances: NDArray
Calculated distance.
indices instance-attribute
indices: NDArray
Index in the original embeddings. Must be integers.