Skip to content

Indices

bocoel.Index

Index(embeddings: NDArray, distance: str | Distance, **kwargs: Any)

Bases: Protocol

Index is responsible for fast retrieval given a vector query.

Source code in src/bocoel/corpora/indices/interfaces/indices.py
19
20
21
22
23
def __init__(
    self, embeddings: NDArray, distance: str | Distance, **kwargs: Any
) -> None:
    # Included s.t. constructors of Index can be used.
    ...

data abstractmethod property

data: NDArray

The underly data that the index is used for searching.

NOTE

This has the shape of [n, dims], where dims is the transformed space.

Returns:

Type Description
NDArray

The data.

batch abstractmethod property

batch: int

The batch size used for searching.

Returns:

Type Description
int

The batch size.

boundary property

boundary: Boundary

The boundary of the queries. This is used to check if the query is in range. By default, this is [-1, 1] for all dimensions, since embeddings are normalized.

Returns:

Type Description
Boundary

The boundary of the input.

distance abstractmethod property

distance: Distance

The distance metric used by the index.

Returns:

Type Description
Distance

The distance metric.

dims property

dims: int

The number of dimensions that the query vector should be.

Returns:

Type Description
int

The number of dimensions.

__len__

__len__() -> int

The number of items in the index.

Returns:

Type Description
int

The number of items.

Source code in src/bocoel/corpora/indices/interfaces/indices.py
29
30
31
32
33
34
35
36
37
def __len__(self) -> int:
    """
    The number of items in the index.

    Returns:
        The number of items.
    """

    return len(self.data)

__getitem__

__getitem__(idx: int) -> NDArray

Get the item at the given index.

Parameters:

Name Type Description Default
idx int

The index of the item.

required

Returns:

Type Description
NDArray

The item.

Source code in src/bocoel/corpora/indices/interfaces/indices.py
39
40
41
42
43
44
45
46
47
48
49
50
def __getitem__(self, idx: int) -> NDArray:
    """
    Get the item at the given index.

    Parameters:
        idx: The index of the item.

    Returns:
        The item.
    """

    return self.data[idx]

search

search(query: ArrayLike, k: int = 1) -> SearchResultBatch

Calls the search function and performs some checks.

Parameters:

Name Type Description Default
query ArrayLike

The query vector. Must be of shape [batch, query_dims].

required
k int

The number of nearest neighbors to return.

1

Returns:

Type Description
SearchResultBatch

A SearchResultBatch instance. See SearchResultBatch for details.

Source code in src/bocoel/corpora/indices/interfaces/indices.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def search(self, query: ArrayLike, k: int = 1) -> SearchResultBatch:
    """
    Calls the search function and performs some checks.

    Parameters:
        query: The query vector. Must be of shape `[batch, query_dims]`.
        k: The number of nearest neighbors to return.

    Returns:
        A `SearchResultBatch` instance. See `SearchResultBatch` for details.
    """

    query = np.array(query)

    if (ndim := query.ndim) != 2:
        raise ValueError(
            f"Expected query to be a 2D vector, got a vector of dim {ndim}."
        )

    if (dim := query.shape[1]) != self.dims:
        raise ValueError(f"Expected query to have dimension {self.dims}, got {dim}")

    if k < 1:
        raise ValueError(f"Expected k to be at least 1, got {k}")

    results: list[InternalResult] = []
    for idx in range(0, len(query), self.batch):
        query_batch = query[idx : idx + self.batch]
        result = self._search(query_batch, k=k)
        results.append(result)

    indices = np.concatenate([res.indices for res in results], axis=0)
    distances = np.concatenate([res.distances for res in results], axis=0)
    vectors = self.data[indices]

    return SearchResultBatch(
        query=query, vectors=vectors, distances=distances, indices=indices
    )
_search(query: NDArray, k: int = 1) -> InternalResult

Search the index with a given query.

Parameters:

Name Type Description Default
query NDArray

The query vector. Must be of shape [query_dims].

required
k int

The number of nearest neighbors to return.

1

Returns:

Type Description
InternalResult

A numpy array of shape [k]. This corresponds to the indices of the nearest neighbors.

Source code in src/bocoel/corpora/indices/interfaces/indices.py
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
@abc.abstractmethod
def _search(self, query: NDArray, k: int = 1) -> InternalResult:
    """
    Search the index with a given query.

    Parameters:
        query: The query vector. Must be of shape [query_dims].
        k: The number of nearest neighbors to return.

    Returns:
        A numpy array of shape [k].
            This corresponds to the indices of the nearest neighbors.
    """

    ...

bocoel.HnswlibIndex

HnswlibIndex(
    embeddings: NDArray,
    distance: str | Distance,
    *,
    normalize: bool = True,
    threads: int = -1,
    batch_size: int = 64
)

Bases: Index

HNSWLIB index. Uses the hnswlib library.

Score is calculated slightly differently https://github.com/nmslib/hnswlib#supported-distances

Initializes the HNSWLIB index.

Parameters:

Name Type Description Default
embeddings NDArray

The embeddings to index.

required
distance str | Distance

The distance metric to use.

required
normalize bool

Whether to normalize the embeddings.

True
threads int

The number of threads to use.

-1
batch_size int

The batch size to use for searching.

64

Raises:

Type Description
ValueError

If the distance is not supported.

Source code in src/bocoel/corpora/indices/backend/hnswlib.py
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def __init__(
    self,
    embeddings: NDArray,
    distance: str | Distance,
    *,
    normalize: bool = True,
    threads: int = -1,
    batch_size: int = 64,
) -> None:
    """
    Initializes the HNSWLIB index.

    Parameters:
        embeddings: The embeddings to index.
        distance: The distance metric to use.
        normalize: Whether to normalize the embeddings.
        threads: The number of threads to use.
        batch_size: The batch size to use for searching.

    Raises:
        ValueError: If the distance is not supported.
    """

    if normalize:
        embeddings = utils.normalize(embeddings)

    self.__embeddings = embeddings

    # Would raise ValueError if not a valid distance.
    self._dist = Distance.lookup(distance)
    self._batch_size = batch_size

    # A public attribute because this can be changed at anytime.
    self.threads = threads

    self._init_index()

boundary property

boundary: Boundary

The boundary of the queries. This is used to check if the query is in range. By default, this is [-1, 1] for all dimensions, since embeddings are normalized.

Returns:

Type Description
Boundary

The boundary of the input.

dims property

dims: int

The number of dimensions that the query vector should be.

Returns:

Type Description
int

The number of dimensions.

__len__

__len__() -> int

The number of items in the index.

Returns:

Type Description
int

The number of items.

Source code in src/bocoel/corpora/indices/interfaces/indices.py
29
30
31
32
33
34
35
36
37
def __len__(self) -> int:
    """
    The number of items in the index.

    Returns:
        The number of items.
    """

    return len(self.data)

__getitem__

__getitem__(idx: int) -> NDArray

Get the item at the given index.

Parameters:

Name Type Description Default
idx int

The index of the item.

required

Returns:

Type Description
NDArray

The item.

Source code in src/bocoel/corpora/indices/interfaces/indices.py
39
40
41
42
43
44
45
46
47
48
49
50
def __getitem__(self, idx: int) -> NDArray:
    """
    Get the item at the given index.

    Parameters:
        idx: The index of the item.

    Returns:
        The item.
    """

    return self.data[idx]

search

search(query: ArrayLike, k: int = 1) -> SearchResultBatch

Calls the search function and performs some checks.

Parameters:

Name Type Description Default
query ArrayLike

The query vector. Must be of shape [batch, query_dims].

required
k int

The number of nearest neighbors to return.

1

Returns:

Type Description
SearchResultBatch

A SearchResultBatch instance. See SearchResultBatch for details.

Source code in src/bocoel/corpora/indices/interfaces/indices.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def search(self, query: ArrayLike, k: int = 1) -> SearchResultBatch:
    """
    Calls the search function and performs some checks.

    Parameters:
        query: The query vector. Must be of shape `[batch, query_dims]`.
        k: The number of nearest neighbors to return.

    Returns:
        A `SearchResultBatch` instance. See `SearchResultBatch` for details.
    """

    query = np.array(query)

    if (ndim := query.ndim) != 2:
        raise ValueError(
            f"Expected query to be a 2D vector, got a vector of dim {ndim}."
        )

    if (dim := query.shape[1]) != self.dims:
        raise ValueError(f"Expected query to have dimension {self.dims}, got {dim}")

    if k < 1:
        raise ValueError(f"Expected k to be at least 1, got {k}")

    results: list[InternalResult] = []
    for idx in range(0, len(query), self.batch):
        query_batch = query[idx : idx + self.batch]
        result = self._search(query_batch, k=k)
        results.append(result)

    indices = np.concatenate([res.indices for res in results], axis=0)
    distances = np.concatenate([res.distances for res in results], axis=0)
    vectors = self.data[indices]

    return SearchResultBatch(
        query=query, vectors=vectors, distances=distances, indices=indices
    )

bocoel.FaissIndex

FaissIndex(
    embeddings: NDArray,
    distance: str | Distance,
    *,
    normalize: bool = True,
    index_string: str,
    cuda: bool = False,
    batch_size: int = 64
)

Bases: Index

Faiss index. Uses the faiss library.

Initializes the Faiss index.

Parameters:

Name Type Description Default
embeddings NDArray

The embeddings to index.

required
distance str | Distance

The distance metric to use.

required
index_string str

The index string to use.

required
cuda bool

Whether to use CUDA.

False
batch_size int

The batch size to use for searching.

64
Source code in src/bocoel/corpora/indices/backend/faiss.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
def __init__(
    self,
    embeddings: NDArray,
    distance: str | Distance,
    *,
    normalize: bool = True,
    index_string: str,
    cuda: bool = False,
    batch_size: int = 64,
) -> None:
    """
    Initializes the Faiss index.

    Parameters:
        embeddings: The embeddings to index.
        distance: The distance metric to use.
        index_string: The index string to use.
        cuda: Whether to use CUDA.
        batch_size: The batch size to use for searching.
    """

    if normalize:
        embeddings = utils.normalize(embeddings)

    self.__embeddings = embeddings

    self._batch_size = batch_size
    self._dist = Distance.lookup(distance)

    self._index_string = index_string
    self._init_index(index_string=index_string, cuda=cuda)

boundary property

boundary: Boundary

The boundary of the queries. This is used to check if the query is in range. By default, this is [-1, 1] for all dimensions, since embeddings are normalized.

Returns:

Type Description
Boundary

The boundary of the input.

__len__

__len__() -> int

The number of items in the index.

Returns:

Type Description
int

The number of items.

Source code in src/bocoel/corpora/indices/interfaces/indices.py
29
30
31
32
33
34
35
36
37
def __len__(self) -> int:
    """
    The number of items in the index.

    Returns:
        The number of items.
    """

    return len(self.data)

__getitem__

__getitem__(idx: int) -> NDArray

Get the item at the given index.

Parameters:

Name Type Description Default
idx int

The index of the item.

required

Returns:

Type Description
NDArray

The item.

Source code in src/bocoel/corpora/indices/interfaces/indices.py
39
40
41
42
43
44
45
46
47
48
49
50
def __getitem__(self, idx: int) -> NDArray:
    """
    Get the item at the given index.

    Parameters:
        idx: The index of the item.

    Returns:
        The item.
    """

    return self.data[idx]

search

search(query: ArrayLike, k: int = 1) -> SearchResultBatch

Calls the search function and performs some checks.

Parameters:

Name Type Description Default
query ArrayLike

The query vector. Must be of shape [batch, query_dims].

required
k int

The number of nearest neighbors to return.

1

Returns:

Type Description
SearchResultBatch

A SearchResultBatch instance. See SearchResultBatch for details.

Source code in src/bocoel/corpora/indices/interfaces/indices.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def search(self, query: ArrayLike, k: int = 1) -> SearchResultBatch:
    """
    Calls the search function and performs some checks.

    Parameters:
        query: The query vector. Must be of shape `[batch, query_dims]`.
        k: The number of nearest neighbors to return.

    Returns:
        A `SearchResultBatch` instance. See `SearchResultBatch` for details.
    """

    query = np.array(query)

    if (ndim := query.ndim) != 2:
        raise ValueError(
            f"Expected query to be a 2D vector, got a vector of dim {ndim}."
        )

    if (dim := query.shape[1]) != self.dims:
        raise ValueError(f"Expected query to have dimension {self.dims}, got {dim}")

    if k < 1:
        raise ValueError(f"Expected k to be at least 1, got {k}")

    results: list[InternalResult] = []
    for idx in range(0, len(query), self.batch):
        query_batch = query[idx : idx + self.batch]
        result = self._search(query_batch, k=k)
        results.append(result)

    indices = np.concatenate([res.indices for res in results], axis=0)
    distances = np.concatenate([res.distances for res in results], axis=0)
    vectors = self.data[indices]

    return SearchResultBatch(
        query=query, vectors=vectors, distances=distances, indices=indices
    )

bocoel.WhiteningIndex

WhiteningIndex(
    embeddings: NDArray,
    distance: str | Distance,
    *,
    reduced: int,
    whitening_backend: type[Index],
    **backend_kwargs: Any
)

Bases: Index

Whitening index. Whitens the data before indexing. See https://arxiv.org/abs/2103.15316 for more info.

Initializes the whitening index.

Parameters:

Name Type Description Default
embeddings NDArray

The embeddings to index.

required
distance str | Distance

The distance metric to use.

required
reduced int

The reduced dimensionality. NOP if larger than embeddings shape.

required
whitening_backend type[Index]

The backend to use for indexing.

required
**backend_kwargs Any

The backend specific keyword arguments.

{}
Source code in src/bocoel/corpora/indices/whitening.py
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
def __init__(
    self,
    embeddings: NDArray,
    distance: str | Distance,
    *,
    reduced: int,
    whitening_backend: type[Index],
    **backend_kwargs: Any,
) -> None:
    """
    Initializes the whitening index.

    Parameters:
        embeddings: The embeddings to index.
        distance: The distance metric to use.
        reduced: The reduced dimensionality. NOP if larger than embeddings shape.
        whitening_backend: The backend to use for indexing.
        **backend_kwargs: The backend specific keyword arguments.
    """

    # Reduced might be smaller than embeddings.
    # In such case, no dimensionality reduction is performed.
    if reduced > embeddings.shape[1]:
        reduced = embeddings.shape[1]
        LOGGER.info(
            "Reduced dimensionality is larger than embeddings. Using full dimensionality",
            reduced=reduced,
            embeddings=embeddings.shape,
        )

    white = self.whiten(embeddings, reduced)
    assert white.shape[1] == reduced, {
        "whitened": white.shape,
        "reduced": reduced,
    }
    self._index = whitening_backend(
        embeddings=white, distance=distance, **backend_kwargs
    )
    assert reduced == self._index.dims

dims property

dims: int

The number of dimensions that the query vector should be.

Returns:

Type Description
int

The number of dimensions.

data property

data: NDArray

Returns the data. This does not necessarily have the same dimensionality as the original transformed embeddings.

Returns:

Type Description
NDArray

The data.

__len__

__len__() -> int

The number of items in the index.

Returns:

Type Description
int

The number of items.

Source code in src/bocoel/corpora/indices/interfaces/indices.py
29
30
31
32
33
34
35
36
37
def __len__(self) -> int:
    """
    The number of items in the index.

    Returns:
        The number of items.
    """

    return len(self.data)

__getitem__

__getitem__(idx: int) -> NDArray

Get the item at the given index.

Parameters:

Name Type Description Default
idx int

The index of the item.

required

Returns:

Type Description
NDArray

The item.

Source code in src/bocoel/corpora/indices/interfaces/indices.py
39
40
41
42
43
44
45
46
47
48
49
50
def __getitem__(self, idx: int) -> NDArray:
    """
    Get the item at the given index.

    Parameters:
        idx: The index of the item.

    Returns:
        The item.
    """

    return self.data[idx]

search

search(query: ArrayLike, k: int = 1) -> SearchResultBatch

Calls the search function and performs some checks.

Parameters:

Name Type Description Default
query ArrayLike

The query vector. Must be of shape [batch, query_dims].

required
k int

The number of nearest neighbors to return.

1

Returns:

Type Description
SearchResultBatch

A SearchResultBatch instance. See SearchResultBatch for details.

Source code in src/bocoel/corpora/indices/interfaces/indices.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def search(self, query: ArrayLike, k: int = 1) -> SearchResultBatch:
    """
    Calls the search function and performs some checks.

    Parameters:
        query: The query vector. Must be of shape `[batch, query_dims]`.
        k: The number of nearest neighbors to return.

    Returns:
        A `SearchResultBatch` instance. See `SearchResultBatch` for details.
    """

    query = np.array(query)

    if (ndim := query.ndim) != 2:
        raise ValueError(
            f"Expected query to be a 2D vector, got a vector of dim {ndim}."
        )

    if (dim := query.shape[1]) != self.dims:
        raise ValueError(f"Expected query to have dimension {self.dims}, got {dim}")

    if k < 1:
        raise ValueError(f"Expected k to be at least 1, got {k}")

    results: list[InternalResult] = []
    for idx in range(0, len(query), self.batch):
        query_batch = query[idx : idx + self.batch]
        result = self._search(query_batch, k=k)
        results.append(result)

    indices = np.concatenate([res.indices for res in results], axis=0)
    distances = np.concatenate([res.distances for res in results], axis=0)
    vectors = self.data[indices]

    return SearchResultBatch(
        query=query, vectors=vectors, distances=distances, indices=indices
    )

bocoel.PolarIndex

PolarIndex(
    embeddings: NDArray,
    distance: str | Distance,
    *,
    polar_backend: type[Index],
    **backend_kwargs: Any
)

Bases: Index

Index that uses N-sphere coordinates as interfaces. See wikipedia linked below for details.

Converting the spatial indices into spherical coordinates has the following benefits:

  • Since the coordinates are normalized, the radius is always 1.
  • The search region is rectangular in spherical coordinates, ideal for bayesian optimization.

Wikipedia link on N-sphere

Parameters:

Name Type Description Default
embeddings NDArray

The embeddings to index.

required
distance str | Distance

The distance metric to use.

required
polar_backend type[Index]

The backend to use for indexing.

required
**backend_kwargs Any

The backend specific keyword arguments.

{}
Source code in src/bocoel/corpora/indices/polar.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def __init__(
    self,
    embeddings: NDArray,
    distance: str | Distance,
    *,
    polar_backend: type[Index],
    **backend_kwargs: Any,
) -> None:
    """
    Parameters:
        embeddings: The embeddings to index.
        distance: The distance metric to use.
        polar_backend: The backend to use for indexing.
        **backend_kwargs: The backend specific keyword arguments.
    """

    embeddings = utils.normalize(embeddings)
    self._index = polar_backend(
        embeddings=embeddings,
        distance=distance,
        **backend_kwargs,
    )

    dims = self._index.dims - 1

    self._boundary = self._polar_boundary(dims)
    self._data = self._polar_coordinates()

dims property

dims: int

The number of dimensions that the query vector should be.

Returns:

Type Description
int

The number of dimensions.

__len__

__len__() -> int

The number of items in the index.

Returns:

Type Description
int

The number of items.

Source code in src/bocoel/corpora/indices/interfaces/indices.py
29
30
31
32
33
34
35
36
37
def __len__(self) -> int:
    """
    The number of items in the index.

    Returns:
        The number of items.
    """

    return len(self.data)

__getitem__

__getitem__(idx: int) -> NDArray

Get the item at the given index.

Parameters:

Name Type Description Default
idx int

The index of the item.

required

Returns:

Type Description
NDArray

The item.

Source code in src/bocoel/corpora/indices/interfaces/indices.py
39
40
41
42
43
44
45
46
47
48
49
50
def __getitem__(self, idx: int) -> NDArray:
    """
    Get the item at the given index.

    Parameters:
        idx: The index of the item.

    Returns:
        The item.
    """

    return self.data[idx]

search

search(query: ArrayLike, k: int = 1) -> SearchResultBatch

Calls the search function and performs some checks.

Parameters:

Name Type Description Default
query ArrayLike

The query vector. Must be of shape [batch, query_dims].

required
k int

The number of nearest neighbors to return.

1

Returns:

Type Description
SearchResultBatch

A SearchResultBatch instance. See SearchResultBatch for details.

Source code in src/bocoel/corpora/indices/interfaces/indices.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def search(self, query: ArrayLike, k: int = 1) -> SearchResultBatch:
    """
    Calls the search function and performs some checks.

    Parameters:
        query: The query vector. Must be of shape `[batch, query_dims]`.
        k: The number of nearest neighbors to return.

    Returns:
        A `SearchResultBatch` instance. See `SearchResultBatch` for details.
    """

    query = np.array(query)

    if (ndim := query.ndim) != 2:
        raise ValueError(
            f"Expected query to be a 2D vector, got a vector of dim {ndim}."
        )

    if (dim := query.shape[1]) != self.dims:
        raise ValueError(f"Expected query to have dimension {self.dims}, got {dim}")

    if k < 1:
        raise ValueError(f"Expected k to be at least 1, got {k}")

    results: list[InternalResult] = []
    for idx in range(0, len(query), self.batch):
        query_batch = query[idx : idx + self.batch]
        result = self._search(query_batch, k=k)
        results.append(result)

    indices = np.concatenate([res.indices for res in results], axis=0)
    distances = np.concatenate([res.distances for res in results], axis=0)
    vectors = self.data[indices]

    return SearchResultBatch(
        query=query, vectors=vectors, distances=distances, indices=indices
    )

_polar_boundary

_polar_boundary(dims: int) -> Boundary

The boundary of the queries. For polar coordinate it is [0, pi] for all dimensions except the last one which is [0, 2 * pi].

Returns:

Type Description
Boundary

The boundary of the input.

Source code in src/bocoel/corpora/indices/polar.py
82
83
84
85
86
87
88
89
90
91
92
93
94
95
def _polar_boundary(self, dims: int) -> Boundary:
    """
    The boundary of the queries.
    For polar coordinate it is [0, pi] for all dimensions
    except the last one which is [0, 2 * pi].

    Returns:
        The boundary of the input.
    """

    # See wikipedia linked in the class documentation for details.
    upper = np.concatenate([[np.pi] * (dims - 1), [2 * np.pi]])
    lower = np.zeros_like(upper)
    return Boundary(np.stack([lower, upper], axis=-1))

polar_to_spatial staticmethod

polar_to_spatial(r: ArrayLike, theta: ArrayLike) -> NDArray

Convert an N-sphere coordinates to cartesian coordinates. See wikipedia linked in the class documentation for details.

Parameters:

Name Type Description Default
r ArrayLike

The radius of the N-sphere. Has the shape [N].

required
theta ArrayLike

The angles of the N-sphere. Hash the shape [N, D].

required

Returns:

Type Description
NDArray

The cartesian coordinates of the N-sphere.

Source code in src/bocoel/corpora/indices/polar.py
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
@staticmethod
def polar_to_spatial(r: ArrayLike, theta: ArrayLike) -> NDArray:
    """
    Convert an N-sphere coordinates to cartesian coordinates.
    See wikipedia linked in the class documentation for details.

    Parameters:
        r: The radius of the N-sphere. Has the shape [N].
        theta: The angles of the N-sphere. Hash the shape [N, D].

    Returns:
        The cartesian coordinates of the N-sphere.
    """

    r = np.array(r)
    theta = np.array(theta)

    if r.ndim != 1:
        raise ValueError(f"Expected r to be 1D, got {r.ndim}")

    if theta.ndim != 2:
        raise ValueError(f"Expected theta to be 2D, got {theta.ndim}")

    if r.shape[0] != theta.shape[0]:
        raise ValueError(
            f"Expected r and theta to have the same length, got {r.shape[0]} and {theta.shape[0]}"
        )

    # Add 1 dimension to the front because spherical coordinate's first dimension is r.
    sin = np.concatenate([np.ones([len(r), 1]), np.sin(theta)], axis=1)
    sin = np.cumprod(sin, axis=1)
    cos = np.concatenate([np.cos(theta), np.ones([len(r), 1])], axis=1)
    return sin * cos * r[:, None]

spatial_to_polar staticmethod

spatial_to_polar(x: ArrayLike) -> tuple[NDArray, NDArray]

Convert cartesian coordinates to N-sphere coordinates. See wikipedia linked in the class documentation for details.

Parameters:

Name Type Description Default
x ArrayLike

The cartesian coordinates. Has the shape [N, D].

required

Returns:

Type Description
tuple[NDArray, NDArray]

A tuple. The radius and the angles of the N-sphere.

Source code in src/bocoel/corpora/indices/polar.py
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
@staticmethod
def spatial_to_polar(x: ArrayLike) -> tuple[NDArray, NDArray]:
    """
    Convert cartesian coordinates to N-sphere coordinates.
    See wikipedia linked in the class documentation for details.

    Parameters:
        x: The cartesian coordinates. Has the shape [N, D].

    Returns:
        A tuple. The radius and the angles of the N-sphere.
    """

    x = np.array(x)

    if x.ndim != 2:
        raise ValueError(f"Expected x to be 2D, got {x.ndim}")

    # Since the function requires a lot of sum of squares, cache it.
    x_2 = x[:, 1:] ** 2

    r = np.sqrt(x_2.sum(axis=1))
    cumsum_back = np.cumsum(x_2[:, ::-1], axis=1)[:, ::-1]

    theta = np.arctan2(np.sqrt(cumsum_back), x[:, 1:])
    return r, theta

bocoel.Boundary dataclass

The boundary of embeddings in a corpus. The boundary is defined as a hyperrectangle in the embedding space.

bounds instance-attribute

bounds: NDArray

The boundary arrays of the corpus. Must be of shape [dims, 2], where dims is the number of dimensions. The first column is the lower bound, the second column is the upper bound.

dims property

dims: int

The number of dimensions.

lower property

lower: NDArray

The lower bounds. Must be of shape [dims].

upper property

upper: NDArray

The upper bounds. Must be of shape [dims].

fixed classmethod

fixed(lower: float, upper: float, dims: int) -> Self

Create a fixed boundary for all dimensions.

Parameters:

Name Type Description Default
lower float

The lower bound.

required
upper float

The upper bound.

required
dims int

The number of dimensions.

required

Returns:

Type Description
Self

A Boundary instance.

Raises:

Type Description
ValueError

If lower > upper.

Source code in src/bocoel/corpora/indices/interfaces/boundaries.py
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
@classmethod
def fixed(cls, lower: float, upper: float, dims: int) -> Self:
    """
    Create a fixed boundary for all dimensions.

    Parameters:
        lower: The lower bound.
        upper: The upper bound.
        dims: The number of dimensions.

    Returns:
        A `Boundary` instance.

    Raises:
        ValueError: If lower > upper.
    """

    if lower > upper:
        raise ValueError("Expected lower <= upper")

    return cls(bounds=np.array([[lower, upper]] * dims))

bocoel.Distance

Bases: StrEnum

Distance metrics.

L2 class-attribute instance-attribute

L2 = 'L2'

L2 distance. Also known as Euclidean distance.

INNER_PRODUCT class-attribute instance-attribute

INNER_PRODUCT = 'IP'

Inner product distance. When normalized, this is equivalent to cosine similarity.

bocoel.corpora.indices.interfaces.results._SearchResult dataclass

query instance-attribute

query: NDArray

Query vector. If batched, should have shape [batch, dims]. Or else, should have shape [dims].

vectors instance-attribute

vectors: NDArray

Nearest neighbors. If batched, should have shape [batch, k, dims]. Or else, should have shape [k, dims].

distances instance-attribute

distances: NDArray

Calculated distance. If batched, should have shape [batch, k]. Or else, should have shape [k].

indices instance-attribute

indices: NDArray

Index in the original embeddings. Must be integers. If batched, should have shape [batch, k]. Or else, should have shape [k].

bocoel.corpora.SearchResultBatch dataclass

Bases: _SearchResult

A batched version of search result.

query instance-attribute

query: NDArray

Query vector. If batched, should have shape [batch, dims]. Or else, should have shape [dims].

vectors instance-attribute

vectors: NDArray

Nearest neighbors. If batched, should have shape [batch, k, dims]. Or else, should have shape [k, dims].

distances instance-attribute

distances: NDArray

Calculated distance. If batched, should have shape [batch, k]. Or else, should have shape [k].

indices instance-attribute

indices: NDArray

Index in the original embeddings. Must be integers. If batched, should have shape [batch, k]. Or else, should have shape [k].

bocoel.corpora.SearchResult dataclass

Bases: _SearchResult

A non-batched version of search result.

query instance-attribute

query: NDArray

Query vector. If batched, should have shape [batch, dims]. Or else, should have shape [dims].

vectors instance-attribute

vectors: NDArray

Nearest neighbors. If batched, should have shape [batch, k, dims]. Or else, should have shape [k, dims].

distances instance-attribute

distances: NDArray

Calculated distance. If batched, should have shape [batch, k]. Or else, should have shape [k].

indices instance-attribute

indices: NDArray

Index in the original embeddings. Must be integers. If batched, should have shape [batch, k]. Or else, should have shape [k].

bocoel.corpora.indices.interfaces.InternalResult

Bases: NamedTuple

distances instance-attribute

distances: NDArray

Calculated distance.

indices instance-attribute

indices: NDArray

Index in the original embeddings. Must be integers.