relevanceai.vector_tools.cluster

Module Contents

Classes

ClusterBase

Using verbose loguru as base logger for now

CentroidCluster

Using verbose loguru as base logger for now

DensityCluster

Using verbose loguru as base logger for now

MiniBatchKMeans

Using verbose loguru as base logger for now

KMeans

Using verbose loguru as base logger for now

HDBSCANClusterer

Using verbose loguru as base logger for now

Cluster

Batch API client

class relevanceai.vector_tools.cluster.ClusterBase

Bases: relevanceai.logger.LoguruLogger, doc_utils.DocUtils

Using verbose loguru as base logger for now

__call__(self, *args, **kwargs)
abstract fit_transform(self, vectors)
_concat_vectors_from_list(self, list_of_vectors: list)

Concatenate 2 vectors together in a pairwise fashion

fit_documents(self, vector_fields: list, docs: list, alias: str = 'default', cluster_field: str = '_cluster_', return_only_clusters: bool = True, inplace: bool = True)

Train clustering algorithm on documents and then store the labels inside the documents.

Parameters
  • vector_field (list) – The vector field of the documents

  • docs (list) – List of documents to run clustering on

  • alias (str) – What the clusters can be called

  • cluster_field (str) – What the cluster fields should be called

  • return_only_clusters (bool) – If True, return only clusters, otherwise returns the original document

  • inplace (bool) – If True, the documents are edited inplace otherwise, a copy is made first

  • kwargs (dict) – Any other keyword argument will go directly into the clustering algorithm

abstract to_metadata(self)

You can also store the metadata of this clustering algorithm

property metadata(self)
_label_cluster(self, label: Union[int, str])
_label_clusters(self, labels)
class relevanceai.vector_tools.cluster.CentroidCluster

Bases: ClusterBase

Using verbose loguru as base logger for now

__call__(self, *args, **kwargs)
abstract fit_transform(self, vectors)
abstract get_centers(self) Union[numpy.ndarray, List[list]]

Get centers for the centroid-based clusters

get_centroid_docs(self, centroid_vector_field_name='centroid_vector_') List

Get the centroid documents to store. if single vector field returns this:

{

“_id”: “document-id-1”, “centroid_vector_”: [0.23, 0.24, 0.23]

}

If multiple vector fields returns this: Returns multiple ``` {

“_id”: “document-id-1”, “blue_vector_”: [0.12, 0.312, 0.42], “red_vector_”: [0.23, 0.41, 0.3]

class relevanceai.vector_tools.cluster.DensityCluster

Bases: ClusterBase

Using verbose loguru as base logger for now

__call__(self, *args, **kwargs)
abstract fit_transform(self, vectors)
class relevanceai.vector_tools.cluster.MiniBatchKMeans(k: Union[None, int] = 10, init: str = 'k-means++', verbose: bool = False, compute_labels: bool = True, max_no_improvement: int = 2)

Bases: CentroidCluster

Using verbose loguru as base logger for now

_init_model(self)
fit_transform(self, vectors: Union[numpy.ndarray, List])

Fit and transform transform the vectors

get_centers(self)

Returns centroids of clusters

to_metadata(self)

Editing the metadata of the function

class relevanceai.vector_tools.cluster.KMeans(k=10, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='auto')

Bases: MiniBatchKMeans

Using verbose loguru as base logger for now

_init_model(self)
to_metadata(self)

Editing the metadata of the function

class relevanceai.vector_tools.cluster.HDBSCANClusterer(algorithm: str = 'best', alpha: float = 1.0, approx_min_span_tree: bool = True, gen_min_span_tree: bool = False, leaf_size: int = 40, memory=Memory(cachedir=None), metric: str = 'euclidean', min_samples: int = None, p: float = None, min_cluster_size: Union[None, int] = 10)

Bases: DensityCluster

Using verbose loguru as base logger for now

fit_transform(self, vectors: numpy.ndarray) numpy.ndarray
class relevanceai.vector_tools.cluster.Cluster(project, api_key)

Bases: relevanceai.vector_tools.cluster_evaluate.ClusterEvaluate, relevanceai.api.client.BatchAPIClient, ClusterBase

Batch API client

static _choose_k(vectors: numpy.ndarray)

” Choose k clusters

static cluster(vectors: numpy.ndarray, cluster: Union[relevanceai.vector_tools.constants.CLUSTER, ClusterBase], cluster_args: Dict = {}, k: Union[None, int] = None) numpy.ndarray

Cluster vectors

kmeans_cluster(self, dataset_id: str, vector_fields: list, filters: List = [], k: Union[None, int] = 10, init: str = 'k-means++', n_init: int = 10, max_iter: int = 300, tol: float = 0.0001, verbose: bool = True, random_state: Optional[int] = None, copy_x: bool = True, algorithm: str = 'auto', alias: str = None, cluster_field: str = '_cluster_', update_documents_chunksize: int = 50, overwrite: bool = False, page_size: int = 1)

This function performs all the steps required for Kmeans clustering: 1- Loads the data 2- Clusters the data 3- Updates the data with clustering info 4- Adds the centroid to the hidden centroid collection

Parameters
  • dataset_id (string) – name of the dataser

  • vector_fields (list) – a list containing the vector field to be used for clustering

  • filters (list) – a list to filter documents of the dataset,

  • k (int) – K in Kmeans

  • init (string) – “k-means++” -> Kmeans algorithm parameter

  • n_init (int) – number of reinitialization for the kmeans algorithm

  • max_iter (int) – max iteration in the kmeans algorithm

  • tol (int) – tol in the kmeans algorithm

  • verbose (bool) – True by default

  • None (random_state =) – None by default -> Kmeans algorithm parameter

  • copy_x (bool) – True bydefault

  • algorithm (string) – “auto” by default

  • alias (string) – “kmeans”, string to be used in naming of the field showing the clustering results

  • cluster_field (string) – “_cluster_”, string to name the main cluster field

  • overwrite (bool) – False by default, To overwite an existing clusering result

Example

>>> client.vector_tools.cluster.kmeans_cluster(
    dataset_id="sample_dataset",
    vector_fields=vector_fields
)
hdbscan_cluster(self, dataset_id: str, vector_fields: list, filters: List = [], algorithm: str = 'best', alpha: float = 1.0, approx_min_span_tree: bool = True, gen_min_span_tree: bool = False, leaf_size: int = 40, memory=Memory(cachedir=None), metric: str = 'euclidean', min_samples=None, p=None, min_cluster_size: Union[None, int] = 10, alias: str = 'hdbscan', cluster_field: str = '_cluster_', update_documents_chunksize: int = 50, overwrite: bool = False)

This function performs all the steps required for hdbscan clustering: 1- Loads the data 2- Clusters the data 3- Updates the data with clustering info 4- Adds the centroid to the hidden centroid collection

Parameters
  • dataset_id (string) – name of the dataser

  • vector_fields (list) – a list containing the vector field to be used for clustering

  • filters (list) – a list to filter documents of the dataset

  • algorithm (str) – hdbscan configuration parameter default to “best”

  • alpha (float) – hdbscan configuration parameter default to 1.0

  • approx_min_span_tree (bool) – hdbscan configuration parameter default to True

  • gen_min_span_tree (bool) – hdbscan configuration parameter default to False

  • leaf_size (int) – hdbscan configuration parameter default to 40

  • Memory(cachedir=None) (memory =) – hdbscan configuration parameter on memory management

  • metric (str = "euclidean") – hdbscan configuration parameter default to “euclidean”

  • None (p =) – hdbscan configuration parameter default to None

  • None – hdbscan configuration parameter default to None

  • min_cluster_size – minimum cluster size, 10 by default

  • alias (string) – “hdbscan”, string to be used in naming of the field showing the clustering results

  • cluster_field (string) – “_cluster_”, string to name the main cluster field

  • overwrite (bool) – False by default, To overwite an existing clusering result

Example

>>> client.vector_tools.cluster.hdbscan_cluster(
    dataset_id="sample_dataset",
    vector_fields=["sample_1_vector_"] # Only 1 vector field is supported for now
)