`relevanceai.vector_tools.cluster`

Module Contents

Classes

`ClusterBase`	Using verbose loguru as base logger for now
`CentroidCluster`	Using verbose loguru as base logger for now
`DensityCluster`	Using verbose loguru as base logger for now
`MiniBatchKMeans`	Using verbose loguru as base logger for now
`KMeans`	Using verbose loguru as base logger for now
`HDBSCANClusterer`	Using verbose loguru as base logger for now
`Cluster`	Batch API client

class relevanceai.vector_tools.cluster.ClusterBase

Bases: relevanceai.logger.LoguruLogger, doc_utils.DocUtils

Using verbose loguru as base logger for now

__call__(self, *args, **kwargs)

abstract fit_transform(self, vectors)

_concat_vectors_from_list(self, list_of_vectors: list): Concatenate 2 vectors together in a pairwise fashion

fit_documents(self, vector_fields: list, docs: list, alias: str = 'default', cluster_field: str = '_cluster_', return_only_clusters: bool = True, inplace: bool = True)

Train clustering algorithm on documents and then store the labels inside the documents.

Parameters

vector_field (list) – The vector field of the documents
docs (list) – List of documents to run clustering on
alias (str) – What the clusters can be called
cluster_field (str) – What the cluster fields should be called
return_only_clusters (bool) – If True, return only clusters, otherwise returns the original document
inplace (bool) – If True, the documents are edited inplace otherwise, a copy is made first
kwargs (dict) – Any other keyword argument will go directly into the clustering algorithm

abstract to_metadata(self): You can also store the metadata of this clustering algorithm

property metadata(self)

_label_cluster(self, label: Union[int, str])

_label_clusters(self, labels)

class relevanceai.vector_tools.cluster.CentroidCluster

Bases: ClusterBase

Using verbose loguru as base logger for now

__call__(self, *args, **kwargs)

abstract fit_transform(self, vectors)

abstract get_centers(self) → Union[numpy.ndarray, List[list]]: Get centers for the centroid-based clusters

get_centroid_docs(self, centroid_vector_field_name='centroid_vector_') → List

Get the centroid documents to store. if single vector field returns this:

{
“_id”: “document-id-1”, “centroid_vector_”: [0.23, 0.24, 0.23]

}

If multiple vector fields returns this: Returns multiple ``` {

“_id”: “document-id-1”, “blue_vector_”: [0.12, 0.312, 0.42], “red_vector_”: [0.23, 0.41, 0.3]

class relevanceai.vector_tools.cluster.DensityCluster

Bases: ClusterBase

Using verbose loguru as base logger for now

__call__(self, *args, **kwargs)

abstract fit_transform(self, vectors)

class relevanceai.vector_tools.cluster.MiniBatchKMeans(k: Union[None, int] = 10, init: str = 'k-means++', verbose: bool = False, compute_labels: bool = True, max_no_improvement: int = 2)

Bases: CentroidCluster

Using verbose loguru as base logger for now

_init_model(self)

fit_transform(self, vectors: Union[numpy.ndarray, List]): Fit and transform transform the vectors

get_centers(self): Returns centroids of clusters

to_metadata(self): Editing the metadata of the function

class relevanceai.vector_tools.cluster.KMeans(k=10, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='auto')

Bases: MiniBatchKMeans

Using verbose loguru as base logger for now

_init_model(self)

to_metadata(self): Editing the metadata of the function

class relevanceai.vector_tools.cluster.HDBSCANClusterer(algorithm: str = 'best', alpha: float = 1.0, approx_min_span_tree: bool = True, gen_min_span_tree: bool = False, leaf_size: int = 40, memory=Memory(cachedir=None), metric: str = 'euclidean', min_samples: int = None, p: float = None, min_cluster_size: Union[None, int] = 10)

Bases: DensityCluster

Using verbose loguru as base logger for now

fit_transform(self, vectors: numpy.ndarray) → numpy.ndarray

class relevanceai.vector_tools.cluster.Cluster(project, api_key)

Bases: relevanceai.vector_tools.cluster_evaluate.ClusterEvaluate, relevanceai.api.client.BatchAPIClient, ClusterBase

Batch API client

static _choose_k(vectors: numpy.ndarray): ” Choose k clusters

static cluster(vectors: numpy.ndarray, cluster: Union[relevanceai.vector_tools.constants.CLUSTER, ClusterBase], cluster_args: Dict = {}, k: Union[None, int] = None) → numpy.ndarray: Cluster vectors

kmeans_cluster(self, dataset_id: str, vector_fields: list, filters: List = [], k: Union[None, int] = 10, init: str = 'k-means++', n_init: int = 10, max_iter: int = 300, tol: float = 0.0001, verbose: bool = True, random_state: Optional[int] = None, copy_x: bool = True, algorithm: str = 'auto', alias: str = None, cluster_field: str = '_cluster_', update_documents_chunksize: int = 50, overwrite: bool = False, page_size: int = 1)

This function performs all the steps required for Kmeans clustering: 1- Loads the data 2- Clusters the data 3- Updates the data with clustering info 4- Adds the centroid to the hidden centroid collection

Parameters

dataset_id (string) – name of the dataser
vector_fields (list) – a list containing the vector field to be used for clustering
filters (list) – a list to filter documents of the dataset,
k (int) – K in Kmeans
init (string) – “k-means++” -> Kmeans algorithm parameter
n_init (int) – number of reinitialization for the kmeans algorithm
max_iter (int) – max iteration in the kmeans algorithm
tol (int) – tol in the kmeans algorithm
verbose (bool) – True by default
None (random_state =) – None by default -> Kmeans algorithm parameter
copy_x (bool) – True bydefault
algorithm (string) – “auto” by default
alias (string) – “kmeans”, string to be used in naming of the field showing the clustering results
cluster_field (string) – “_cluster_”, string to name the main cluster field
overwrite (bool) – False by default, To overwite an existing clusering result

Example

>>> client.vector_tools.cluster.kmeans_cluster(
    dataset_id="sample_dataset",
    vector_fields=vector_fields
)

hdbscan_cluster(self, dataset_id: str, vector_fields: list, filters: List = [], algorithm: str = 'best', alpha: float = 1.0, approx_min_span_tree: bool = True, gen_min_span_tree: bool = False, leaf_size: int = 40, memory=Memory(cachedir=None), metric: str = 'euclidean', min_samples=None, p=None, min_cluster_size: Union[None, int] = 10, alias: str = 'hdbscan', cluster_field: str = '_cluster_', update_documents_chunksize: int = 50, overwrite: bool = False)

This function performs all the steps required for hdbscan clustering: 1- Loads the data 2- Clusters the data 3- Updates the data with clustering info 4- Adds the centroid to the hidden centroid collection

Parameters

dataset_id (string) – name of the dataser
vector_fields (list) – a list containing the vector field to be used for clustering
filters (list) – a list to filter documents of the dataset
algorithm (str) – hdbscan configuration parameter default to “best”
alpha (float) – hdbscan configuration parameter default to 1.0
approx_min_span_tree (bool) – hdbscan configuration parameter default to True
gen_min_span_tree (bool) – hdbscan configuration parameter default to False
leaf_size (int) – hdbscan configuration parameter default to 40
Memory(cachedir=None) (memory =) – hdbscan configuration parameter on memory management
metric (str = "euclidean") – hdbscan configuration parameter default to “euclidean”
None (p =) – hdbscan configuration parameter default to None
None – hdbscan configuration parameter default to None
min_cluster_size – minimum cluster size, 10 by default
alias (string) – “hdbscan”, string to be used in naming of the field showing the clustering results
cluster_field (string) – “_cluster_”, string to name the main cluster field
overwrite (bool) – False by default, To overwite an existing clusering result

Example

>>> client.vector_tools.cluster.hdbscan_cluster(
    dataset_id="sample_dataset",
    vector_fields=["sample_1_vector_"] # Only 1 vector field is supported for now
)

relevanceai.vector_tools.cluster

Module Contents

Classes

`relevanceai.vector_tools.cluster`