`relevanceai.vector_tools.cluster_evaluate`

Module Contents

Classes

ClusterEvaluate

Batch API client

Functions

sort_dict(dict, reverse: bool = True, cut_off=0)

Attributes

`SILHOUETTE_INFO`
`RAND_INFO`
`HOMOGENEITY_INFO`
`COMPLETENESS_INFO`
`METRIC_DESCRIPTION`

relevanceai.vector_tools.cluster_evaluate.SILHOUETTE_INFO = Multiline-String

Show Value

Good clusters have clusters which are highly seperated and elements within which are highly cohesive. <br/>
<b>Silohuette Score</b> is a metric from <b>-1 to 1</b> that calculates the average cohesion and seperation of each element, with <b>1</b> being clustered perfectly, <b>0</b> being indifferent and <b>-1</b> being clustered the wrong way

relevanceai.vector_tools.cluster_evaluate.RAND_INFO = Multiline-String

Show Value

Good clusters have elements, which, when paired, belong to the same cluster label and same ground truth label. <br/>
<b>Rand Index</b> is a metric from <b>0 to 1</b> that represents the percentage of element pairs that have a matching cluster and ground truth labels with <b>1</b> matching perfect and <b>0</b> matching randomly. <br/> <i>Note: This measure is adjusted for randomness so does not equal the exact numerical percentage.</i>

relevanceai.vector_tools.cluster_evaluate.HOMOGENEITY_INFO = Multiline-String

Show Value

Good clusters only have elements from the same ground truth within the same cluster<br/>
<b>Homogeneity</b> is a metric from <b>0 to 1</b> that represents whether clusters contain only elements in the same ground truth with <b>1</b> being perfect and <b>0</b> being absolutely incorrect.

relevanceai.vector_tools.cluster_evaluate.COMPLETENESS_INFO = Multiline-String

Show Value

Good clusters have all elements from the same ground truth within the same cluster <br/>
<b>Completeness</b> is a metric from <b>0 to 1</b> that represents whether clusters contain all elements in the same ground truth with <b>1</b> being perfect and <b>0</b> being absolutely incorrect.

relevanceai.vector_tools.cluster_evaluate.METRIC_DESCRIPTION

relevanceai.vector_tools.cluster_evaluate.sort_dict(dict, reverse: bool = True, cut_off=0)

class relevanceai.vector_tools.cluster_evaluate.ClusterEvaluate(project, api_key)

Bases: relevanceai.api.client.BatchAPIClient, relevanceai.base._Base, doc_utils.DocUtils

Batch API client

plot(self, dataset_id: str, vector_field: str, cluster_alias: str, ground_truth_field: str = None, description_fields: list = [], marker_size: int = 5)

Plot the vectors in a collection to compare performance of cluster labels, optionally, against ground truth labels

Parameters

dataset_id (string) – Unique name of dataset
vector_field (string) – The vector field that was clustered upon
cluster_alias (string) – The alias of the clustered labels
ground_truth_field (string) – The field to use as ground truth
description_fields (list) – List of fields to use as additional labels on plot
marker_size (int) – Size of scatterplot marker

metrics(self, dataset_id: str, vector_field: str, cluster_alias: str, ground_truth_field: str = None)

Determine the performance of clusters through the Silhouette Score, and optionally against ground truth labels through Rand Index, Homogeneity and Completeness

Parameters

dataset_id (string) – Unique name of dataset
vector_field (string) – The vector field that was clustered upon
cluster_alias (string) – The alias of the clustered labels
ground_truth_field (string) – The field to use as ground truth

distribution(self, dataset_id: str, vector_field: str, cluster_alias: str, ground_truth_field: str = None, transpose=False)

Determine the distribution of clusters, optionally against the ground truth

Parameters

dataset_id (string) – Unique name of dataset
vector_field (string) – The vector field that was clustered upon
cluster_alias (string) – The alias of the clustered labels
ground_truth_field (string) – The field to use as ground truth
transpose (bool) – Whether to transpose cluster and ground truth perspectives

_get_cluster_documents(self, dataset_id: str, vector_field: str, cluster_alias: str, ground_truth_field: str = None, description_fields: list = [], get_vectors=True): Return vectors, cluster labels, ground truth labels and other fields

static plot_from_docs(vectors: list, cluster_labels: list, ground_truth: list = None, vector_description: dict = None, marker_size: int = 5)

Plot the vectors in a collection to compare performance of cluster labels, optionally, against ground truth labels

Parameters

vectors (list) – List of vectors which were clustered upon
cluster_labels (list) – List of cluster labels corresponding to the vectors
ground_truth (list) – List of ground truth labels for the vectors
vector_description (dict) – Dictionary of fields and their values to describe the vectors
marker_size (int) – Size of scatterplot marker

static metrics_from_docs(vectors, cluster_labels, ground_truth=None)

Determine the performance of clusters through the Silhouette Score, and optionally against ground truth labels through Rand Index, Homogeneity and Completeness

Parameters

vectors (list) – List of vectors which were clustered upon
cluster_labels (list) – List of cluster labels corresponding to the vectors
ground_truth (list) – List of ground truth labels for the vectors

static label_distribution_from_docs(label)

Determine the distribution of a label

Parameters: label (list) – List of labels

static label_joint_distribution_from_docs(label_1, label_2)

Determine the distribution of a label against another label

Parameters

label_1 (list) – List of labels
label_2 (list) – List of labels

static silhouette_score(vectors, cluster_labels)

static adjusted_rand_score(ground_truth, cluster_labels)

static completeness_score(ground_truth, cluster_labels)

static homogeneity_score(ground_truth, cluster_labels)

static _generate_layout()

static _generate_plot(df, hover_label, marker_size)

relevanceai.vector_tools.cluster_evaluate

Module Contents

Classes

Functions

Attributes

`relevanceai.vector_tools.cluster_evaluate`