relevanceai.vector_tools.cluster_evaluate

Module Contents

Classes

ClusterEvaluate

Batch API client

Functions

sort_dict(dict, reverse: bool = True, cut_off=0)

Attributes

SILHOUETTE_INFO

RAND_INFO

HOMOGENEITY_INFO

COMPLETENESS_INFO

METRIC_DESCRIPTION

relevanceai.vector_tools.cluster_evaluate.SILHOUETTE_INFO = Multiline-String
Show Value
1Good clusters have clusters which are highly seperated and elements within which are highly cohesive. <br/>
2<b>Silohuette Score</b> is a metric from <b>-1 to 1</b> that calculates the average cohesion and seperation of each element, with <b>1</b> being clustered perfectly, <b>0</b> being indifferent and <b>-1</b> being clustered the wrong way
relevanceai.vector_tools.cluster_evaluate.RAND_INFO = Multiline-String
Show Value
1Good clusters have elements, which, when paired, belong to the same cluster label and same ground truth label. <br/>
2<b>Rand Index</b> is a metric from <b>0 to 1</b> that represents the percentage of element pairs that have a matching cluster and ground truth labels with <b>1</b> matching perfect and <b>0</b> matching randomly. <br/> <i>Note: This measure is adjusted for randomness so does not equal the exact numerical percentage.</i>
relevanceai.vector_tools.cluster_evaluate.HOMOGENEITY_INFO = Multiline-String
Show Value
1Good clusters only have elements from the same ground truth within the same cluster<br/>
2<b>Homogeneity</b> is a metric from <b>0 to 1</b> that represents whether clusters contain only elements in the same ground truth with <b>1</b> being perfect and <b>0</b> being absolutely incorrect.
relevanceai.vector_tools.cluster_evaluate.COMPLETENESS_INFO = Multiline-String
Show Value
1Good clusters have all elements from the same ground truth within the same cluster <br/>
2<b>Completeness</b> is a metric from <b>0 to 1</b> that represents whether clusters contain all elements in the same ground truth with <b>1</b> being perfect and <b>0</b> being absolutely incorrect.
relevanceai.vector_tools.cluster_evaluate.METRIC_DESCRIPTION
relevanceai.vector_tools.cluster_evaluate.sort_dict(dict, reverse: bool = True, cut_off=0)
class relevanceai.vector_tools.cluster_evaluate.ClusterEvaluate(project, api_key)

Bases: relevanceai.api.client.BatchAPIClient, relevanceai.base._Base, doc_utils.DocUtils

Batch API client

plot(self, dataset_id: str, vector_field: str, cluster_alias: str, ground_truth_field: str = None, description_fields: list = [], marker_size: int = 5)

Plot the vectors in a collection to compare performance of cluster labels, optionally, against ground truth labels

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_field (string) – The vector field that was clustered upon

  • cluster_alias (string) – The alias of the clustered labels

  • ground_truth_field (string) – The field to use as ground truth

  • description_fields (list) – List of fields to use as additional labels on plot

  • marker_size (int) – Size of scatterplot marker

metrics(self, dataset_id: str, vector_field: str, cluster_alias: str, ground_truth_field: str = None)

Determine the performance of clusters through the Silhouette Score, and optionally against ground truth labels through Rand Index, Homogeneity and Completeness

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_field (string) – The vector field that was clustered upon

  • cluster_alias (string) – The alias of the clustered labels

  • ground_truth_field (string) – The field to use as ground truth

distribution(self, dataset_id: str, vector_field: str, cluster_alias: str, ground_truth_field: str = None, transpose=False)

Determine the distribution of clusters, optionally against the ground truth

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_field (string) – The vector field that was clustered upon

  • cluster_alias (string) – The alias of the clustered labels

  • ground_truth_field (string) – The field to use as ground truth

  • transpose (bool) – Whether to transpose cluster and ground truth perspectives

_get_cluster_documents(self, dataset_id: str, vector_field: str, cluster_alias: str, ground_truth_field: str = None, description_fields: list = [], get_vectors=True)

Return vectors, cluster labels, ground truth labels and other fields

static plot_from_docs(vectors: list, cluster_labels: list, ground_truth: list = None, vector_description: dict = None, marker_size: int = 5)

Plot the vectors in a collection to compare performance of cluster labels, optionally, against ground truth labels

Parameters
  • vectors (list) – List of vectors which were clustered upon

  • cluster_labels (list) – List of cluster labels corresponding to the vectors

  • ground_truth (list) – List of ground truth labels for the vectors

  • vector_description (dict) – Dictionary of fields and their values to describe the vectors

  • marker_size (int) – Size of scatterplot marker

static metrics_from_docs(vectors, cluster_labels, ground_truth=None)

Determine the performance of clusters through the Silhouette Score, and optionally against ground truth labels through Rand Index, Homogeneity and Completeness

Parameters
  • vectors (list) – List of vectors which were clustered upon

  • cluster_labels (list) – List of cluster labels corresponding to the vectors

  • ground_truth (list) – List of ground truth labels for the vectors

static label_distribution_from_docs(label)

Determine the distribution of a label

Parameters

label (list) – List of labels

static label_joint_distribution_from_docs(label_1, label_2)

Determine the distribution of a label against another label

Parameters
  • label_1 (list) – List of labels

  • label_2 (list) – List of labels

static silhouette_score(vectors, cluster_labels)
static adjusted_rand_score(ground_truth, cluster_labels)
static completeness_score(ground_truth, cluster_labels)
static homogeneity_score(ground_truth, cluster_labels)
static _generate_layout()
static _generate_plot(df, hover_label, marker_size)