relevanceai.vector_tools.cluster
Module Contents
Classes
Using verbose loguru as base logger for now |
|
Using verbose loguru as base logger for now |
|
Using verbose loguru as base logger for now |
|
Using verbose loguru as base logger for now |
|
Using verbose loguru as base logger for now |
|
Using verbose loguru as base logger for now |
|
Batch API client |
- class relevanceai.vector_tools.cluster.ClusterBase
Bases:
relevanceai.logger.LoguruLogger,doc_utils.DocUtilsUsing verbose loguru as base logger for now
- __call__(self, *args, **kwargs)
- abstract fit_transform(self, vectors)
- _concat_vectors_from_list(self, list_of_vectors: list)
Concatenate 2 vectors together in a pairwise fashion
- fit_documents(self, vector_fields: list, docs: list, alias: str = 'default', cluster_field: str = '_cluster_', return_only_clusters: bool = True, inplace: bool = True)
Train clustering algorithm on documents and then store the labels inside the documents.
- Parameters
vector_field (list) – The vector field of the documents
docs (list) – List of documents to run clustering on
alias (str) – What the clusters can be called
cluster_field (str) – What the cluster fields should be called
return_only_clusters (bool) – If True, return only clusters, otherwise returns the original document
inplace (bool) – If True, the documents are edited inplace otherwise, a copy is made first
kwargs (dict) – Any other keyword argument will go directly into the clustering algorithm
- abstract to_metadata(self)
You can also store the metadata of this clustering algorithm
- property metadata(self)
- _label_cluster(self, label: Union[int, str])
- _label_clusters(self, labels)
- class relevanceai.vector_tools.cluster.CentroidCluster
Bases:
ClusterBaseUsing verbose loguru as base logger for now
- __call__(self, *args, **kwargs)
- abstract fit_transform(self, vectors)
- abstract get_centers(self) Union[numpy.ndarray, List[list]]
Get centers for the centroid-based clusters
- get_centroid_docs(self, centroid_vector_field_name='centroid_vector_') List
Get the centroid documents to store. if single vector field returns this:
- {
“_id”: “document-id-1”, “centroid_vector_”: [0.23, 0.24, 0.23]
}
If multiple vector fields returns this: Returns multiple ``` {
“_id”: “document-id-1”, “blue_vector_”: [0.12, 0.312, 0.42], “red_vector_”: [0.23, 0.41, 0.3]
- class relevanceai.vector_tools.cluster.DensityCluster
Bases:
ClusterBaseUsing verbose loguru as base logger for now
- __call__(self, *args, **kwargs)
- abstract fit_transform(self, vectors)
- class relevanceai.vector_tools.cluster.MiniBatchKMeans(k: Union[None, int] = 10, init: str = 'k-means++', verbose: bool = False, compute_labels: bool = True, max_no_improvement: int = 2)
Bases:
CentroidClusterUsing verbose loguru as base logger for now
- _init_model(self)
- fit_transform(self, vectors: Union[numpy.ndarray, List])
Fit and transform transform the vectors
- get_centers(self)
Returns centroids of clusters
- to_metadata(self)
Editing the metadata of the function
- class relevanceai.vector_tools.cluster.KMeans(k=10, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='auto')
Bases:
MiniBatchKMeansUsing verbose loguru as base logger for now
- _init_model(self)
- to_metadata(self)
Editing the metadata of the function
- class relevanceai.vector_tools.cluster.HDBSCANClusterer(algorithm: str = 'best', alpha: float = 1.0, approx_min_span_tree: bool = True, gen_min_span_tree: bool = False, leaf_size: int = 40, memory=Memory(cachedir=None), metric: str = 'euclidean', min_samples: int = None, p: float = None, min_cluster_size: Union[None, int] = 10)
Bases:
DensityClusterUsing verbose loguru as base logger for now
- fit_transform(self, vectors: numpy.ndarray) numpy.ndarray
- class relevanceai.vector_tools.cluster.Cluster(project, api_key)
Bases:
relevanceai.vector_tools.cluster_evaluate.ClusterEvaluate,relevanceai.api.client.BatchAPIClient,ClusterBaseBatch API client
- static _choose_k(vectors: numpy.ndarray)
” Choose k clusters
- static cluster(vectors: numpy.ndarray, cluster: Union[relevanceai.vector_tools.constants.CLUSTER, ClusterBase], cluster_args: Dict = {}, k: Union[None, int] = None) numpy.ndarray
Cluster vectors
- kmeans_cluster(self, dataset_id: str, vector_fields: list, filters: List = [], k: Union[None, int] = 10, init: str = 'k-means++', n_init: int = 10, max_iter: int = 300, tol: float = 0.0001, verbose: bool = True, random_state: Optional[int] = None, copy_x: bool = True, algorithm: str = 'auto', alias: str = None, cluster_field: str = '_cluster_', update_documents_chunksize: int = 50, overwrite: bool = False, page_size: int = 1)
This function performs all the steps required for Kmeans clustering: 1- Loads the data 2- Clusters the data 3- Updates the data with clustering info 4- Adds the centroid to the hidden centroid collection
- Parameters
dataset_id (string) – name of the dataser
vector_fields (list) – a list containing the vector field to be used for clustering
filters (list) – a list to filter documents of the dataset,
k (int) – K in Kmeans
init (string) – “k-means++” -> Kmeans algorithm parameter
n_init (int) – number of reinitialization for the kmeans algorithm
max_iter (int) – max iteration in the kmeans algorithm
tol (int) – tol in the kmeans algorithm
verbose (bool) – True by default
None (random_state =) – None by default -> Kmeans algorithm parameter
copy_x (bool) – True bydefault
algorithm (string) – “auto” by default
alias (string) – “kmeans”, string to be used in naming of the field showing the clustering results
cluster_field (string) – “_cluster_”, string to name the main cluster field
overwrite (bool) – False by default, To overwite an existing clusering result
Example
>>> client.vector_tools.cluster.kmeans_cluster( dataset_id="sample_dataset", vector_fields=vector_fields )
- hdbscan_cluster(self, dataset_id: str, vector_fields: list, filters: List = [], algorithm: str = 'best', alpha: float = 1.0, approx_min_span_tree: bool = True, gen_min_span_tree: bool = False, leaf_size: int = 40, memory=Memory(cachedir=None), metric: str = 'euclidean', min_samples=None, p=None, min_cluster_size: Union[None, int] = 10, alias: str = 'hdbscan', cluster_field: str = '_cluster_', update_documents_chunksize: int = 50, overwrite: bool = False)
This function performs all the steps required for hdbscan clustering: 1- Loads the data 2- Clusters the data 3- Updates the data with clustering info 4- Adds the centroid to the hidden centroid collection
- Parameters
dataset_id (string) – name of the dataser
vector_fields (list) – a list containing the vector field to be used for clustering
filters (list) – a list to filter documents of the dataset
algorithm (str) – hdbscan configuration parameter default to “best”
alpha (float) – hdbscan configuration parameter default to 1.0
approx_min_span_tree (bool) – hdbscan configuration parameter default to True
gen_min_span_tree (bool) – hdbscan configuration parameter default to False
leaf_size (int) – hdbscan configuration parameter default to 40
Memory(cachedir=None) (memory =) – hdbscan configuration parameter on memory management
metric (str = "euclidean") – hdbscan configuration parameter default to “euclidean”
None (p =) – hdbscan configuration parameter default to None
None – hdbscan configuration parameter default to None
min_cluster_size – minimum cluster size, 10 by default
alias (string) – “hdbscan”, string to be used in naming of the field showing the clustering results
cluster_field (string) – “_cluster_”, string to name the main cluster field
overwrite (bool) – False by default, To overwite an existing clusering result
Example
>>> client.vector_tools.cluster.hdbscan_cluster( dataset_id="sample_dataset", vector_fields=["sample_1_vector_"] # Only 1 vector field is supported for now )