Core Endpoints

All admin-related tasks.

class relevanceai.api.endpoints.admin.AdminClient(project, api_key)

Bases: relevanceai.base._Base

api_key: str
config: relevanceai.config.Config
copy_foreign_dataset(dataset_id, source_dataset_id, source_project, source_api_key, project=None, api_key=None)

Copy a dataset from another user’s projects into your project.

Example

>>> client = Client()
>>> client.admin.send_dataset(
    dataset_id="research",
    receiver_project="...",
    receiver_api_key="..."
)
Parameters
  • dataset_id (str) – The dataset to copy

  • source_dataset_id (str) – The original dataset

  • source_project (str) – The original project to copy from

  • source_api_key (str) – The original API key of the project

  • project (Optional[str]) – The original project

  • api_key (Optional[str]) – The original API key

project: str
request_read_api_key(read_username)

Creates a read only key for your project. Make sure to save the api key somewhere safe. When doing a search the admin username should still be used.

Parameters

read_username (str) – Read-only project

send_dataset(dataset_id, receiver_project, receiver_api_key)

Send an individual a dataset.

Example

>>> client = Client()
>>> client.admin.send_dataset(
    dataset_id="research",
    receiver_project="...",
    receiver_api_key="..."
)
Parameters
  • dataset_id (str) – The name of the dataset

  • receiver_project (str) – The project name that will receive the dataset

  • receiver_api_key (str) – The project API key that will receive the dataset

class relevanceai.api.endpoints.aggregate.AggregateClient(project, api_key)

Bases: relevanceai.base._Base

Aggregate service

aggregate(dataset_id, metrics=[], groupby=[], filters=[], page_size=20, page=1, asc=False, flatten=True, alias='default')

Aggregation/Groupby of a collection using an aggregation query. The aggregation query is a json body that follows the schema of:

>>> {
>>>        "groupby" : [
>>>            {"name": <alias>, "field": <field in the collection>, "agg": "category"},
>>>            {"name": <alias>, "field": <another groupby field in the collection>, "agg": "numeric"}
>>>        ],
>>>        "metrics" : [
>>>            {"name": <alias>, "field": <numeric field in the collection>, "agg": "avg"}
>>>            {"name": <alias>, "field": <another numeric field in the collection>, "agg": "max"}
>>>        ]
>>>    }
>>>    For example, one can use the following aggregations to group score based on region and player name.
>>>    {
>>>        "groupby" : [
>>>            {"name": "region", "field": "player_region", "agg": "category"},
>>>            {"name": "player_name", "field": "name", "agg": "category"}
>>>        ],
>>>        "metrics" : [
>>>            {"name": "average_score", "field": "final_score", "agg": "avg"},
>>>            {"name": "max_score", "field": "final_score", "agg": "max"},
>>>            {'name':'total_score','field':"final_score", 'agg':'sum'},
>>>            {'name':'average_deaths','field':"final_deaths", 'agg':'avg'},
>>>            {'name':'highest_deaths','field':"final_deaths", 'agg':'max'},
>>>        ]
>>>    }

“groupby” is the fields you want to split the data into. These are the available groupby types:

  • category : groupby a field that is a category

  • numeric: groupby a field that is a numeric

“metrics” is the fields and metrics you want to calculate in each of those, every aggregation includes a frequency metric. These are the available metric types:

  • “avg”, “max”, “min”, “sum”, “cardinality”

The response returned has the following in descending order.

If you want to return documents, specify a “group_size” parameter and a “select_fields” parameter if you want to limit the specific fields chosen. This looks as such:

>>>    {
>>>    'groupby':[
>>>        {'name':'Manufacturer','field':'manufacturer','agg':'category',
>>>        'group_size': 10, 'select_fields': ["name"]},
>>>    ],
>>>    'metrics':[
>>>        {'name':'Price Average','field':'price','agg':'avg'},
>>>    ],
>>>    }
>>>
>>>    {"title": {"title": "books", "frequency": 200, "documents": [{...}, {...}]}, {"title": "books", "frequency": 100, "documents": [{...}, {...}]}}

For array-aggregations, you can add “agg”: “array” into the aggregation query.

Parameters
  • dataset_id (string) – Unique name of dataset

  • metrics (list) – Fields and metrics you want to calculate

  • groupby (list) – Fields you want to split the data into

  • filters (list) – Query for filtering the search results

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • asc (bool) – Whether to sort results by ascending or descending order

  • flatten (bool) – Whether to flatten

  • alias (string) – Alias used to name a vector field. Belongs in field_{alias} vector

api_key: str
config: relevanceai.config.Config
project: str
class relevanceai.api.endpoints.centroids.CentroidsClient(project, api_key)

Bases: relevanceai.base._Base

api_key: str
config: relevanceai.config.Config
delete(dataset_id, vector_fields, alias='default')

Delete centroids by dataset ID, vector field and alias

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_field (string) – The vector field where a clustering task was run.

  • alias (string) – Alias is used to name a cluster

docs_closest_to_center(dataset_id, vector_fields, cluster_ids=[], alias='default', centroid_vector_fields=['centroid_vector_'], select_fields=[], approx=0, sum_fields=True, page_size=1, page=1, similarity_metric='cosine', filters=[], facets=[], min_score=0, include_vector=False, include_count=True, include_facets=False)

List of documents closest from the centre.

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_field (string) – The vector field where a clustering task was run.

  • cluster_ids (lsit) – Any of the cluster ids

  • alias (string) – Alias is used to name a cluster

  • centroid_vector_fields (list) – Vector fields stored

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields

  • approx (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • filters (list) – Query for filtering the search results

  • facets (list) – Fields to include in the facets, if [] then all

  • min_score (int) – Minimum score for similarity metric

  • include_vectors (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • include_facets (bool) – Include facets in the search results

docs_furthest_from_center(dataset_id, vector_fields, cluster_ids=[], alias='default', select_fields=[], approx=0, sum_fields=True, page_size=1, page=1, similarity_metric='cosine', filters=[], facets=[], min_score=0, include_vector=False, include_count=True, include_facets=False)

List of documents furthest from the centre.

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_fields (list) – The vector field where a clustering task was run.

  • cluster_ids (list) – Any of the cluster ids

  • alias (string) – Alias is used to name a cluster

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields

  • approx (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • filters (list) – Query for filtering the search results

  • facets (list) – Fields to include in the facets, if [] then all

  • min_score (int) – Minimum score for similarity metric

  • include_vectors (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • include_facets (bool) – Include facets in the search results

documents(dataset_id, cluster_ids, vector_fields, alias='default', page_size=5, cursor=None, page=1, include_vector=False, similarity_metric='cosine')

Retrieve the cluster centroids by IDs

Parameters
  • dataset_id (string) – Unique name of dataset

  • cluster_ids (list) – List of cluster IDs

  • vector_fields (list) – The vector field where a clustering task was run.

  • alias (string) – Alias is used to name a cluster

  • page_size (int) – Size of each page of results

  • cursor (string) – Cursor to paginate the document retrieval

  • page (int) – Page of the results

  • include_vector (bool) – Include vectors in the search results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

get(dataset_id, cluster_ids, vector_fields, alias='default', page_size=5, cursor=None)

Retrieve the cluster centroids by IDs

Parameters
  • dataset_id (string) – Unique name of dataset

  • cluster_ids (list) – List of cluster IDs

  • vector_field (string) – The vector field where a clustering task was run.

  • alias (string) – Alias is used to name a cluster

  • page_size (int) – Size of each page of results

  • cursor (string) – Cursor to paginate the document retrieval

insert(dataset_id, cluster_centers, vector_fields, alias='default')

Insert your own cluster centroids for it to be used in approximate search settings and cluster aggregations. :type dataset_id: str :param dataset_id: Unique name of dataset :type dataset_id: string :type cluster_centers: List :param cluster_centers: Cluster centers with the key being the index number :type cluster_centers: list :param vector_field: The vector field where a clustering task was run. :type vector_field: string :type alias: str :param alias: Alias is used to name a cluster :type alias: string

list(dataset_id, vector_fields, alias='default', page_size=5, cursor=None, include_vector=False, base_url='https://gateway-api-aueast.relevance.ai/latest')

Retrieve the cluster centroid

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_fields (list) – The vector field where a clustering task was run.

  • alias (string) – Alias is used to name a cluster

  • page_size (int) – Size of each page of results

  • cursor (string) – Cursor to paginate the document retrieval

  • include_vector (bool) – Include vectors in the search results

list_closest_to_center(dataset_id, vector_fields, cluster_ids=[], alias='default', centroid_vector_fields=['centroid_vector_'], select_fields=[], approx=0, sum_fields=True, page_size=1, page=1, similarity_metric='cosine', filters=[], facets=[], min_score=0, include_vector=False, include_count=True, include_facets=False)

List of documents closest from the centre.

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_field (string) – The vector field where a clustering task was run.

  • cluster_ids (lsit) – Any of the cluster ids

  • alias (string) – Alias is used to name a cluster

  • centroid_vector_fields (list) – Vector fields stored

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields

  • approx (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • filters (list) – Query for filtering the search results

  • facets (list) – Fields to include in the facets, if [] then all

  • min_score (int) – Minimum score for similarity metric

  • include_vectors (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • include_facets (bool) – Include facets in the search results

list_furthest_from_center(dataset_id, vector_fields, cluster_ids=[], alias='default', select_fields=[], approx=0, sum_fields=True, page_size=1, page=1, similarity_metric='cosine', filters=[], facets=[], min_score=0, include_vector=False, include_count=True, include_facets=False)

List of documents furthest from the centre.

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_fields (list) – The vector field where a clustering task was run.

  • cluster_ids (list) – Any of the cluster ids

  • alias (string) – Alias is used to name a cluster

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields

  • approx (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • filters (list) – Query for filtering the search results

  • facets (list) – Fields to include in the facets, if [] then all

  • min_score (int) – Minimum score for similarity metric

  • include_vectors (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • include_facets (bool) – Include facets in the search results

metadata(dataset_id, vector_fields, alias='default', metadata=None)

If metadata is none, retrieves metadata about a dataset. notably description, data source, etc Otherwise, you can store the metadata about your cluster here.

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_field (string) – The vector field where a clustering task was run.

  • alias (string) – Alias is used to name a cluster

  • metadata (Optional[dict]) – If None, it will retrieve the metadata, otherwise it will overwrite the metadata of the cluster

project: str
update(dataset_id, vector_fields, id, update={}, alias='default')

Delete centroids by dataset ID, vector field and alias

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_field (List) – The vector field where a clustering task was run.

  • alias (string) – Alias is used to name a cluster

  • id (string) – The centroid ID

  • update (dict) – The update to be applied to the document

API Client

class relevanceai.api.endpoints.client.APIClient(project, api_key)

Bases: relevanceai.base._Base

API Client

api_key: str
config: relevanceai.config.Config
project: str
relevanceai.api.endpoints.client.str2bool(v)
class relevanceai.api.endpoints.cluster.ClusterClient(project, api_key)

Bases: relevanceai.base._Base

aggregate(dataset_id, vector_fields, metrics=[], groupby=[], filters=[], page_size=20, page=1, asc=False, flatten=True, alias='default')

Takes an aggregation query and gets the aggregate of each cluster in a collection. This helps you interpret each cluster and what is in them. It can only can be used after a vector field has been clustered.

For more information about aggregations check out services.aggregate.aggregate.

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_fields (list) – The vector field that was clustered on

  • metrics (list) – Fields and metrics you want to calculate

  • groupby (list) – Fields you want to split the data into

  • filters (list) – Query for filtering the search results

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • asc (bool) – Whether to sort results by ascending or descending order

  • flatten (bool) – Whether to flatten

  • alias (string) – Alias used to name a vector field. Belongs in field_{alias}vector

api_key: str
config: relevanceai.config.Config
facets(dataset_id, facets_fields=[], page_size=20, page=1, asc=False, date_interval='monthly')

Takes a high level aggregation of every field and every cluster in a collection. This helps you interpret each cluster and what is in them.

Only can be used after a vector field has been clustered.

Parameters
  • dataset_id (string) – Unique name of dataset

  • facets_fields (list) – Fields to include in the facets, if [] then all

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • asc (bool) – Whether to sort results by ascending or descending order

  • date_interval (string) – Interval for date facets

project: str

All Dataset related functions

class relevanceai.api.endpoints.datasets.DatasetsClient(project, api_key)

Bases: relevanceai.base._Base

All dataset-related functions

api_key: str
bulk_insert(dataset_id, documents, insert_date=True, overwrite=True, update_schema=True, field_transformers=[], return_documents=False)

Documentation can be found here: https://ingest-api-dev-aueast.relevance.ai/latest/documentation#operation/InsertEncode

  • When inserting the document you can optionally specify your own id for a document by using the field name “_id”, if not specified a random id is assigned.

  • When inserting or specifying vectors in a document use the suffix (ends with) “_vector_” for the field name. e.g. “product_description_vector_”.

  • When inserting or specifying chunks in a document the suffix (ends with) “_chunk_” for the field name. e.g. “products_chunk_”.

  • When inserting or specifying chunk vectors in a document’s chunks use the suffix (ends with) “_chunkvector_” for the field name. e.g. “products_chunk_.product_description_chunkvector_”.

  • Try to keep each batch of documents to insert under 200mb to avoid the insert timing out.

Parameters
  • dataset_id (string) – Unique name of dataset

  • documents (list) – A list of documents. Document is a JSON-like data that we store our metadata and vectors with. For specifying id of the document use the field ‘_id’, for specifying vector field use the suffix of ‘_vector_’

  • insert_date (bool) – Whether to include insert date as a field ‘insert_date_’.

  • overwrite (bool) – Whether to overwrite document if it exists.

  • update_schema (bool) – Whether the api should check the documents for vector datatype to update the schema.

  • include_inserted_ids (bool) – Include the inserted IDs in the response

  • field_transformers (list) –

    An example field_transformers object:

    >>> {
    >>>    "field": "string",
    >>>    "output_field": "string",
    >>>    "remove_html": true,
    >>>    "split_sentences": true
    >>> }
    

check_missing_ids(dataset_id, ids)

Look up in bulk if the ids exists in the dataset, returns all the missing one as a list.

Parameters
  • dataset_id (string) – Unique name of dataset

  • ids (list) – IDs of documents

clone(old_dataset, new_dataset, schema={}, rename_fields={}, remove_fields=[], filters=[])

Clone a dataset into a new dataset. You can use this to rename fields and change data schemas. This is considered a project job.

Parameters
  • old_dataset (string) – Unique name of old dataset to copy from

  • new_dataset (string) – Unique name of new dataset to copy to

  • schema (dict) – Schema for specifying the field that are vectors and its length

  • rename_fields (dict) – Fields to rename {‘old_field’: ‘new_field’}. Defaults to no renames

  • remove_fields (list) – Fields to remove [‘random_field’, ‘another_random_field’]. Defaults to no removes

  • filters (list) – Query for filtering the search results

config: relevanceai.config.Config
create(dataset_id, schema={})

A dataset can store documents to be searched, retrieved, filtered and aggregated (similar to Collections in MongoDB, Tables in SQL, Indexes in ElasticSearch). A powerful and core feature of VecDB is that you can store both your metadata and vectors in the same document. When specifying the schema of a dataset and inserting your own vector use the suffix (ends with) “_vector_” for the field name, and specify the length of the vector in dataset_schema.

For example:

>>>    {
>>>        "product_image_vector_": 1024,
>>>        "product_text_description_vector_" : 128
>>>    }

These are the field types supported in our datasets: [“text”, “numeric”, “date”, “dict”, “chunks”, “vector”, “chunkvector”].

For example:

>>>    {
>>>        "product_text_description" : "text",
>>>        "price" : "numeric",
>>>        "created_date" : "date",
>>>        "product_texts_chunk_": "chunks",
>>>        "product_text_chunkvector_" : 1024
>>>    }

You don’t have to specify the schema of every single field when creating a dataset, as VecDB will automatically detect the appropriate data type for each field (vectors will be automatically identified by its “_vector_” suffix). Infact you also don’t always have to use this endpoint to create a dataset as /datasets/bulk_insert will infer and create the dataset and schema as you insert new documents.

Note

  • A dataset name/id can only contain undercase letters, dash, underscore and numbers.

  • “_id” is reserved as the key and id of a document.

  • Once a schema is set for a dataset it cannot be altered. If it has to be altered, utlise the copy dataset endpoint.

For more information about vectors check out the ‘Vectorizing’ section, services.search.vector or out blog at https://relevance.ai/blog. For more information about chunks and chunk vectors check out services.search.chunk.

Parameters
  • dataset_id (string) – Unique name of dataset

  • schema (dict) – Schema for specifying the field that are vectors and its length

delete(dataset_id, confirm=False)

Delete a dataset

Parameters

dataset_id (string) – Unique name of dataset

facets(dataset_id, fields=[], date_interval='monthly', page_size=5, page=1, asc=False)

Takes a high level aggregation of every field, return their unique values and frequencies. This is used to help create the filter bar for search.

Parameters
  • dataset_id (string) – Unique name of dataset

  • fields (list) – Fields to include in the facets, if [] then all

  • date_interval (str) – Interval for date facets

  • page_size (int) – Size of facet page

  • page (int) – Page of the results

  • asc (bool) – Whether to sort results by ascending or descending order

insert(dataset_id, document, insert_date=True, overwrite=True, update_schema=True)

Insert a single documents

  • When inserting the document you can optionally specify your own id for a document by using the field name “_id”, if not specified a random id is assigned.

  • When inserting or specifying vectors in a document use the suffix (ends with) “_vector_” for the field name. e.g. “product_description_vector_”.

  • When inserting or specifying chunks in a document the suffix (ends with) “_chunk_” for the field name. e.g. “products_chunk_”.

  • When inserting or specifying chunk vectors in a document’s chunks use the suffix (ends with) “_chunkvector_” for the field name. e.g. “products_chunk_.product_description_chunkvector_”.

Documentation can be found here: https://ingest-api-dev-aueast.relevance.ai/latest/documentation#operation/InsertEncode

Try to keep each batch of documents to insert under 200mb to avoid the insert timing out.

Parameters
  • dataset_id (string) – Unique name of dataset

  • documents (list) – A list of documents. Document is a JSON-like data that we store our metadata and vectors with. For specifying id of the document use the field ‘_id’, for specifying vector field use the suffix of ‘_vector_’

  • insert_date (bool) – Whether to include insert date as a field ‘insert_date_’.

  • overwrite (bool) – Whether to overwrite document if it exists.

  • update_schema (bool) – Whether the api should check the documents for vector datatype to update the schema.

list()

List all datasets in a project that you are authorized to read/write.

list_all(include_schema=True, include_stats=True, include_metadata=True, include_schema_stats=False, include_vector_health=False, include_active_jobs=False, dataset_ids=[], sort_by_created_at_date=False, asc=False, page_size=20, page=1)

Returns a page of datasets and in detail the dataset’s associated information that you are authorized to read/write. The information includes:

  • Schema - Data schema of a dataset (same as dataset.schema).

  • Metadata - Metadata of a dataset (same as dataset.metadata).

  • Stats - Statistics of number of documents and size of a dataset (same as dataset.stats).

  • Vector_health - Number of zero vectors stored (same as dataset.health).

  • Schema_stats - Fields and number of documents missing/not missing for that field (same as dataset.stats).

  • Active_jobs - All active jobs/tasks on the dataset.

Parameters
  • include_schema (bool) – Whether to return schema

  • include_stats (bool) – Whether to return stats

  • include_metadata (bool) – Whether to return metadata

  • include_vector_health (bool) – Whether to return vector_health

  • include_schema_stats (bool) – Whether to return schema_stats

  • include_active_jobs (bool) – Whether to return active_jobs

  • dataset_ids (list) – List of dataset IDs

  • sort_by_created_at_date (bool) – Sort by created at date. By default shows the newest datasets. Set asc=False to get oldest dataset.

  • asc (bool) – Whether to sort results by ascending or descending order

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

metadata(dataset_id)

Retreives metadata about a dataset. Notably description, data source, etc

Parameters

dataset_id (string) – Unique name of dataset

project: str
schema(dataset_id)

Returns the schema of a dataset. Refer to datasets.create for different field types available in a VecDB schema.

Parameters

dataset_id (string) – Unique name of dataset

search(query, sort_by_created_at_date=False, asc=False)

Search datasets by their names with a traditional keyword search.

Parameters
  • query (string) – Any string that belongs to part of a dataset.

  • sort_by_created_at_date (bool) – Sort by created at date. By default shows the newest datasets. Set asc=False to get oldest dataset.

  • asc (bool) – Whether to sort results by ascending or descending order

task_status(dataset_id, task_id)

Check the status of an existing encoding task on the given dataset.

The required task_id was returned in the original encoding request such as datasets.vectorize.

Parameters
  • dataset_id (string) – Unique name of dataset

  • task_id (string) – The task ID of the earlier queued vectorize task

vectorize(dataset_id, model_id, fields=[], filters=[], refresh=False, alias='default', chunksize=20, chunk_field=None)

Queue the encoding of a dataset using the method given by model_id.

Parameters
  • dataset_id (string) – Unique name of dataset

  • model_id (string) – Model ID to use for vectorizing (encoding.)

  • fields (list) – Fields to remove [‘random_field’, ‘another_random_field’]. Defaults to no removes

  • filters (list) – Filters to run against

  • refresh (bool) – If True, re-runs encoding on whole dataset.

  • alias (string) – Alias used to name a vector field. Belongs in field_{alias}vector

  • chunksize (int) – Batch for each encoding. Change at your own risk.

  • chunk_field (string) – The chunk field. If the chunk field is specified, the field to be encoded should not include the chunk field.

class relevanceai.api.endpoints.documents.DocumentsClient(project, api_key)

Bases: relevanceai.base._Base

api_key: str
bulk_delete(dataset_id, ids=[])

Delete a list of documents by their IDs.

Parameters
  • dataset_id (string) – Unique name of dataset

  • ids (list) – IDs of documents to delete

bulk_get(dataset_id, ids, include_vector=True, select_fields=[])

Retrieve a document by its ID (“_id” field). This will retrieve the document faster than a filter applied on the “_id” field.

For single id lookup version of this request use datasets.documents.get.

Parameters
  • dataset_id (string) – Unique name of dataset

  • ids (list) – IDs of documents in the dataset.

  • include_vector (bool) – Include vectors in the search results

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields.

bulk_update(dataset_id, updates, insert_date=True, return_documents=False)

Edits documents by providing a key value pair of fields you are adding or changing, make sure to include the “_id” in the documents.

Parameters
  • dataset_id (string) – Unique name of dataset

  • updates (list) – Updates to make to the documents. It should be specified in a format of {“field_name”: “value”}. e.g. {“item.status” : “Sold Out”}

  • insert_date (bool) – Whether to include insert date as a field ‘insert_date_’.

  • include_updated_ids (bool) – Include the inserted IDs in the response

config: relevanceai.config.Config
delete(dataset_id, id)

Delete a document by ID.

For deleting multiple documents refer to datasets.documents.bulk_delete

Parameters
  • dataset_id (string) – Unique name of dataset

  • id (string) – ID of document to delete

delete_fields(dataset_id, id, fields)

Delete fields in a document in a dataset by its id

Parameters
  • dataset_id (string) – Unique name of dataset

  • id (string) – ID of a document in a dataset

  • fields (list) – List of fields to delete in a document

delete_where(dataset_id, filters)

Delete a document by filters.

For more information about filters refer to datasets.documents.get_where.

Parameters
  • dataset_id (string) – Unique name of dataset

  • filters (list) – Query for filtering the search results

get(dataset_id, id, include_vector=True)

Retrieve a document by its ID (“_id” field). This will retrieve the document faster than a filter applied on the “_id” field.

Parameters
  • dataset_id (string) – Unique name of dataset

  • id (string) – ID of a document in a dataset.

  • include_vector (bool) – Include vectors in the search results

get_where(dataset_id, filters=[], cursor=None, page_size=20, sort=[], select_fields=[], include_vector=True, random_state=0, is_random=False)

Retrieve documents with filters. Cursor is provided to retrieve even more documents. Loop through it to retrieve all documents in the database. Filter is used to retrieve documents that match the conditions set in a filter query. This is used in advance search to filter the documents that are searched.

The filters query is a json body that follows the schema of:

>>> [
>>>    {'field' : <field to filter>, 'filter_type' : <type of filter>, "condition":"==", "condition_value":"america"},
>>>    {'field' : <field to filter>, 'filter_type' : <type of filter>, "condition":">=", "condition_value":90},
>>> ]

These are the available filter_type types: [“contains”, “category”, “categories”, “exists”, “date”, “numeric”, “ids”]

“contains”: for filtering documents that contains a string

>>> {'field' : 'item_brand', 'filter_type' : 'contains', "condition":"==", "condition_value": "samsu"}

“exact_match”/”category”: for filtering documents that matches a string or list of strings exactly.

>>> {'field' : 'item_brand', 'filter_type' : 'category', "condition":"==", "condition_value": "sumsung"}

“categories”: for filtering documents that contains any of a category from a list of categories.

>>> {'field' : 'item_category_tags', 'filter_type' : 'categories', "condition":"==", "condition_value": ["tv", "smart", "bluetooth_compatible"]}

“exists”: for filtering documents that contains a field.

>>> {'field' : 'purchased', 'filter_type' : 'exists', "condition":"==", "condition_value":" "}

If you are looking to filter for documents where a field doesn’t exist, run this:

>>> {'field' : 'purchased', 'filter_type' : 'exists', "condition":"!=", "condition_value":" "}

“date”: for filtering date by date range.

>>> {'field' : 'insert_date_', 'filter_type' : 'date', "condition":">=", "condition_value":"2020-01-01"}

“numeric”: for filtering by numeric range.

>>> {'field' : 'price', 'filter_type' : 'numeric', "condition":">=", "condition_value":90}

“ids”: for filtering by document ids.

>>> {'field' : 'ids', 'filter_type' : 'ids', "condition":"==", "condition_value":["1", "10"]}

These are the available conditions:

>>> "==", "!=", ">=", ">", "<", "<="

If you are looking to combine your filters with multiple ORs, simply add the following inside the query {“strict”:”must_or”}.

Parameters
  • dataset_id (string) – Unique name of dataset

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields.

  • cursor (string) – Cursor to paginate the document retrieval

  • page_size (int) – Size of each page of results

  • include_vector (bool) – Include vectors in the search results

  • sort (list) – Fields to sort by. For each field, sort by descending or ascending. If you are using descending by datetime, it will get the most recent ones.

  • filters (list) – Query for filtering the search results

  • is_random (bool) – If True, retrieves doucments randomly. Cannot be used with cursor.

  • random_state (int) – Random Seed for retrieving random documents.

list(dataset_id, select_fields=[], cursor=None, page_size=20, include_vector=True, random_state=0)

Retrieve documents from a specified dataset. Cursor is provided to retrieve even more documents. Loop through it to retrieve all documents in the dataset.

Parameters
  • dataset_id (string) – Unique name of dataset

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields.

  • page_size (int) – Size of each page of results

  • cursor (string) – Cursor to paginate the document retrieval

  • include_vector (bool) – Include vectors in the search results

  • random_state (int) – Random Seed for retrieving random documents.

paginate(dataset_id, page=1, page_size=20, include_vector=True, select_fields=[])

Retrieve documents with filters and support for pagination.

For more information about filters check out datasets.documents.get_where.

Parameters
  • dataset_id (string) – Unique name of dataset

  • page (int) – Page of the results

  • page_size (int) – Size of each page of results

  • include_vector (bool) – Include vectors in the search results

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields.

project: str
update(dataset_id, update, insert_date=True)

Edits documents by providing a key value pair of fields you are adding or changing, make sure to include the “_id” in the documents.

For update multiple documents refer to datasets.documents.bulk_update

Parameters
  • dataset_id (string) – Unique name of dataset

  • update (list) – A dictionary to edit and add fields to a document. It should be specified in a format of {“field_name”: “value”}. e.g. {“item.status” : “Sold Out”}

  • insert_date (bool) – Whether to include insert date as a field ‘insert_date_’.

update_where(dataset_id, update, filters=[])

Updates documents by filters. The updates to make to the documents that is returned by a filter.

For more information about filters refer to datasets.documents.get_where.

Parameters
  • dataset_id (string) – Unique name of dataset

  • update (list) – A dictionary to edit and add fields to a document. It should be specified in a format of {“field_name”: “value”}. e.g. {“item.status” : “Sold Out”}

  • filters (list) – Query for filtering the search results

class relevanceai.api.endpoints.encoders.EncodersClient(project, api_key)

Bases: relevanceai.base._Base

api_key: str
config: relevanceai.config.Config
image(image)

Encode an image

Parameters

image (string) – URL of image to encode

imagetext(image)

Encode an image to make searchable with text

Parameters

image (string) – URL of image to encode

multi_text(text)

Encode multilingual text

Parameters

text (string) – Text to encode

project: str
text(text)

Encode text

Parameters

text (string) – Text to encode

textimage(text)

Encode text to make searchable with images

Parameters

text (string) – Text to encode

All Dataset related functions

class relevanceai.api.endpoints.monitor.MonitorClient(project, api_key)

Bases: relevanceai.base._Base

api_key: str
config: relevanceai.config.Config
health(dataset_id)

Gives you a summary of the health of your vectors, e.g. how many documents with vectors are missing, how many documents with zero vectors

Parameters

dataset_id (string) – Unique name of dataset

project: str
stats(dataset_id)

All operations related to monitoring

Parameters

dataset_id (string) – Unique name of dataset

usage(dataset_id, filters=[], page_size=20, page=1, asc=False, flatten=True, log_ids=[])

Aggregate the logs for a dataset.

The response returned has the following fields:

>>> [{'frequency': 958, 'insert_date': 1630159200000},...]
Parameters
  • dataset_id (string) – Unique name of dataset

  • filters (list) – Query for filtering the search results

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • asc (bool) – Whether to sort results by ascending or descending order

  • flatten (bool) – Whether to flatten

  • log_ids (list) – The log dataset IDs to aggregate with - one or more of logs, logs-write, logs-search, logs-task or js-logs

Prediction services

class relevanceai.api.endpoints.prediction.PredictionClient(project, api_key)

Bases: relevanceai.base._Base

KNN(dataset_id, vector, vector_field, target_field, k=5, weighting=True, impute_value=0, predict_operation='most_frequent', include_search_results=True)

Predict using KNN regression.

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector (list) – Vector, a list/array of floats that represents a piece of data.

  • vector_field (string) – The vector field to search in. It can either be an array of strings (automatically equally weighted) (e.g. [’check_vector_’, ‘yellow_vector_’]) or it is a dictionary mapping field to float where the weighting is explicitly specified (e.g. {’check_vector_’: 0.2, ‘yellow_vector_’: 0.5})

  • target_field (string) – The field to perform regression on.

  • k (int) – The number of results for KNN.

  • weighting (bool/list) – The weighting for each prediction

  • impute_value (int) – The value used to fill in the document when the data is missing.

  • predict_operation (string) – How to predict using the vectors. One of most_frequent or `sum_scores

  • include_search_results (bool) – Whether to include search results.

KNN_from_results(field, results, impute_value=0, predict_operation='most_frequent')

Predict using KNN regression from search results

Parameters
  • field (string) – Field in results to use for the prediction. Can be multiplied with weighting.

  • results (dict) – List of results in a dictionary

  • weighting (bool/list) – The weighting for each prediction

  • impute_value (int) – The value used to fill in the document when the data is missing.

  • predict_operation (string) – How to predict using the vectors. One of most_frequent or `sum_scores

api_key: str
config: relevanceai.config.Config
project: str

Recommmend services.

class relevanceai.api.endpoints.recommend.RecommendClient(project, api_key)

Bases: relevanceai.base._Base

api_key: str
config: relevanceai.config.Config
diversity(dataset_id, cluster_vector_field, n_clusters, positive_document_ids={}, negative_document_ids={}, vector_fields=[], approximation_depth=0, vector_operation='sum', sum_fields=True, page_size=20, page=1, similarity_metric='cosine', facets=[], filters=[], min_score=0, select_fields=[], include_vector=False, include_count=True, asc=False, keep_search_history=False, hundred_scale=False, search_history_id=None, n_init=5, n_iter=10, return_as_clusters=False)

Vector Search based recommendations are done by extracting the vectors of the documents ids specified performing some vector operations and then searching the dataset with the resultant vector. This allows us to not only do recommendations but personalized and weighted recommendations.

Diversity recommendation increases the variety within the recommendations via clustering. Search results are clustered and the top k items in each cluster are selected. The main clustering parameters are cluster_vector_field and n_clusters, the vector field on which to perform clustering and number of clusters respectively.

Here are a couple of different scenarios and what the queries would look like for those:

Recommendations Personalized by single liked product:

>>> positive_document_ids=['A']

-> Document ID A Vector = Search Query

Recommendations Personalized by multiple liked product:

>>> positive_document_ids=['A', 'B']

-> Document ID A Vector + Document ID B Vector = Search Query

Recommendations Personalized by multiple liked product and disliked products:

>>> positive_document_ids=['A', 'B'], negative_document_ids=['C', 'D']

-> (Document ID A Vector + Document ID B Vector) - (Document ID C Vector + Document ID C Vector) = Search Query

Recommendations Personalized by multiple liked product and disliked products with weights:

>>> positive_document_ids={'A':0.5, 'B':1}, negative_document_ids={'C':0.6, 'D':0.4}

-> (Document ID A Vector * 0.5 + Document ID B Vector * 1) - (Document ID C Vector * 0.6 + Document ID D Vector * 0.4) = Search Query

You can change the operator between vectors with vector_operation:

e.g. >>> positive_document_ids=[‘A’, ‘B’], negative_document_ids=[‘C’, ‘D’], vector_operation=’multiply’

-> (Document ID A Vector * Document ID B Vector) - (Document ID C Vector * Document ID D Vector) = Search Query

Parameters
  • dataset_id (string) – Unique name of dataset

  • cluster_vector_field (str) – The field to cluster on.

  • n_clusters (int) – Number of clusters to be specified.

  • positive_document_ids (dict) – Positive document IDs to personalize the results with, this will retrive the vectors from the document IDs and consider it in the operation.

  • negative_document_ids (dict) – Negative document IDs to personalize the results with, this will retrive the vectors from the document IDs and consider it in the operation.

  • vector_fields (list) – The vector field to search in. It can either be an array of strings (automatically equally weighted) (e.g. [’check_vector_’, ‘yellow_vector_’]) or it is a dictionary mapping field to float where the weighting is explicitly specified (e.g. {’check_vector_’: 0.2, ‘yellow_vector_’: 0.5})

  • approximation_depth (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate.

  • vector_operation (string) – Aggregation for the vectors when using positive and negative document IDs, choose from [‘mean’, ‘sum’, ‘min’, ‘max’, ‘divide’, ‘mulitple’]

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • facets (list) – Fields to include in the facets, if [] then all

  • filters (list) – Query for filtering the search results

  • min_score (int) – Minimum score for similarity metric

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields.

  • include_vector (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • asc (bool) – Whether to sort results by ascending or descending order

  • keep_search_history (bool) – Whether to store the history into VecDB. This will increase the storage costs over time.

  • hundred_scale (bool) – Whether to scale up the metric by 100

  • search_history_id (str) – Search history ID, only used for storing search histories.

  • n_init (int) – Number of runs to run with different centroid seeds

  • n_iter (int) – Number of iterations in each run

  • return_as_clusters (bool) – If True, return as clusters as opposed to results list

project: str
vector(dataset_id, positive_document_ids={}, negative_document_ids={}, vector_fields=[], approximation_depth=0, vector_operation='sum', sum_fields=True, page_size=20, page=1, similarity_metric='cosine', facets=[], filters=[], min_score=0, select_fields=[], include_vector=False, include_count=True, asc=False, keep_search_history=False, hundred_scale=False)

Vector Search based recommendations are done by extracting the vectors of the documents ids specified performing some vector operations and then searching the dataset with the resultant vector. This allows us to not only do recommendations but personalized and weighted recommendations.

Here are a couple of different scenarios and what the queries would look like for those:

Recommendations Personalized by single liked product:

>>> positive_document_ids=['A']

-> Document ID A Vector = Search Query

Recommendations Personalized by multiple liked product:

>>> positive_document_ids=['A', 'B']

-> Document ID A Vector + Document ID B Vector = Search Query

Recommendations Personalized by multiple liked product and disliked products:

>>> positive_document_ids=['A', 'B'], negative_document_ids=['C', 'D']

-> (Document ID A Vector + Document ID B Vector) - (Document ID C Vector + Document ID C Vector) = Search Query

Recommendations Personalized by multiple liked product and disliked products with weights:

>>> positive_document_ids={'A':0.5, 'B':1}, negative_document_ids={'C':0.6, 'D':0.4}

-> (Document ID A Vector * 0.5 + Document ID B Vector * 1) - (Document ID C Vector * 0.6 + Document ID D Vector * 0.4) = Search Query

You can change the operator between vectors with vector_operation:

e.g. >>> positive_document_ids=[‘A’, ‘B’], negative_document_ids=[‘C’, ‘D’], vector_operation=’multiply’

-> (Document ID A Vector * Document ID B Vector) - (Document ID C Vector * Document ID D Vector) = Search Query

Parameters
  • dataset_id (string) – Unique name of dataset

  • positive_document_ids (dict) – Positive document IDs to personalize the results with, this will retrive the vectors from the document IDs and consider it in the operation.

  • negative_document_ids (dict) – Negative document IDs to personalize the results with, this will retrive the vectors from the document IDs and consider it in the operation.

  • vector_fields (list) – The vector field to search in. It can either be an array of strings (automatically equally weighted) (e.g. [’check_vector_’, ‘yellow_vector_’]) or it is a dictionary mapping field to float where the weighting is explicitly specified (e.g. {’check_vector_’: 0.2, ‘yellow_vector_’: 0.5})

  • approximation_depth (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate.

  • vector_operation (string) – Aggregation for the vectors when using positive and negative document IDs, choose from [‘mean’, ‘sum’, ‘min’, ‘max’, ‘divide’, ‘mulitple’]

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • facets (list) – Fields to include in the facets, if [] then all

  • filters (list) – Query for filtering the search results

  • min_score (int) – Minimum score for similarity metric

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields.

  • include_vector (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • asc (bool) – Whether to sort results by ascending or descending order

  • keep_search_history (bool) – Whether to store the history into VecDB. This will increase the storage costs over time.

  • hundred_scale (bool) – Whether to scale up the metric by 100

class relevanceai.api.endpoints.search.SearchClient(project, api_key)

Bases: relevanceai.base._Base

advanced_chunk(dataset_ids, chunk_search_query, min_score=None, page_size=20, include_vector=False, select_fields=[], query=None)

A more advanced chunk search to be able to combine vector search and chunk search in many different ways.

Example 1 (Hybrid chunk search): >>> chunk_query = { >>> “chunk” : “some.test”, >>> “queries” : [ >>> {“vector” : vec1, “fields”: {”some.test.some_chunkvector_”:1}, >>> “traditional_query” : {“text”:”python”, “fields” : [“some.test.test_words”], “traditional_weight”: 0.3}, >>> “metric” : “cosine”}, >>> {“vector” : vec, “fields”: [”some.test.tt.some_other_chunkvector_”], >>> “traditional_query” : {“text”:”jumble”, “fields” : [“some.test.test_words”], “traditional_weight”: 0.3}, >>> “metric” : “cosine”}, >>> ] >>> }

Example 2 (combines normal vector search with chunk search): >>> chunk_query = { >>> “queries” : [ >>> { >>> “queries”: [ >>> { >>> “vector”: vec1, >>> “fields”: { >>> “some.test.some_chunkvector_”: 0.9 >>> }, >>> “traditional_query”: { >>> “text”: “python”, >>> “fields”: [ >>> “some.test.test_words” >>> ], >>> “traditional_weight”: 0.3 >>> }, >>> “metric”: “cosine” >>> } >>> ], >>> “chunk”: “some.test”, >>> }, >>> { >>> “vector” : vec, >>> “fields”: { >>> “.some_vector_” : 0.1}, >>> “metric” : “cosine” >>> }, >>> ] >>> }

Parameters
  • dataset_id (string) – Unique name of dataset

  • chunk_search_query (list) – Advanced chunk query

  • min_score (int) – Minimum score for similarity metric

  • page_size (int) – Size of each page of results

  • include_vector (bool) – Include vectors in the search results

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields.

  • query (string) – What to store as the query name in the dashboard

advanced_multistep_chunk(dataset_ids, first_step_query, first_step_text, first_step_fields, chunk_search_query, first_step_edit_distance=- 1, first_step_ignore_space=True, first_step_traditional_weight=0.075, first_step_approximation_depth=0, first_step_sum_fields=True, first_step_filters=[], first_step_page_size=50, include_count=True, min_score=0, page_size=20, include_vector=False, select_fields=[], query=None)

Performs a vector hybrid search and then an advanced chunk search. Chunk Search allows one to search through chunks inside a document. The major difference between chunk search and normal search in Vector AI is that it relies on the chunkvector field. Chunk Vector Search. Search with a multiple chunkvectors for the most similar documents. Chunk search also supports filtering to only search through filtered results and facets to get the overview of products available when a minimum score is set.

Example 1 (Hybrid chunk search):

>>> chunk_query = {
>>>     "chunk" : "some.test",
>>>     "queries" : [
>>>         {"vector" : vec1, "fields": {"some.test.some_chunkvector_":1},
>>>         "traditional_query" : {"text":"python", "fields" : ["some.test.test_words"], "traditional_weight": 0.3},
>>>         "metric" : "cosine"},
>>>         {"vector" : vec, "fields": ["some.test.tt.some_other_chunkvector_"],
>>>         "traditional_query" : {"text":"jumble", "fields" : ["some.test.test_words"], "traditional_weight": 0.3},
>>>         "metric" : "cosine"},
>>>     ]
>>> }

Example 2 (combines normal vector search with chunk search): >>> chunk_query = { >>> “queries” : [ >>> { >>> “queries”: [ >>> { >>> “vector”: vec1, >>> “fields”: { >>> “some.test.some_chunkvector_”: 0.9 >>> }, >>> “traditional_query”: { >>> “text”: “python”, >>> “fields”: [ >>> “some.test.test_words” >>> ], >>> “traditional_weight”: 0.3 >>> }, >>> “metric”: “cosine” >>> } >>> ], >>> “chunk”: “some.test”, >>> }, >>> { >>> “vector” : vec, >>> “fields”: { >>> “.some_vector_” : 0.1}, >>> “metric” : “cosine” >>> }, >>> ] >>> }

Parameters
  • dataset_id (string) – Unique name of dataset

  • first_step_query (list) – First step query

  • first_step_text (string) – Text search query (not encoded as vector)

  • first_step_fields (list) – Text fields to search against

  • chunk_search_query (list) – Advanced chunk query

  • first_step_edit_distance (int) – This refers to the amount of letters it takes to reach from 1 string to another string. e.g. band vs bant is a 1 word edit distance. Use -1 if you would like this to be automated.

  • first_step_ignore_spaces (bool) – Whether to consider cases when there is a space in the word. E.g. Go Pro vs GoPro.

  • first_step_traditional_weight (int) – Multiplier of traditional search score. A value of 0.025~0.075 is the ideal range

  • first_step_approximation_depth (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate.

  • first_step_sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • first_step_filters (list) – Query for filtering the search results

  • first_step_page_size (int) – In the first search, you are more interested in the contents

  • include_count (bool) – Include the total count of results in the search results

  • min_score (int) – Minimum score for similarity metric

  • page_size (int) – Size of each page of results

  • include_vector (bool) – Include vectors in the search results

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields.

  • query (string) – What to store as the query name in the dashboard

api_key: str
chunk(dataset_id, multivector_query, chunk_field, chunk_scoring='max', chunk_page_size=3, chunk_page=1, approximation_depth=0, sum_fields=True, page_size=20, page=1, similarity_metric='cosine', facets=[], filters=[], min_score=None, include_vector=False, include_count=True, asc=False, keep_search_history=False, hundred_scale=False, query=None)

Chunks are data that has been divided into different units. e.g. A paragraph is made of many sentence chunks, a sentence is made of many word chunks, an image frame in a video. By searching through chunks you can pinpoint more specifically where a match is occuring. When creating a chunk in your document use the suffix “chunk” and “chunkvector”. An example of a document with chunks:

>>> {
>>>     "_id" : "123",
>>>     "title" : "Lorem Ipsum Article",
>>>     "description" : "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.",
>>>     "description_vector_" : [1.1, 1.2, 1.3],
>>>     "description_sentence_chunk_" : [
>>>         {"sentence_id" : 0, "sentence_chunkvector_" : [0.1, 0.2, 0.3], "sentence" : "Lorem Ipsum is simply dummy text of the printing and typesetting industry."},
>>>         {"sentence_id" : 1, "sentence_chunkvector_" : [0.4, 0.5, 0.6], "sentence" : "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book."},
>>>         {"sentence_id" : 2, "sentence_chunkvector_" : [0.7, 0.8, 0.9], "sentence" : "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged."},
>>>     ]
>>> }

For combining chunk search with other search check out services.search.advanced_chunk.

Parameters
  • dataset_id (string) – Unique name of dataset

  • multivector_query (list) – Query for advance search that allows for multiple vector and field querying.

  • chunk_field (string) – Field where the array of chunked documents are.

  • chunk_scoring (string) – Scoring method for determining for ranking between document chunks.

  • chunk_page_size (int) – Size of each page of chunk results

  • chunk_page (int) – Page of the chunk results

  • approximation_depth (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate.

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • facets (list) – Fields to include in the facets, if [] then all

  • filters (list) – Query for filtering the search results

  • min_score (int) – Minimum score for similarity metric

  • include_vector (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • asc (bool) – Whether to sort results by ascending or descending order

  • keep_search_history (bool) – Whether to store the history into VecDB. This will increase the storage costs over time.

  • hundred_scale (bool) – Whether to scale up the metric by 100

  • query (string) – What to store as the query name in the dashboard

config: relevanceai.config.Config
diversity(dataset_id, cluster_vector_field, n_clusters, multivector_query, positive_document_ids={}, negative_document_ids={}, vector_operation='sum', approximation_depth=0, sum_fields=True, page_size=20, page=1, similarity_metric='cosine', facets=[], filters=[], min_score=0, select_fields=[], include_vector=False, include_count=True, asc=False, keep_search_history=False, hundred_scale=False, search_history_id=None, n_init=5, n_iter=10, return_as_clusters=False, query=None)

This will first perform an advanced search and then cluster the top X (page_size) search results. Results are returned as such: Once you have the clusters:

>>> Cluster 0: [A, B, C]
>>> Cluster 1: [D, E]
>>> Cluster 2: [F, G]
>>> Cluster 3: [H, I]

(Note, each cluster is ordered by highest to lowest search score.)

This intermediately returns:

>>> results_batch_1: [A, H, F, D] (ordered by highest search score)
>>> results_batch_2: [G, E, B, I] (ordered by highest search score)
>>> results_batch_3: [C]

This then returns the final results:

>>> results: [A, H, F, D, G, E, B, I, C]
Parameters
  • dataset_id (string) – Unique name of dataset

  • cluster_vector_field (str) – The field to cluster on.

  • multivector_query (list) – Query for advance search that allows for multiple vector and field querying.

  • positive_document_ids (dict) – Positive document IDs to personalize the results with, this will retrive the vectors from the document IDs and consider it in the operation.

  • negative_document_ids (dict) – Negative document IDs to personalize the results with, this will retrive the vectors from the document IDs and consider it in the operation.

  • approximation_depth (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate.

  • vector_operation (string) – Aggregation for the vectors when using positive and negative document IDs, choose from [‘mean’, ‘sum’, ‘min’, ‘max’, ‘divide’, ‘mulitple’]

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • facets (list) – Fields to include in the facets, if [] then all

  • filters (list) – Query for filtering the search results

  • min_score (int) – Minimum score for similarity metric

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields.

  • include_vector (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • asc (bool) – Whether to sort results by ascending or descending order

  • keep_search_history (bool) – Whether to store the history into VecDB. This will increase the storage costs over time.

  • hundred_scale (bool) – Whether to scale up the metric by 100

  • search_history_id (str) – Search history ID, only used for storing search histories.

  • n_clusters (int) – Number of clusters to be specified.

  • n_init (int) – Number of runs to run with different centroid seeds

  • n_iter (int) – Number of iterations in each run

  • return_as_clusters (bool) – If True, return as clusters as opposed to results list

  • query (string) – What to store as the query name in the dashboard

hybrid(dataset_id, multivector_query, text, fields, edit_distance=- 1, ignore_spaces=True, traditional_weight=0.075, page_size=20, page=1, similarity_metric='cosine', facets=[], filters=[], min_score=0, select_fields=[], include_vector=False, include_count=True, asc=False, keep_search_history=False, hundred_scale=False, search_history_id=None)

Combine the best of both traditional keyword faceted search with semantic vector search to create the best search possible.

For information on how to use vector search check out services.search.vector.

For information on how to use traditional keyword faceted search check out services.search.traditional.

Parameters
  • dataset_id (string) – Unique name of dataset

  • multivector_query (list) – Query for advance search that allows for multiple vector and field querying.

  • text (string) – Text Search Query (not encoded as vector)

  • fields (list) – Text fields to search against

  • positive_document_ids (dict) – Positive document IDs to personalize the results with, this will retrive the vectors from the document IDs and consider it in the operation.

  • negative_document_ids (dict) – Negative document IDs to personalize the results with, this will retrive the vectors from the document IDs and consider it in the operation.

  • approximation_depth (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate.

  • vector_operation (string) – Aggregation for the vectors when using positive and negative document IDs, choose from [‘mean’, ‘sum’, ‘min’, ‘max’, ‘divide’, ‘mulitple’]

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • facets (list) – Fields to include in the facets, if [] then all

  • filters (list) – Query for filtering the search results

  • min_score (float) – Minimum score for similarity metric

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields.

  • include_vector (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • asc (bool) – Whether to sort results by ascending or descending order

  • keep_search_history (bool) – Whether to store the history into VecDB. This will increase the storage costs over time.

  • hundred_scale (bool) – Whether to scale up the metric by 100

  • search_history_id (string) – Search history ID, only used for storing search histories.

  • edit_distance (int) – This refers to the amount of letters it takes to reach from 1 string to another string. e.g. band vs bant is a 1 word edit distance. Use -1 if you would like this to be automated.

  • ignore_spaces (bool) – Whether to consider cases when there is a space in the word. E.g. Go Pro vs GoPro.

  • traditional_weight (int) – Multiplier of traditional search score. A value of 0.025~0.075 is the ideal range

make_suggestion()
multistep_chunk(dataset_id, multivector_query, first_step_multivector_query, chunk_field, chunk_scoring='max', chunk_page_size=3, chunk_page=1, approximation_depth=0, sum_fields=True, page_size=20, page=1, similarity_metric='cosine', facets=[], filters=[], min_score=None, include_vector=False, include_count=True, asc=False, keep_search_history=False, hundred_scale=False, first_step_page=1, first_step_page_size=20, query=None)

Multistep chunk search involves a vector search followed by chunk search, used to accelerate chunk searches or to identify context before delving into relevant chunks. e.g. Search against the paragraph vector first then sentence chunkvector after.

For more information about chunk search check out services.search.chunk.

For more information about vector search check out services.search.vector

Parameters
  • dataset_id (string) – Unique name of dataset

  • multivector_query (list) – Query for advance search that allows for multiple vector and field querying.

  • chunk_field (string) – Field where the array of chunked documents are.

  • chunk_scoring (string) – Scoring method for determining for ranking between document chunks.

  • chunk_page_size (int) – Size of each page of chunk results

  • chunk_page (int) – Page of the chunk results

  • approximation_depth (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate.

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • facets (list) – Fields to include in the facets, if [] then all

  • filters (list) – Query for filtering the search results

  • min_score (int) – Minimum score for similarity metric

  • include_vector (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • asc (bool) – Whether to sort results by ascending or descending order

  • keep_search_history (bool) – Whether to store the history into VecDB. This will increase the storage costs over time.

  • hundred_scale (bool) – Whether to scale up the metric by 100

  • first_step_multivector_query (list) – Query for advance search that allows for multiple vector and field querying.

  • first_step_page (int) – Page of the results

  • first_step_page_size (int) – Size of each page of results

  • query (string) – What to store as the query name in the dashboard

project: str
semantic(dataset_id, multivector_query, fields, text, page_size=20, page=1, similarity_metric='cosine', facets=[], filters=[], min_score=0, select_fields=[], include_vector=False, include_count=True, asc=False, keep_search_history=False, hundred_scale=False)

A more automated hybrid search with a few extra things that automatically adjusts some of the key parameters for more automated and good out of the box results.

For information on how to configure semantic search check out services.search.hybrid.

Parameters
  • dataset_id (string) – Unique name of dataset

  • multivector_query (list) – Query for advance search that allows for multiple vector and field querying.

  • positive_document_ids (dict) – Positive document IDs to personalize the results with, this will retrive the vectors from the document IDs and consider it in the operation.

  • negative_document_ids (dict) – Negative document IDs to personalize the results with, this will retrive the vectors from the document IDs and consider it in the operation.

  • text (string) – Text Search Query (not encoded as vector)

  • fields (list) – Text fields to search against

  • approximation_depth (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate.

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • facets (list) – Fields to include in the facets, if [] then all

  • filters (list) – Query for filtering the search results

  • min_score (int) – Minimum score for similarity metric

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields.

  • include_vector (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • asc (bool) – Whether to sort results by ascending or descending order

  • keep_search_history (bool) – Whether to store the history into VecDB. This will increase the storage costs over time.

  • hundred_scale (bool) – Whether to scale up the metric by 100

traditional(dataset_id, text, fields=[], edit_distance=- 1, ignore_spaces=True, page_size=29, page=1, select_fields=[], include_vector=False, include_count=True, asc=False, keep_search_history=False, search_history_id=None)

Traditional Faceted Keyword Search with edit distance/fuzzy matching.

For information on how to apply facets/filters check out datasets.documents.get_where.

For information on how to construct the facets section for your search bar check out datasets.facets.

Parameters
  • dataset_id (string) – Unique name of dataset

  • multivector_query (list) – Query for advance search that allows for multiple vector and field querying.

  • text (string) – Text Search Query (not encoded as vector)

  • fields (list) – Text fields to search against

  • edit_distance (int) – This refers to the amount of letters it takes to reach from 1 string to another string. e.g. band vs bant is a 1 word edit distance. Use -1 if you would like this to be automated.

  • ignore_spaces (bool) – Whether to consider cases when there is a space in the word. E.g. Go Pro vs GoPro.

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields.

  • include_vector (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • asc (bool) – Whether to sort results by ascending or descending order

  • keep_search_history (bool) – Whether to store the history into VecDB. This will increase the storage costs over time.

  • search_history_id (string) – Search history ID, only used for storing search histories.

vector(dataset_id, multivector_query, positive_document_ids={}, negative_document_ids={}, vector_operation='sum', approximation_depth=0, sum_fields=True, page_size=20, page=1, similarity_metric='cosine', facets=[], filters=[], min_score=0, select_fields=[], include_vector=False, include_count=True, asc=False, keep_search_history=False, hundred_scale=False, search_history_id=None, query=None)

Allows you to leverage vector similarity search to create a semantic search engine. Powerful features of VecDB vector search:

  1. Multivector search that allows you to search with multiple vectors and give each vector a different weight. e.g. Search with a product image vector and text description vector to find the most similar products by what it looks like and what its described to do. You can also give weightings of each vector field towards the search, e.g. image_vector_ weights 100%, whilst description_vector_ 50%

    An example of a simple multivector query:

    >>> [
    >>>     {"vector": [0.12, 0.23, 0.34], "fields": ["name_vector_"], "alias":"text"},
    >>>     {"vector": [0.45, 0.56, 0.67], "fields": ["image_vector_"], "alias":"image"},
    >>> ]
    

    An example of a weighted multivector query:

    >>> [
    >>>     {"vector": [0.12, 0.23, 0.34], "fields": {"name_vector_":0.6}, "alias":"text"},
    >>>     {"vector": [0.45, 0.56, 0.67], "fields": {"image_vector_"0.4}, "alias":"image"},
    >>> ]
    

    An example of a weighted multivector query with multiple fields for each vector:

    >>> [
    >>>     {"vector": [0.12, 0.23, 0.34], "fields": {"name_vector_":0.6, "description_vector_":0.3}, "alias":"text"},
    >>>     {"vector": [0.45, 0.56, 0.67], "fields": {"image_vector_"0.4}, "alias":"image"},
    >>> ]
    
  2. Utilise faceted search with vector search. For information on how to apply facets/filters check out datasets.documents.get_where

  3. Sum Fields option to adjust whether you want multiple vectors to be combined in the scoring or compared in the scoring. e.g. image_vector_ + text_vector_ or image_vector_ vs text_vector_.

    When sum_fields=True:

    • Multi-vector search allows you to obtain search scores by taking the sum of these scores.

    • TextSearchScore + ImageSearchScore = SearchScore

    • We then rank by the new SearchScore, so for searching 1000 documents there will be 1000 search scores and results

    When sum_fields=False:

    • Multi vector search but not summing the score, instead including it in the comparison!

    • TextSearchScore = SearchScore1

    • ImagSearchScore = SearchScore2

    • We then rank by the 2 new SearchScore, so for searching 1000 documents there should be 2000 search scores and results.

  4. Personalization with positive and negative document ids.

    • For more information about the positive and negative document ids to personalize check out services.recommend.vector

For more even more advanced configuration and customisation of vector search, reach out to us at dev@relevance.ai and learn about our new advanced_vector_search.

Parameters
  • dataset_id (string) – Unique name of dataset

  • multivector_query (list) – Query for advance search that allows for multiple vector and field querying.

  • positive_document_ids (dict) – Positive document IDs to personalize the results with, this will retrive the vectors from the document IDs and consider it in the operation.

  • negative_document_ids (dict) – Negative document IDs to personalize the results with, this will retrive the vectors from the document IDs and consider it in the operation.

  • approximation_depth (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate.

  • vector_operation (string) – Aggregation for the vectors when using positive and negative document IDs, choose from [‘mean’, ‘sum’, ‘min’, ‘max’, ‘divide’, ‘mulitple’]

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • facets (list) – Fields to include in the facets, if [] then all

  • filters (list) – Query for filtering the search results

  • min_score (int) – Minimum score for similarity metric

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields.

  • include_vector (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • asc (bool) – Whether to sort results by ascending or descending order

  • keep_search_history (bool) – Whether to store the history into VecDB. This will increase the storage costs over time.

  • hundred_scale (bool) – Whether to scale up the metric by 100

  • search_history_id (string) – Search history ID, only used for storing search histories.

  • query (string) – What to store as the query name in the dashboard

Services class

class relevanceai.api.endpoints.services.ServicesClient(project, api_key)

Bases: relevanceai.base._Base

api_key: str
config: relevanceai.config.Config
document_diff(doc, docs_to_compare, difference_fields=[])

Find differences between documents

Parameters
  • doc (dict) – Main document to compare other documents against.

  • docs_to_compare (list) – Other documents to compare against the main document.

  • difference_fields (list) – Fields to compare. Defaults to [], which compares all fields.

project: str

Tagger services

class relevanceai.api.endpoints.tagger.TaggerClient(project, api_key)

Bases: relevanceai.base._Base

api_key: str
config: relevanceai.config.Config
diversity(data, tag_dataset_id, encoder, cluster_vector_field, n_clusters, tag_field=None, approximation_depth=0, sum_fields=True, page_size=20, page=1, similarity_metric='cosine', filters=[], min_score=0, include_search_relevance=False, search_relevance_cutoff_aggressiveness=1, asc=False, include_score=False, n_init=5, n_iter=10)

Tagging and then clustering the tags and returning one from each cluster (starting from the closest tag)

Parameters
  • data (string) – Image Url or text or any data suited for the encoder

  • tag_dataset_id (string) – Name of the dataset you want to tag

  • encoder (string) – Which encoder to use.

  • cluster_vector_field (str) – The field to cluster on.

  • n_clusters (int) – Number of clusters to be specified.

  • tag_field (string) – The field used to tag in a dataset. If None, automatically uses the one stated in the encoder.

  • approximation_depth (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate.

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • filters (list) – Query for filtering the search results

  • min_score (int) – Minimum score for similarity metric

  • include_search_relevance (bool) – Whether to calculate a search_relevance cutoff score to flag relevant and less relevant results

  • search_relevance_cutoff_aggressiveness (int) – How aggressive the search_relevance cutoff score is (higher value the less results will be relevant)

  • asc (bool) – Whether to sort results by ascending or descending order

  • include_score (bool) – Whether to include score

  • n_init (int) – Number of runs to run with different centroid seeds

  • n_iter (int) – Number of iterations in each run

project: str
tag(data, tag_dataset_id, encoder, tag_field=None, approximation_depth=0, sum_fields=True, page_size=20, page=1, similarity_metric='cosine', filters=[], min_score=0, include_search_relevance=False, search_relevance_cutoff_aggressiveness=1, asc=False, include_score=False)

Tag documents or vectors

Parameters
  • data (string) – Image Url or text or any data suited for the encoder

  • tag_dataset_id (string) – Name of the dataset you want to tag

  • encoder (string) – Which encoder to use.

  • tag_field (string) – The field used to tag in a dataset. If None, automatically uses the one stated in the encoder.

  • approximation_depth (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate.

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • filters (list) – Query for filtering the search results

  • min_score (int) – Minimum score for similarity metric

  • include_search_relevance (bool) – Whether to calculate a search_relevance cutoff score to flag relevant and less relevant results

  • search_relevance_cutoff_aggressiveness (int) – How aggressive the search_relevance cutoff score is (higher value the less results will be relevant)

  • asc (bool) – Whether to sort results by ascending or descending order

  • include_score (bool) – Whether to include score

Tasks Module

class relevanceai.api.endpoints.tasks.TasksClient(project, api_key)

Bases: relevanceai.base._Base

api_key: str
config: relevanceai.config.Config
create(dataset_id, task_name, task_parameters)

Tasks unlock the power of VecDb AI by adding a lot more new functionality with a flexible way of searching.

Parameters
  • dataset_id (string) – Unique name of dataset

  • task_name (string) – Name of task to complete

  • task_parameters (dict) – Parameters of task to complete

create_cluster_task(dataset_id, vector_field, n_clusters, alias='default', refresh=False, n_iter=10, n_init=5, status_checker=True, verbose=True, time_between_ping=10)

Start a task which creates clusters for a dataset based on a vector field :param dataset_id: Unique name of dataset :type dataset_id: string :type vector_field: str :param vector_field: The field to cluster on. :type vector_field: string :type alias: str :param alias: Alias is used to name a cluster :type alias: string :type n_clusters: int :param n_clusters: Number of clusters to be specified. :type n_clusters: int :type n_iter: int :param n_iter: Number of iterations in each run :type n_iter: int :type n_init: int :param n_init: Number of runs to run with different centroid seeds :type n_init: int :type refresh: bool :param refresh: Whether to rerun task on the whole dataset or just the ones missing the output :type refresh: bool

create_encode_categories_task(dataset_id, fields, status_checker=True, verbose=True, time_between_ping=10)

Within a collection encode the specified array field in every document into vectors.

For example, array that represents a movie’s categories: >>> document 1 array field: {“category” : [“sci-fi”, “thriller”, “comedy”]} >>> document 2 array field: {“category” : [“sci-fi”, “romance”, “drama”]} >>> -> <Encode the arrays to vectors> -> >>> | sci-fi | thriller | comedy | romance | drama | >>> |--------|———-|--------|———|-------| >>> | 1 | 1 | 1 | 0 | 0 | >>> | 1 | 0 | 0 | 1 | 1 | >>> document 1 array vector: {”movie_categories_vector_”: [1, 1, 1, 0, 0]} >>> document 2 array vector: {”movie_categories_vector_”: [1, 0, 0, 1, 1]}

Parameters
  • dataset_id (string) – Unique name of dataset

  • fields (list) – The numeric fields to encode into vectors.

create_encode_imagetext_task(dataset_id, field, alias='default', refresh=False, status_checker=True, verbose=True, time_between_ping=10)

Start a task which encodes an image field for text representation :type dataset_id: str :param dataset_id: Unique name of dataset :type dataset_id: string :type field: str :param field: The field to encode :type field: string :type alias: str :param alias: Alias used to name a vector field. Belongs in field_{alias}vector :type alias: string :type refresh: bool :param refresh: Whether to rerun task on the whole dataset or just the ones missing the output :type refresh: bool

create_encode_text_task(dataset_id, field, alias='default', refresh=False, status_checker=True, verbose=True, time_between_ping=10)

Start a task which encodes a text field :type dataset_id: str :param dataset_id: Unique name of dataset :type dataset_id: string :type field: str :param field: The field to encode :type field: string :type alias: str :param alias: Alias used to name a vector field. Belongs in field_{alias}vector :type alias: string :type refresh: bool :param refresh: Whether to rerun task on the whole dataset or just the ones missing the output :type refresh: bool

create_encode_textimage_task(dataset_id, field, alias='default', refresh=False, status_checker=True, verbose=True, time_between_ping=10)

Start a task which encodes a text field for image representation :type dataset_id: str :param dataset_id: Unique name of dataset :type dataset_id: string :type field: str :param field: The field to encode :type field: string :type alias: str :param alias: Alias used to name a vector field. Belongs in field_{alias}vector :type alias: string :type refresh: bool :param refresh: Whether to rerun task on the whole dataset or just the ones missing the output :type refresh: bool

create_numeric_encoder_task(dataset_id, fields, vector_name='_vector_', status_checker=True, verbose=True, time_between_ping=10)

Within a collection encode the specified dictionary field in every document into vectors.

For example: a dictionary that represents a person’s characteristics visiting a store: >>> document 1 field: {“person_characteristics” : {“height”:180, “age”:40, “weight”:70}} >>> document 2 field: {“person_characteristics” : {“age”:32, “purchases”:10, “visits”: 24}} >>> -> <Encode the dictionaries to vectors> -> >>> | height | age | weight | purchases | visits | >>> |--------|—–|--------|———–|--------| >>> | 180 | 40 | 70 | 0 | 0 | >>> | 0 | 32 | 0 | 10 | 24 | >>> document 1 dictionary vector: {”person_characteristics_vector_”: [180, 40, 70, 0, 0]} >>> document 2 dictionary vector: {”person_characteristics_vector_”: [0, 32, 0, 10, 24]} :type dataset_id: str :param dataset_id: Unique name of dataset :type dataset_id: string :type fields: list :param fields: The numeric fields to encode into vectors. :type fields: list :type vector_name: str :param vector_name: The name of the vector field created :type vector_name: string

list(dataset_id, show_active_only=True)

List and get a history of all the jobs and its job_id, parameters, start time, etc.

Parameters
  • dataset_id (string) – Unique name of dataset

  • show_active_only (bool) – Whether to show active only

project: str
status(dataset_id, task_id)

Get status of a collection level job. Whether its starting, running, failed or finished.

Parameters
  • dataset_id (string) – Unique name of dataset

  • task_id (string) – Unique name of task

Wordclouds services

class relevanceai.api.endpoints.wordclouds.WordcloudsClient(project, api_key)

Bases: relevanceai.base._Base

api_key: str
config: relevanceai.config.Config
project: str
wordclouds(dataset_id, fields, n=2, most_common=5, page_size=20, select_fields=[], include_vector=False, filters=[], additional_stopwords=[])

Get frequency n-gram frequency counter from the wordcloud.

Parameters
  • dataset_id (string) – Unique name of dataset

  • fields (list) – The field on which to build NGrams

  • n (int) – The number of words fo combine

  • most_common (int) – The most common number of n-gram terms

  • page_size (int) – Size of each page of results

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields.

  • include_vector (bool) – Include vectors in the search results

  • filters (list) – Query for filtering the search results

  • additional_stopwords (list) – Additional stopwords to add

Module contents