`relevanceai.api.endpoints.datasets`

All Dataset related functions

Module Contents

Classes

DatasetsClient

All dataset-related functions

class relevanceai.api.endpoints.datasets.DatasetsClient(project: str, api_key: str)

Bases: relevanceai.base._Base

All dataset-related functions

schema(self, dataset_id: str)

Returns the schema of a dataset. Refer to datasets.create for different field types available in a VecDB schema.

Parameters: dataset_id (string) – Unique name of dataset

metadata(self, dataset_id: str)

Retreives metadata about a dataset. Notably description, data source, etc

Parameters: dataset_id (string) – Unique name of dataset

create(self, dataset_id: str, schema: dict = {})

A dataset can store documents to be searched, retrieved, filtered and aggregated (similar to Collections in MongoDB, Tables in SQL, Indexes in ElasticSearch). A powerful and core feature of VecDB is that you can store both your metadata and vectors in the same document. When specifying the schema of a dataset and inserting your own vector use the suffix (ends with) “_vector_” for the field name, and specify the length of the vector in dataset_schema.

For example:

>>>    {
>>>        "product_image_vector_": 1024,
>>>        "product_text_description_vector_" : 128
>>>    }

These are the field types supported in our datasets: [“text”, “numeric”, “date”, “dict”, “chunks”, “vector”, “chunkvector”].

For example:

>>>    {
>>>        "product_text_description" : "text",
>>>        "price" : "numeric",
>>>        "created_date" : "date",
>>>        "product_texts_chunk_": "chunks",
>>>        "product_text_chunkvector_" : 1024
>>>    }

You don’t have to specify the schema of every single field when creating a dataset, as VecDB will automatically detect the appropriate data type for each field (vectors will be automatically identified by its “_vector_” suffix). Infact you also don’t always have to use this endpoint to create a dataset as /datasets/bulk_insert will infer and create the dataset and schema as you insert new documents.

Note

A dataset name/id can only contain undercase letters, dash, underscore and numbers.
“_id” is reserved as the key and id of a document.
Once a schema is set for a dataset it cannot be altered. If it has to be altered, utlise the copy dataset endpoint.

For more information about vectors check out the ‘Vectorizing’ section, services.search.vector or out blog at https://relevance.ai/blog. For more information about chunks and chunk vectors check out services.search.chunk.

Parameters

dataset_id (string) – Unique name of dataset
schema (dict) – Schema for specifying the field that are vectors and its length

list(self): List all datasets in a project that you are authorized to read/write.

list_all(self, include_schema: bool = True, include_stats: bool = True, include_metadata: bool = True, include_schema_stats: bool = False, include_vector_health: bool = False, include_active_jobs: bool = False, dataset_ids: list = [], sort_by_created_at_date: bool = False, asc: bool = False, page_size: int = 20, page: int = 1)

Returns a page of datasets and in detail the dataset’s associated information that you are authorized to read/write. The information includes:

Schema - Data schema of a dataset (same as dataset.schema).
Metadata - Metadata of a dataset (same as dataset.metadata).
Stats - Statistics of number of documents and size of a dataset (same as dataset.stats).
Vector_health - Number of zero vectors stored (same as dataset.health).
Schema_stats - Fields and number of documents missing/not missing for that field (same as dataset.stats).
Active_jobs - All active jobs/tasks on the dataset.

Parameters

include_schema (bool) – Whether to return schema
include_stats (bool) – Whether to return stats
include_metadata (bool) – Whether to return metadata
include_vector_health (bool) – Whether to return vector_health
include_schema_stats (bool) – Whether to return schema_stats
include_active_jobs (bool) – Whether to return active_jobs
dataset_ids (list) – List of dataset IDs
sort_by_created_at_date (bool) – Sort by created at date. By default shows the newest datasets. Set asc=False to get oldest dataset.
asc (bool) – Whether to sort results by ascending or descending order
page_size (int) – Size of each page of results
page (int) – Page of the results

facets(self, dataset_id, fields: list = [], date_interval: str = 'monthly', page_size: int = 5, page: int = 1, asc: bool = False)

Takes a high level aggregation of every field, return their unique values and frequencies. This is used to help create the filter bar for search.

Parameters

dataset_id (string) – Unique name of dataset
fields (list) – Fields to include in the facets, if [] then all
date_interval (str) – Interval for date facets
page_size (int) – Size of facet page
page (int) – Page of the results
asc (bool) – Whether to sort results by ascending or descending order

check_missing_ids(self, dataset_id, ids)

Look up in bulk if the ids exists in the dataset, returns all the missing one as a list.

Parameters

dataset_id (string) – Unique name of dataset
ids (list) – IDs of documents

insert(self, dataset_id: str, document: dict, insert_date: bool = True, overwrite: bool = True, update_schema: bool = True)

Insert a single documents

When inserting the document you can optionally specify your own id for a document by using the field name “_id”, if not specified a random id is assigned.
When inserting or specifying vectors in a document use the suffix (ends with) “_vector_” for the field name. e.g. “product_description_vector_”.
When inserting or specifying chunks in a document the suffix (ends with) “_chunk_” for the field name. e.g. “products_chunk_”.
When inserting or specifying chunk vectors in a document’s chunks use the suffix (ends with) “_chunkvector_” for the field name. e.g. “products_chunk_.product_description_chunkvector_”.

Documentation can be found here: https://ingest-api-dev-aueast.relevance.ai/latest/documentation#operation/InsertEncode

Try to keep each batch of documents to insert under 200mb to avoid the insert timing out.

Parameters

dataset_id (string) – Unique name of dataset
documents (list) – A list of documents. Document is a JSON-like data that we store our metadata and vectors with. For specifying id of the document use the field ‘_id’, for specifying vector field use the suffix of ‘_vector_’
insert_date (bool) – Whether to include insert date as a field ‘insert_date_’.
overwrite (bool) – Whether to overwrite document if it exists.
update_schema (bool) – Whether the api should check the documents for vector datatype to update the schema.

bulk_insert(self, dataset_id: str, documents: list, insert_date: bool = True, overwrite: bool = True, update_schema: bool = True, field_transformers=[], return_documents: bool = False)

Documentation can be found here: https://ingest-api-dev-aueast.relevance.ai/latest/documentation#operation/InsertEncode

When inserting the document you can optionally specify your own id for a document by using the field name “_id”, if not specified a random id is assigned.
When inserting or specifying vectors in a document use the suffix (ends with) “_vector_” for the field name. e.g. “product_description_vector_”.
When inserting or specifying chunks in a document the suffix (ends with) “_chunk_” for the field name. e.g. “products_chunk_”.
When inserting or specifying chunk vectors in a document’s chunks use the suffix (ends with) “_chunkvector_” for the field name. e.g. “products_chunk_.product_description_chunkvector_”.
Try to keep each batch of documents to insert under 200mb to avoid the insert timing out.

Parameters

dataset_id (string) – Unique name of dataset
documents (list) – A list of documents. Document is a JSON-like data that we store our metadata and vectors with. For specifying id of the document use the field ‘_id’, for specifying vector field use the suffix of ‘_vector_’
insert_date (bool) – Whether to include insert date as a field ‘insert_date_’.
overwrite (bool) – Whether to overwrite document if it exists.
update_schema (bool) – Whether the api should check the documents for vector datatype to update the schema.
include_inserted_ids (bool) – Include the inserted IDs in the response

field_transformers (list) –

An example field_transformers object:

>>> {
>>>    "field": "string",
>>>    "output_field": "string",
>>>    "remove_html": true,
>>>    "split_sentences": true
>>> }

delete(self, dataset_id: str, confirm: bool = False)

Delete a dataset

Parameters: dataset_id (string) – Unique name of dataset

clone(self, old_dataset: str, new_dataset: str, schema: dict = {}, rename_fields: dict = {}, remove_fields: list = [], filters: list = [])

Clone a dataset into a new dataset. You can use this to rename fields and change data schemas. This is considered a project job.

Parameters

old_dataset (string) – Unique name of old dataset to copy from
new_dataset (string) – Unique name of new dataset to copy to
schema (dict) – Schema for specifying the field that are vectors and its length
rename_fields (dict) – Fields to rename {‘old_field’: ‘new_field’}. Defaults to no renames
remove_fields (list) – Fields to remove [‘random_field’, ‘another_random_field’]. Defaults to no removes
filters (list) – Query for filtering the search results

search(self, query, sort_by_created_at_date: bool = False, asc: bool = False)

Search datasets by their names with a traditional keyword search.

Parameters

query (string) – Any string that belongs to part of a dataset.
sort_by_created_at_date (bool) – Sort by created at date. By default shows the newest datasets. Set asc=False to get oldest dataset.
asc (bool) – Whether to sort results by ascending or descending order

vectorize(self, dataset_id: str, model_id: str, fields: list = [], filters: list = [], refresh: bool = False, alias: str = 'default', chunksize: int = 20, chunk_field: str = None)

Queue the encoding of a dataset using the method given by model_id.

Parameters

dataset_id (string) – Unique name of dataset
model_id (string) – Model ID to use for vectorizing (encoding.)
fields (list) – Fields to remove [‘random_field’, ‘another_random_field’]. Defaults to no removes
filters (list) – Filters to run against
refresh (bool) – If True, re-runs encoding on whole dataset.
alias (string) – Alias used to name a vector field. Belongs in field_{alias}vector
chunksize (int) – Batch for each encoding. Change at your own risk.
chunk_field (string) – The chunk field. If the chunk field is specified, the field to be encoded should not include the chunk field.

task_status(self, dataset_id: str, task_id: str)

Check the status of an existing encoding task on the given dataset.

The required task_id was returned in the original encoding request such as datasets.vectorize.

Parameters

dataset_id (string) – Unique name of dataset
task_id (string) – The task ID of the earlier queued vectorize task

relevanceai.api.endpoints.datasets

Module Contents

Classes

`relevanceai.api.endpoints.datasets`