relevanceai.api.endpoints.datasets

All Dataset related functions

Module Contents

Classes

DatasetsClient

All dataset-related functions

class relevanceai.api.endpoints.datasets.DatasetsClient(project: str, api_key: str)

Bases: relevanceai.base._Base

All dataset-related functions

schema(self, dataset_id: str)

Returns the schema of a dataset. Refer to datasets.create for different field types available in a VecDB schema.

Parameters

dataset_id (string) – Unique name of dataset

metadata(self, dataset_id: str)

Retreives metadata about a dataset. Notably description, data source, etc

Parameters

dataset_id (string) – Unique name of dataset

create(self, dataset_id: str, schema: dict = {})

A dataset can store documents to be searched, retrieved, filtered and aggregated (similar to Collections in MongoDB, Tables in SQL, Indexes in ElasticSearch). A powerful and core feature of VecDB is that you can store both your metadata and vectors in the same document. When specifying the schema of a dataset and inserting your own vector use the suffix (ends with) “_vector_” for the field name, and specify the length of the vector in dataset_schema.

For example:

>>>    {
>>>        "product_image_vector_": 1024,
>>>        "product_text_description_vector_" : 128
>>>    }

These are the field types supported in our datasets: [“text”, “numeric”, “date”, “dict”, “chunks”, “vector”, “chunkvector”].

For example:

>>>    {
>>>        "product_text_description" : "text",
>>>        "price" : "numeric",
>>>        "created_date" : "date",
>>>        "product_texts_chunk_": "chunks",
>>>        "product_text_chunkvector_" : 1024
>>>    }

You don’t have to specify the schema of every single field when creating a dataset, as VecDB will automatically detect the appropriate data type for each field (vectors will be automatically identified by its “_vector_” suffix). Infact you also don’t always have to use this endpoint to create a dataset as /datasets/bulk_insert will infer and create the dataset and schema as you insert new documents.

Note

  • A dataset name/id can only contain undercase letters, dash, underscore and numbers.

  • “_id” is reserved as the key and id of a document.

  • Once a schema is set for a dataset it cannot be altered. If it has to be altered, utlise the copy dataset endpoint.

For more information about vectors check out the ‘Vectorizing’ section, services.search.vector or out blog at https://relevance.ai/blog. For more information about chunks and chunk vectors check out services.search.chunk.

Parameters
  • dataset_id (string) – Unique name of dataset

  • schema (dict) – Schema for specifying the field that are vectors and its length

list(self)

List all datasets in a project that you are authorized to read/write.

list_all(self, include_schema: bool = True, include_stats: bool = True, include_metadata: bool = True, include_schema_stats: bool = False, include_vector_health: bool = False, include_active_jobs: bool = False, dataset_ids: list = [], sort_by_created_at_date: bool = False, asc: bool = False, page_size: int = 20, page: int = 1)

Returns a page of datasets and in detail the dataset’s associated information that you are authorized to read/write. The information includes:

  • Schema - Data schema of a dataset (same as dataset.schema).

  • Metadata - Metadata of a dataset (same as dataset.metadata).

  • Stats - Statistics of number of documents and size of a dataset (same as dataset.stats).

  • Vector_health - Number of zero vectors stored (same as dataset.health).

  • Schema_stats - Fields and number of documents missing/not missing for that field (same as dataset.stats).

  • Active_jobs - All active jobs/tasks on the dataset.

Parameters
  • include_schema (bool) – Whether to return schema

  • include_stats (bool) – Whether to return stats

  • include_metadata (bool) – Whether to return metadata

  • include_vector_health (bool) – Whether to return vector_health

  • include_schema_stats (bool) – Whether to return schema_stats

  • include_active_jobs (bool) – Whether to return active_jobs

  • dataset_ids (list) – List of dataset IDs

  • sort_by_created_at_date (bool) – Sort by created at date. By default shows the newest datasets. Set asc=False to get oldest dataset.

  • asc (bool) – Whether to sort results by ascending or descending order

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

facets(self, dataset_id, fields: list = [], date_interval: str = 'monthly', page_size: int = 5, page: int = 1, asc: bool = False)

Takes a high level aggregation of every field, return their unique values and frequencies. This is used to help create the filter bar for search.

Parameters
  • dataset_id (string) – Unique name of dataset

  • fields (list) – Fields to include in the facets, if [] then all

  • date_interval (str) – Interval for date facets

  • page_size (int) – Size of facet page

  • page (int) – Page of the results

  • asc (bool) – Whether to sort results by ascending or descending order

check_missing_ids(self, dataset_id, ids)

Look up in bulk if the ids exists in the dataset, returns all the missing one as a list.

Parameters
  • dataset_id (string) – Unique name of dataset

  • ids (list) – IDs of documents

insert(self, dataset_id: str, document: dict, insert_date: bool = True, overwrite: bool = True, update_schema: bool = True)

Insert a single documents

  • When inserting the document you can optionally specify your own id for a document by using the field name “_id”, if not specified a random id is assigned.

  • When inserting or specifying vectors in a document use the suffix (ends with) “_vector_” for the field name. e.g. “product_description_vector_”.

  • When inserting or specifying chunks in a document the suffix (ends with) “_chunk_” for the field name. e.g. “products_chunk_”.

  • When inserting or specifying chunk vectors in a document’s chunks use the suffix (ends with) “_chunkvector_” for the field name. e.g. “products_chunk_.product_description_chunkvector_”.

Documentation can be found here: https://ingest-api-dev-aueast.relevance.ai/latest/documentation#operation/InsertEncode

Try to keep each batch of documents to insert under 200mb to avoid the insert timing out.

Parameters
  • dataset_id (string) – Unique name of dataset

  • documents (list) – A list of documents. Document is a JSON-like data that we store our metadata and vectors with. For specifying id of the document use the field ‘_id’, for specifying vector field use the suffix of ‘_vector_’

  • insert_date (bool) – Whether to include insert date as a field ‘insert_date_’.

  • overwrite (bool) – Whether to overwrite document if it exists.

  • update_schema (bool) – Whether the api should check the documents for vector datatype to update the schema.

bulk_insert(self, dataset_id: str, documents: list, insert_date: bool = True, overwrite: bool = True, update_schema: bool = True, field_transformers=[], return_documents: bool = False)

Documentation can be found here: https://ingest-api-dev-aueast.relevance.ai/latest/documentation#operation/InsertEncode

  • When inserting the document you can optionally specify your own id for a document by using the field name “_id”, if not specified a random id is assigned.

  • When inserting or specifying vectors in a document use the suffix (ends with) “_vector_” for the field name. e.g. “product_description_vector_”.

  • When inserting or specifying chunks in a document the suffix (ends with) “_chunk_” for the field name. e.g. “products_chunk_”.

  • When inserting or specifying chunk vectors in a document’s chunks use the suffix (ends with) “_chunkvector_” for the field name. e.g. “products_chunk_.product_description_chunkvector_”.

  • Try to keep each batch of documents to insert under 200mb to avoid the insert timing out.

Parameters
  • dataset_id (string) – Unique name of dataset

  • documents (list) – A list of documents. Document is a JSON-like data that we store our metadata and vectors with. For specifying id of the document use the field ‘_id’, for specifying vector field use the suffix of ‘_vector_’

  • insert_date (bool) – Whether to include insert date as a field ‘insert_date_’.

  • overwrite (bool) – Whether to overwrite document if it exists.

  • update_schema (bool) – Whether the api should check the documents for vector datatype to update the schema.

  • include_inserted_ids (bool) – Include the inserted IDs in the response

  • field_transformers (list) –

    An example field_transformers object:

    >>> {
    >>>    "field": "string",
    >>>    "output_field": "string",
    >>>    "remove_html": true,
    >>>    "split_sentences": true
    >>> }
    

delete(self, dataset_id: str, confirm: bool = False)

Delete a dataset

Parameters

dataset_id (string) – Unique name of dataset

clone(self, old_dataset: str, new_dataset: str, schema: dict = {}, rename_fields: dict = {}, remove_fields: list = [], filters: list = [])

Clone a dataset into a new dataset. You can use this to rename fields and change data schemas. This is considered a project job.

Parameters
  • old_dataset (string) – Unique name of old dataset to copy from

  • new_dataset (string) – Unique name of new dataset to copy to

  • schema (dict) – Schema for specifying the field that are vectors and its length

  • rename_fields (dict) – Fields to rename {‘old_field’: ‘new_field’}. Defaults to no renames

  • remove_fields (list) – Fields to remove [‘random_field’, ‘another_random_field’]. Defaults to no removes

  • filters (list) – Query for filtering the search results

search(self, query, sort_by_created_at_date: bool = False, asc: bool = False)

Search datasets by their names with a traditional keyword search.

Parameters
  • query (string) – Any string that belongs to part of a dataset.

  • sort_by_created_at_date (bool) – Sort by created at date. By default shows the newest datasets. Set asc=False to get oldest dataset.

  • asc (bool) – Whether to sort results by ascending or descending order

vectorize(self, dataset_id: str, model_id: str, fields: list = [], filters: list = [], refresh: bool = False, alias: str = 'default', chunksize: int = 20, chunk_field: str = None)

Queue the encoding of a dataset using the method given by model_id.

Parameters
  • dataset_id (string) – Unique name of dataset

  • model_id (string) – Model ID to use for vectorizing (encoding.)

  • fields (list) – Fields to remove [‘random_field’, ‘another_random_field’]. Defaults to no removes

  • filters (list) – Filters to run against

  • refresh (bool) – If True, re-runs encoding on whole dataset.

  • alias (string) – Alias used to name a vector field. Belongs in field_{alias}vector

  • chunksize (int) – Batch for each encoding. Change at your own risk.

  • chunk_field (string) – The chunk field. If the chunk field is specified, the field to be encoded should not include the chunk field.

task_status(self, dataset_id: str, task_id: str)

Check the status of an existing encoding task on the given dataset.

The required task_id was returned in the original encoding request such as datasets.vectorize.

Parameters
  • dataset_id (string) – Unique name of dataset

  • task_id (string) – The task ID of the earlier queued vectorize task