relevanceai.api.batch.batch_insert

Batch Insert

Module Contents

Classes

BatchInsertClient

API Client

Attributes

BYTE_TO_MB

LIST_SIZE_MULTIPLIER

relevanceai.api.batch.batch_insert.BYTE_TO_MB
relevanceai.api.batch.batch_insert.LIST_SIZE_MULTIPLIER = 3
class relevanceai.api.batch.batch_insert.BatchInsertClient(project: str, api_key: str)

Bases: relevanceai.api.batch.batch_retrieve.BatchRetrieveClient, relevanceai.api.endpoints.client.APIClient, relevanceai.api.batch.chunk.Chunker, doc_utils.DocUtils

API Client

insert_documents(self, dataset_id: str, docs: list, bulk_fn: Callable = None, max_workers: int = 8, retry_chunk_mult: float = 0.5, show_progress_bar: bool = False, chunksize=0, *args, **kwargs)

Insert a list of documents with multi-threading automatically enabled.

  • When inserting the document you can optionally specify your own id for a document by using the field name “_id”, if not specified a random id is assigned.

  • When inserting or specifying vectors in a document use the suffix (ends with) “_vector_” for the field name. e.g. “product_description_vector_”.

  • When inserting or specifying chunks in a document the suffix (ends with) “_chunk_” for the field name. e.g. “products_chunk_”.

  • When inserting or specifying chunk vectors in a document’s chunks use the suffix (ends with) “_chunkvector_” for the field name. e.g. “products_chunk_.product_description_chunkvector_”.

Documentation can be found here: https://ingest-api-dev-aueast.relevance.ai/latest/documentation#operation/InsertEncode

Parameters
  • dataset_id (string) – Unique name of dataset

  • docs (list) – A list of documents. Document is a JSON-like data that we store our metadata and vectors with. For specifying id of the document use the field ‘_id’, for specifying vector field use the suffix of ‘_vector_’

  • bulk_fn (callable) – Function to apply to documents before uploading

  • max_workers (int) – Number of workers active for multi-threading

  • retry_chunk_mult (int) – Multiplier to apply to chunksize if upload fails

  • chunksize (int) – Number of documents to upload per worker. If None, it will default to the size specified in config.upload.target_chunk_mb

update_documents(self, dataset_id: str, docs: list, bulk_fn: Callable = None, max_workers: int = 8, retry_chunk_mult: float = 0.5, chunksize: int = 0, show_progress_bar=False, *args, **kwargs)

Update a list of documents with multi-threading automatically enabled. Edits documents by providing a key value pair of fields you are adding or changing, make sure to include the “_id” in the documents.

>>> from relevanceai import Client
>>> url = "https://api-aueast.relevance.ai/v1/"
>>> collection = ""
>>> project = ""
>>> api_key = ""
>>> client = Client(project, api_key)
>>> docs = client.datasets.documents.get_where(collection, select_fields=['title'])
>>> while len(docs['documents']) > 0:
>>>     docs['documents'] = model.encode_documents_in_bulk(['product_name'], docs['documents'])
>>>     client.update_documents(collection, docs['documents'])
>>>     docs = client.datasets.documents.get_where(collection, select_fields=['product_name'], cursor=docs['cursor'])
Parameters
  • dataset_id (string) – Unique name of dataset

  • docs (list) – A list of documents. Document is a JSON-like data that we store our metadata and vectors with. For specifying id of the document use the field ‘_id’, for specifying vector field use the suffix of ‘_vector_’

  • bulk_fn (callable) – Function to apply to documents before uploading

  • max_workers (int) – Number of workers active for multi-threading

  • retry_chunk_mult (int) – Multiplier to apply to chunksize if upload fails

  • chunksize (int) – Number of documents to upload per worker. If None, it will default to the size specified in config.upload.target_chunk_mb

pull_update_push(self, original_collection: str, update_function, updated_collection: str = None, log_file: str = None, updating_args: dict = {}, retrieve_chunk_size: int = 100, max_workers: int = 8, filters: list = [], select_fields: list = [], show_progress_bar: bool = True)

Loops through every document in your collection and applies a function (that is specified by you) to the documents. These documents are then uploaded into either an updated collection, or back into the original collection.

Parameters
  • original_collection (string) – The dataset_id of the collection where your original documents are

  • logging_collection (string) – The dataset_id of the collection which logs which documents have been updated. If ‘None’, then one will be created for you.

  • updated_collection (string) – The dataset_id of the collection where your updated documents are uploaded into. If ‘None’, then your original collection will be updated.

  • update_function (function) – A function created by you that converts documents in your original collection into the updated documents. The function must contain a field which takes in a list of documents from the original collection. The output of the function must be a list of updated documents.

  • updating_args (dict) – Additional arguments to your update_function, if they exist. They must be in the format of {‘Argument’: Value}

  • retrieve_chunk_size (int) – The number of documents that are received from the original collection with each loop iteration.

  • max_workers (int) – The number of processors you want to parallelize with

  • max_error – How many failed uploads before the function breaks

pull_update_push_to_cloud(self, original_collection: str, update_function, updated_collection: str = None, logging_collection: str = None, updating_args: dict = {}, retrieve_chunk_size: int = 100, retrieve_chunk_size_failure_retry_multiplier: float = 0.5, number_of_retrieve_retries: int = 3, max_workers: int = 8, max_error: int = 1000, filters: list = [], select_fields: list = [], show_progress_bar: bool = True)

Loops through every document in your collection and applies a function (that is specified by you) to the documents. These documents are then uploaded into either an updated collection, or back into the original collection.

Parameters
  • original_collection (string) – The dataset_id of the collection where your original documents are

  • logging_collection (string) – The dataset_id of the collection which logs which documents have been updated. If ‘None’, then one will be created for you.

  • updated_collection (string) – The dataset_id of the collection where your updated documents are uploaded into. If ‘None’, then your original collection will be updated.

  • update_function (function) – A function created by you that converts documents in your original collection into the updated documents. The function must contain a field which takes in a list of documents from the original collection. The output of the function must be a list of updated documents.

  • updating_args (dict) – Additional arguments to your update_function, if they exist. They must be in the format of {‘Argument’: Value}

  • retrieve_chunk_size (int) – The number of documents that are received from the original collection with each loop iteration.

  • retrieve_chunk_size_failure_retry_multiplier (int) – If fails, retry on each chunk

  • max_workers (int) – The number of processors you want to parallelize with

  • max_error – How many failed uploads before the function breaks

insert_df(self, dataset_id, dataframe, *args, **kwargs)

Insert a dataframe for eachd doc

delete_pull_update_push_logs(self, dataset_id=False)
_write_documents(self, insert_function, docs: list, bulk_fn: Callable = None, max_workers: int = 8, retry_chunk_mult: float = 0.5, show_progress_bar: bool = False, chunksize: int = 0)
rename_fields(self, dataset_id: str, field_mappings: dict, retrieve_chunk_size: int = 100, max_workers: int = 8, show_progress_bar: bool = True)

Loops through every document in your collection and renames specified fields by deleting the old one and creating a new field using the provided mapping These documents are then uploaded into either an updated collection, or back into the original collection.

Example: rename_fields(dataset_id,field_mappings = {‘a.b.d’:’a.b.c’}) => doc[‘a’][‘b’][‘d’] => doc[‘a’][‘b’][‘c’] rename_fields(dataset_id,field_mappings = {‘a.b’:’a.c’}) => doc[‘a’][‘b’] => doc[‘a’][‘c’]

Parameters
  • dataset_id (string) – The dataset_id of the collection where your original documents are

  • field_mappings (dict) – A dictionary in the form f {old_field_name1 : new_field_name1, …}

  • retrieve_chunk_size (int) – The number of documents that are received from the original collection with each loop iteration.

  • retrieve_chunk_size_failure_retry_multiplier (int) – If fails, retry on each chunk

  • max_workers (int) – The number of processors you want to parallelize with

  • show_progress_bar (bool) – Shows a progress bar if True