`relevanceai.api.batch.batch_insert`

Batch Insert

Module Contents

Classes

BatchInsertClient

API Client

Attributes

`BYTE_TO_MB`
`LIST_SIZE_MULTIPLIER`

relevanceai.api.batch.batch_insert.BYTE_TO_MB

relevanceai.api.batch.batch_insert.LIST_SIZE_MULTIPLIER = 3

class relevanceai.api.batch.batch_insert.BatchInsertClient(project: str, api_key: str)

Bases: relevanceai.api.batch.batch_retrieve.BatchRetrieveClient, relevanceai.api.endpoints.client.APIClient, relevanceai.api.batch.chunk.Chunker, doc_utils.DocUtils

API Client

insert_documents(self, dataset_id: str, docs: list, bulk_fn: Callable = None, max_workers: int = 8, retry_chunk_mult: float = 0.5, show_progress_bar: bool = False, chunksize=0, *args, **kwargs)

Insert a list of documents with multi-threading automatically enabled.

When inserting the document you can optionally specify your own id for a document by using the field name “_id”, if not specified a random id is assigned.
When inserting or specifying vectors in a document use the suffix (ends with) “_vector_” for the field name. e.g. “product_description_vector_”.
When inserting or specifying chunks in a document the suffix (ends with) “_chunk_” for the field name. e.g. “products_chunk_”.
When inserting or specifying chunk vectors in a document’s chunks use the suffix (ends with) “_chunkvector_” for the field name. e.g. “products_chunk_.product_description_chunkvector_”.

Documentation can be found here: https://ingest-api-dev-aueast.relevance.ai/latest/documentation#operation/InsertEncode

Parameters

dataset_id (string) – Unique name of dataset
docs (list) – A list of documents. Document is a JSON-like data that we store our metadata and vectors with. For specifying id of the document use the field ‘_id’, for specifying vector field use the suffix of ‘_vector_’
bulk_fn (callable) – Function to apply to documents before uploading
max_workers (int) – Number of workers active for multi-threading
retry_chunk_mult (int) – Multiplier to apply to chunksize if upload fails
chunksize (int) – Number of documents to upload per worker. If None, it will default to the size specified in config.upload.target_chunk_mb

update_documents(self, dataset_id: str, docs: list, bulk_fn: Callable = None, max_workers: int = 8, retry_chunk_mult: float = 0.5, chunksize: int = 0, show_progress_bar=False, *args, **kwargs)

Update a list of documents with multi-threading automatically enabled. Edits documents by providing a key value pair of fields you are adding or changing, make sure to include the “_id” in the documents.

>>> from relevanceai import Client
>>> url = "https://api-aueast.relevance.ai/v1/"
>>> collection = ""
>>> project = ""
>>> api_key = ""
>>> client = Client(project, api_key)
>>> docs = client.datasets.documents.get_where(collection, select_fields=['title'])
>>> while len(docs['documents']) > 0:
>>>     docs['documents'] = model.encode_documents_in_bulk(['product_name'], docs['documents'])
>>>     client.update_documents(collection, docs['documents'])
>>>     docs = client.datasets.documents.get_where(collection, select_fields=['product_name'], cursor=docs['cursor'])

Parameters

dataset_id (string) – Unique name of dataset
docs (list) – A list of documents. Document is a JSON-like data that we store our metadata and vectors with. For specifying id of the document use the field ‘_id’, for specifying vector field use the suffix of ‘_vector_’
bulk_fn (callable) – Function to apply to documents before uploading
max_workers (int) – Number of workers active for multi-threading
retry_chunk_mult (int) – Multiplier to apply to chunksize if upload fails
chunksize (int) – Number of documents to upload per worker. If None, it will default to the size specified in config.upload.target_chunk_mb

pull_update_push(self, original_collection: str, update_function, updated_collection: str = None, log_file: str = None, updating_args: dict = {}, retrieve_chunk_size: int = 100, max_workers: int = 8, filters: list = [], select_fields: list = [], show_progress_bar: bool = True)

Loops through every document in your collection and applies a function (that is specified by you) to the documents. These documents are then uploaded into either an updated collection, or back into the original collection.

Parameters

original_collection (string) – The dataset_id of the collection where your original documents are
logging_collection (string) – The dataset_id of the collection which logs which documents have been updated. If ‘None’, then one will be created for you.
updated_collection (string) – The dataset_id of the collection where your updated documents are uploaded into. If ‘None’, then your original collection will be updated.
update_function (function) – A function created by you that converts documents in your original collection into the updated documents. The function must contain a field which takes in a list of documents from the original collection. The output of the function must be a list of updated documents.
updating_args (dict) – Additional arguments to your update_function, if they exist. They must be in the format of {‘Argument’: Value}
retrieve_chunk_size (int) – The number of documents that are received from the original collection with each loop iteration.
max_workers (int) – The number of processors you want to parallelize with
max_error – How many failed uploads before the function breaks

pull_update_push_to_cloud(self, original_collection: str, update_function, updated_collection: str = None, logging_collection: str = None, updating_args: dict = {}, retrieve_chunk_size: int = 100, retrieve_chunk_size_failure_retry_multiplier: float = 0.5, number_of_retrieve_retries: int = 3, max_workers: int = 8, max_error: int = 1000, filters: list = [], select_fields: list = [], show_progress_bar: bool = True)

Loops through every document in your collection and applies a function (that is specified by you) to the documents. These documents are then uploaded into either an updated collection, or back into the original collection.

Parameters

original_collection (string) – The dataset_id of the collection where your original documents are
logging_collection (string) – The dataset_id of the collection which logs which documents have been updated. If ‘None’, then one will be created for you.
updated_collection (string) – The dataset_id of the collection where your updated documents are uploaded into. If ‘None’, then your original collection will be updated.
update_function (function) – A function created by you that converts documents in your original collection into the updated documents. The function must contain a field which takes in a list of documents from the original collection. The output of the function must be a list of updated documents.
updating_args (dict) – Additional arguments to your update_function, if they exist. They must be in the format of {‘Argument’: Value}
retrieve_chunk_size (int) – The number of documents that are received from the original collection with each loop iteration.
retrieve_chunk_size_failure_retry_multiplier (int) – If fails, retry on each chunk
max_workers (int) – The number of processors you want to parallelize with
max_error – How many failed uploads before the function breaks

insert_df(self, dataset_id, dataframe, *args, **kwargs): Insert a dataframe for eachd doc

delete_pull_update_push_logs(self, dataset_id=False)

_write_documents(self, insert_function, docs: list, bulk_fn: Callable = None, max_workers: int = 8, retry_chunk_mult: float = 0.5, show_progress_bar: bool = False, chunksize: int = 0)

rename_fields(self, dataset_id: str, field_mappings: dict, retrieve_chunk_size: int = 100, max_workers: int = 8, show_progress_bar: bool = True)

Loops through every document in your collection and renames specified fields by deleting the old one and creating a new field using the provided mapping These documents are then uploaded into either an updated collection, or back into the original collection.

Example: rename_fields(dataset_id,field_mappings = {‘a.b.d’:’a.b.c’}) => doc[‘a’][‘b’][‘d’] => doc[‘a’][‘b’][‘c’] rename_fields(dataset_id,field_mappings = {‘a.b’:’a.c’}) => doc[‘a’][‘b’] => doc[‘a’][‘c’]

Parameters

dataset_id (string) – The dataset_id of the collection where your original documents are
field_mappings (dict) – A dictionary in the form f {old_field_name1 : new_field_name1, …}
retrieve_chunk_size (int) – The number of documents that are received from the original collection with each loop iteration.
retrieve_chunk_size_failure_retry_multiplier (int) – If fails, retry on each chunk
max_workers (int) – The number of processors you want to parallelize with
show_progress_bar (bool) – Shows a progress bar if True

relevanceai.api.batch.batch_insert

Module Contents

Classes

Attributes

`relevanceai.api.batch.batch_insert`