relevanceai.api.batch.batch_insert
Batch Insert
Module Contents
Classes
API Client |
Attributes
- relevanceai.api.batch.batch_insert.BYTE_TO_MB
- relevanceai.api.batch.batch_insert.LIST_SIZE_MULTIPLIER = 3
- class relevanceai.api.batch.batch_insert.BatchInsertClient(project: str, api_key: str)
Bases:
relevanceai.api.batch.batch_retrieve.BatchRetrieveClient,relevanceai.api.endpoints.client.APIClient,relevanceai.api.batch.chunk.Chunker,doc_utils.DocUtilsAPI Client
- insert_documents(self, dataset_id: str, docs: list, bulk_fn: Callable = None, max_workers: int = 8, retry_chunk_mult: float = 0.5, show_progress_bar: bool = False, chunksize=0, *args, **kwargs)
Insert a list of documents with multi-threading automatically enabled.
When inserting the document you can optionally specify your own id for a document by using the field name “_id”, if not specified a random id is assigned.
When inserting or specifying vectors in a document use the suffix (ends with) “_vector_” for the field name. e.g. “product_description_vector_”.
When inserting or specifying chunks in a document the suffix (ends with) “_chunk_” for the field name. e.g. “products_chunk_”.
When inserting or specifying chunk vectors in a document’s chunks use the suffix (ends with) “_chunkvector_” for the field name. e.g. “products_chunk_.product_description_chunkvector_”.
Documentation can be found here: https://ingest-api-dev-aueast.relevance.ai/latest/documentation#operation/InsertEncode
- Parameters
dataset_id (string) – Unique name of dataset
docs (list) – A list of documents. Document is a JSON-like data that we store our metadata and vectors with. For specifying id of the document use the field ‘_id’, for specifying vector field use the suffix of ‘_vector_’
bulk_fn (callable) – Function to apply to documents before uploading
max_workers (int) – Number of workers active for multi-threading
retry_chunk_mult (int) – Multiplier to apply to chunksize if upload fails
chunksize (int) – Number of documents to upload per worker. If None, it will default to the size specified in config.upload.target_chunk_mb
- update_documents(self, dataset_id: str, docs: list, bulk_fn: Callable = None, max_workers: int = 8, retry_chunk_mult: float = 0.5, chunksize: int = 0, show_progress_bar=False, *args, **kwargs)
Update a list of documents with multi-threading automatically enabled. Edits documents by providing a key value pair of fields you are adding or changing, make sure to include the “_id” in the documents.
>>> from relevanceai import Client >>> url = "https://api-aueast.relevance.ai/v1/" >>> collection = "" >>> project = "" >>> api_key = "" >>> client = Client(project, api_key) >>> docs = client.datasets.documents.get_where(collection, select_fields=['title']) >>> while len(docs['documents']) > 0: >>> docs['documents'] = model.encode_documents_in_bulk(['product_name'], docs['documents']) >>> client.update_documents(collection, docs['documents']) >>> docs = client.datasets.documents.get_where(collection, select_fields=['product_name'], cursor=docs['cursor'])
- Parameters
dataset_id (string) – Unique name of dataset
docs (list) – A list of documents. Document is a JSON-like data that we store our metadata and vectors with. For specifying id of the document use the field ‘_id’, for specifying vector field use the suffix of ‘_vector_’
bulk_fn (callable) – Function to apply to documents before uploading
max_workers (int) – Number of workers active for multi-threading
retry_chunk_mult (int) – Multiplier to apply to chunksize if upload fails
chunksize (int) – Number of documents to upload per worker. If None, it will default to the size specified in config.upload.target_chunk_mb
- pull_update_push(self, original_collection: str, update_function, updated_collection: str = None, log_file: str = None, updating_args: dict = {}, retrieve_chunk_size: int = 100, max_workers: int = 8, filters: list = [], select_fields: list = [], show_progress_bar: bool = True)
Loops through every document in your collection and applies a function (that is specified by you) to the documents. These documents are then uploaded into either an updated collection, or back into the original collection.
- Parameters
original_collection (string) – The dataset_id of the collection where your original documents are
logging_collection (string) – The dataset_id of the collection which logs which documents have been updated. If ‘None’, then one will be created for you.
updated_collection (string) – The dataset_id of the collection where your updated documents are uploaded into. If ‘None’, then your original collection will be updated.
update_function (function) – A function created by you that converts documents in your original collection into the updated documents. The function must contain a field which takes in a list of documents from the original collection. The output of the function must be a list of updated documents.
updating_args (dict) – Additional arguments to your update_function, if they exist. They must be in the format of {‘Argument’: Value}
retrieve_chunk_size (int) – The number of documents that are received from the original collection with each loop iteration.
max_workers (int) – The number of processors you want to parallelize with
max_error – How many failed uploads before the function breaks
- pull_update_push_to_cloud(self, original_collection: str, update_function, updated_collection: str = None, logging_collection: str = None, updating_args: dict = {}, retrieve_chunk_size: int = 100, retrieve_chunk_size_failure_retry_multiplier: float = 0.5, number_of_retrieve_retries: int = 3, max_workers: int = 8, max_error: int = 1000, filters: list = [], select_fields: list = [], show_progress_bar: bool = True)
Loops through every document in your collection and applies a function (that is specified by you) to the documents. These documents are then uploaded into either an updated collection, or back into the original collection.
- Parameters
original_collection (string) – The dataset_id of the collection where your original documents are
logging_collection (string) – The dataset_id of the collection which logs which documents have been updated. If ‘None’, then one will be created for you.
updated_collection (string) – The dataset_id of the collection where your updated documents are uploaded into. If ‘None’, then your original collection will be updated.
update_function (function) – A function created by you that converts documents in your original collection into the updated documents. The function must contain a field which takes in a list of documents from the original collection. The output of the function must be a list of updated documents.
updating_args (dict) – Additional arguments to your update_function, if they exist. They must be in the format of {‘Argument’: Value}
retrieve_chunk_size (int) – The number of documents that are received from the original collection with each loop iteration.
retrieve_chunk_size_failure_retry_multiplier (int) – If fails, retry on each chunk
max_workers (int) – The number of processors you want to parallelize with
max_error – How many failed uploads before the function breaks
- insert_df(self, dataset_id, dataframe, *args, **kwargs)
Insert a dataframe for eachd doc
- delete_pull_update_push_logs(self, dataset_id=False)
- _write_documents(self, insert_function, docs: list, bulk_fn: Callable = None, max_workers: int = 8, retry_chunk_mult: float = 0.5, show_progress_bar: bool = False, chunksize: int = 0)
- rename_fields(self, dataset_id: str, field_mappings: dict, retrieve_chunk_size: int = 100, max_workers: int = 8, show_progress_bar: bool = True)
Loops through every document in your collection and renames specified fields by deleting the old one and creating a new field using the provided mapping These documents are then uploaded into either an updated collection, or back into the original collection.
Example: rename_fields(dataset_id,field_mappings = {‘a.b.d’:’a.b.c’}) => doc[‘a’][‘b’][‘d’] => doc[‘a’][‘b’][‘c’] rename_fields(dataset_id,field_mappings = {‘a.b’:’a.c’}) => doc[‘a’][‘b’] => doc[‘a’][‘c’]
- Parameters
dataset_id (string) – The dataset_id of the collection where your original documents are
field_mappings (dict) – A dictionary in the form f {old_field_name1 : new_field_name1, …}
retrieve_chunk_size (int) – The number of documents that are received from the original collection with each loop iteration.
retrieve_chunk_size_failure_retry_multiplier (int) – If fails, retry on each chunk
max_workers (int) – The number of processors you want to parallelize with
show_progress_bar (bool) – Shows a progress bar if True