I’ve built dozens of applications where Mongo DB was the system of record, and that’s unlikely to change. Old habits die hard after all.
However, as AI capabilities and vector search engines become more available, satisfying complicated use cases such as semantic search becomes easier.
I’m going to walk you through how to build an application that uses MongoDB as the metadata and content store (i.e. system of record) so that whenever updates are made (inserts, deletes, updates) the corresponding vector embeddings in Pinecone are synchronized in real-time.
I wrote this tutorial because it’s how we at Mixpeek are keeping multimodal (image, video, audio and text) embeddings up-to-date in real-time as our users push files to the API.
NOTE: I am not affiliated with MongoDB or Pinecone in any capacity.
Setting up the environments
To start, we will need to install both MongoDB and Pinecone on our system. We’ll also use an off-the-shelf sentenceSimilarity transformer
# transformer stuff model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
Setting up the real-time pipeline
Now that our database and index are set up, we can use the change stream API in MongoDB to automatically replicate values between Mongo and Pinecone. The change stream API allows us to watch for changes in our MongoDB collection and perform actions based on those changes.
By replicating values in real time between the two, the AI application can not only perform key-value lookups and semantic searches more efficiently but also keep the vector embeddings up-to-date as the system of record changes.
Our goal is to keep the name embedding up-to-date in our vector search engine so that when we do sentence similarity, we’re never returning obsolete data:
To use the change stream API, we create a cursor that watches our MongoDB collection for changes:
# open up change stream cursor cursor = db.collection.watch() while True: change = next(cursor)
# If a new document is inserted into the collection, replicate its vector in Pinecone if change['operationType'] == 'insert': document = change['fullDocument']
# convert the document's name into an embedding vector = model.encode(document['name'])
# insert into pinecone pinecone.insert(index_name="myindex", data=vector, ids=[str(document['_id'])])
# If a document is updated in the collection, update its vector in Pinecone elif change['operationType'] == 'update': document_id = change['documentKey']['_id'] updated_fields = change['updateDescription']['updatedFields']
# if the change is in the name field, generate the embedding and insert if updated_fields.get('name'): vector = model.encode(updated_fields['name']) pinecone.upsert(index_name="myindex", data=vector, ids=[str(document_id)])
# If a document is deleted from the collection, remove its vector from Pinecone elif document['operationType'] == 'delete': pinecone.delete(ids=[str(change['documentKey']['_id'])])
Now with this code above, we have a real-time pipeline that automatically inserts, updates or deletes pinecone vector embeddings depending on the changes made to the underlying database.
Document search and recommendation: Combining metadata and sentence embeddings to perform more accurate and meaningful searches is useful in industries such as legal, scientific, and academic research, where large volumes of documents need to be searched and analyzed whilst maintaining factual data.
Product recommendation and personalization: By analyzing customers’ previous purchases, search history, and interactions with the website or app, you can compute a personalized vector representation of the customer’s preferences and match them with similar product vectors can lead to higher customer satisfaction and sales, as well as improved customer loyalty.
Fraud detection and prevention: By analyzing the semantic similarity between the transaction details and the customer’s historical patterns of behavior, you can identify suspicious transactions or activities and alert the appropriate authorities. This can help prevent financial losses, identity theft, and other forms of online fraud.
Want all these bells and whistles included?
Mixpeek is an all-encompassing multimodal search API that powers content discovery, e-commerce, real estate, and more applications. File parsing, chunking, GPU inference, and more boiled down to two API calls.
Here’s an example video of a semantic video search: