Featured image of post Db2 vector for similarity search

Db2 vector for similarity search

Db2 features new vector capabilities that support use cases like in-database similarity search or RAG close to production data.

When I wrote about the availability of Db2 V12.1 Mod Pack2 (12.1.2), I stated the following:

To me, the most important new feature is the addition of a VECTOR type and related functionality. It allows to create, store, and query vector data close to the business data. Thereby, Db2 can be used for new AI-related use cases, but also supports the classic “what is related” queries.

Today, I am going to look deeper at the VECTOR type and share with you a Jupyter notebook I created based on my tests.

Db2 vector functionality

Starting with Db2 12.12, Db2 supports vector values. It means that you can insert, store, export vector data and query it with SQL. At its core is the VECTOR data type.

Vectors, or the vector space model, are used for representing text documents such that the distance between vectors represents the relevance between the documents. Thus, vectors are essential to use cases like information filtering, information retrieval, indexing and relevance rankings.

The following functions can be used to turn a vector string into a Db2 vector value and to retrieve it back as string:

With vector data stored in Db2, you can compute the distance between two vector values or retrieve a vector value’s dimension. The distance functions support multiple metric types.

  • VECTOR_DISTANCE: Calculates the distance between two vectors using the specified metric.
  • VECTOR_NORM: Calculates the distance between one vector and the zero vector using the specified metric.
  • VECTOR_DIMENSION_COUNT: Returns the dimension of the vector type definition, even if the value is null.

Jupyter notebook with vector data in action

Recently, I had some time to “get my hands dirty” with vector data in Db2. Based on Shaikh Quader’s blog post AI Vectors and Similarity Search - A Gentle Introduction and his demo on GitHub Semantic Search for Product Recommendation Using Db2 Vector AI, I created a demo in a Jupyter notebook showing Db2 vector functionality.

The demo runs against my Db2 12.1 Community Edition for Docker, but any Db2 12.1.2 (and up) will do. The embeddings are computed locally, based on ollama with an IBM Granite embedding model. Thus, I (can) run everything locally on my machine, without any cloud account. The following shows a section from the notebook where the vector embeddings are generated in ollama:

In a future blog post, I am going to discuss the setup in more detail. For now, you find the instructions in the GitHub repository with the Db2 vector demo.

Vectors and existing data

There are few parts of the notebook I want to point to. First, there is cell 15 with the following SQL statement. It is an ALTER TABLE to add a VECTOR-typed column to an existing table. An alternative would be to have a side table with the primary ID and the vector column. Both approaches are great to bring vector data and similarity search close to existing production data.

ALTER TABLE SHOES
ADD COLUMN EMBEDDING VECTOR({{vector_dimension}}, FLOAT32);

Similarity search for relevant data

Next, take a look at cells 22 to 24. There, you find SELECT statements with VECTOR_DISTANCE for similarity search. The first two searches are based on an already stored record as reference data. Thus, there is a sub-select to retrieve the embedding data for a specific row identified by SKU.

VECTOR_DISTANCE(
    (SELECT EMBEDDING FROM SHOES WHERE SKU = '{my_choice_sku}'), 
    EMBEDDING, 
    EUCLIDEAN
) AS DISTANCE

For the last search, I define values for product properties, encode them, generate embeddings, and finally use them as input for VECTOR_DISTANCE. Thus, relevant existing products are returned for user/customer preferences.

VECTOR_DISTANCE(
    VECTOR('{our_preferences_embeddings}', {vector_dimension}, FLOAT32),
    EMBEDDING, 
    EUCLIDEAN
) AS DISTANCE

The following screenshot shows the above SQL statement within the full context:

Conclusions

Db2 has a built-in vector data type since version 12.1.2. You can use my and other existing code and reading material to get started. The README for the demo has links for more background reading and demos.

Instead of using shoes as product, you could also think of financial products like mutual funds or ETFs. They have many properties (features) to consider, and a market with many similar financial products.

Two things before I finish 😉:

  1. My demo is similar to Shaikh’s
  2. Are we going to meet in not so distant future at IDUG EMEA in Düsseldorf…?

If you have feedback, suggestions, or questions about this post, please reach out to me on Mastodon (@data_henrik@mastodon.social) or LinkedIn.