Featured image of post Db2 experiments with vectors and similarity search

Db2 experiments with vectors and similarity search

Some experiments with Db2 vector capabilities that demonstrate the power of SQL, but also show that careful selection of embedding models is needed

Over the past days, I got back to using the vector feature in Db2 for geographic search. What would happen, so I asked myself, when instead of just latitute and longitude I would use population as a third dimension? How would the similarity search turn out? The experiments proofed to be quite interesting. And it turned the focus on the aspect that the way embeddings are generated, turning objects into vector data, has a significant impact on result quality and usefulness.

Not all cities are equal or similar

The screenshot above shows Markham, Ontaria, Canada and its surrounding cities as found by a basic similarity search in Db2. The vector consists of the latitude and longitude. All neighbors, regardless of city size, are returned. Here is the same query with the result shown for Munich in Germany:

Map with neighboring cities to Munich, simple approach

Munich is significantly larger than its direct neighbors, both in area and population size. So, could we add population as third dimension to the vector and use VECTOR_DISTANCE to find cities with a similar population as neighbors to Munich?

Naive approach

I started my experiments with the naive approach. I made up the vector of latitude, longitude, population. Later, I used 1/10 of the population, but the result did not change. Maybe (un-)suprising, using vector distance to find similar cities to Munich, ordered by distance, returned the following:

NAME                                                         COUNTRY                        POPULATION  DISTANCE                
------------------------------------------------------------ ------------------------------ ----------- ------------------------
Munich                                                       Germany                            2606021   +0.00000000000000E+000
Gulou                                                        China                              2606000   +1.09974954638774E+002
Attadappa                                                    India                              2615266   +9.26908497906217E+002
Birmingham                                                   United Kingdom                     2590363   +1.56606401538943E+003
Valencia                                                     Venezuela                          2585202   +2.08386580708759E+003
Damascus                                                     Syria                              2584771   +2.12519420984858E+003
Chattogram                                                   Bangladesh                         2581643   +2.43945711888750E+003
Zibo                                                         China                              2631647   +2.56423677079286E+003
Huai’an                                                      China                              2632788   +2.67819961677703E+003
Weifang                                                      China                              2636154   +3.01494188943090E+003
Longyan                                                      China                              2640000   +3.39971378910805E+003
Baotou                                                       China                              2650364   +4.43509760476645E+003
Brazzaville                                                  Congo (Brazzaville)                2557100   +4.89228206759222E+003
Yangjiang                                                    China                              2555600   +5.04306815654393E+003
Chaozhou                                                     China                              2656600   +5.05914994030683E+003
Maracaibo                                                    Venezuela                          2658355   +5.23379586785489E+003
Sanaa                                                        Yemen                              2545000   +6.10217534491475E+003
Shanwei                                                      China                              2672819   +6.67985465432196E+003
Gāzipura                                                     Bangladesh                         2674697   +6.86749503505122E+003
Barmhān Kalān                                                India                              2537000   +6.90233374714442E+003

  20 record(s) selected.

With latitude and longitude having rather small values compared to population, there is an overemphasis on the latter (“bias towards population equality”?). Results are close to the actual population of Munich, but no longer are geographically close.

Would we need to transform (or encode) population before creating the vectors? Let’s give it a try…

Population control, sort of

In my next steps, I first applied square root (SQRT) to the population data. Later, I utilized the natural logarithm (LN).

Below are the returned cities for SQRT(). They are geographically closer to Munich, but still “all over the world”. The number of residents is still close to Munich’s population:

NAME                                                         COUNTRY                        POPULATION  DISTANCE                
------------------------------------------------------------ ------------------------------ ----------- ------------------------
Munich                                                       Germany                            2606021   +0.00000000000000E+000
Birmingham                                                   United Kingdom                     2590363   +1.49696789337801E+001
Damascus                                                     Syria                              2584771   +2.94775581209598E+001
Hamburg                                                      Germany                            2496600   +3.47149959809833E+001
Rome                                                         Italy                              2748109   +4.38807439818582E+001
Sanaa                                                        Yemen                              2545000   +5.00138998365395E+001
Brazzaville                                                  Congo (Brazzaville)                2557100   +5.46955119796239E+001
Stuttgart                                                    Germany                            2787724   +5.53858301721772E+001
Bucharest                                                    Romania                            2412530   +6.28987131188135E+001
Beirut                                                       Lebanon                            2421354   +6.45616562868369E+001
Mecca                                                        Saudi Arabia                       2427924   +6.82881424663490E+001
Yaoundé                                                      Cameroon                           2440462   +6.83840306415624E+001
Barmhān Kalān                                                India                              2537000   +7.12061847735144E+001
Omdurman                                                     Sudan                              2805396   +7.18584713163960E+001
Attadappa                                                    India                              2615266   +7.34271652308136E+001
Algiers                                                      Algeria                            2364230   +7.80185284132648E+001
Mirzāpur                                                     India                              2496970   +8.20615487236226E+001
Chattogram                                                   Bangladesh                         2581643   +8.46422764020678E+001
Gāzipura                                                     Bangladesh                         2674697   +8.51213363304276E+001
Valencia                                                     Venezuela                          2585202   +8.83993711346483E+001

  20 record(s) selected.

Now, looking at the results when LN() is used instead, all returned cities are in Germany and “neighboring” countries. The population, however, deviates much more, but it still returns large cities only.

NAME                                                         COUNTRY                        POPULATION  DISTANCE                
------------------------------------------------------------ ------------------------------ ----------- ------------------------
Munich                                                       Germany                            2606021   +0.00000000000000E+000
Nuremberg                                                    Germany                             526091   +2.13090380450457E+000
Augsburg                                                     Germany                             303150   +2.26725912680422E+000
Stuttgart                                                    Germany                            2787724   +2.47995256588007E+000
Regensburg                                                   Germany                             159465   +2.97261354556007E+000
Ingolstadt                                                   Germany                             142308   +2.97790711645387E+000
Innsbruck                                                    Austria                             130585   +3.12248362476045E+000
Salzburg                                                     Austria                             155021   +3.19978164226177E+000
Lochau                                                       Austria                             200000   +3.21042901778803E+000
Fürth                                                        Germany                             132032   +3.31555761161626E+000
Ulm                                                          Germany                             129942   +3.40490381459675E+000
Erlangen                                                     Germany                             117806   +3.46274565636570E+000
Prague                                                       Czechia                            1384732   +3.50775200123016E+000
Plzeň                                                        Czechia                             185599   +3.58067411048600E+000
Zürich                                                       Switzerland                         448664   +3.58915440458200E+000
Leipzig                                                      Germany                             619879   +3.59975830782204E+000
Bolzano                                                      Italy                               106107   +3.60324391785282E+000
Verona                                                       Italy                               255588   +3.60760002624422E+000
Landshut                                                     Germany                              75272   +3.61339028640625E+000
Milan                                                        Italy                              1354196   +3.63989821744654E+000

  20 record(s) selected.

Map with neighboring cities to Munich, logarithm applied

I tried to adjust query results by applying the predicate “country=‘Germany’” in the WHERE clause. On first sight, the result looks ok. However, it ignores the fact that Munich is close to countries like Austria and Czechia and both have qualifying cities.

NAME                                                         COUNTRY                        POPULATION  DISTANCE                
------------------------------------------------------------ ------------------------------ ----------- ------------------------
Munich                                                       Germany                            2606021   +0.00000000000000E+000
Nuremberg                                                    Germany                             526091   +2.13090380450457E+000
Augsburg                                                     Germany                             303150   +2.26725912680422E+000
Stuttgart                                                    Germany                            2787724   +2.47995256588007E+000
Regensburg                                                   Germany                             159465   +2.97261354556007E+000
Ingolstadt                                                   Germany                             142308   +2.97790711645387E+000
Fürth                                                        Germany                             132032   +3.31555761161626E+000
Ulm                                                          Germany                             129942   +3.40490381459675E+000
Erlangen                                                     Germany                             117806   +3.46274565636570E+000
Leipzig                                                      Germany                             619879   +3.59975830782204E+000
Landshut                                                     Germany                              75272   +3.61339028640625E+000
Frankfurt                                                    Germany                             775790   +3.70535022195604E+000
Rosenheim                                                    Germany                              65192   +3.74132320060203E+000
Würzburg                                                     Germany                             128246   +3.80443740624146E+000
Chemnitz                                                     Germany                             250681   +3.81440066178152E+000
Kempten                                                      Germany                              70056   +3.85022296238472E+000
Reutlingen                                                   Germany                             118528   +3.90280944423094E+000
Karlsruhe                                                    Germany                             309964   +3.91768644328843E+000
Dresden                                                      Germany                             566222   +3.93704990214691E+000
Heilbronn                                                    Germany                             130093   +3.94597627841490E+000

  20 record(s) selected.

The power of SQL? Use BETWEEN with percentage

Instead of encoding population in the vector, maybe it could be considered in a more traditional predicate? How could I express the search criteria to only consider neighboring cities of similar size? One way could be to apply the BETWEEN predicate with a percentage range for the population. Here is the SQL statement for a range within +/- 20% the size of Munich’s population.

select name, 
       country, 
       population,
       vector_distance((select coord from cities where name='Munich'),
                       coord,
                       euclidean) as distance 
from cities 
where (select population from cities where name='Munich') between population*0.8 and population*1.2
order by distance
limit 20;

The query returns the following, maybe surprising result:

NAME                                                         COUNTRY                        POPULATION  DISTANCE                
------------------------------------------------------------ ------------------------------ ----------- ------------------------
Munich                                                       Germany                            2606021   +0.00000000000000E+000
Stuttgart                                                    Germany                            2787724   +2.47903647478988E+000
Vienna                                                       Austria                            2223236   +4.79802206189633E+000
Hamburg                                                      Germany                            2496600   +5.63699956816997E+000
Rome                                                         Italy                              2748109   +6.31004215500355E+000
Birmingham                                                   United Kingdom                     2590363   +1.41598127691243E+001
Algiers                                                      Algeria                            2364230   +1.42167799104080E+001
Bucharest                                                    Romania                            2412530   +1.49938656038498E+001
Athens                                                       Greece                             3059764   +1.58362674261946E+001
İzmir                                                        Turkey                             2965900   +1.83493615311709E+001
Kyiv                                                         Ukraine                            2952301   +1.90888905608648E+001
Bursa                                                        Turkey                             3101833   +1.92001279810137E+001
Konya                                                        Turkey                             2320241   +2.32947708743434E+001
Casablanca                                                   Morocco                            3215935   +2.40898978805504E+001
Beirut                                                       Lebanon                            2421354   +2.78467879825984E+001
Damascus                                                     Syria                              2584771   +2.87302901819551E+001
Ouagadougou                                                  Burkina Faso                       3063271   +3.80931722447174E+001
Omdurman                                                     Sudan                              2805396   +3.85949018415254E+001
Mecca                                                        Saudi Arabia                       2427924   +3.88820763462854E+001
Baku                                                         Azerbaijan                         2300500   +3.90413674447026E+001

  20 record(s) selected.

Conclusions

My experiments clearly show that an approach of “just encode, then query” do not work well (Hello, vibe coders!). When designing the vector or choosing an embedding model, it requires careful selection and quality control. What also became clear is that storing vectors and traditional relational data side by side is quite powerful. By combining similarity search with regular SQL predicates on other columns, I was able to built more tailored queries and to tune results.

If you have feedback, suggestions, or questions about this post, please reach out to me on Mastodon (@data_henrik@mastodon.social) or LinkedIn.