Vektör Veritabanları: Yerleştirmelerden Uygulamalara

Cahit Barkin Ozer

12 min readMar 14, 2024

Deeplearning.ai’ın “Vector Databases: from Embeddings to Applications” kursunun özeti.

For English:

Vector Databases: from Embeddings to Applications

An overview of Deeplearning.ai's "Pair Programming with a Large Language Model" course. LLMs fall short when a new or…

cbarkinozer.blogspot.com

Yeni veya harici bir veri kaynağı gerektiğinde LLM’ler yetersiz kalmaktadır. Bu getirme işlemini sağlamak için artırılmış üretim yaklaşımı oluşturulmuştur ve RAG, vektör veritabanlarını kullanır. Vektör veritabanları, sözdizimi (harf kombinasyonları) yerine semantik (anlam) ile arama yapmanızı sağlar.

Aşağıda veritabanı türü karşılaştırması verilmiştir:

Weaviate bu bölümün sponsorudur. Weaviate, yardımcı olan açık kaynaklı bir yapay zeka yerel vektör veritabanıdır. geliştiriciler sezgisel ve güvenilir yapay zeka destekli uygulamalar oluşturur.

Aşağıda bazı vektör veritabanlarının karşılaştırması bulunmaktadır:

Kursun içeriği

Gömmeler
Mesafe Metrikleri
ANN — Doğruluk için ticari geri çağırma
HNSW
CRUD işlemleri
Nesneler+vektörler
Tersine çevrilmiş indeks — filtrelenmiş arama
Yoğun yerleştirmeler üzerinde YSA araması
Seyrek arama
Hibrit arama
Sektördeki VectorDB uygulamaları

Verilerin vektör temsilleri nasıl elde edilir?

Genellikle büyük dil modelleri görüntüleri, videoları, metinleri veya diğer herhangi bir veri kaynağını vektör adı verilen float sayılara dönüştürür. Bu işleme yerleştirme (embedding) denir. Yerleştirme, büyük dil ağlarının anlamsal (anlam) uzayda nereye ait olduklarını anlayabilmeleri nedeniyle yapılır. Örneğin, futbolu anlama ile futbol görseli/videosu/metni birbirine çok benzerken, basketbol içeriği daha az benzer, yüzme içeriği ise daha da az benzer.

Bu resim ve cümle yerleştirmeleri arasındaki mesafeyi nasıl ölçebiliriz?

İki vektör arasındaki mesafeyi hesaplamanın birçok yolu vardır. Burada vektör veritabanları bağlamında kullanıldığını bulabileceğiniz 4 mesafe ölçümünü ele alacağız:

Öklid Mesafesi (L2): İki nokta veya vektör arasındaki en kısa yolun uzunluğudur.

# Euclidean Distance
L2 = [(zero_A[i] - zero_B[i])**2 for i in range(len(zero_A))]
L2 = np.sqrt(np.array(L2).sum())
print(L2)

#An alternative way of doing this
np.linalg.norm((zero_A - zero_B), ord=2)

#Calculate L2 distances
print("Distance zeroA-zeroB:", np.linalg.norm((zero_A - zero_B), ord=2))
print("Distance zeroA-one:  ", np.linalg.norm((zero_A - one), ord=2))
print("Distance zeroB-one:  ", np.linalg.norm((zero_B - one), ord=2))

Manhattan Mesafesi (L1): Bir seferde yalnızca bir eksen boyunca hareket etmekle sınırlanmışsa, iki nokta arasındaki mesafe.

# Manhattan Distance
L1 = [zero_A[i] - zero_B[i] for i in range(len(zero_A))]
L1 = np.abs(L1).sum()

print(L1)

#an alternative way of doing this is
np.linalg.norm((zero_A - zero_B), ord=1)

#Calculate L1 distances
print("Distance zeroA-zeroB:", np.linalg.norm((zero_A - zero_B), ord=1))
print("Distance zeroA-one:  ", np.linalg.norm((zero_A - one), ord=1))
print("Distance zeroB-one:  ", np.linalg.norm((zero_B - one), ord=1))

Nokta Çıktısı(Dot Product): Bir vektörün diğerine izdüşümünün büyüklüğünü ölçer.

# Dot Product
np.dot(zero_A,zero_B)

#Calculate Dot products
print("Distance zeroA-zeroB:", np.dot(zero_A, zero_B))
print("Distance zeroA-one:  ", np.dot(zero_A, one))
print("Distance zeroB-one:  ", np.dot(zero_B, one))

Kosinüs Mesafesi: Vektörler arasındaki yönsellik farkını ölçün.

# Cosine Distance
cosine = 1 - np.dot(zero_A,zero_B)/(np.linalg.norm(zero_A)*np.linalg.norm(zero_B))
print(f"{cosine:.6f}")

zero_A/zero_B

# Cosine Distance function
def cosine_distance(vec1,vec2):
  cosine = 1 - (np.dot(vec1, vec2)/(np.linalg.norm(vec1)*np.linalg.norm(vec2)))
  return cosine

#Cosine Distance
print(f"Distance zeroA-zeroB: {cosine_distance(zero_A, zero_B): .6f}")
print(f"Distance zeroA-one:   {cosine_distance(zero_A, one): .6f}")
print(f"Distance zeroB-one:   {cosine_distance(zero_B, one): .6f}")

Benzer vektörleri aramak

Kaba kuvvet (bruteforce) KNN algoritması, vektör veritabanları için varsayılan arama algoritmasıdır.

KNN veya K en yakın komşu algoritması basitçe vektörün etrafına bakar ve o vektörün K en yakın komşusunu alır. Yakınlık yukarıda bahsettiğimiz Öklid uzaklığı ile hesaplanır.

Kaba kuvvet arama algoritması:

Sorgu ile her vektör arasındaki L2 mesafesi ölçülür.
Tüm bu mesafeler sıralanır.
En uygun k eşleşmesi döndürülür. Bunlar anlamsal olarak en benzer noktalardır.

Brute force search has O(dN) runtime complexity for search.

K En Yakın Komşu Uygulaması:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
import time
np.random.seed(42)

# Generate 20 data points with 2 dimensions
X = np.random.rand(20,2)

# Display Embeddings
n = range(len(X))

fig, ax = plt.subplots()
ax.scatter(X[:,0], X[:,1], label='Embeddings')
ax.legend()

for i, txt in enumerate(n):
    ax.annotate(txt, (X[i,0], X[i,1]))

k = 4

neigh = NearestNeighbors(n_neighbors=k, algorithm='brute', metric='euclidean')
neigh.fit(X)

# Display Query with data
n = range(len(X))

fig, ax = plt.subplots()
ax.scatter(X[:,0], X[:,1])
ax.scatter(0.45,0.2, c='red',label='Query')
ax.legend()

for i, txt in enumerate(n):
    ax.annotate(txt, (X[i,0], X[i,1]))

neighbours = neigh.kneighbors([[0.45,0.2]], k, return_distance=True)
print(neighbours)

t0 = time.time()
neighbours = neigh.kneighbors([[0.45,0.2]], k, return_distance=True)
t1 = time.time()

query_time = t1-t0
print(f"Runtime: {query_time: .4f} seconds")

def speed_test(count):
    # generate random objects
    data = np.random.rand(count,2)
    
    # prepare brute force index
    k=4
    neigh = NearestNeighbors(n_neighbors=k, algorithm='brute', metric='euclidean')
    neigh.fit(data)

    # measure time for a brute force query
    t0 = time.time()
    neighbours = neigh.kneighbors([[0.45,0.2]], k, return_distance=True)
    t1 = time.time()

    total_time = t1-t0
    print (f"Runtime: {total_time: .4f}")

    return total_time

Sonuçlar:

# Brute force examples (seconds)
time20k = speed_test(20_000) # 0.0030
time200k = speed_test(200_000) # 0.0095
time2m = speed_test(2_000_000) # 0.0144
time20m = speed_test(20_000_000) # 0.1171
time200m = speed_test(200_000_000) # 12.7856

Kaba kuvvet KNN uygulaması:

documents = 1000
dimensions = 768

embeddings = np.random.randn(documents, dimensions) # 1000 documents, 768-dimensional embeddings
embeddings = embeddings / np.sqrt((embeddings**2).sum(1, keepdims=True)) # L2 normalize the rows, as is common

query = np.random.randn(768) # the query vector
query = query / np.sqrt((query**2).sum()) # normalize query

# kNN
t0 = time.time()
# Calculate Dot Product between the query and all data items
similarities = embeddings.dot(query)
# Sort results
sorted_ix = np.argsort(-similarities)
t1 = time.time()

total = t1-t0
print(f"Runtime for dim={dimensions}, documents_n={documents}: {np.round(total,3)} seconds")

print("Top 5 results:")
for k in sorted_ix[:5]:
    print(f"Point: {k}, Similarity: {similarities[k]}")

n_runs = [1_000, 10_000, 100_000, 500_000]

for n in n_runs:
    embeddings = np.random.randn(n, dimensions) #768-dimensional embeddings
    query = np.random.randn(768) # the query vector
    
    t0 = time.time()
    similarities = embeddings.dot(query)
    sorted_ix = np.argsort(-similarities)
    t1 = time.time()

    total = t1-t0
    print(f"Runtime for 1 query with dim={dimensions}, documents_n={n}: {np.round(total,3)} seconds")

Sonuç:

print (f"To run 1,000 queries: {total * 1_000/60 : .2f} minutes") 
# To run 1,000 queries: 31.38 minutes

Bu arama süresi kabul edilemez.

Yaklaşık En Yakın Komşular (ANN — Approximate Nearest Neighbours)

YSA algoritması, çok fazla performans elde etmek için KNN’den bir miktar doğruluk verir.

ANN uygulaması

from random import random, randint
from math import floor, log
import networkx as nx
import numpy as np
import matplotlib as mtplt
from matplotlib import pyplot as plt
from utils import *

vec_num = 40 # Number of vectors (nodes)
dim = 2 ## Dimention. Set to be 2. All the graph plots are for dim 2. If changed, then plots should be commented. 
m_nearest_neighbor = 2 # M Nearest Neigbor used in construction of the Navigable Small World (NSW)

vec_pos = np.random.uniform(size=(vec_num, dim))

Sorgu vektör kısmı:

## Query
query_vec = [0.5, 0.5]

nodes = []
nodes.append(("Q",{"pos": query_vec}))

G_query = nx.Graph()
G_query.add_nodes_from(nodes)

print("nodes = ", nodes, flush=True)

pos_query=nx.get_node_attributes(G_query,'pos')

Kaba kuvvet kısmı:

(G_lin, G_best) = nearest_neigbor(vec_pos,query_vec)

pos_lin=nx.get_node_attributes(G_lin,'pos')
pos_best=nx.get_node_attributes(G_best,'pos')

fig, axs = plt.subplots()

nx.draw(G_lin, pos_lin, with_labels=True, node_size=150, node_color=[[0.8,0.8,1]], width=0.0, font_size=7, ax = axs)
nx.draw(G_query, pos_query, with_labels=True, node_size=200, node_color=[[0.5,0,0]], font_color='white', width=0.5, font_size=7, font_weight='bold', ax = axs)
nx.draw(G_best, pos_best, with_labels=True, node_size=200, node_color=[[0.85,0.7,0.2]], width=0.5, font_size=7, font_weight='bold', ax = axs)

İnsanların Gezinebileceği Küçük Dünya (HNSW — Human Navigable Small World)

HNSW’de bu sorunu çözmeyi amaçlıyor ve herkesin yakından bağlantılı olduğu küçük dünya fenomeni olan sosyal ağlara dayanıyor. Buradaki fikir, ortalama olarak her insanın altıncı dereceden birbirimize bağlı olduğumuzdur. Yani bu fikire göre herkesi tanıyan birilerini tanıyorsunuzdur.

HNSW uygulaması

HNSW yapısı:

GraphArray = construct_HNSW(vec_pos,m_nearest_neighbor)

for layer_i in range(len(GraphArray)-1,-1,-1):
    fig, axs = plt.subplots()

    print("layer_i = ", layer_i)
        
    if layer_i>0:
        pos_layer_0 = nx.get_node_attributes(GraphArray[0],'pos')
        nx.draw(GraphArray[0], pos_layer_0, with_labels=True, node_size=120, node_color=[[0.9,0.9,1]], width=0.0, font_size=6, font_color=(0.65,0.65,0.65), ax = axs)

    pos_layer_i = nx.get_node_attributes(GraphArray[layer_i],'pos')
    nx.draw(GraphArray[layer_i], pos_layer_i, with_labels=True, node_size=150, node_color=[[0.7,0.7,1]], width=0.5, font_size=7, ax = axs)
    nx.draw(G_query, pos_query, with_labels=True, node_size=200, node_color=[[0.8,0,0]], width=0.5, font_size=7, font_weight='bold', ax = axs)
    nx.draw(G_best, pos_best, with_labels=True, node_size=200, node_color=[[0.85,0.7,0.2]], width=0.5, font_size=7, font_weight='bold', ax = axs)
    plt.show()

HNSW Search

(SearchPathGraphArray, EntryGraphArray) = search_HNSW(GraphArray,G_query)

for layer_i in range(len(GraphArray)-1,-1,-1):
    fig, axs = plt.subplots()

    print("layer_i = ", layer_i)
    G_path_layer = SearchPathGraphArray[layer_i]
    pos_path = nx.get_node_attributes(G_path_layer,'pos')
    G_entry = EntryGraphArray[layer_i]
    pos_entry = nx.get_node_attributes(G_entry,'pos')

    if layer_i>0:
            pos_layer_0 = nx.get_node_attributes(GraphArray[0],'pos')
            nx.draw(GraphArray[0], pos_layer_0, with_labels=True, node_size=120, node_color=[[0.9,0.9,1]], width=0.0, font_size=6, font_color=(0.65,0.65,0.65), ax = axs)

    pos_layer_i = nx.get_node_attributes(GraphArray[layer_i],'pos')
    nx.draw(GraphArray[layer_i], pos_layer_i, with_labels=True, node_size=100, node_color=[[0.7,0.7,1]], width=0.5, font_size=6, ax = axs)
    nx.draw(G_path_layer, pos_path, with_labels=True, node_size=110, node_color=[[0.8,1,0.8]], width=0.5, font_size=6, ax = axs)
    nx.draw(G_query, pos_query, with_labels=True, node_size=80, node_color=[[0.8,0,0]], width=0.5, font_size=7, ax = axs)
    nx.draw(G_best, pos_best, with_labels=True, node_size=70, node_color=[[0.85,0.7,0.2]], width=0.5, font_size=7, ax = axs)
    nx.draw(G_entry, pos_entry, with_labels=True, node_size=80, node_color=[[0.1,0.9,0.1]], width=0.5, font_size=7, ax = axs)
    plt.show()

Weaviate vektör veritabanıyla Saf Vektör Araması

import weaviate, json
from weaviate import EmbeddedOptions

client = weaviate.Client(
    embedded_options=EmbeddedOptions(),
)

client.is_ready()

# resetting the schema. CAUTION: This will delete your collection 
# if client.schema.exists("MyCollection"):
#     client.schema.delete_class("MyCollection")

schema = {
    "class": "MyCollection",
    "vectorizer": "none",
    "vectorIndexConfig": {
        "distance": "cosine" # let's use cosine distance
    },
}

client.schema.create_class(schema)

print("Successfully created the schema.")

Verileri içe aktarma:

data = [
   {
      "title": "First Object",
      "foo": 99, 
      "vector": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
   },
   {
      "title": "Second Object",
      "foo": 77, 
      "vector": [0.2, 0.3, 0.4, 0.5, 0.6, 0.7]
   },
   {
      "title": "Third Object",
      "foo": 55, 
      "vector": [0.3, 0.1, -0.1, -0.3, -0.5, -0.7]
   },
   {
      "title": "Fourth Object",
      "foo": 33, 
      "vector": [0.4, 0.41, 0.42, 0.43, 0.44, 0.45]
   },
   {
      "title": "Fifth Object",
      "foo": 11,
      "vector": [0.5, 0.5, 0, 0, 0, 0]
   },
]

Verileri toplu olarak içe aktarma ve ekleme işlemini doğrulama:

client.batch.configure(batch_size=10)  # Configure batch

# Batch import all objects
# yes batch is an overkill for 5 objects, but it is recommended for large volumes of data
with client.batch as batch:
  for item in data:

      properties = {
         "title": item["title"],
         "foo": item["foo"],
      }

      # the call that performs data insert
      client.batch.add_data_object(
         class_name="MyCollection",
         data_object=properties,
         vector=item["vector"] # your vector embeddings go here
      )

# Check number of objects
response = (
    client.query
    .aggregate("MyCollection")
    .with_meta_count()
    .do()
)

print(response)

Weaviate ile Sorgulama — Vektör Arama (vektör yerleştirmeleri):

response = (
    client.query
    .get("MyCollection", ["title"])
    .with_near_vector({
        "vector": [-0.012, 0.021, -0.23, -0.42, 0.5, 0.5]
    })
    .with_limit(2) # limit the output to only 2
    .do()
)

result = response["data"]["Get"]["MyCollection"]
print(json.dumps(result, indent=2))

response = (
    client.query
    .get("MyCollection", ["title"])
    .with_near_vector({
        "vector": [-0.012, 0.021, -0.23, -0.42, 0.5, 0.5]
    })
    .with_limit(2) # limit the output to only 2
    .with_additional(["distance", "vector, id"])
    .do()
)

result = response["data"]["Get"]["MyCollection"]
print(json.dumps(result, indent=2))

Filtrelerle vektör arama:

response = (
    client.query
    .get("MyCollection", ["title", "foo"])
    .with_near_vector({
        "vector": [-0.012, 0.021, -0.23, -0.42, 0.5, 0.5]
    })
    .with_additional(["distance, id"]) # output the distance of the query vector to the objects in the database
    .with_where({
        "path": ["foo"],
        "operator": "GreaterThan",
        "valueNumber": 44
    })
    .with_limit(2) # limit the output to only 2
    .do()
)

result = response["data"]["Get"]["MyCollection"]
print(json.dumps(result, indent=2))

nearObject örneği:

response = (
    client.query
    .get("MyCollection", ["title"])
    .with_near_object({ # the id of the the search object
        "id": result[0]['_additional']['id']
    })
    .with_limit(3)
    .with_additional(["distance"])
    .do()
)

result = response["data"]["Get"]["MyCollection"]
print(json.dumps(result, indent=2))

HNSW Çalışma Zamanı

Daha yüksek seviyelerde bir vektörün bulunma olasılığı katlanarak azalır. Veri noktalarının sayısı arttıkça, vektör araması gerçekleştirmek için yapılan karşılaştırmaların sayısı yalnızca logaritmik olarak artar.

HNSW algoritması 0(log(N)) çalışma zamanı karmaşıklığına sahiptir.

Weaviate Vector veritabanı nasıl kullanılır?

Örnek veriler indirilir

import requests
import json

# Download the data
resp = requests.get('https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json')
data = json.loads(resp.text)  # Load data

# Parse the JSON and preview it
print(type(data), len(data))

def json_print(data):
    print(json.dumps(data, indent=2))

json_print(data[0])

Weaviate vektör veritabanının yerleştirilmiş bir örneğini oluşturun

import weaviate, os
from weaviate import EmbeddedOptions
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

client = weaviate.Client(
    embedded_options=EmbeddedOptions(),
    additional_headers={
        "X-OpenAI-BaseURL": os.environ['OPENAI_API_BASE'],
        "X-OpenAI-Api-Key": openai.api_key  # Replace this with your actual key
    }
)
print(f"Client created? {client.is_ready()}")
json_print(client.get_meta())

Soru koleksiyonu oluştur

# resetting the schema. CAUTION: This will delete your collection 
if client.schema.exists("Question"):
    client.schema.delete_class("Question")
class_obj = {
    "class": "Question",
    "vectorizer": "text2vec-openai",  # Use OpenAI as the vectorizer
    "moduleConfig": {
        "text2vec-openai": {
            "model": "ada",
            "modelVersion": "002",
            "type": "text",
            "baseURL": os.environ["OPENAI_API_BASE"]
        }
    }
}

client.schema.create_class(class_obj)

Örnek verileri yükleyin ve vektör yerleştirmeleri oluşturun

# reminder for the data structure
json_print(data[0])

with client.batch.configure(batch_size=5) as batch:
    for i, d in enumerate(data):  # Batch import data
        
        print(f"importing question: {i+1}")
        
        properties = {
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        }
        
        batch.add_data_object(
            data_object=properties,
            class_name="Question"
        )

count = client.query.aggregate("Question").with_meta_count().do()
json_print(count)

Her soruyu temsil eden vektörü çıkarın

# write a query to extract the vector for a question
result = (client.query
          .get("Question", ["category", "question", "answer"])
          .with_additional("vector")
          .with_limit(1)
          .do())

json_print(result)

Sorgu zamanı

Biyoloji sorgusu ile döndürülen nesneler arasındaki mesafe nedir?

response = (
    client.query
    .get("Question",["question","answer","category"])
    .with_near_text({"concepts": "biology"})
    .with_additional('distance')
    .with_limit(2)
    .do()
)

json_print(response)

response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_near_text({"concepts": ["animals"]})
    .with_limit(10)
    .with_additional(["distance"])
    .do()
)

json_print(response)

Vektör veritabanına, bir eşik mesafesinden sonra sonuçların kaldırılmasını bildirebiliriz

response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_near_text({"concepts": ["animals"], "distance": 0.24})
    .with_limit(10)
    .with_additional(["distance"])
    .do()
)

json_print(response)

Weaviate ile CRUD işlemleri

Create (Oluşturma işlemi)

#Create an object
object_uuid = client.data_object.create(
    data_object={
        'question':"Leonardo da Vinci was born in this country.",
        'answer': "Italy",
        'category': "Culture"
    },
    class_name="Question"
 )

print(object_uuid)

Read (Okuma işlemi)

data_object = client.data_object.get_by_id(object_uuid, class_name="Question")
json_print(data_object)

data_object = client.data_object.get_by_id(
    object_uuid,
    class_name='Question',
    with_vector=True
)

json_print(data_object)

Update (Güncelleme işlemi)

client.data_object.update(
    uuid=object_uuid,
    class_name="Question",
    data_object={
        'answer':"Florence, Italy"
    })


data_object = client.data_object.get_by_id(
    object_uuid,
    class_name='Question',
)
json_print(data_object)

Delete (Silme işlemi)

json_print(client.query.aggregate("Question").with_meta_count().do())
client.data_object.delete(uuid=object_uuid, class_name="Question")
json_print(client.query.aggregate("Question").with_meta_count().do())

Seyrek, yoğun ve karma arama

Sorgu Türleri

Yoğun Arama/Semantik Arama(Dense Search/Semantic Search)

Arama gerçekleştirmek için verilerin vektör katıştırma gösterimlerini kullanır.

Örnek sorgu:

response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_near_text({"concepts":["animal"]})
    .with_limit(3)
    .do()
)

json_print(response)

Yoğun arama eksiklikleri:

Alan dışı veriler düşük doğruluk oranına sebep olur.
Seri numaraları gibi görünüşte rastgele verileri aramak, doğruluğun azalmasına neden olacaktır. Bu durumda anahtar kelime/seyrek arama yapmak daha iyi sonuçlar verecektir.

Anahtar kelime eşleştirmeyi yapmanın en kolay yolu, bir kelimenin sorguda ve veri vektöründe kaç kez geçtiğini sayan ve ardından en yüksek eşleşen kelime sıklığına sahip nesneleri döndüren kelime çantasını kullanmaktır.

Bu, seyrek arama olarak bilinir çünkü metin, kelime dağarcığınızdaki her benzersiz kelimenin sorguda ve saklanan cümlelerde kaç kez geçtiği sayılarak vektörlerin içine gömülür.

Çoğunlukla sıfır seyrek yerleştirmeler (Mostly zero sparse embeddings)

Herhangi bir cümlenin kelime dağarcığınızdaki her kelimeyi içerme olasılığı oldukça düşük olduğundan yerleştirmeler çoğunlukla sıfırdır ve bu nedenle seyrek yerleştirme olarak bilinir.

BM25( Best Match 25/ En iyi Eşleşen 25)

Pratikte anahtar kelime araması yaparken, en iyi eşleşen 25 olarak adlandırılan basit kelime sıklıklarının bir modifikasyonunu kullanırız. BM25, aktardığınız kelime öbeği içindeki kelime sayısını sayar ve ardından daha sık görünenler, eşleşme gerçekleştiğinde daha az önemli olacak şekilde ağırlıklandırılır. ancak nadir görülen kelimelerle eşleştirdiğimizde puanın çok daha yüksek olduğunu görüyoruz.

Seyrek Arama — BM25 Örneği:

response = (
    client.query
    .get("Question",["question","answer"])
    .with_bm25(query="animal")
    .with_limit(3)
    .do()
)

json_print(response)

Hibrit Arama

Hibrit arama, hem vektör/yoğun arama hem de anahtar kelime/seyrek aramanın gerçekleştirilip sonuçların birleştirilmesi işlemidir. Sonuçların birleşimi, hem yoğun hem de seyrek aramalar kullanılarak her nesnenin sorguyla ne kadar iyi eşleştiğini ölçen bir puanlama sistemine dayalı olarak yapılabilir.

Hibrit arama örneği:

response = (
    client.query
    .get("Question",["question","answer"])
    .with_hybrid(query="animal", alpha=0.5) # Try with alpha= 0.5 and 1 too
    .with_limit(3)
    .do()
)
json_print(response)

Vektör Veritabanından Yararlanan Uygulamalar

Çok dilli veya anlam arama

Yerleştirme, anlamı ileten vektörler ürettiğinden, aynı ifadenin farklı dillerdeki vektörleri benzer sonuçlar üretir.

Erişim Destekli Üretim (RAG — Retrieval Augmented Generation)

Vektör veritabanlarını harici bilgi tabanları olarak kullanabilirsiniz. LLM’in gerçek ve güncellenmiş bilgilerden oluşan harici bir bilgi tabanı olarak bir vektör veritabanından yararlanmasını sağlayın.

RAG ile LLM’ler kaynaklardan alıntı yapabilir, halüsinasyonları azaltabilir, bilgi yoğun görevleri çözebilir.

Weaviate’te RAG yapmak için:

# instruction for the generative module
generatePrompt = "Describe the following as a Facebook Ad: {summary}"

result = (
  client.query
  .get("Article", ["title", "summary"])
  .with_generate(single_prompt=generatePrompt)
  .with_near_text({"concepts": ["Italian food"]})
  .with_limit(5)
).do()

Veritabanında saklanan vektör sayısını kontrol edin:

print(json.dumps(client.query.aggregate("Wikipedia").with_meta_count().do(), indent=2))

İlgilendiğiniz kavramları bulmak için üzerlerinde arama yapın:

response = (client.query
            .get("Wikipedia",['text','title','url','views','lang'])
            .with_near_text({"concepts": "Vacation spots in california"})
            .with_limit(5)
            .do()
           )

json_print(response)

Sorgu:

response = (client.query
            .get("Wikipedia",['text','title','url','views','lang'])
            .with_near_text({"concepts": "Vacation spots in california"})
            .with_where({
                "path" : ['lang'],
                "operator" : "Equal",
                "valueString":'en'
            })
            .with_limit(3)
            .do()
           )

json_print(response)


response = (client.query
            .get("Wikipedia",['text','title','url','views','lang'])
            .with_near_text({"concepts": "Miejsca na wakacje w Kalifornii"})
            .with_where({
                "path" : ['lang'],
                "operator" : "Equal",
                "valueString":'en'
            })
            .with_limit(3)
            .do()
           )

json_print(response)

response = (client.query
            .get("Wikipedia",['text','title','url','views','lang'])
            .with_near_text({"concepts": "أماكن العطلات في كاليفورنيا"})
            .with_where({
                "path" : ['lang'],
                "operator" : "Equal",
                "valueString":'en'
            })
            .with_limit(3)
            .do()
           )

json_print(response)

Tek istem:

prompt = "Write me a facebook ad about {title} using information inside {text}"
result = (
  client.query
  .get("Wikipedia", ["title","text"])
  .with_generate(single_prompt=prompt)
  .with_near_text({
    "concepts": ["Vacation spots in california"]
  })
  .with_limit(3)
).do()

json_print(result)

Grup Görevi:

generate_prompt = "Summarize what these posts are about in two paragraphs."

result = (
  client.query
  .get("Wikipedia", ["title","text"])
  .with_generate(grouped_task=generate_prompt) # Pass in all objects at once
  .with_near_text({
    "concepts": ["Vacation spots in california"]
  })
  .with_limit(3)
).do()

json_print(result)

Kaynakça

[1] Deeplearning.ai, (2024), Vector Databases: from Embeddings to Applications:

[https://learn.deeplearning.ai/courses/vector-databases-embeddings-applications/]