LLM Uygulamalarının Doğruluğunun İyileştirilmesi

Cahit Barkin Ozer

27 min readSep 18, 2024

Deeplearning.ai’daki “Improving Accuracy of LLM Applications” kısa kursunun Türkçe çevirisidir.

For English:

Improving Accuracy of LLM Applications

Deeplearning.ai "Improving Accuracy of LLM Applications" short course. Introduction AI applications may now execute…

cbarkinozer.blogspot.com

Giriş

Yapay zeka uygulamaları artık daha önce bilgisayarlar için çok zor olan görevleri yerine getirebiliyor, örneğin bir veritabanına doğal dil arayüzü sağlamak gibi. Ancak, bu uygulamalar bazı alanlarda iyi performans gösterirken diğer alanlarda zorlanabiliyor. Uygulamanızın performansını sistematik olarak iyileştirmek için bir geliştirme adımları çerçevesini öğreneceksiniz. Özellikle, performansı ölçmek için bir değerlendirme veri seti oluşturacak, istem mühendisliği (prompt engineering) yapacak ve son olarak modelinizi ince ayar ile optimize edeceksiniz.

Bu açık kaynaklı büyük dil modellerinin (LLM) önemli bir yönü, kullanıcıların modelleri kendi spesifik görevlerine göre ince ayar yapmalarına olanak tanımasıdır. Özellikle daha küçük modellerle, kullanıcıların bu modelleri metinden SQL’e, sınıflandırma, soru-cevap, öneri ve özetleme gibi görevler için ince ayar yaptığını gördük. Ayrıca finansal, müşteri ve hukuki bilgiler gibi tescilli veri setlerini anlamak için de adapte edildiler.

Bu kursta, önce LLM uygulamanızı oluşturacak, hepsini birbirine bağlayacak ve biraz öz değerlendirme ile istem mühendisliği yapacaksınız. Daha sonra, modelin performansını değerlendirmede titiz olmak, modelin ana kullanıma hazır olup olmadığını ve hangi alanlarda iyileştirilmesi gerektiğini bilmek önemlidir. İstem mühendisliği yetersiz kaldığında, bir sonraki adım olarak LLM’leri kullanarak modelinizi ince ayar için bir veri seti oluşturabilirsiniz. Bu adımla ilgili yaygın bir yanılgı, yeterince veri olmadığını düşünmektir. Aslında yeterli veri mevcuttur ve sahip olduğunuz verileri LLM’ler yardımıyla önemli ölçüde çoğaltmanın yolları vardır. İnce ayar yapmak eskiden yavaş ve maliyetliydi, ancak LoRA (Low-Rank Adaptation) gibi düşük dereceli adaptasyon teknikleri kullanarak zaman ve maliyetler büyük ölçüde azaldı.

Lamini bellek ayarı, doğrulukta yeni bir seviyeye ulaşmanızı sağlayan ve aynı doğruluk seviyesine veya daha yüksek bir seviyeye ulaşmak için harcanan zaman ve geliştirici çabasını azaltan bir tekniktir. Optimize edilmiş ince ayar teknikleri kullanarak, LLM’inize sadece birkaç dakika içinde binlerce yeni bilgi öğretebilir, bunu tek bir A100 veya MI250 GPU üzerinde gerçekleştirebilirsiniz.

Örnek olarak, belirli bir şema için SQL sorguları oluşturmak üzere bir LLM’e ince ayar yapma sürecini kullanacağız. İnce ayar sayesinde, doğruluğun 30%’dan yaklaşık 95%’e kadar nasıl yükseldiğini, sadece 128 örnekten oluşan küçük bir veri seti ile yaklaşık 30 dakika içinde ve birkaç dolarlık maliyetle göreceksiniz. Bu süreç, bellek ayarlarıyla 6–7 saniyeye kadar optimize edilerek daha yüksek performans elde edilebilir.”

Overview

LLM’ler olasılıkçı bir programdır, dolayısıyla yinelemeli çalışmamız gerekir.

from dotenv import load_dotenv
import lamini
_ = load_dotenv()   #load environmental variable LAMINI_API_KEY with key from .env file
llm = lamini.Lamini(model_name="meta-llama/Meta-Llama-3-8B-Instruct")
prompt = """\
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Please write a birthday card for my good friend Andrew\
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
result = llm.generate(prompt, max_new_tokens=200)
print(result)

Python Geliştirme Önerileri (PEP 8), Python için bir stil kılavuzu sunar ve ifadelerin parantez içine alınmasıyla uzun satırların birden fazla satıra bölünebileceğini önerir.

prompt2 = ( 
    "<|begin_of_text|>"  # Start of prompt
    "<|start_header_id|>system<|end_header_id|>\n\n"  #  header - system
    "You are a helpful assistant."  # system prompt
    "<|eot_id|>" # end of turn
    "<|start_header_id|>user<|end_header_id|>\n\n" # header - user
    "Please write a birthday card for my good friend Andrew" 
    "<|eot_id|>" # end of turn
    "<|start_header_id|>assistant<|end_header_id|>\n\n" # header - assistant
    )
print(prompt2)
print(prompt == prompt2) # True

Kullanıcı ve sistem mesajlarından bir komut üretecek bir metot oluşturalım.

def make_llama_3_prompt(user, system=""):
    system_prompt = ""
    if system != "":
        system_prompt = (
            f"<|start_header_id|>system<|end_header_id|>\n\n{system}"
            f"<|eot_id|>"
        )
    prompt = (f"<|begin_of_text|>{system_prompt}"
              f"<|start_header_id|>user<|end_header_id|>\n\n"
              f"{user}"
              f"<|eot_id|>"
              f"<|start_header_id|>assistant<|end_header_id|>\n\n"
         )
    return prompt    

system_prompt = user_prompt = "You are a helpful assistant."
user_prompt = "Please write a birthday card for my good friend Andrew"
prompt3 = make_llama_3_prompt(user_prompt, system_prompt)
print(prompt3)
print(prompt == prompt3) # True

Yeni metodumuz aşağıdaki gibi kullanılabilir:

user_prompt = "Tell me a joke about birthday cake"
prompt = make_llama_3_prompt(user_prompt)
print(prompt)
result = llm.generate(prompt, max_new_tokens=200)
print(result)

Llama3 SQL Üretimi

question = (
    "Given an arbitrary table named `sql_table`, "
    "write a query to return how many rows are in the table." 
    )
prompt = make_llama_3_prompt(question)
print(llm.generate(prompt, max_new_tokens=200))

question = """Given an arbitrary table named `sql_table`, 
help me calculate the average `height` where `age` is above 20."""
prompt = make_llama_3_prompt(question)
print(llm.generate(prompt, max_new_tokens=200))

question = """Given an arbitrary table named `sql_table`, 
Can you calculate the p95 `height` where the `age` is above 20?"""
prompt = make_llama_3_prompt(question)
print(llm.generate(prompt, max_new_tokens=200))

question = ("Given an arbitrary table named `sql_table`, "
            "Can you calculate the p95 `height` "
            "where the `age` is above 20? Use sqlite.")
prompt = make_llama_3_prompt(question)

print(llm.generate(prompt, max_new_tokens=200))

LLM’ler için, biraz doğru bir cevap, doğru bir cevapla aynıdır. Bu biraz doğru cevaplar, merhaba yerine selam demek gibi yaratıcılık görevleri için uygun olabilir ancak API’ler, kimlikler, telefon numaraları vb. gibi kesinliğe ihtiyaç duyduğunuz gerçek bilgili cevaplar için zararlı olabilir.

+ Latte istemiştim Cappuccino değil. — Aynı şey sayılır.

SQL oluşturma durumu için PERCENTILE_CONT SQLite’ta mevcut değildir.

Bunu çözmek için:

İstem Mühendisliği→26%
Öz-yansıtma (Self-reflection) → 26–40%
Erişim Destekli Üretim (RAG) → 50%
Talimat İnce Ayarı → 40–60%

İnce ayar nasıl yardımcı olabilir?

Modelleri ince ayar yapmanın bir yöntemi, gerçekleri modele gömmektir, ancak ince ayar yapmanın en yaygın biçimi olan talimat ince ayarı, bu halüsinasyonları gidermek için uygun bir araç değildir ve maliyetli olabilir. Lamini tarafından icat edilen ve bellek ayarı adı verilen bir teknik, gerçekleri doğrudan modelin ağırlıklarına gömerek ve bu çok olasılıklı sürece biraz deterministiklik ekleyerek, modelin birçok gerçeği tam olarak hatırlamasını sağlar. Bellek ayarı, yalnızca bilgi içeren sorularda gerçeklerin olasılığını azaltmayı amaçlar.
Aşağıda SQL ajanlarındaki halüsinasyonlar yer almaktadır:

Geçersiz SQL: Sütun adlarının, ID’lerin, formatların veya işlevlerin kaçırılması.
Bozuk SQL: Geçerli ama anlamsal olarak yanlış SQL sorguları.”

Memory Tuning kullanarak değerlendirme, veri oluşturma ve ince ayar yapma işlemlerini 3 kez yinelemeniz gerektiğini unutmayın.

Bir SQL Ajanı Oluşturma

Bir SQL Agent oluşturalım ve modelin halüsinasyon gördüğü yeri gözlemleyelim. İlk olarak, bunun benzeri şekilde bir istem mühendisliği yapın:

Daha sonra yapılandırılmış çıktıyı kullanarak yalnızca SQL çıktısı aldığınızdan emin olun.

Daha sonra yanlış maaş formatı, yanlış sorgu gibi halüsinasyonlar teşhis edildi.

SQL Agent Oluşturma:

from dotenv import load_dotenv
_ = load_dotenv()   #load environmental variable LAMINI_API_KEY with key from .env file
import lamini 
import logging
import sqlite3
import pandas as pd
from util.get_schema import get_schema
from util.make_llama_3_prompt import make_llama_3_prompt
from util.setup_logging import setup_logging

logger = logging.getLogger(__name__)
engine = sqlite3.connect("./nba_roster.db")
setup_logging()

llm = lamini.Lamini(model_name="meta-llama/Meta-Llama-3-8B-Instruct")

# Meta Llama 3 Instruct uses a prompt template, with special tags used to indicate the user query and system prompt. 
# You can find the documentation on this [model card](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/#meta-llama-3-instruct).
def make_llama_3_prompt(user, system=""):
    system_prompt = ""
    if system != "":
        system_prompt = (
            f"<|start_header_id|>system<|end_header_id|>\n\n{system}<|eot_id|>"
        )
    return f"<|begin_of_text|>{system_prompt}<|start_header_id|>user<|end_header_id|>\n\n{user}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

def get_schema():
    return """\
0|Team|TEXT 
1|NAME|TEXT  
2|Jersey|TEXT 
3|POS|TEXT
4|AGE|INT 
5|HT|TEXT 
6|WT|TEXT 
7|COLLEGE|TEXT 
8|SALARY|TEXT eg. 
"""

user = """Who is the highest paid NBA player?"""

system = f"""You are an NBA analyst with 15 years of experience writing complex SQL queries. Consider the nba_roster table with the following schema:
{get_schema()}

Write a sqlite query to answer the following question. Follow instructions exactly"""
print(system)
prompt = make_llama_3_prompt(user, system)
print(llm.generate(prompt, max_new_tokens=200))

def get_updated_schema():
    return """\
0|Team|TEXT eg. "Toronto Raptors"
1|NAME|TEXT eg. "Otto Porter Jr."
2|Jersey|TEXT eg. "0" and when null has a value "NA"
3|POS|TEXT eg. "PF"
4|AGE|INT eg. "22" in years
5|HT|TEXT eg. `6' 7"` or `6' 10"`
6|WT|TEXT eg. "232 lbs" 
7|COLLEGE|TEXT eg. "Michigan" and when null has a value "--"
8|SALARY|TEXT eg. "$9,945,830" and when null has a value "--"
"""
system = f"""You are an NBA analyst with 15 years of experience writing complex SQL queries. Consider the nba_roster table with the following schema:
{get_updated_schema()}

Write a sqlite query to answer the following question. Follow instructions exactly"""
prompt = make_llama_3_prompt(user, system)
print(prompt)
print(llm.generate(prompt, max_new_tokens=200))

Yapılandırılmış çıktı alma:

result = llm.generate(prompt, output_type={"sqlite_query": "str"}, max_new_tokens=200)
print(result)

df = pd.read_sql(result['sqlite_query'], con=engine)
print(df)

Halüsinasyonların Tanısı:

query="""SELECT salary, name 
FROM nba_roster 
WHERE salary != '--' 
ORDER BY CAST(REPLACE(REPLACE(salary, '$', ''), ',','') AS INTEGER) DESC 
LIMIT 1;"""
df = pd.read_sql(query, con=engine)
print(df)

Bir Değerlendirme Oluşturun

Bu derste, performansı sistematik olarak ölçmek için bir değerlendirme çerçevesi oluşturacaksınız. Büyük dil modellerini (LLM’leri) nerede halüsinasyon gördüğünü öğrenmek ve doğruluğun iyileşip iyileşmediğini anlamak için değerlendiriyoruz. İyi bir değerlendirme, iyileşmeyi gösteren ve ölçeklenebilir, otomatik bir şekilde yapılabilen nicel bir değerlendirmedir. Bir değerlendirme veri setine sahip olmak için 20–100 örnekle başlayın ve kaliteli bir veri seti oluşturun.

Çıktınızı değerlendirmek için bir LLM kullanın:

LLM’nin sayısal bir puan üretmesini sağlayın.
Soru, üretilen cevap ve değerlendirme yöntemini LLM’nizin değerlendirme istemi aracılığıyla sağlayın.
Yapılandırılmış çıktı, puanı bir int, float, List[int], vb. olarak döndürür.

Ya da geleneksel tam eşleşme, kesinlik, F1, vb. değerleri hesaplayabiliriz.

LLM ile referans SQL’e göre oluşturulan puan benzerliği:

Oluşturulan SQL doğru bir anlamsal eşleşme olabilir ancak tam bir eşleşme olmayabilir.

Yapabildiğiniz zaman, değerlendirme için deterministik sistemler kullanmak hala değerlidir: Üretilen SQL’i veritabanına karşı çalıştırın ve birebir eşleşmeleri karşılaştırın.
Bellek ayarlaması ile bonus puanlar: LLM’i birebir eşleşmeler döndürmesi için eğiterek birebir eşleşme sağlayabilirsiniz!
Değerlendirme iteratiftir; modeli geliştirdikçe, değerlendirme veri setinizi daha genişleterek genişletmeye devam edecek ve hataları yakalamak, daha zor halüsinasyon örnekleri eklemek ve değerlendirme sonuçlarını iterasyonlar boyunca takip etmek konusunda titiz olacaksınız, böylece hangi modellerin hangi sonuçları ürettiğini bileceksiniz.

from dotenv import load_dotenv
_ = load_dotenv()   #load environmental variable LAMINI_API_KEY with key from .env file
!cat data/gold-test-set.jsonl
question = "What is the median weight in the NBA?"

import lamini 
from util.get_schema import get_schema
from util.make_llama_3_prompt import make_llama_3_prompt
llm = lamini.Lamini(model_name="meta-llama/Meta-Llama-3-8B-Instruct")

system = f"""You are an NBA analyst with 15 years of experience writing complex SQL queries. Consider the nba_roster table with the following schema:
{get_schema()}
Write a sqlite query to answer the following question. Follow instructions exactly"""
prompt = make_llama_3_prompt(question, system)

generated_query = llm.generate(prompt, output_type={"sqlite_query": "str"}, max_new_tokens=200)
print(generated_query)
# {'sqlite_query': "SELECT AVG(CAST(SUBSTR(WT, INSTR(WT,'') + 1) AS INTEGER) FROM nba_roster WHERE WT IS NOT NULL"}

import pandas as pd
import sqlite3
engine = sqlite3.connect("./nba_roster.db")
# This creates an error: df = pd.read_sql(generated_query['sqlite_query'], con=engine)

import pandas as pd
import sqlite3
engine = sqlite3.connect("./nba_roster.db")
try:
    df = pd.read_sql(generated_query['sqlite_query'], con=engine)
    print(df)
except Exception as e:
    print(e)
# Execution failed on sql 'SELECT AVG(CAST(SUBSTR(WT, INSTR(WT,'') + 1) AS INTEGER) FROM nba_roster WHERE WT IS NOT NULL': near "FROM": syntax error

# TRYING AGENT REFLECTION
reflection = f"Question: {question}. Query: {generated_query['sqlite_query']}. This query is invalid (gets the error Execution failed on sql 'SELECT AVG(CAST(SUBSTR(WT, INSTR(WT,'') + 1) AS INTEGER) FROM nba_roster WHERE WT IS NOT NULL': near \"FROM\": syntax error), so it cannot answer the question. Write a corrected sqlite query."
reflection_prompt = make_llama_3_prompt(reflection, system)
print(reflection_prompt)
"""
'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are an NBA analyst with 15 years of experience writing complex SQL queries. Consider the nba_roster table with the following schema:\n0|Team|TEXT eg. "Toronto Raptors"\n1|NAME|TEXT eg. "Otto Porter Jr."\n2|Jersey|TEXT eg. "0" and when null has a value "NA"\n3|POS|TEXT eg. "PF"\n4|AGE|INT eg. "22" in years\n5|HT|TEXT eg. `6\' 7"` or `6\' 10"`\n6|WT|TEXT eg. "232 lbs" \n7|COLLEGE|TEXT eg. "Michigan" and when null has a value "--"\n8|SALARY|TEXT eg. "$9,945,830" and when null has a value "--"\n\n\nWrite a sqlite query to answer the following question. Follow instructions exactly<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nQuestion: What is the median weight in the NBA?. Query: SELECT AVG(CAST(SUBSTR(WT, INSTR(WT,\'\') + 1) AS INTEGER) FROM nba_roster WHERE WT IS NOT NULL. This query is invalid (gets the error Execution failed on sql \'SELECT AVG(CAST(SUBSTR(WT, INSTR(WT,\'\') + 1) AS INTEGER) FROM nba_roster WHERE WT IS NOT NULL\': near "FROM": syntax error), so it cannot answer the question. Write a corrected sqlite query.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
"""

reflection_query = llm.generate(reflection_prompt, output_type={"sqlite_query": "str"}, max_new_tokens=200)
print(reflection_query)
"""
{'sqlite_query': "SELECT AVG(CAST(SUBSTR(WT, INSTR(WT,'') + 1) AS INTEGER) FROM nba_roster WHERE WT IS NOT NULL"}
"""

try:
    df = pd.read_sql(reflection_query['sqlite_query'], con=engine)
    print(df)
except Exception as e:
    print(e)
"""
Execution failed on sql 'SELECT AVG(CAST(SUBSTR(WT, INSTR(WT,'') + 1) AS INTEGER) FROM nba_roster WHERE WT IS NOT NULL': near "FROM": syntax error
"""

correct_sql = "select CAST(SUBSTR(WT, 1, INSTR(WT,' ')) as INTEGER) as percentile from nba_roster order by percentile limit 1 offset (select count(*) from nba_roster)/2;"
df_corrected = pd.read_sql(correct_sql, con=engine)
print(df_corrected)
"""
   percentile
0         215
"""


# EVALUATE OVER A LARGE DATASET
import logging
import os
from datetime import datetime
from pprint import pprint
from typing import AsyncIterator, Iterator, Union
import sqlite3
from tqdm import tqdm
import pandas as pd
import jsonlines
from lamini.generation.base_prompt_object import PromptObject
from lamini.generation.generation_node import GenerationNode
from lamini.generation.base_prompt_object import PromptObject
from lamini.generation.generation_pipeline import GenerationPipeline
from util.get_schema import get_schema
from util.make_llama_3_prompt import make_llama_3_prompt
from util.setup_logging import setup_logging

logger = logging.getLogger(__name__)
engine = sqlite3.connect("./nba_roster.db")
setup_logging()

class Args:
    def __init__(self, 
                 max_examples=100, 
                 sql_model_name="meta-llama/Meta-Llama-3-8B-Instruct", 
                 gold_file_name="gold-test-set.jsonl",
                 training_file_name="archive/generated_queries.jsonl",
                 num_to_generate=10):
        self.sql_model_name = sql_model_name
        self.max_examples = max_examples
        self.gold_file_name = gold_file_name
        self.training_file_name = training_file_name
        self.num_to_generate = num_to_generate

def load_gold_dataset(args):
    path = f"data/{args.gold_file_name}"
    with jsonlines.open(path) as reader:
        for index, obj in enumerate(reversed(list(reader))):
            if index >= args.max_examples:
                break
            yield PromptObject(prompt="", data=obj)

path = "data/gold-test-set.jsonl"
with jsonlines.open(path) as reader:
    data = [obj for obj in reader]

datapoint = data[4]
print(datapoint)
"""
{'question': 'What is the average weight in the NBA?',
 'answer': '214.98',
 'sql': "SELECT AVG(CAST(SUBSTR(WT, 1, INSTR(WT,' ')) as INTEGER)) FROM nba_roster;"}
"""

datapoint = data[7]
print(datapoint)
"""
{'question': 'Can you tell me how many players are in the NBA?',
 'answer': '600',
 'sql': 'select count(*) from nba_roster;'}
"""

system = "You are an NBA analyst with 15 years of experience writing complex SQL queries.\n"
system += "Consider the nba_roster table with the following schema:\n"
system += get_schema() + "\n"
system += (
    "Write a sqlite SQL query that would help you answer the following question:\n"
)
user = datapoint["question"]
prompt = make_llama_3_prompt(user, system)
generated_sql = llm.generate(prompt, output_type={"sqlite_query": "str"}, max_new_tokens=200)
print(generated_sql)
df = pd.read_sql(generated_sql['sqlite_query'], con=engine)
print(df)

"""
   COUNT(*)
0       476
"""

query_succeeded = False
try:
    df = pd.read_sql(generated_sql['sqlite_query'], con=engine)
    query_succeeded = True
    print("Query is valid")
except Exception as e:
    print(
        f"Failed to run SQL query: {generated_sql}"
    )

reference_sql = datapoint["sql"]
ref_df = pd.read_sql(reference_sql, con=engine)
print(ref_df)

""""
   count(*)
0       600
"""


# Let's transform it to our codes to a class

class QueryStage(GenerationNode):
    def __init__(self, model_name):
        super().__init__(
            model_name=model_name,
            max_new_tokens=200,
        )

    def generate(
        self,
        prompt: Union[Iterator[PromptObject], AsyncIterator[PromptObject]],
        *args,
        **kwargs,
    ):
        results = super().generate(
            prompt,
            output_type={"sqlite_query": "str"},
            *args,
            **kwargs,
        )
        return results


    def postprocess(self, obj: PromptObject):
        # Run both the generated and reference (Gold Dataset) SQL queries
        # Assessing whether the SQL queries succeeded in hitting the database (not correctness yet!)
        
        query_succeeded = False

        try:
            logger.error(f"Running SQL query '{obj.response['sqlite_query']}'")
            obj.data["generated_query"] = obj.response["sqlite_query"]
            df = pd.read_sql(obj.response["sqlite_query"], con=engine)
            obj.data['df'] = df
            logger.error(f"Got data: {df}")
            query_succeeded = True

        except Exception as e:
            logger.error(
                f"Failed to run SQL query: {obj.response['sqlite_query']}"
            )

        logger.info(f"Running reference SQL query '{obj.data['sql']}'")
        df = pd.read_sql(obj.data["sql"], con=engine)
        logger.info(f"Got data: {df}")
        obj.data['reference_df'] = df

        logger.info(f"For question: {obj.data['question']}")
        logger.info(f"For query: {obj.response['sqlite_query']}")

        obj.data["query_succeeded"] = query_succeeded

    def preprocess(self, obj: PromptObject):
        new_prompt = make_llama_3_prompt(**self.make_prompt(obj.data))
        obj.prompt = new_prompt

    def make_prompt(self, data: dict):
        system = "You are an NBA analyst with 15 years of experience writing complex SQL queries.\n"
        system += "Consider the nba_roster table with the following schema:\n"
        system += get_schema() + "\n"
        system += (
            "Write a sqlite SQL query that would help you answer the following question:\n"
        )
        user = data["question"]
        return {
            "user": user,
            "system": system,
        }
# Compare strings
str(df).lower() == str(ref_df).lower() # False

# Using a LLM to compare
system_prompt = "Compare the following two dataframes. They are similar if they are almost identical, or if they convey the same information about the nba_roster dataset"
system_prompt += "Respond with valid JSON {'explanation' : str, 'similar' : bool}"
print(system_prompt)
"""
"Compare the following two dataframes.
They are similar if they are almost identical, or if they convey the same information about the nba_roster datasetRespond with valid JSON {'explanation' : str, 'similar' : bool}"
"""

user_prompt = (
    f"========== Dataframe 1 =========\n{str(df).lower()}\n\n"
)
user_prompt += (
    f"========== Dataframe 2 =========\n{str(ref_df).lower()}\n\n"
)
user_prompt += f"Can you tell me if these dataframes are similar?"

llm_similarity_prompt = make_llama_3_prompt(user_prompt, system_prompt)
llm_similarity = llm.generate(llm_similarity_prompt, output_type={"explanation": "str", "similar": "bool"}, max_new_tokens=200)
print(llm_similarity)
"""
{'explanation': 'The dataframes are not similar because they have different counts. The first dataframe has a count of 476, while the second dataframe has a count of 600',
 'similar': False}
"""

str(df).lower() == str(ref_df).lower() or llm_similarity["similar"] # False


# How to wrap it up in a class

class ScoreStage(GenerationNode):
    def __init__(self):
        super().__init__(
            model_name="meta-llama/Meta-Llama-3-8B-Instruct",
            max_new_tokens=150,
        )

    def generate(
        self,
        prompt: Union[Iterator[PromptObject], AsyncIterator[PromptObject]],
        *args,
        **kwargs,
    ):
        logger.debug("ScoreStage Generate")
        results = super().generate(
            prompt,
            output_type={"explanation": "str", "similar": ["true", "false"]},
            *args,
            **kwargs,
        )        
        logger.debug(f"ScoreStage Results {results}")

        return results

    def preprocess(self, obj: PromptObject):
        obj.prompt = make_llama_3_prompt(**self.make_prompt(obj))
        logger.info(f"Scoring Stage Prompt:\n{obj.prompt}")

    def postprocess(self, obj: PromptObject):
        logger.info(f"Postprocess")
        obj.data['is_matching'] = self.is_matching(obj.data, obj.response)
        obj.data['explanation'] = obj.response["explanation"]
        obj.data['similar'] = obj.response["similar"] == "true"


    def is_matching(self, data, response):
        return (str(data.get('df',"None")).lower() == str(data['reference_df']).lower() 
                or response['similar'] == "true")

    def make_prompt(self, obj: PromptObject):
        # Your evaluation model compares SQL output from the generated and reference SQL queries, using another LLM in the pipeline
        system_prompt = "Compare the following two dataframes. They are similar if they are almost identical, or if they convey the same information about the nba_roster dataset"
        system_prompt += "Respond with valid JSON {'explanation' : str, 'similar' : bool}"
        user_prompt = (
            f"========== Dataframe 1 =========\n{str(obj.data.get('df','None')).lower()}\n\n"
        )
        user_prompt += (
            f"========== Dataframe 2 =========\n{str(obj.data['reference_df']).lower()}\n\n"
        )
        user_prompt += f"Can you tell me if these dataframes are similar?"
        return {
            "system": system_prompt,
            "user": user_prompt
        }

class EvaluationPipeline(GenerationPipeline):
    def __init__(self, args):
        super().__init__()
        self.query_stage = QueryStage(args.sql_model_name)
        self.score_stage = ScoreStage()

    def forward(self, x):
        x = self.query_stage(x)
        x = self.score_stage(x)
        return x

async def run_eval(dataset, args):
    results = await run_evaluation_pipeline(dataset, args)
    print("Total results:", len(results))
    return results

async def run_evaluation_pipeline(dataset, args):
    results = EvaluationPipeline(args).call(dataset)
    result_list = []

    pbar = tqdm(desc="Saving results", unit=" results")
    async for result in results:
        result_list.append(result)
        pbar.update()
    return result_list

def save_eval_results(results, args):
    base_path = "./data/results"
    now = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
    experiment_name = f"nba_sql_pipeline_{now}"
    experiment_dir = os.path.join(base_path, experiment_name)
    os.makedirs(os.path.join(base_path, experiment_name))

    # Write args to file
    args_file_name = f"{experiment_dir}/args.txt"
    with open(args_file_name, "w") as writer:
        pprint(args.__dict__, writer)


    def is_correct(r):
        if (
            (r.data["query_succeeded"] and r.data['is_matching']) or 
            r.data["generated_query"] == r.data['sql']
        ):
            return True
        return False

    # Write sql results and errors to file
    results_file_name = f"{experiment_dir}/sql_results.jsonl"
    with jsonlines.open(results_file_name, "w") as writer:
        for result in results:
            if not is_correct(result):
                continue
            writer.write(
                {
                    "question": result.data['question'],
                    "query": result.data["generated_query"],
                    "query_succeeded": result.data["query_succeeded"],
                    "reference_sql": result.data['sql'],
                    "df": str(result.data.get('df', 'None')),
                    "reference_df": str(result.data['reference_df']),
                    'is_matching': result.data['is_matching'],
                    'similar': result.data['similar'],
                }
            )

    results_file_name = f"{experiment_dir}/sql_errors.jsonl"
    with jsonlines.open(results_file_name, "w") as writer:
        for result in results:
            if is_correct(result):
                continue
            writer.write(
                {
                    "question": result.data['question'],
                    "query": result.data["generated_query"],
                    "query_succeeded": result.data["query_succeeded"],
                    "df": str(result.data.get('df', 'None')),
                    "reference_df": str(result.data['reference_df']),
                    'is_matching': result.data['is_matching'],
                    'similar': result.data['similar'],
                }
            )

    # Write statistics to file
    average_sql_succeeded = sum(
        [result.data["query_succeeded"] for result in results]
    ) / len(results)
    average_correct = sum(
        [result.data["query_succeeded"] and result.data['is_matching'] for result in results]
    ) / len(results)

    file_name = f"{experiment_dir}/summary.txt"
    with open(file_name, "w") as writer:
        print(f"Total size of eval dataset: {len(results)}", file=writer)
        print(f"Total size of eval dataset: {len(results)}")
        print(f"Percent Valid SQL Syntax: {average_sql_succeeded*100}", file=writer)
        print(f"Percent Valid SQL Syntax: {average_sql_succeeded*100}")
        print(f"Percent Correct SQL Query: {average_correct*100}", file=writer)
        print(f"Percent Correct SQL Query: {average_correct*100}")

args = Args()
dataset = load_gold_dataset(args)
results = await run_eval(dataset, args)
save_eval_results(results, args)

İnce ayar, PEFT ve Bellek Ayarı (Memory Tuning)

PEFT ince ayarını ve bellek ayarıyla halüsinasyon giderme yöntemini öğrenelim. Daha fazla veriyi işlemek için ince ayar yapıyoruz, veriden öğreniyoruz, LLM üzerinde daha derin bir kontrol sağlıyoruz ve doğruluk sınırı yok.

İki ince ayar türünü göreceğiz:

Talimat ince ayarı ve bellek ayarı. Talimat ince ayarı, önceden eğitilmiş bir LLM’nin talimatları takip etmesini sağladığınız zamandır.
Bellek ayarı ise, LLM’nin halüsinasyon yapmamasını sağladığınız zamandır.

Talimat ince ayarı, sohbet özelliği, fonksiyon çağırma özelliği ve çıktı formatını elde etmek için kullanılabilir. Ön eğitim, bir LLM’yi devasa verilerle, bir token’i bir seferde öğretmek gibidir; örnekler üzerindeki ortalama hatayı azaltır (genelleme hatası) ve çok şey öğrenen güçlü temel modeller oluşturur. Ancak LLM’ler talimatları takip edemez ve gerçekler hakkında halüsinasyon yapar. LLM’ler, olasılıksal doğaları gereği ‘hafif doğru’ ve ‘doğru’yu aynı şey olarak görür. Öte yandan, Bellek Ayarı gerçeklerde hatayı sıfıra indirir ve gerçeklerde neredeyse kusursuz hale gelirken diğer her konuda da başarılı olur.

Unutma ki prompting ve RAG (Retrieval-Augmented Generation) tüm sorunları çözebilir ve fine-tuning (ince ayar) artık çok pahalı değil. Fine-tuning, RAG’de büyük prompt’lar çalıştırmaktan daha ucuz hale geldi, PEFT (Parametrik Etkinlik Faktörü) maliyeti 10 bin kat azalttı ve MoME (Memory Experts Karışımı), herhangi bir LLM’yi (Büyük Dil Modeli) milyonlarca uzman adaptörün karışımına dönüştürerek süreyi 240 kat azalttı. Ancak verimlilik kazanımlarını elde etmek için fine-tuning’i doğru şekilde uygulamanız gerekiyor.

En popüler PEFT yöntemlerinden biri LoRA’dır (Düşük Sıralı Adaptasyon). LoRA’da ağırlıklar üzerinde çalışan bir adaptör eğitiyoruz. Bunu yaparak ana ağırlıklara dokunmuyoruz, aynı zamanda bunu ana modele geri birleştirebiliriz, böylece çıkarımda aynı gecikme miktarını alırız.
MoME’de de adaptörleriniz var, ancak buna ek olarak, bu adaptörlere sahip olduğunuz mantıklı bir aşamada ayarladığınız ve örneklediğiniz bir dizi bellek uzmanı ağırlığına da sahipsiniz ve verilerinizden öğrenilen bilgileri içeren bir alt kümeyi örnekliyorsunuz ve bunları adaptörlerin içine birleştiriyorsunuz. Yani modelinizi bellek uzmanları aracılığıyla büyütebilir ve büyük bir modelin zekasını, daha küçük (seyrek etkinleştirilen) bir modelin maliyeti ve gecikmesiyle elde edebilirsiniz.

Tam yönetilen ince ayar mevcut olsa da, kendi tam yönetilen ince ayarınızı oluşturmak zordur.

Verimlilik: Verimli değildir. Aynı doğruluğu elde etmek için 10k-1M kat daha fazla hesaplama gücü gerektirir. Birden fazla GPU’da verimli bir şekilde paralelleştirilemez. Gerçek bir kullanım durumunda çöker, üretimde sürekli olarak ince ayar yapıp çıkarım gerçekleştirmek mümkün değildir. LLM gelişmez, her kullanım durumu, model ve veri seti için ince ayar yapmak zordur.
Çıkarım: GPU ve bellek sorunları nedeniyle kullanımı ve ölçeklenmesi kolay değildir.
Hatalar: İnce ayarı çıkarımla entegre etmek birçok hataya yol açar. Model ağırlıklarını farklı formatlar arasında aktarmak hatasız değildir.
Yanlış araçları kullanma: Talimatlarla ince ayar yapmak halüsinasyonları çözmez.

Lamini gibi platformlarda PEFT talimat ince ayarını ve bellek ayarını çalıştırmak için 1 satırlık bir komut kullanabilirsiniz ve bu ücretsizdir.

İnce ayar yaparken daha fazla ve daha iyi veri ve değerlendirme gibi kullanım senaryonuzun özelliklerine odaklanın.

Veri Oluşturun ve İnce Ayar Yapın

Genellikle düşündüğünüzden daha fazla veriye sahipsinizdir. Verilerle ilgili sorun, genellikle doğru formatta olmamasıdır. Eskiden manuel veri etiketleme ağır bir işti, ancak bugün, doğru formatı belirttiğiniz sürece LLM’lerin yardımıyla bu büyük bir sorun olmaktan çıkmıştır. Doğru formatı belirlemek ise büyük ölçüde uygulamaya özeldir. İstemi SQL sorgusuna ya da istemi kimlik veya yanıta dönüştürmek örnek olarka verilebilir.

Pratik İpuçları

Örnekler ekleyin: Az sayıda örnekle veya bağlam içi öğrenme. Özellikle, LLM’nin öğrenmesi gerekenlere benzer şekilde düzeltilmiş halüsinatif örnekler.
Varyasyonlar oluşturun: Bu size genişlik kazandırır. Örneğin, bunu sadece NBA oyuncuları için kullanmak yerine, alternatif karakterlerle denemeler yaparak üretimler ekleyebilirsiniz.
Üretimleri filtreleyin: Filtreleyebilmek ve filtrelerinize, otomatik filtrelere güvenebilmek, oluşturduğunuz örnekleri yüksek kaliteli bir veri setine ölçeklenebilir bir şekilde indirgemek son derece önemlidir. Bu, size ölçeklenebilir bir şekilde yüksek kaliteli veri sağlar.
Ne işe yaradı, ne yaramadı incelemeye değer: Genellikle, daha karmaşık sorgular daha zordur. Bu, yukarıdaki tüm ipuçlarını içeren komutları ayarlamanıza yardımcı olur.

Verileri ince ayar için göndermek için minimum gereksinimler:

Bellek Ayarı 1 veri noktası. Talimat ince ayarı 1000 veri noktası. İnce ayar, komut-yanıt çiftleri gerektirir. Bellek ayarı, LLM’nin yanıtlarda öğrenmesini istediğiniz bilgileri gerektirir.
İnce ayar kolaylığı: Lamini gibi bir kütüphane kullanarak tek satırlık bir API/Python çağrısı. Veya kendi sisteminizi kurup hiperparametre ayarlarını yaparak modeli ileri/geri çalıştırabilirsiniz

İnce ayar yapmaya başlamadan önce zaman ve hesaplama gereksinimlerini hesaplamamız gerekir.

Yüksek performanslı (MFU’nun %40'ı) 1 NVIDIA A100 GPU’da ince ayar

LoRA ile yapılan özel talimat ince ayarı, yaklaşık %50 doğruluğa ulaşır, ancak çok hızlıdır; yalnızca birkaç dakika sürer ve gerekli hesaplama gücü sadece 19,2 petaflopstır. Bu karşılaştırma, %40 MFU performansı olan 1 NVIDIA A100 GPU’su üzerinde yapılmıştır ya da eşdeğer AMD GPU, ortalama %50 civarındadır. Bellek ayarı, optimize edilmemiş versiyonu kullandığınızda oldukça daha yoğundur; bu hala LoRA kullanır, ancak sadece optimize edilmemiştir ve doğruluk, binlerce gerçek üzerinde çok daha yükseğe çıkabilir çünkü halüsinasyonlar büyük ölçüde azaltılmış olur. Bunun çok daha uzun sürmesinin ve çok daha fazla hesaplama gücü gerektirmesinin sebebi, ikinci satırda gördüğünüz gibi 1.9 exaflopsa kadar çıkabilmesi ve 4 saat sürmesidir, çünkü belirli gerçekler üzerindeki kaybı sıfıra indiriyorsunuz. Lamini Bellek Ayarı, daha önce gördüğünüz optimizasyonları uygulamış olup, gerekli süre sadece bir dakika ve 24 petaflops

Beklentiler:

Finetuning (ince ayar), yinelemeli bir deney yaklaşımıdır. %95 doğruluğa ulaşmak için farklı veri işleme kanalları üzerinde 10–30 iterasyon gerektirir.
Birden fazla varyantı paralel olarak finetuning (ince ayar) yapmak, deneme sürecini hızlandırır. %90 ve üzeri doğruluklardan bahsetmeye başlayabilirsiniz.

Doğruluğunuzu artırabilecek unsurlar:

Eğer tüm gerçeklere bir yerde sahipseniz, yeterli veriye sahipsiniz demektir. Geri kalan kısım için LLM’leri (Büyük Dil Modelleri) kullanabilirsiniz. Verilerinizi bir LLM ile işleyerek ihtiyaç duyduğunuz veriyi elde edebilirsiniz.
LLM veri işlem kanallarınız hedeflerinize, verilerinize ve sorunlarınıza özgü olacak şekilde oldukça karmaşık ve benzersiz hale gelebilir (Eğer bir yapay zeka startup’ıysanız, bu size benzersiz bir şey kazandırabilir). LLM hatalarını ve halüsinasyonlarını tespit etmede ustalaşarak LLM’inize doğru olanı öğretin. En hızlı oluşturulan uygulamalar, veri değerlendiricilerin uygulama geliştiricilerin yanında oturduğu yerlerdir.

from dotenv import load_dotenv
_ = load_dotenv()   #load environmental variable LAMINI_API_KEY with key from .env file
import lamini
import logging
import random
from typing import AsyncIterator, Iterator, Union
import sqlite3
import copy
from tqdm import tqdm

import pandas as pd
import jsonlines
from lamini.generation.base_prompt_object import PromptObject
from lamini.generation.generation_node import GenerationNode
from lamini.generation.base_prompt_object import PromptObject
from lamini.generation.generation_pipeline import GenerationPipeline
from util.get_schema import get_schema, get_schema_s
from util.make_llama_3_prompt import make_llama_3_prompt
from util.setup_logging import setup_logging

logger = logging.getLogger(__name__)
engine = sqlite3.connect("./nba_roster.db")
setup_logging()

class Args:
    def __init__(self, 
                 max_examples=100, 
                 sql_model_name="meta-llama/Meta-Llama-3-8B-Instruct", 
                 gold_file_name="gold-test-set.jsonl",
                 training_file_name="generated_queries.jsonl",
                 num_to_generate=10):
        self.sql_model_name = sql_model_name
        self.max_examples = max_examples
        self.gold_file_name = gold_file_name
        self.training_file_name = training_file_name
        self.num_to_generate = num_to_generate

Sahip Olduğunuz Şeyden Geriye Doğru Çalışmak

Scheme ve örnekten yeni SQL sorguları oluşturun:

system = "You are an NBA analyst with 15 years of experience writing complex SQL queries.\n"
system += (
    "Consider a table called 'nba_roster' with the following schema (columns)\n"
)
system += get_schema_s()
system += "Consider the following questions, and queries used to answer them:\n"
print(system)

question = """What is the median weight in the NBA?"""
sql = "select CAST(SUBSTR(WT, 1, INSTR(WT,' ')) as INTEGER) as percentile from nba_roster order by percentile limit 1 offset (select count(*) from nba_roster)/2;"

system += "Question: " + question + "\n"
system += "Query: " + sql + "\n"
print(system)

user = "Write two queries that are similar but different to those above.\n"
user += "Format the queries as a JSON object, i.e.\n"
user += '{ "explanation": str, "sql_query_1" : str, "sql_query_2": str }.\n'
print(user)

user += "First write an explanation of why you decided to write these new queries in about 3-5 sentences, then write valid sqlite SQL queries for each of the 2 new queries. Make sure each query is complete and ends with a ;\n"
print(user)

prompt = make_llama_3_prompt(user, system)

llm = lamini.Lamini(model_name="meta-llama/Meta-Llama-3-8B-Instruct")
result = llm.generate(prompt, output_type={ "explanation": "str", "sql_query_1" : "str", "sql_query_2": "str" }, max_new_tokens=200)
print(result)

def check_sql_query(query):
    try:
        pd.read_sql(query, con=engine)
    except Exception as e:
        logger.debug(f"Error in SQL query: {e}")
        return False
    logger.info(f"SQL query {query} is valid")
    return True

check_sql_query(result["sql_query_1"])

check_sql_query(result["sql_query_2"])

# Hepsini tek bir sınıf içerisine koyalım

class ModelStage(GenerationNode):
    def __init__(self):
        super().__init__(
            model_name="meta-llama/Meta-Llama-3-8B-Instruct",
            max_new_tokens=300,
        )

    def generate(
        self,
        prompt: Union[Iterator[PromptObject], AsyncIterator[PromptObject]],
        *args,
        **kwargs,
    ):
        prompt = self.add_template(prompt)

        results = super().generate(
            prompt,
            output_type={
                "explanation": "str",
                "sql_query_1": "str",
                "sql_query_2": "str",
            },
            *args,
            **kwargs,
        )

        return results

    async def add_template(self, prompts):
        async for prompt in prompts:
            new_prompt = make_llama_3_prompt(**self.make_prompt(prompt.data))
            yield PromptObject(prompt=new_prompt, data=prompt.data)

    async def process_results(self, results):
        async for result in results:
            if result is None:
                continue

            if result.response is None:
                continue

            logger.info("=====================================")
            logger.info(f"Generated query 1: {result.response['sql_query_1']}")
            logger.info(f"Generated query 2: {result.response['sql_query_2']}")
            logger.info("=====================================")

            if self.check_sql_query(result.response["sql_query_1"]):
                new_result = PromptObject(prompt="", data=copy.deepcopy(result.data))
                new_result.data.generated_sql_query = result.response["sql_query_1"]
                yield new_result

            if self.check_sql_query(result.response["sql_query_2"]):
                new_result = PromptObject(prompt="", data=copy.deepcopy(result.data))
                new_result.data.generated_sql_query = result.response["sql_query_2"]
                yield new_result

    def make_prompt(self, data):
        system = "You are an NBA analyst with 15 years of experience writing complex SQL queries.\n"
        system += (
            "Consider a table called 'nba_roster' with the following schema (columns)\n"
        )
        system += get_schema()
        system += "Consider the following questions, and queries used to answer them:\n"
        for example in data.sample:
            system += "Question: " + example["question"] + "\n"
            system += "Query: " + example["sql"] + "\n"

        # Important: generate relevant queries to your reference data
        # Ideally, close to those that are failing so you can show the model examples of how to do it right!
        user = "Write two queries that are similar but different to those above.\n"
        user += "Format the queries as a JSON object, i.e.\n"
        user += '{ "explanation": str, "sql_query_1" : str, "sql_query_2": str }.\n'

        # Next, use Chain of Thought (CoT) and prompt-engineering to help with generating SQL queries
        user += "First write an explanation of why you decided to write these new queries in about 3-5 sentences, then write valid sqlite SQL queries for each of the 2 new queries. Make sure each query is complete and ends with a ;\n"

        return {"system": system, "user": user}

    def check_sql_query(self, query):
        try:
            pd.read_sql(query, con=engine)
        except Exception as e:
            logger.debug(f"Error in SQL query: {e}")
            return False

        logger.info(f"SQL query {query} is valid")

        return True

2. Artık sorularınız olduğuna göre, bu sorulara yönelik sorular üretin:

system = "You are an NBA analyst with 15 years of experience writing complex SQL queries.\n"
system += (
    "Consider a table called 'nba_roster' with the following schema (columns)\n"
)
system += get_schema() + "\n"
system += "Queries, and questions that they are used to answer:\n"

example_question = """What is the median weight in the NBA?"""
example_sql = "select CAST(SUBSTR(WT, 1, INSTR(WT,' ')) as INTEGER) as percentile from nba_roster order by percentile limit 1 offset (select count(*) from nba_roster)/2;"

system += "Question: " + example_question + "\n"
system += "Query: " + example_sql + "\n"

generated_sql = result["sql_query_2"]

user = "Now consider the following query.\n"
user += "Query: " + generated_sql + "\n"
user += "Write a question that this query could be used to answer.\n"

user += "Format your response as a JSON object, i.e.\n"
user += '{ "explanation": str, "question": str }.\n'

user += "First write an explanation in about 3-5 sentences, then write a one sentence question.\n"

prompt = make_llama_3_prompt(user, system)
result = llm.generate(prompt, output_type={ "explanation": "str", "question" : "str" }, max_new_tokens=200)
print(result)

# Wrap it all up together in a class which generates a question
# given a query

class QuestionStage(GenerationNode):
    def __init__(self):
        super().__init__(
            model_name="meta-llama/Meta-Llama-3-8B-Instruct",
            max_new_tokens=150,
        )

    def generate(
        self,
        prompt: Union[Iterator[PromptObject], AsyncIterator[PromptObject]],
        *args,
        **kwargs,
    ):
        results = super().generate(
            prompt,
            output_type={
                "explanation": "str",
                "question": "str",
            },
            *args,
            **kwargs,
        )
        return results

    def preprocess(self, obj: PromptObject):
        new_prompt = make_llama_3_prompt(**self.make_question_prompt(obj.data))
        obj.prompt = new_prompt

    def make_question_prompt(self, data):
        system = "You are an NBA analyst with 15 years of experience writing complex SQL queries.\n"
        system += (
            "Consider a table called 'nba_roster' with the following schema (columns)\n"
        )
        system += get_schema() + "\n"
        system += "Queries, and questions that they are used to answer:\n"
        for example in data.sample:
            system += "Query: " + example["sql"] + "\n"
            system += "Question: " + example["question"] + "\n"

        user = "Now consider the following query.\n"
        user += "Query: " + data.generated_sql_query + "\n"
        user += "Write a question that this query could be used to answer.\n"

        # Using Chain of Thought (CoT) again
        # This time you can do it programmatically with function calling, so you can easily extract a question out of the JSON object
        user += "Format your response as a JSON object, i.e.\n"
        user += '{ "explanation": str, "question": str }.\n'

        user += "First write an explanation in about 3-5 sentences, then write a one sentence question.\n"

        return {"system": system, "user": user}

class QueryGenPipeline(GenerationPipeline):
    def __init__(self):
        super().__init__()
        self.model_stage = ModelStage()
        self.question_stage = QuestionStage()

    def forward(self, x):
        x = self.model_stage(x)
        x = self.question_stage(x)
        return x

async def run_query_gen_pipeline(gold_queries):
    return QueryGenPipeline().call(gold_queries)

# Generate N samples, for every example in the gold dataset

all_examples = []

async def load_gold_queries(args):
    path = f"data/{args.gold_file_name}"

    with jsonlines.open(path) as reader:
        global all_examples

        all_examples = [obj for obj in reader]

    sample_count = args.num_to_generate
    sample_size = 3

    random.seed(42)

    for i in range(sample_count):
        example_sample = ExampleSample(random.sample(all_examples, sample_size), i)
        yield PromptObject(prompt="", data=example_sample)


class ExampleSample:
    def __init__(self, sample, index):
        self.sample = sample
        self.index = index

async def save_generation_results(results, args):
    path = f"data/training_data/{args.training_file_name}"

    pbar = tqdm(desc="Saving results", unit=" results")
    with jsonlines.open(path, "w") as writer:

        async for result in results:
            writer.write(
                {
                    "question": result.response["question"],
                    "sql": result.data.generated_sql_query,
                }
            )
            pbar.update()

        for example in all_examples:
            writer.write(example)
            pbar.update()

args = Args()
gold_queries = load_gold_queries(args)
results = await run_query_gen_pipeline(gold_queries)
await save_generation_results(results, args)

# Display the queries generated above
#!cat "data/training_data/generated_queries.jsonl"

# Display the archived queries which match the course video
!cat "data/training_data/archive/generated_queries.jsonl"

# ROUND OF FINETUNING
# Now that you have data, even if it is not perfect, go through a round of finetuning!

import logging
import os
from datetime import datetime
from pprint import pprint
from typing import AsyncIterator, Iterator, Union
import sqlite3
from tqdm import tqdm

import pandas as pd
import jsonlines
from lamini.generation.base_prompt_object import PromptObject
from lamini.generation.generation_node import GenerationNode
from lamini.generation.base_prompt_object import PromptObject
from lamini.generation.generation_pipeline import GenerationPipeline
from util.get_schema import get_schema
from util.make_llama_3_prompt import make_llama_3_prompt
from util.setup_logging import setup_logging
from util.load_dataset import get_dataset
from util.get_default_finetune_args import get_default_finetune_args

logger = logging.getLogger(__name__)
engine = sqlite3.connect("./nba_roster.db")
setup_logging()

class Args:
    def __init__(self, 
                 max_examples=100, 
                 sql_model_name="meta-llama/Meta-Llama-3-8B-Instruct", 
                 gold_file_name="gold-test-set.jsonl",
                 training_file_name="archive/generated_queries.jsonl",
                 num_to_generate=10):
        self.sql_model_name = sql_model_name
        self.max_examples = max_examples
        self.gold_file_name = gold_file_name
        self.training_file_name = training_file_name
        self.num_to_generate = num_to_generate

# make_question will take the questions and queries from the training_file and embed them in the prompt below to form the training data.
def make_question(obj):
    system = "You are an NBA analyst with 15 years of experience writing complex SQL queries.\n"
    system += "Consider the nba_roster table with the following schema:\n"
    system += get_schema() + "\n"
    system += (
        "Write a sqlite SQL query that would help you answer the following question:\n"
    )
    user = obj["question"]
    return {"system": system, "user": user}

args = Args()
llm = lamini.Lamini(model_name="meta-llama/Meta-Llama-3-8B-Instruct")
dataset = get_dataset(args, make_question)
finetune_args = get_default_finetune_args()

"""
This fine tuning step takes about 30 mintues to complete. The dispatch to run on the lamini services is commented out and the pre-computed final results of the run are provided below. You can uncomment and run if you have modified data on your own.
"""
llm.train(
    data_or_dataset_id=dataset,
    finetune_args=finetune_args,
    is_public=True,  # For sharing
)

# Let's examine this pre-computed finetuning result
llm = lamini.Lamini(model_name="a5ebf1c4879569101f32444afae5adcafbfce9c5a6ed13035fd892147f7d59bc")

question = """Who is the highest paid NBA player?"""
system = f"""You are an NBA analyst with 15 years of experience writing complex SQL queries. Consider the nba_roster table with the following schema:
{get_schema()}

Write a sqlite query to answer the following question. Follow instructions exactly"""
prompt = make_llama_3_prompt(question, system)
print("Question:\n", question)

print("Answer:")
print(llm.generate(prompt, max_new_tokens=200))

query="SELECT salary, name FROM nba_roster WHERE salary != '--' ORDER BY CAST(REPLACE(REPLACE(salary, '$', ''), ',','') AS INTEGER) DESC LIMIT 1;"
df = pd.read_sql(query, con=engine)
print(df)


# Let's run an evaluation over the eval dataset.
# Collapsible or utils from Lesson 3 Lab for evaluation
class QueryStage(GenerationNode):
    def __init__(self, model_name):
        super().__init__(
            model_name=model_name,
            max_new_tokens=300,
        )

    def generate(
        self,
        prompt: Union[Iterator[PromptObject], AsyncIterator[PromptObject]],
        *args,
        **kwargs,
    ):
        results = super().generate(
            prompt,
            output_type={"sqlite_query": "str"},
            *args,
            **kwargs,
        )
        return results


    def postprocess(self, obj: PromptObject):
        # Run both the generated and reference (Gold Dataset) SQL queries
        # Assessing whether the SQL queries succeeded in hitting the database (not correctness yet!)
        
        query_succeeded = False

        try:
            logger.info(f"Running SQL query '{obj.response['sqlite_query']}'")
            obj.data["generated_query"] = obj.response["sqlite_query"]
            df = pd.read_sql(obj.response["sqlite_query"], con=engine)
            obj.data['df'] = df
            logger.info(f"Got data: {df}")
            query_succeeded = True

        except Exception as e:
            logger.error(
                f"Failed to run SQL query: {obj.response['sqlite_query']}"
            )

        logger.info(f"Running reference SQL query '{obj.data['sql']}'")
        df = pd.read_sql(obj.data["sql"], con=engine)
        logger.info(f"Got data: {df}")
        obj.data['reference_df'] = df

        logger.info(f"For question: {obj.data['question']}")
        logger.info(f"For query: {obj.response['sqlite_query']}")

        obj.data["query_succeeded"] = query_succeeded

    def preprocess(self, obj: PromptObject):
        new_prompt = make_llama_3_prompt(**self.make_prompt(obj.data))
        obj.prompt = new_prompt

    def make_prompt(self, data: dict):
        system = "You are an NBA analyst with 15 years of experience writing complex SQL queries.\n"
        system += "Consider the nba_roster table with the following schema:\n"
        system += get_schema() + "\n"
        system += (
            "Write a sqlite SQL query that would help you answer the following question. Make sure each query ends with a semicolon:\n"
        )
        user = data["question"]
        return {
            "user": user,
            "system": system,
        }
    
class ScoreStage(GenerationNode):
    def __init__(self):
        super().__init__(
            model_name="meta-llama/Meta-Llama-3-8B-Instruct",
            max_new_tokens=150,
        )

    def generate(
        self,
        prompt: Union[Iterator[PromptObject], AsyncIterator[PromptObject]],
        *args,
        **kwargs,
    ):
        results = super().generate(
            prompt,
            output_type={"explanation": "str", "similar": ["true", "false"]},
            *args,
            **kwargs,
        )
        return results

    def preprocess(self, obj: PromptObject):
        obj.prompt = make_llama_3_prompt(**self.make_prompt(obj))
        logger.info(f"Scoring Stage Prompt:\n{obj.prompt}")

    def postprocess(self, obj: PromptObject):
        obj.data['is_matching'] = self.is_matching(obj.data, obj.response)
        obj.data['explanation'] = obj.response["explanation"]
        obj.data['similar'] = obj.response["similar"] == "true"

    def is_matching(self, data, response):
        return (str(data.get('df',"None")).lower() == str(data['reference_df']).lower() 
                or response['similar'] == "true")

    def make_prompt(self, obj: PromptObject):
        # Your evaluation model compares SQL output from the generated and reference SQL queries, using another LLM in the pipeline
        '''
        Note:
        Prompt tuning is important! 
        A previous iteration of this scoring pipeline said `Compare the following two dataframes to see if they are identical`.
        That prompt turned out to be too stringent of criteria.
        '''
        system_prompt = "Compare the following two dataframes. They are similar if they are almost identical, or if they convey the same information about the nba_roster dataset"
        system_prompt += "Respond with valid JSON {'explanation' : str, 'similar' : bool}"
        user_prompt = (
            f"========== Dataframe 1 =========\n{str(obj.data.get('df','None')).lower()}\n\n"
        )
        user_prompt += (
            f"========== Dataframe 2 =========\n{str(obj.data['reference_df']).lower()}\n\n"
        )
        user_prompt += f"Can you tell me if these dataframes are similar?"
        return {
            "system": system_prompt,
            "user": user_prompt
        }
    
async def run_eval(dataset, args):

    results = await run_evaluation_pipeline(dataset, args)

    print("Total results:", len(results))

    return results


async def run_evaluation_pipeline(dataset, args):
    results = EvaluationPipeline(args).call(dataset)

    result_list = []

    pbar = tqdm(desc="Saving results", unit=" results")
    async for result in results:
        result_list.append(result)
        pbar.update()
    return result_list


class EvaluationPipeline(GenerationPipeline):
    def __init__(self, args):
        super().__init__()
        self.query_stage = QueryStage(args.sql_model_name)
        self.score_stage = ScoreStage()


    def forward(self, x):
        x = self.query_stage(x)
        x = self.score_stage(x)
        return x
    
def load_gold_dataset(args):
    path = f"data/{args.gold_file_name}"

    with jsonlines.open(path) as reader:
        for index, obj in enumerate(reversed(list(reader))):
            if index >= args.max_examples:
                break
            yield PromptObject(prompt="", data=obj)

def save_eval_results(results, args):
    base_path = "./data/results"
    now = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
    experiment_name = f"nba_sql_pipeline_{now}"
    experiment_dir = os.path.join(base_path, experiment_name)
    os.makedirs(os.path.join(base_path, experiment_name))

    # Write args to file
    args_file_name = f"{experiment_dir}/args.txt"
    with open(args_file_name, "w") as writer:
        pprint(args.__dict__, writer)


    def is_correct(r):
        if (
            (result.data["query_succeeded"] and result.data['is_matching']) or 
            result.data["generated_query"] == result.data['sql']
        ):
            return True
        return False

    # Write sql results and errors to file
    results_file_name = f"{experiment_dir}/sql_results.jsonl"
    with jsonlines.open(results_file_name, "w") as writer:
        for result in results:
            if not is_correct(result):
                continue
            writer.write(
                {
                    "question": result.data['question'],
                    "query": result.data["generated_query"],
                    "query_succeeded": result.data["query_succeeded"],
                    "reference_sql": result.data['sql'],
                    "df": str(result.data.get('df', 'None')),
                    "reference_df": str(result.data['reference_df']),
                    'is_matching': result.data['is_matching'],
                    'similar': result.data['similar'],
                }
            )

    results_file_name = f"{experiment_dir}/sql_errors.jsonl"
    with jsonlines.open(results_file_name, "w") as writer:
        for result in results:
            if is_correct(result):
                continue
            writer.write(
                {
                    "question": result.data['question'],
                    "query": result.data["generated_query"],
                    "query_succeeded": result.data["query_succeeded"],
                    "df": str(result.data.get('df', 'None')),
                    "reference_df": str(result.data['reference_df']),
                    'is_matching': result.data['is_matching'],
                    'similar': result.data['similar'],
                }
            )

    # Write statistics to file
    average_sql_succeeded = sum(
        [result.data["query_succeeded"] for result in results]
    ) / len(results)
    average_correct = sum(
        [result.data["query_succeeded"] and result.data['is_matching'] for result in results]
    ) / len(results)

    file_name = f"{experiment_dir}/summary.txt"
    with open(file_name, "w") as writer:
        print(f"Total size of eval dataset: {len(results)}", file=writer)
        print(f"Total size of eval dataset: {len(results)}")
        print(f"Percent Valid SQL Syntax: {average_sql_succeeded*100}", file=writer)
        print(f"Percent Valid SQL Syntax: {average_sql_succeeded*100}")
        print(f"Percent Correct SQL Query: {average_correct*100}", file=writer)
        print(f"Percent Correct SQL Query: {average_correct*100}")

# Run the evaluation and you can see there is more valid SQL and correct queries.
args = Args(sql_model_name="a5ebf1c4879569101f32444afae5adcafbfce9c5a6ed13035fd892147f7d59bc")
dataset = load_gold_dataset(args)
results = await run_eval(dataset, args)
save_eval_results(results, args)

# Filtering the dataset
# Next step is filtering, manually create functions to filter the test set.

question_set = set()
sql_set = set()

def is_not_valid_sql(question, sql):
    try:
        df = pd.read_sql(sql, con=engine)
        return False
    except Exception as e:
        return True

def has_null_in_sql_or_question(question, sql):
    return "null" in sql.lower() or "null" in question

def returns_empty_dataframe(question, sql):
    try:
        df = pd.read_sql(sql, con=engine)
        return "Empty" in str(df) or "None" in str(df)
    except Exception as e:
        return False
    
def uses_avg_on_ht_column(question, sql):
    return "avg(ht)" in sql.lower() or "avg(salary" in sql.lower() 

filter_conditions = [is_not_valid_sql, has_null_in_sql_or_question, returns_empty_dataframe, uses_avg_on_ht_column]

def training_semicolon(sql):
    if sql.strip()[-1] != ";":
        return sql.strip() + ";"
    return sql

with jsonlines.open("data/training_data/archive/generated_queries_large.jsonl", "r") as reader:
    with jsonlines.open("data/training_data/generated_queries_large_filtered.jsonl", "w") as writer:
        for r in reader:
            if r["question"] in question_set or r["sql"] in sql_set:
                continue
            question_set.add(r["question"])
            sql_set.add(r["sql"])
            
            if any(c(r['question'], r['sql']) for c in filter_conditions):
                continue

            sql = training_semicolon(r['sql'])
            writer.write(
                {
                    "question": r["question"],
                    "sql": sql,
                }
            )

Kaynaklar

Deeplearning.ai, (Eylül 2024), Improving Accuracy of LLM Applications:

[https://learn.deeplearning.ai/courses/improving-accuracy-of-llm-applications/]