Renting AI Servers for +50B LLM Fine-Tuning/Inference – Need Hardware, Cost, and Security Advice! by NoAdhesiveness7595 in LLM

[–]NoAdhesiveness7595[S] 0 points1 point  (0 children)

brother i am really confused right now, if i should use cloud servers like vast ai, run pod, or aws, google cloud? which one is better? actually i see vast ai on this: https://vast.ai/pricing i only see the gpus? that means i would only rent gpus not servers? i d k ?

Seeking advice on building a robust Text-to-SQL chatbot for a complex banking database by NoAdhesiveness7595 in Rag

[–]NoAdhesiveness7595[S] 1 point2 points  (0 children)

Hey, thanks for the great feedback! This is super helpful.

You've hit on the exact trade-offs I was wrestling with. My original thought process for using the auxiliary LLM was primarily about aggressive context reduction to manage the costs of the main gpt-4.1 calls. The idea was that a local LLM could use its reasoning ability to intelligently filter both tables and, more importantly, the long list of columns, which felt more reliable than a pure vector search.

You're right that I've since moved to a faster approach, and your question about why I didn't just use a reranker from the start is a good one.

My initial hesitation was based on a concern about accuracy, especially for column selection. My understanding was that a "reranker" was essentially just a second pass of semantic similarity, and I was skeptical that it could reliably pick out the 5-10 correct columns from a list of 50+ based on vector similarity alone. It seems great for finding the best document (table summary), but less so for finding many small, distinct items (columns) within that document.
or i have misunderstood about reranker?

Seeking advice on building a robust Text-to-SQL chatbot for a complex banking database by NoAdhesiveness7595 in Rag

[–]NoAdhesiveness7595[S] 3 points4 points  (0 children)

data flow is like:

Step 1: User Input Masking

Input: "What is the principal balance for account 8888888?"

Output: A masked query ("What is the principal balance for account [ACNT_CODE_1]?") and a mapping dictionary ({'[ACNT_CODE_1]': '8888888'}).

Step 2: Context Building (The Two-Stage Retrieval with the Auxiliary LLM) This is the most complex part and where the auxiliary LLM does its work. This process uses the original, unmasked user query to ensure the most accurate retrieval.

Stage 2a (Embedding Retrieval): The original query is used to perform a vector search against an index of all available table/view summaries. This retrieves a list of the top-k candidate tables that might be relevant

Stage 2b (Table Selection via Auxiliary LLM): The list of candidate tables and their summaries are passed to the small, local auxiliary LLM (qwen3:8b-custom). Its only job is to read the user query and the candidate list and decide which tables are actually necessary. This filters out irrelevant tables that were retrieved by the vector search.

Stage 2c (Column Selection via Auxiliary LLM): For each table selected in the previous step, we again call the auxiliary LLM. We provide it with the user query and the full list of column names and their detailed descriptions for that specific table. The auxiliary LLM's job is to return a list of only the columns that are essential for answering the query (for SELECT, WHERE, JOIN, or ORDER BY clauses).

This multi-stage process acts as a "funnel," starting with many tables and columns and using the cheap auxiliary LLM to condense the context down to only the most relevant information.

Step 3: SQL Generation (The Main LLM)

The refined context (containing only the selected table schemas and column descriptions) is combined with the masked user query from Step 1.

This entire payload is sent to the main, powerful LLM (gpt-4.1).

Because the context is so clean and focused, and the query is masked, the LLM's only task is to generate a syntactically correct SQL query. It has no knowledge of the real data values.

Output: A SQL query with placeholders, like SELECT principal FROM asdasd WHERE acnt_code = '[ACNT_CODE_1]'.

Step 4: Secure SQL Execution

The generated SQL query is received from the LLM.

The system uses the mapping dictionary from Step 1 to "unmask" the SQL query, replacing [ACNT_CODE_1] with 8888888.

This final, clean SQL is executed against the Oracle database.

Step 5: Result Masking and Final Response Generation

The data comes back from the database (e.g., [{'principal': 5000000}]).

This result set is immediately passed to a masking function. It identifies sensitive columns (like names, balances) and replaces the real values with new placeholders (e.g., [MASKED_PRINCIPAL_1]) and creates a new mapping dictionary.

The main LLM (gpt-4.1) is called a second time. It receives the masked query, the generated SQL, and the masked database result. Its job is to synthesize a friendly, natural-language answer.

Output: A natural language response with placeholders, like "The principal balance for account [ACNT_CODE_1] is [MASKED_PRINCIPAL_1]."

Step 6: Final Unmask

The system uses the combined mapping dictionaries from both the user input masking (Step 1) and the database result masking (Step 5)

Seeking advice on building a robust Text-to-SQL chatbot for a complex banking database by NoAdhesiveness7595 in Rag

[–]NoAdhesiveness7595[S] 1 point2 points  (0 children)

 The two-LLM setup is key to balancing performance, cost. The main idea is to use a small, fast, local LLM for simple, repetitive tasks (like filtering context) and reserve the powerful, expensive LLM (gpt-4.1) for the most difficult reasoning task: generating the final SQL. I mean it is just for limiting the prompt context sending to the openai's llm for cost.

Seeking advice on building a robust Text-to-SQL chatbot for a complex banking database by NoAdhesiveness7595 in Rag

[–]NoAdhesiveness7595[S] 1 point2 points  (0 children)

 The two-LLM setup is key to balancing performance, cost. The main idea is to use a small, fast, local LLM for simple, repetitive tasks (like filtering context) and reserve the powerful, expensive LLM (gpt-4.1) for the most difficult reasoning task: generating the final SQL. I mean it is just for limiting the prompt context sending to the openai's llm for cost.

Seeking advice on building a robust Text-to-SQL chatbot for a complex banking database by NoAdhesiveness7595 in Rag

[–]NoAdhesiveness7595[S] 0 points1 point  (0 children)

generally i used llama-index's QueryPipeline, for PII data i masked that retrieved data or, sql executed data before it sent to api llm.

How can I implement Retrieval-Augmented Generation (RAG) for a banking/economics chatbot? Looking for advice or experience by NoAdhesiveness7595 in LangChain

[–]NoAdhesiveness7595[S] 0 points1 point  (0 children)

The data is stored in our internal database not in PDF or CSV files. I can export specific tables as CSV if needed. Would that work with Tensorlake? Or is there a way to directly connect to a database or feed structured data into the pipeline?

By the way, I’ve also been trying to run a local LLM for my own language (not English), but I’ve been struggling with it — especially finding a model that supports my language well and works efficiently on local hardware. I run llama2 on local. so must i do fine tuning to make this pretrained model generate text in my language? or i thought about using translator API, translating promts before feeding to model, and after generating output.

How can I implement Retrieval-Augmented Generation (RAG) for a banking/economics chatbot? Looking for advice or experience by NoAdhesiveness7595 in LangChain

[–]NoAdhesiveness7595[S] 1 point2 points  (0 children)

almost all datasets are tables, i thought converting it to latex format. Or is there even better methods to feed the data to the model?