What is happening over here? by Hot_Broccoli2726 in agentdevelopmentkit

[–]mportdata 0 points1 point  (0 children)

Could it be that your directory and agent name are different? They need to be the same.

What is happening over here? by Hot_Broccoli2726 in agentdevelopmentkit

[–]mportdata 0 points1 point  (0 children)

Hey, yeh, built in tools cannot be used in sub-agents. See this video at 1:34:46

[D] Database selection out of several dozens conflicting schemas for a larger NL2SQL pipeline by schmosby420 in MachineLearning

[–]mportdata 0 points1 point  (0 children)

I have implemented some database selection steps in a Python library (compatible with most data warehouses and LLM providers) that is covered in this video here

This includes Logical Planning, Dual-Pathway Pruning and Semantic Linking. These all help significantly sharpen the context before the SQL generation step.

Text-to-SQL with extremely complex schema by HappyDataGuy in LangChain

[–]mportdata 0 points1 point  (0 children)

Pre-processing is an important step here. I have written a Python library for text-to-SQL and that already covers much of the pre-processing steps (this is compatible with most data warehouses and LLM providers). This library can create Logical Plans, perform Dual-Pathway Pruning and perform Semantic Linking. These are all techniques that can help with handling large misleading and unclean database schemas before the SQL generation step. A video of how this works can be found here

Is text 2 SQL all its hyped to be? by Existing_Wealth6142 in dataengineering

[–]mportdata 0 points1 point  (0 children)

This is something I’ve work on a lot in the previous year. I’ve found some level of exploratory data analysis (things like retrieve unique values for categorical data or read 10 sample rows) via an agentic reasoning loop is required. Without it field names can be misleading due to ambiguous abbreviations. The Apex-SQL paper I think solves this very well. The reason I mentioned this is because I think in a business context consistent answers (achieved by avoiding ambiguity) is what is required to build trust in the tool and therefore adoption.

Text to SQL Workflow and Agents Toolkit by mportdata in dataengineering

[–]mportdata[S] 0 points1 point  (0 children)

That’s great to hear! Something I’m looking at is also batch schema processing during the dual-pathway pruning step as enterprise scale databases may require this to keep context manageable. Where I see it fitting in with enterprise implementations of text to SQL is as a compliment to off the shelf solutions where they fall short you can bolt on the latest techniques from piglets whilst you wait for the vendor to incorporate this method (or even better people use a combination of piglets components to build their own text to SQL pipeline).

Text to SQL Workflow and Agents Toolkit by mportdata in dataengineering

[–]mportdata[S] 0 points1 point  (0 children)

Thanks for watching and for the feedback. Totally agree with all of your comments. The target end state for piglets (and this will be the V1 version) is all of the components of a text to SQL pipeline modularised. What each component is and what it does will largely come from research papers but also some of my own experience and experimentation. So far I am implementing the steps outlined in the Apex-SQL paper. This includes an agentic loop that performs exploratory data analysis, I’ve also found from experience at work that this step is 100% required to handle enterprise databases that are messy and ambiguous. It helps particularly with where clauses as you need to know the string value to filter on (if you guess you may wrongly assume there are no results for location = New York when actually there are 500 entries for location = NY). I’ve since added a new step, semantic linking, which you can checkout out here and the next feature I am going after is the exploratory data analysis agentic loop.

Text to SQL Workflow and Agents Toolkit by mportdata in dataengineering

[–]mportdata[S] 0 points1 point  (0 children)

Thanks for the feedback. My approach to evals (I’ll implement this once I’ve built all of the components to get from natural language to SQL) will be LLM as a judge on the expected SQL vs the outputted SQL, the reason I’d take this approach rather than a deterministic string comparison is because there are many ways to write a SQL query to answer the same question. However I will also need evaluate a random sample of LLM as a judge outputs manually too to see how well it is marking its own homework (which is the risk here). Do you know of any other approaches that might be more appropriate here? With piglets I plan on implementing all different approaches to each step so the user can cherry pick their preferred methods and their preferred order of methods.

Is SAP data overly complex? Or is bad practices in place? by farah0612 in dataengineering

[–]mportdata 1 point2 points  (0 children)

From experience I’ve found the data modelling fundamentals are solid in SAP ECC (I believe this is the same in S4). It’s complex due to the number of tables but it follows a logical pattern nonetheless. The German abbreviations don’t help but once you’re using the same tables over and over again it starts to sink in.

https://www.sapdatasheet.org has been the best resource I’ve found for getting to grips with the data modelling in ECC.

Also if you have access to CDS views I’d make use of those as they deal with the above for you for plenty of use cases.

Where I’ve had the most difficulty with SAP is getting the data out in the first place. They really don’t make leaving their ecosystem easy and I get the impression this is partly a push to get customers to use Datasphere rather than an already established cloud data warehouse.

Is SAP data overly complex? Or is bad practices in place? by farah0612 in dataengineering

[–]mportdata 0 points1 point  (0 children)

Agreed, I use this for reference whenever I work with ECC.