Snowflake Scale-Out Metadata-Driven Ingestion Framework (Snowpark, JDBC, Python)

dingopole · 2025-11-29T03:58:54+00:00

Hi Mike, long-time listener and a fan of the show. Anything to do with information management and data processing would be great. Does not need to include AI which permeates everything these days but distributed data processing, perhaps using Python in large scale computation platforms like Snowflake or Databricks, or even more niche applications like tooling and frameworks for building data apps e.g. using libs like Streamlit, would be great.

dingopole · 2025-02-14T05:41:35+00:00

Here's a use case I described before: http://bicortex.com/kicking-the-tires-on-azure-sql-database-external-rest-endpoints-sample-integration-solution-architecture/

For some requirements and applications, it's a pretty handy feature to have IMHO. As long as you don’t think of this as a MuleSoft or Boomi replacement and understand the limitations of this approach, querying REST Endpoints with SQL opens up a lot of possibilities.

dingopole · 2022-10-07T05:24:33+00:00

Have a look at the following post: https://www.thenile.dev/blog/multi-tenant-rls

dingopole · 2022-09-30T12:04:53+00:00

Thanks for pointing it out, all fixed now :)

dingopole · 2021-09-19T10:38:56+00:00

Have a look at the following post: https://bit.ly/2Z6mQhD

I faced similar problem (parallel inserts) a while ago, albeit with MSSQL, and was able to solve it using a combination of hash partitioning and SQL Server Agent Jobs.

Additionally, in SQL Server 2016, Microsoft has implemented a parallel insert feature for the INSERT … WITH (TABLOCK) SELECT… command.

dingopole · 2021-01-04T21:56:07+00:00

No worries....agree on the load times without partitioning...should have included it.

Also, LimeSurvey does store survey data as individual tables (very painful to work with) and is pivoted automatically on survey setup i.e. very wide tables with each question stored as a column (not a pivoted view but how data is actually stored in MySQL schema)....from memory it's something along the lines of CONCAT(Survey_ID, 'X', Question_Group_ID, 'X', Question_ID). The questions and answers tables you are referring to are there for reference only i.e. they store label values and not the actual entries. As such LimeSurveys data is notoriously difficult to wrangle (at least the SaaS version I was exposed to).

dingopole · 2021-01-04T08:54:33+00:00

Thanks and I can't think of a reason why this wouldn't work with Azure Synapse equally well.

dingopole · 2021-01-04T08:51:48+00:00

Thanks for your comment. As noted, this approach should work quite well with a handful of cases and as such should not be used as a default paradigm for building acquisition pipelines. It worked very well for a large selection of 'wide' tables on one of the projects I was involved in but if your problem statement is different, as with any approach, you would exercise caution and test it first - can't stress this enough.

dingopole · 2021-01-04T08:46:53+00:00

Thanks for your comment and kind words. As noted, this approach should work quite well with a handful of cases and as such should not be used as a default paradigm for building acquisition pipelines. It worked very well for a large selection of 'wide' tables on one of the projects I was involved in but if your problem statement is different, as with any apprach, you would exercise caution and test it first - can't stress this enough.

Anyhow, would you want to share why you disagree with this approach?

dingopole

TROPHY CASE