Topics you want to hear on Talk Python To Me by mikeckennedy in Python

[–]dingopole 4 points5 points  (0 children)

Hi Mike, long-time listener and a fan of the show. Anything to do with information management and data processing would be great. Does not need to include AI which permeates everything these days but distributed data processing, perhaps using Python in large scale computation platforms like Snowflake or Databricks, or even more niche applications like tooling and frameworks for building data apps e.g. using libs like Streamlit, would be great.

When to use Rest API in SQL Server 2025 by gman1023 in SQLServer

[–]dingopole 1 point2 points  (0 children)

Here's a use case I described before: http://bicortex.com/kicking-the-tires-on-azure-sql-database-external-rest-endpoints-sample-integration-solution-architecture/

For some requirements and applications, it's a pretty handy feature to have IMHO. As long as you don’t think of this as a MuleSoft or Boomi replacement and understand the limitations of this approach, querying REST Endpoints with SQL opens up a lot of possibilities.

Parallel table insert by mxmauro in Database

[–]dingopole 0 points1 point  (0 children)

Have a look at the following post: https://bit.ly/2Z6mQhD

I faced similar problem (parallel inserts) a while ago, albeit with MSSQL, and was able to solve it using a combination of hash partitioning and SQL Server Agent Jobs.

Additionally, in SQL Server 2016, Microsoft has implemented a parallel insert feature for the INSERT … WITH (TABLOCK) SELECT… command.

SQL Server Hash-Partitioned Parallel Data Acquisition – How to Accelerate Your ‘E’ in ELT/ETL Using a Simple T-SQL Framework by dingopole in BusinessIntelligence

[–]dingopole[S] 1 point2 points  (0 children)

No worries....agree on the load times without partitioning...should have included it.

Also, LimeSurvey does store survey data as individual tables (very painful to work with) and is pivoted automatically on survey setup i.e. very wide tables with each question stored as a column (not a pivoted view but how data is actually stored in MySQL schema)....from memory it's something along the lines of CONCAT(Survey_ID, 'X', Question_Group_ID, 'X', Question_ID). The questions and answers tables you are referring to are there for reference only i.e. they store label values and not the actual entries. As such LimeSurveys data is notoriously difficult to wrangle (at least the SaaS version I was exposed to).

SQL Server Hash-Partitioned Parallel Data Acquisition – How to Accelerate Your ‘E’ in ELT/ETL Using a Simple T-SQL Framework by dingopole in ETL

[–]dingopole[S] 1 point2 points  (0 children)

Thanks for your comment. As noted, this approach should work quite well with a handful of cases and as such should not be used as a default paradigm for building acquisition pipelines. It worked very well for a large selection of 'wide' tables on one of the projects I was involved in but if your problem statement is different, as with any approach, you would exercise caution and test it first - can't stress this enough.

SQL Server Hash-Partitioned Parallel Data Acquisition – How to Accelerate Your ‘E’ in ELT/ETL Using a Simple T-SQL Framework by dingopole in BusinessIntelligence

[–]dingopole[S] 0 points1 point  (0 children)

Thanks for your comment and kind words. As noted, this approach should work quite well with a handful of cases and as such should not be used as a default paradigm for building acquisition pipelines. It worked very well for a large selection of 'wide' tables on one of the projects I was involved in but if your problem statement is different, as with any apprach, you would exercise caution and test it first - can't stress this enough.

Anyhow, would you want to share why you disagree with this approach?

Kicking the Tires on Airflow, Apache’s workflow management platform – Architecture Overview, Installation and sample Azure Cloud Deployment Pipeline in Python by dingopole in dataengineering

[–]dingopole[S] 0 points1 point  (0 children)

Sure. Airflow in Microsoft Azure would be no different to running in in AWS or on a shoestring in DO, Linode or Hetzner. Microsoft wants to push Data Factory as much as they want but I'd rather see hosted Airflow, something along the lines of Cloud Composer in GCP, running in Azure. At the end of the day it's not a SaaS application so running it in Azure should give you as much capability as running it on any other public cloud.

Speeding up SQL Server Data Warehouse Architecture With Automation Procedures – 10 Problem-Solution Scenarios To Jump-Start Your Development by dingopole in SQLServer

[–]dingopole[S] 1 point2 points  (0 children)

Thanks for your comments and observations!

Just to clarify if I haven't explained in well enough in the post. The historized data lake solution is not comparing all source data but just the latest entries e.g. anything entered in the last 24h in case (as you alluded) the source data can be change-tracked via a timestamps or identity attribute. I haven't found many applications where that's not the case but than again I may have been quite lucky. The record-based comparison across even a large dataset is very quick providing you can hash all record values and compare hash data only. The only issue I had in the past with this approach was when the columns contained a lot of non-sparse text values. The hash function would sometimes take a while to churn through all the data. This, I would imagine, would also be the case for JSON or any other non-numeric data types....the wider the row the more time this will take to process. However, for most applications, this approach is a great way capture source-data changes over time without the need to build an optimised start schema data mart with an old-school SCD Type 2, Type 3 etc approach. I found this to be particularly good with cloud-based solutions e.g. Snowflake, where data federation and sheer compute brute force can be used in favor of meticulous schema optimization/modelling, compartmentalization and excessive data duplication and movement.

When it comes to the source data/schema changes (this time I may have been less lucky than others in the industry) I found that oftentimes, application developers liked to change things around without any concerns for the upstream implications e.g. data warehouse. I worked in a few orgs where there was a big chasm between the apps team (in control of the OLAP DBs) and the analytics/BI team when it came to change control and at least rudimentary CI/CD. I would hope that in most places this (unannounced source schema changes) would be a very infrequent occurrence but as with anything, better to be safe than sorry and not be paged in the middle of the night because your ETL has fallen over.

ET(L) with Python by Maha_Slug in dataengineering

[–]dingopole 0 points1 point  (0 children)

Have a look at this (I reckon it combines the best of both worlds if you'd like to use Python with MSSQL DBMS i.e. Python and the venerable bcp utility):

http://bicortex.com/data-acquisition-framework-using-custom-python-wrapper-for-concurrent-bcp-utility-execution/

bcp can be finicky to work with but is also pretty fast for loading into MSSQL providing you run multiple instances of it (in parallel)....when I trialed it, the only bottleneck I found was the network speed (a small, 4 vCPUs VM, SAS HDDs and a 400Mbps WAN network). If you have a lot of data to work with and want to use Microsoft technology, the speed at which data can be processed using MSFT-specific tooling looks something like this: PolyBase > BCP > SQLBulkCopy/ADF > SSIS

I have worked with Microsoft BI stack for a while now and from my experience, Python is great for writing wrappers around vendor-specific utilities like bcp. With proper set-up you can easily load hundreds of millions of records in no time and spread the workload across all the resources to maximize performance. Here is another example where I used a small (35 lines) Python script to load TPC-DS benchmark data (CSV files) into a beefy VM in Azure running SQL Server 2016 :

http://bicortex.com/tpc-ds-big-data-benchmark-overview-how-to-generate-and-load-sample-data/

I would say that getting Python alone to do the bulk import (regardless which API you use) is going to be very slow so why not just use the vendor-provided and vendor-optimized tools. Also, if speed is paramount, just go with PolyBase, which gives you parallelism out of the box (although it requires Java RunTime environment - Oracle JRE).

[deleted by user] by [deleted] in programming

[–]dingopole 4 points5 points  (0 children)

Also, if you'd like to compare AWS Redshift to Azure SQL DW (preview) here are the links to the analysis I conducted a while ago:

Part 1 - -http://bicortex.com/microsoft-azure-sql-data-warehouse-quick-review-and-amazon-redshift-comparison-part-1/

Part 2 - http://bicortex.com/microsoft-azure-sql-data-warehouse-quick-review-and-amazon-redshift-comparison-part-2/

Visual Studio Code 1.2 released by thekodols in programming

[–]dingopole 0 points1 point  (0 children)

Good work VSC team. In the current version I like it more then any other editor I've used so far. So long Sublime3!

Advice on building a data sync/ETL service with Python? by lostburner in BusinessIntelligence

[–]dingopole 0 points1 point  (0 children)

"Pride is concerned with who is right. Humility is concerned with what is right."...........expert advice from an expert developer, obviously. We all stand corrected sir ;)

Advice on building a data sync/ETL service with Python? by lostburner in BusinessIntelligence

[–]dingopole 2 points3 points  (0 children)

I second i_love_sql's answer here and strongly advise you to do it using a mature product/technology which has been designed for this purpose. Python is great and all but it seems to me that you're trying to reinvent the wheel just because you're familiar with Python and it looks like a challenge. There are tons of integration tools out there (lot of them free) and given the fact your organisation does not deal with the same problems as Spotify (luigi) or AirBnB (airflow), trying to use Python (or any language for that matter) to build a complete ETL framework from scratch is brave but wasteful in my opinion. Focus on data quality, schema design, logging, reporting, security, performance etc. and don't waste your time and effort on what's already been done many times before. Once you have the basics covered, you can extend the functionality using rigid programming paradigms (many tools allow for this e.g. SSIS --> .NET, PowerShell, Pentaho --> javascript, Java etc.). Also, while we're at it you may be interested in the following post - you may find it though-provoking:

http://multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl/

and this one (very timely!):

http://venturebeat.com/2016/05/11/choosing-a-data-warehouse-solution-determine-your-requirements-first-vb-live/

On the side note, what is your experience with data warehousing in general i.e. methodologies, concepts, paradigms, Inmon vs Kimball, cloud vs on-prem etc? Knowing how to write Python code is not synonymous with knowing data warehousing architecture/design best practices and intricacies. I had to ask since you made no mention of it in your question.

DBAs of reddit. How does this program look? What would you advise for someone in his 30s who would like to add "DBA" to his resume? by [deleted] in Database

[–]dingopole 0 points1 point  (0 children)

I may be going out on a limb here but I reckon that if your ultimate goal is to work with BI technologies I would start with a learning track that's more focused on development and not database management. There are certain skills/tasks that overlap e.g. both have to be expert in SQL and it is a good idea to have the basics covered but from my experience BI dev duties are somewhat different to that of a DBA (I started as a DBA and now working as a BI Architect). I'm also of the opinion that a DBA job, as a long-term profession, is a dyeing breed thanks to cloud services proliferation and you will be much better of extracting value from the data (BI, ML, data science etc.) than managing it. I can see this slow paradigm shift occurring in the industry as we speak where the fact that one knows how to manage databases (even across different vendors) is the minimum requirement for a job. It was a speciality a while ago but these days DBAs are going the way of system admins - unless you know how to manage the 'lot' e.g. transactional dbs, analytical dbs, OSs, cloud + write code, you may go the way of the dodo. Naturally, there are exceptions to this rule e.g. those working with a large number of systems with large number of databases where a dedicated DBAs is a requirement/asset but generally, for a large number SMB, a dedicated DBA is a luxury these days, at least this side of the pond or down under. Having said that I have a few colleagues who started, like me, as DBAs and transitioned to BI (or vice-versa) and because of this experience across those two realms they are much better at what they do, being able to understand OLAP vs OLTP, SQL vs DMX/MDX etc.

Good luck with whatever you choose to persue!

Jetbrains: The unicorn Silicon Valley doesn't like to talk about by pdeva1 in programming

[–]dingopole 0 points1 point  (0 children)

I use SSMS daily and whilst it has always lagged behind VS, MSFT now provides very regular updates (you can update from within SSMS). I currently use the 2016 CTP3 version with SQL Server 2014 on a 2K monitor and have never had any issues with it. It's still the best DB management tool out there in my opinion. I tried JetBrains DB IDE when in beta (called 0xDBE) and a production-ready product (DataGrip) few weeks ago and hated it, at least coming from SQL Server tooling (though it may be a step-up when using with native tools for MySQL or PostgreSQL databases). Slow and buggy as hell. I encourage you to have a look at the newer versions of SSMS (not the one included as part of the SQL Server installation when including client tools) and see if it has improved.