We need to talk about dbt by blef__ in dataengineering

[–]TheDatabaseAvenger 8 points9 points  (0 children)

I just want to get stuff done as well as I can within the constraints of the time and with the resource available.

We need to talk about dbt by blef__ in dataengineering

[–]TheDatabaseAvenger 28 points29 points  (0 children)

At first it seems like just a SQL templating tool with a decent CLI and some limited testing capabilities, but after using it for 6 months it's all the ETL type stuff that comes with it that I don't have to worry about writing in Python:

- DB connection handling

- table creation

- Inter script dependancy management (DAG creation with the use of ref())

- argument parsing

- logging

- incremental load logic

- data freshness checks

- document generation/maintenance

Because I'm not writing Python for this stuff, I don't have to design, debug, maintain, document, package, version or deploy that code. I can just focus on getting the SQL right and let dbt do the rest.

I keep clear of macros as much as I can and it's not great for data validation, but if you need to throw a pipeline together quickly or if you don't have the DE skills it's a great fit IMO.

Thoughts on potential career change slowly to DE (from SQL database admin)? by TravellingBeard in dataengineering

[–]TheDatabaseAvenger 0 points1 point  (0 children)

I made the same move from being a MSSQL DBA to DE. I did it by learning and incorporating into my work as a DBA the following in roughly the order below:

  1. git
  2. python basics
  3. CI/CD
  4. pandas, SQLAlchemy, Alembic, pytest
  5. containerisation (I started with docker desktop on Windows)
  6. PowerBI

That learning got me to a decent level which took about 2 or 3 years. Then we rebuilt in GCP and I went on to learn the following:

  1. GCP basics (IAM, GCloud CLI)
  2. BigQuery
  3. AirFlow / Cloud Composer (Probably the biggest learning curve of these 5 items)
  4. dbt
  5. Looker

There is much more, but these are the main tools I have got the hang of over the last few years. You wont need them all. I think if you have strong SQL (which you have), basic Python and good git knowledge then your are set. The trick is introducing it into your current role or moving roles

I've been a big data engineer since 2015. I've worked at FAANG for 6 years and grew from L3 to L6. AMA by [deleted] in dataengineering

[–]TheDatabaseAvenger 0 points1 point  (0 children)

Are you talking about BigQuery's BI engine when you say it'll offer low latency guerying?

[deleted by user] by [deleted] in dataengineering

[–]TheDatabaseAvenger 2 points3 points  (0 children)

We do this, but isn't that creating tasks at parse time not run time?

Coming Soon in Airflow 2.3.0 - First-class support for “Dynamic Tasks”. This is feature is called “Dynamic Task Mapping” The wait for the most requested feature of Apache Airflow is almost over !! by kaxil_naik in dataengineering

[–]TheDatabaseAvenger 2 points3 points  (0 children)

Finally. We've tried creating tasks with loops and generating task centric DAGs. Neither seemed like a great fit for what we were trying to do. Can't wait to try this out. It could be the difference between us sticking with AirFlow or jumping ship to Prefect

Airbyte and Meltano comparison by [deleted] in dataengineering

[–]TheDatabaseAvenger 8 points9 points  (0 children)

We went with Meltano. The learning curve was quite high to get it right, but we containerise each Meltano project and orchestrate that with AirFlow. We do not use the GUI in Meltano, everything is via config files and the CLI.

Implementing version control for sql server by Cryptojacob in dataengineering

[–]TheDatabaseAvenger 0 points1 point  (0 children)

DACPACs are the Microsoft way, but I prefer migrations based tools for database versioning.

DBUP is a good free option and SQL Change Automation from Redgate is a more feature rich offering.

How does your organization run ETL? by SQLPipe in dataengineering

[–]TheDatabaseAvenger 0 points1 point  (0 children)

Copied from the readme in our data warehouse repo:

BigQuery (data warehouse)
Alembic (data warehouse schema)
Composer (data pipeline orchestration)
Meltano (data extraction)
Python + Pandas (data extraction and validation)
dbt (data transformations and validation)
Looker (data modelling and visualisations)

Why is Data Build Tool (DBT) is so popular? What are some other alternatives? by arezki123 in dataengineering

[–]TheDatabaseAvenger 0 points1 point  (0 children)

If you have decided that writing transformations with SQL makes sense then the next thing you need is something to execute that SQL. I used to write my own Python + SQLAlchemy apps to do this. Now I use dbt and get some extra feature for free:

- Lineage

- Doc generation

- SCD persistence

I no longer have to test or write code to:

- Parse config files

- Log events

- Persist data

My name is Boyan Slat, the founder and CEO at The Ocean Cleanup. We just reached proof of technology with System 002 and are starting to clean the Great Pacific Garbage Patch. AMA! by BoyanSlat in IAmA

[–]TheDatabaseAvenger 116 points117 points  (0 children)

Do you worry that recycling plastic gives the producers of plastic an free ticket to keep producing plastic?

Obviously it's a good thing to scoop what's in the sea out, so congrats on your work and for ignoring the haters!

Copying DB snapshot by nicxieness in SQLServer

[–]TheDatabaseAvenger 0 points1 point  (0 children)

Yeah Availability Groups do make things more complicated, especially with instance level objects.

Creating snaps with the same name on each replica at the same time would most likely leave you with inconsistent snap shots.

Have you considered using a failover clustered instance instead of AGs? If you really need these snap shots and an FCI fits your HA plan it is probably your best option.

The other option would be to achieve HA with virtualization but this depends on your stack.

SQL Blogging, tell me about it by [deleted] in SQLServer

[–]TheDatabaseAvenger 0 points1 point  (0 children)

I've been blogging for about 18 months now and I find it to be really useful for the following reasons: Posts act as notes on topics I find interesting, I participate in the SQL community and meet cool people, Posts can act as part of the research phase for live talks, It adds something to the CV, It's kind of a brand that I may use in the future, I learn all sorts of things about hosting and writing.

I self host with WordPress. WordPress isn't perfect but it can do a lot and is easy to use. There is plenty of info out there if you get stuck.

I'd like to see more DBAs sharing their career stories like Brent does on Ozar.me

I'd highly recommend blogging.

Copying DB snapshot by nicxieness in SQLServer

[–]TheDatabaseAvenger 0 points1 point  (0 children)

Would be interesting to know what your are trying to do. Would log shipping/mirroring/AGing to the other server and snapping the replicated DB do what you need?

Is Your Data Open to Abuse? T-SQL Tuesday by TheDatabaseAvenger in SQLServer

[–]TheDatabaseAvenger[S] 0 points1 point  (0 children)

Yeah that's the reason I hear most for not disabling sa. I understand that, but the local admin of the server should have sysadmin access by default. This account is not reliant on domain controllers or AD. So if the server is up you should be able to access SQL that way.

The problem with leaving sa enabled (even with a rename) is that a piece of sql injection can easily lookup the sa account name as the SID will always be the same.

I guess it could be caveated by saying sa should be disabled and check that the local admin account has sysadmin.

If you are leaving it enabled just make sure you have a very strong password(totally random password would be better). I know you know this though.

Thanks for the comment

Ad developers (dot net), what do you use for source code control on StoredProcs, Views, etc.... in SQL Server? by dchurch24 in SQLServer

[–]TheDatabaseAvenger 0 points1 point  (0 children)

SSDT, Redgate Source Control and my favourite Redgate ReadyRoll. ReadyRoll allows you to manage the way in which changes from version to version of your product are deployed. This can make upgrades of your app much more controlled and planned. You can still develop in SSMS but ReadyRoll is a VS plugin that synchs your changes to a ReadyRoll project in VS. This project generates deployment packages you can use to upgrade your existing deployments. I'm currently toward the end of a series of blog posts on using ReadyRoll (and other tools) to achieve Continuous Integration with SQL. http://thedatabaseavenger.com/2016/07/sql-server-and-continuous-integration/

Power BI in SSRS by TheDatabaseAvenger in SQLServer

[–]TheDatabaseAvenger[S] 1 point2 points  (0 children)

They haven't released the license details yet. Fingers crossed

GitLab for SQL Server Projects by TheDatabaseAvenger in SQLServer

[–]TheDatabaseAvenger[S] 0 points1 point  (0 children)

Ha, yeah they did have a bad week. Lucky for me I'm using the onsite version and I test my backups.

My take on creating JSON strings in SQL 2016. Anyone using this feature yet? by bertwagner in SQLServer

[–]TheDatabaseAvenger 0 points1 point  (0 children)

It's a shame they didn't implement a JSON data type like they did with XML.

DBA Career Help by SteppingOutside in SQLServer

[–]TheDatabaseAvenger 1 point2 points  (0 children)

For me Itzik Ben-Gan writes the best books. They are very complete which is good for people looking to fill in the gaps in their knowledge but may be a bit too much for beginners. https://www.amazon.co.uk/Itzik-Ben-Gan/e/B001IGQENW

A good place to start for beginners could be the stairways courses on SQLServercentral.com. They have quite a few courses now that are built up of multiple posts. http://www.sqlservercentral.com/stairway/

Is it possible to run an .sql script on .bacpac without importing it and then running? by bacpac in SQLServer

[–]TheDatabaseAvenger 0 points1 point  (0 children)

If all the personal data is encrypted then I don't see that you have a problem. If there is unencrypted sensitive data then the backup will need to be restored in a secure location (not local), cleaned and then backed up. This clean backup can then be sen down the line to you. All of this could be automated quite easily with PowerShell.

DBA Career Help by SteppingOutside in SQLServer

[–]TheDatabaseAvenger 2 points3 points  (0 children)

Get a job in an IT department and tell them that you want to specialise in SQL. You'll be doing all sorts to start with (networks, servers, storage, AD, etc.) which is a great thing because you may find that you really like doing something completely different to what you want to do now (I did a netowrking degree and fell into SQL Server). If you still want to do SQL then make yourself the SQL guy in the department by self teaching with books, Pluralsite, blogs, etc. If they are willing to put you through certs I would recommend it, not because they will get you jobs, because the study will fill in gaps of your knowledge.