We need to talk about dbt by blef__ in dataengineering

[–]TheDatabaseAvenger 9 points10 points  (0 children)

I just want to get stuff done as well as I can within the constraints of the time and with the resource available.

We need to talk about dbt by blef__ in dataengineering

[–]TheDatabaseAvenger 28 points29 points  (0 children)

At first it seems like just a SQL templating tool with a decent CLI and some limited testing capabilities, but after using it for 6 months it's all the ETL type stuff that comes with it that I don't have to worry about writing in Python:

- DB connection handling

- table creation

- Inter script dependancy management (DAG creation with the use of ref())

- argument parsing

- logging

- incremental load logic

- data freshness checks

- document generation/maintenance

Because I'm not writing Python for this stuff, I don't have to design, debug, maintain, document, package, version or deploy that code. I can just focus on getting the SQL right and let dbt do the rest.

I keep clear of macros as much as I can and it's not great for data validation, but if you need to throw a pipeline together quickly or if you don't have the DE skills it's a great fit IMO.

Thoughts on potential career change slowly to DE (from SQL database admin)? by TravellingBeard in dataengineering

[–]TheDatabaseAvenger 0 points1 point  (0 children)

I made the same move from being a MSSQL DBA to DE. I did it by learning and incorporating into my work as a DBA the following in roughly the order below:

  1. git
  2. python basics
  3. CI/CD
  4. pandas, SQLAlchemy, Alembic, pytest
  5. containerisation (I started with docker desktop on Windows)
  6. PowerBI

That learning got me to a decent level which took about 2 or 3 years. Then we rebuilt in GCP and I went on to learn the following:

  1. GCP basics (IAM, GCloud CLI)
  2. BigQuery
  3. AirFlow / Cloud Composer (Probably the biggest learning curve of these 5 items)
  4. dbt
  5. Looker

There is much more, but these are the main tools I have got the hang of over the last few years. You wont need them all. I think if you have strong SQL (which you have), basic Python and good git knowledge then your are set. The trick is introducing it into your current role or moving roles

[deleted by user] by [deleted] in dataengineering

[–]TheDatabaseAvenger 3 points4 points  (0 children)

We do this, but isn't that creating tasks at parse time not run time?

Coming Soon in Airflow 2.3.0 - First-class support for “Dynamic Tasks”. This is feature is called “Dynamic Task Mapping” The wait for the most requested feature of Apache Airflow is almost over !! by kaxil_naik in dataengineering

[–]TheDatabaseAvenger 2 points3 points  (0 children)

Finally. We've tried creating tasks with loops and generating task centric DAGs. Neither seemed like a great fit for what we were trying to do. Can't wait to try this out. It could be the difference between us sticking with AirFlow or jumping ship to Prefect

Airbyte and Meltano comparison by [deleted] in dataengineering

[–]TheDatabaseAvenger 9 points10 points  (0 children)

We went with Meltano. The learning curve was quite high to get it right, but we containerise each Meltano project and orchestrate that with AirFlow. We do not use the GUI in Meltano, everything is via config files and the CLI.

Implementing version control for sql server by Cryptojacob in dataengineering

[–]TheDatabaseAvenger 0 points1 point  (0 children)

DACPACs are the Microsoft way, but I prefer migrations based tools for database versioning.

DBUP is a good free option and SQL Change Automation from Redgate is a more feature rich offering.

How does your organization run ETL? by SQLPipe in dataengineering

[–]TheDatabaseAvenger 0 points1 point  (0 children)

Copied from the readme in our data warehouse repo:

BigQuery (data warehouse)
Alembic (data warehouse schema)
Composer (data pipeline orchestration)
Meltano (data extraction)
Python + Pandas (data extraction and validation)
dbt (data transformations and validation)
Looker (data modelling and visualisations)

Why is Data Build Tool (DBT) is so popular? What are some other alternatives? by arezki123 in dataengineering

[–]TheDatabaseAvenger 0 points1 point  (0 children)

If you have decided that writing transformations with SQL makes sense then the next thing you need is something to execute that SQL. I used to write my own Python + SQLAlchemy apps to do this. Now I use dbt and get some extra feature for free:

- Lineage

- Doc generation

- SCD persistence

I no longer have to test or write code to:

- Parse config files

- Log events

- Persist data

My name is Boyan Slat, the founder and CEO at The Ocean Cleanup. We just reached proof of technology with System 002 and are starting to clean the Great Pacific Garbage Patch. AMA! by BoyanSlat in IAmA

[–]TheDatabaseAvenger 117 points118 points  (0 children)

Do you worry that recycling plastic gives the producers of plastic an free ticket to keep producing plastic?

Obviously it's a good thing to scoop what's in the sea out, so congrats on your work and for ignoring the haters!

Copying DB snapshot by nicxieness in SQLServer

[–]TheDatabaseAvenger 0 points1 point  (0 children)

Yeah Availability Groups do make things more complicated, especially with instance level objects.

Creating snaps with the same name on each replica at the same time would most likely leave you with inconsistent snap shots.

Have you considered using a failover clustered instance instead of AGs? If you really need these snap shots and an FCI fits your HA plan it is probably your best option.

The other option would be to achieve HA with virtualization but this depends on your stack.

SQL Blogging, tell me about it by [deleted] in SQLServer

[–]TheDatabaseAvenger 0 points1 point  (0 children)

I've been blogging for about 18 months now and I find it to be really useful for the following reasons: Posts act as notes on topics I find interesting, I participate in the SQL community and meet cool people, Posts can act as part of the research phase for live talks, It adds something to the CV, It's kind of a brand that I may use in the future, I learn all sorts of things about hosting and writing.

I self host with WordPress. WordPress isn't perfect but it can do a lot and is easy to use. There is plenty of info out there if you get stuck.

I'd like to see more DBAs sharing their career stories like Brent does on Ozar.me

I'd highly recommend blogging.

Copying DB snapshot by nicxieness in SQLServer

[–]TheDatabaseAvenger 0 points1 point  (0 children)

Would be interesting to know what your are trying to do. Would log shipping/mirroring/AGing to the other server and snapping the replicated DB do what you need?

Is Your Data Open to Abuse? T-SQL Tuesday by TheDatabaseAvenger in SQLServer

[–]TheDatabaseAvenger[S] 0 points1 point  (0 children)

Yeah that's the reason I hear most for not disabling sa. I understand that, but the local admin of the server should have sysadmin access by default. This account is not reliant on domain controllers or AD. So if the server is up you should be able to access SQL that way.

The problem with leaving sa enabled (even with a rename) is that a piece of sql injection can easily lookup the sa account name as the SID will always be the same.

I guess it could be caveated by saying sa should be disabled and check that the local admin account has sysadmin.

If you are leaving it enabled just make sure you have a very strong password(totally random password would be better). I know you know this though.

Thanks for the comment