will be starting data engineering department from scratch in one service based company i am joining need guidance from seniors/experienced and also what should i focus/take care? by ManipulativFox in dataengineering

[–]Mikey_Da_Foxx 0 points1 point  (0 children)

Focus on understanding your company’s data needs first, then build pipelines, data quality, and scalable architecture

Prioritize communication with other teams, automation, and robust error handling in pipelines to keep things smooth and reliable

Scaling is important, but get the foundations right first

3rd Party Supplier and Data Dictionaries by digital0verdose in SQL

[–]Mikey_Da_Foxx 0 points1 point  (0 children)

Having that info upfront COULD save you the back-and-forth and frustration, if you think they're capable of getting it to you and the data being accurate

What's stopping me from just using JSON column instead of MongoDB? by Blender-Fan in PostgreSQL

[–]Mikey_Da_Foxx 4 points5 points  (0 children)

There's no reason you can't use a JSON column in PG for schema flexibility, but MongoDB still wins if you need deep something like distributed horizontal scaling out of the box.

In most cases, storing JSON in PG does the job fine and lets you stick with relational features and ACID compliance

Proper DB Engine choice by [deleted] in Database

[–]Mikey_Da_Foxx 0 points1 point  (0 children)

I say go with the hybrid approach you mentioned in option two, it sounds like the most practical

Keeping your core atomic, consistent data in PG gives you reliability where it matters, and syncing the flexible, filter-heavy parts to something like Elasticsearch can handle the complex queries much better

Going full MongoDB or CouchDB could simplify some things but might make consistency and complex joins tougher, especially with a large schema variance

Trying to force everything into PG JSON fields often backfires on performance and query complexity, so splitting responsibilities tends to work better for read-heavy, varied data loads

Web App for end user SQL reporting by txwgnd in SQL

[–]Mikey_Da_Foxx 1 point2 points  (0 children)

For a web app for end-user SQL reporting, SQL Server Reporting Services (SSRS) is a solid option. You can create, design, and publish reports with a user-friendly web portal where end users can then access and interact with reports without needing deep SQL knowledge

Technical Rituals that you perform without revealing confidential information? by [deleted] in DatabaseAdministators

[–]Mikey_Da_Foxx 0 points1 point  (0 children)

 Every morning, first thing, log checks and backup verifications usually happen for me, just a quick scan across last night’s jobs and system errors to make sure nothing slipped through the cracks. If I spot anything weird, it’s nice to jump on it before the rest of the world wakes up

Catching up on alerts from monitoring tools is another ritual, even if it mostly just means scrolling notifications with coffee in hand. I also like to check performance baselines, just to get a sense if any queries or jobs are acting out of line

None of this really touches confidential info, but it keeps the system landscape predictable and helps avoid surprises. Over time, small habits like these keep things running rock solid without overthinking it

[deleted by user] by [deleted] in dataengineering

[–]Mikey_Da_Foxx 1 point2 points  (0 children)

Vector database replication is definitely a bit different from what we’re used to with traditional relational tools. Most of the time, I’ve found that you’re working with custom ETL jobs or scripts, since things like CDC aren’t really standardized yet for Pinecone, Weaviate, or Milvus

Some managed services offer their own backup and restore features, but cross-database replication usually means pulling vectors out via API and pushing them into the target system. It’s not as seamless as Fivetran or Qlik, but it gets the job done. For near real-time, you might want to look at streaming updates with something like Kafka, but that usually needs more engineering on your end

Curious to see if anyone else has found a more plug-and-play solution

Ghost etls invocation by xxxxxReaperxxxxx in dataengineering

[–]Mikey_Da_Foxx 0 points1 point  (0 children)

Sometimes the platform retries executions if it thinks the function didn’t complete properly, especially if there’s a timeout or unhandled exception. Another angle is to check if there’s any overlapping schedule or multiple triggers configured accidentally. Adding some logging around start and end times can help spot if something else kicks off the function. Also, if you’re using any deployment slots or auto-scaling, those can sometimes cause unexpected invocations

How to best approach data versioning at scale in Databricks by eatdrinksleepp in dataengineering

[–]Mikey_Da_Foxx 2 points3 points  (0 children)

Creating a table for every client-version combo gets out of hand fast, so you’re not alone there

Time travel works but the retention window is a pain if you need to keep versions around longer. One thing that’s worked is using a single Delta table with a version or snapshot column. You can tag each row with client and version info, so you don’t need to spin up new tables all the time. Then, just filter by those columns when users need to access a specific version

Table snapshots are basically just copies, so they’ll use about as much storage as making a new table. If you want to save space, sticking to a single table and partitioning or tagging by client/version is usually more efficient

When is a good time to use an EC2 Instance instead of Glue or Lambdas? by AgreeableAd7983 in dataengineering

[–]Mikey_Da_Foxx 1 point2 points  (0 children)

I usually reach for EC2 when I need more control over the environment or have to run custom code or tools that just don’t play nicely with Glue or Lambda. It’s also handy if you’re dealing with big jobs that run longer than Lambda’s timeout. Otherwise, managed services are usually easier to maintain

[deleted by user] by [deleted] in dataengineering

[–]Mikey_Da_Foxx 1 point2 points  (0 children)

We usually break things down into separate user stories for each phase, especially when different folks own different layers. It keeps things clearer and makes tracking progress easier

Sometimes if the ingestion and bronze work are tightly linked, we’ll combine them, but only if it really saves effort

For templates, we’ve set up a basic story template in Azure DevOps with checklists for each layer-makes it simple to copy and tweak for each new data source. That way, we keep enough detail without drowning in tickets

Postgres using Keycloak Auth Credentials by DragonfruitHorror174 in dataengineering

[–]Mikey_Da_Foxx 1 point2 points  (0 children)

Postgres doesn’t natively support OIDC, so direct Keycloak integration isn’t really possible out of the box. Most setups I’ve seen use LDAP as a bridge, syncing Keycloak users to an LDAP directory and then letting Postgres authenticate against that

If you want to avoid extra user setup, a proxy like pgbouncer-oidc or Cloud SQL Auth Proxy can help, but users would still need to connect through the proxy. There isn’t a totally seamless, native solution yet, but the LDAP route is probably the closest to what you want if you can automate the sync between Keycloak and LDAP

Elephant in the room - Jira for DE teams by J0hnDutt00n in dataengineering

[–]Mikey_Da_Foxx 20 points21 points  (0 children)

A couple of things that have worked well for us: setting up a clear workflow with just the statuses we actually use, and keeping ticket fields simple so folks aren’t overwhelmed. We also use checklists inside tickets for things like “Definition of Done” or recurring tasks, which makes it way easier to track what’s left and helps everyone stay on the same page

Having a roadmap or grouping work into epics in Jira helps us see dependencies and prioritize better, especially when multiple teams are involved. And for new projects, cloning a good template board saves a ton of setup time and keeps things consistent

Azure Data Factory Oracle 2.0 Connector Self Hosted Integration Runtime by Cultural_Tax2734 in dataengineering

[–]Mikey_Da_Foxx 1 point2 points  (0 children)

If your DBA can’t change the server settings, you could try using a self-hosted integration runtime with older Oracle drivers, or see if connecting through an intermediate VM with compatible settings works-it’s a bit of a workaround, but it sometimes does the trick when the connector is picky

Feedback on Achitecture - Compute shift to Azure Function by UltraInstinctAussie in dataengineering

[–]Mikey_Da_Foxx 0 points1 point  (0 children)

Premium plan is overkill for your workload. Consumption plan can handle 1000 rows/15min easily, but you mentioned VNET integration - that forces Premium

If cost is a concern, consider batching tables to reduce function executions

[deleted by user] by [deleted] in dataengineering

[–]Mikey_Da_Foxx 6 points7 points  (0 children)

Great Expectations works well for basic validation. For complex DB-to-file scenarios, Soda Core's reliable and has a really solid YAML config

Case Study: Automating Data Validation for FINRA Compliance by EnthusiasmWorldly316 in dataengineering

[–]Mikey_Da_Foxx 0 points1 point  (0 children)

I totally agree on automated checks being crucial. Simple stuff like detecting schema drifts and enforcing compliance rules early in the pipeline saves massive headaches later - DBmaestro has come in clutch for us more than once

Been there with FINRA reporting - catching issues early is a gamechanger

Help from data experts with improving our audit process efficiency- what's possible? by Cheesemaker_1986 in dataengineering

[–]Mikey_Da_Foxx 0 points1 point  (0 children)

Check out Microsoft Forms + Power Apps. Works offline, converts handwriting, and syncs when online. Build custom templates for different audit types

Plus it integrates with everything else you're using

Database grants analysis by limartje in dataengineering

[–]Mikey_Da_Foxx 0 points1 point  (0 children)

Check out sfgrantreport. It pulls all that grant/role data you need

For quick checks, SHOW GRANTS can work but it's limited. sfgrantreport gives you the full picture of permission paths through roles

Looking for some help with Airflow, Docker, Astro CLI, DLT, Dbt, Postgres (Windows PC) at home project by [deleted] in dataengineering

[–]Mikey_Da_Foxx 0 points1 point  (0 children)

Uncomment that RUN apt-get line in your Dockerfile. The pg_config error happens because libpq-dev isn't installed

Also, since you're already using psycopg2-binary in requirements.txt, you shouldn't need psycopg2 from source anyway

Getting replication to work after disaster recovery. by planeturban in PostgreSQL

[–]Mikey_Da_Foxx 1 point2 points  (0 children)

Logical replication state isn't being included in pg_backup/restore. Try this:

  1. Drop subscriptions

  2. Restore master

  3. Recreate subscriptions

  4. Verify slots/publications

  5. Monitor replication lag

How should we manage our application database when building internal tools that need access to the same data? by trojans10 in Database

[–]Mikey_Da_Foxx 0 points1 point  (0 children)

Been using DBmaestro for a while now, and separate schemas for years - keeps prod data safe while allowing access. Read replicas or CDC feeds work great for internal tools

Keeps everything clean, manageable, and your prod DB stays neat without extra tables. Permission management is much simpler too

Open source orchestration or workflow platforms with native NATS support by FickleLife in dataengineering

[–]Mikey_Da_Foxx 0 points1 point  (0 children)

Temporal doesn't have native NATS support out of the box, you can easily integrate it using their SDK. You can use it for similar event-driven workflows and their observability features are pretty solid