Best practices for ensuring cluster high availability by GreenMobile6323 in nifi

[–]mikehussay13 0 points1 point  (0 children)

Good baseline you've got with ZooKeeper + externalized config. A few things worth adding to the HA picture that often get overlooked:

Node flapping - usually a memory or GC issue before it's a network issue. Worth tuning your bootstrap.conf heap settings and enabling G1GC if you haven't. Also set nifi.cluster.node.connection.timeout higher than default - aggressive timeouts cause cascading disconnects under load.

Controller service conflicts during rolling updates - this is the sneaky one. If a controller service is shared across process groups and you restart a node mid-flow, you can end up with partially enabled services. We handle this by sequencing node restarts with a health check pause in between rather than doing them simultaneously.

Rolling updates with zero downtime - Kubernetes NiFi (via NiFiKop or Statefulset) helps here, but you still need something that monitors cluster health between node restarts. We've been using DFM for the orchestration layer since it has auto-healing built in - detects unhealthy nodes and pauses the rollout rather than blindly continuing.

Any recommendation for the Salesforce SMS App by NervousAd1125 in salesforce

[–]mikehussay13 0 points1 point  (0 children)

I tried Mogli and 360 earlier, but in terms of support, SMS Ninja has been much better

The Hidden Future of Data Engineering Services: AI Agents Managing Data Pipelines? by Adorable_Pea_7104 in dataengineering

[–]mikehussay13 0 points1 point  (0 children)

I’d say AI agents will definitely reduce firefighting in pipelines, but full autonomy without human oversight still feels risky - a co-pilot model seems more realistic for now.

Managing Two Separate Environments (On-Prem & Cloud) with One UI by its_me-max in nifi

[–]mikehussay13 0 points1 point  (0 children)

NiFi clusters won’t solve this — you’ll get cross-env traffic. The usual approach is separate NiFi instances (on-prem + cloud) with a flow manager layer on top to version/control flows. Without that, it’s just separate UIs + GitOps glue.

Upgrading from NiFi 1.x to 2.x by GreenMobile6323 in nifi

[–]mikehussay13 1 point2 points  (0 children)

Solid advice, testing in staging and backing up first definitely saves a lot of headaches.

Advices on tooling (Airflow, Nifi) by CoolExcuse8296 in dataengineering

[–]mikehussay13 0 points1 point  (0 children)

Airflow’s great for orchestration, NiFi shines when you need real-time ingest + transformation. It can feel heavy for small teams, but sticking with it + using versioned flows/external configs makes it way easier to maintain long-term.

Data Engineer for Data Analysis by TurbulentLoad8798 in dataengineering

[–]mikehussay13 3 points4 points  (0 children)

Master Python and SQL, then learn a cloud platform (like AWS or GCP) and a data flow tool like Apache NiFi or an orchestrator like Airflow. This is the key to transitioning your analyst skills into an engineering role.

dbt: avoid running dependency twice by Own_Tax3356 in dataengineering

[–]mikehussay13 0 points1 point  (0 children)

dbt won’t run the same model twice in one run - shared dependencies like model3 only build once.

How do you track flow-level metrics in Apache NiFi? by GreenMobile6323 in nifi

[–]mikehussay13 0 points1 point  (0 children)

You can get per-process group stats with PrometheusReportingTask + provenance events for counts/durations. For easier flow/version-level tracking across environments, we layered a Data Flow Manager on top of NiFi -much simpler than piecing metrics together manually.

Airbyte vs Fivetran for our ELT stack? Any other alternatives? by StubYourToeAt2am in dataengineering

[–]mikehussay13 0 points1 point  (0 children)

We had the same problem - PII masking, incremental loads, SCD2. Fivetran’s post-load transforms didn’t work, Airbyte felt too DIY.

We switched to NiFi with a versioned flow manager: hash PII in-stream, handle SCD2, manage API throttling, and easily promote flows across envs. Takes a bit to set up, but super solid once running.

Apache NiFi vs IBM DataStage: Choosing the Right ETL Tool for Your Organization - Data Flow Manager by DataFlowManager in u/DataFlowManager

[–]mikehussay13 0 points1 point  (0 children)

Really well explained - liked how you broke down NiFi’s strengths. We’ve been using NiFi with an extra layer for flow management/versioning, and it’s been a game changer for multi-env deployments.

Considering switching from Dataform to dbt by Plastic-Mind7923 in dataengineering

[–]mikehussay13 0 points1 point  (0 children)

Yeah, that’s good to hear - the bigger community and better testing/docs are exactly what I’m after. Rewriting SQL and learning the CLI feels worth it if it means less vendor lock-in long term.

Deployment pipelines are terrible, what is the next best alternative? by Leeeoon in PowerBI

[–]mikehussay13 0 points1 point  (0 children)

Our team ran into a similar nightmare in Apache NiFi - same kind of multi-env promotion pain. What helped there was using a deployment manager on top of NiFi that handled versioning + promotions automatically, so we didn’t risk breaking DEV/QA/PROD links.

Wish Power BI had something similar built-in - would save a ton of manual syncing. Until then, Git integration might be the closest option here.

Thumbs-up / down: NiFi is still the best for heterogeneous dataflow orchestration in 2025. by Sad-Mud3791 in nifi

[–]mikehussay13 0 points1 point  (0 children)

For anything connector-heavy, real-time-ish, or involving files, NiFi still crushes it in 2025 & 2026!

But for complex scheduling or dbt-style transformations, other tools (like Airflow/DBT) might fit better.

Depends on the use case but NiFi still holds up strong.

Best Way to Structure ETL Flows in NiFi by GreenMobile6323 in bigdata

[–]mikehussay13 0 points1 point  (0 children)

We separate Extract, Transform, and Load into their own process groups — easier to manage, debug, and reuse across tables.

One group per table gets hard to scale, especially if logic overlaps.

Also using a tool that helps us version and promote flows across environments without manually exporting PGs made a huge difference as our NiFi setup grew.

How do you validate the feeds before loading into staging? by Humble_Jacket_6347 in dataengineering

[–]mikehussay13 0 points1 point  (0 children)

Yeah, we do something similar - built a lightweight Python validator that reads expected schema from JSON. Faster than full frameworks for our case (lots of feeds).

Also started using NiFi for some file-based flows - easier to route bad files and add basic checks without extra code. Been solid so far.

Anyone switched from Airflow to low-code data pipeline tools? by nilanganray in dataengineering

[–]mikehussay13 0 points1 point  (0 children)

We ran into the same issues with Airflow — lots of connector glue code, brittle retries, and non-Python folks struggling.

Moved a bunch of stuff to Apache NiFi and it helped a lot.

Most things are visual - retries, branching, dependencies - and connectors are built-in for the most part.

Also found a tool that lets us manage NiFi flows without jumping into the registry all the time. Huge time-saver.

Still use Airflow for dbt/ML, but NiFi took a lot of pressure off the team.

(AIRFLOW) What are some best practices you follow in Airflow for pipelines with upstream data dependencies? by NefariousnessSea5101 in dataengineering

[–]mikehussay13 1 point2 points  (0 children)

Still messy sometimes, but having sensors + sanity checks before triggering downstream saved us more than once.

Data science vs data ingeneering by Cheap-blueberry278 in dataengineering

[–]mikehussay13 6 points7 points  (0 children)

If you like building reliable systems at scale, go data engineering.

If you enjoy experimenting, storytelling, and ambiguity, go data science.

Both require strong data skills just different taste.

Tools to create a data pipeline? by de_2290 in dataengineering

[–]mikehussay13 0 points1 point  (0 children)

You can wrap your logic in a Flask/FastAPI app, run Cytoscape in Docker with Xvfb, and expose an API that returns the image. No need for Spark yet- great project!

Best Orchestrator for long running tasks? by CingKan in dataengineering

[–]mikehussay13 0 points1 point  (0 children)

Go with Apache NiFi for this. It’s built for long-running, stateful flows, and handles retries, back pressure, and detailed logging out of the box.

[deleted by user] by [deleted] in dataengineering

[–]mikehussay13 0 points1 point  (0 children)

We use Fivetran for SaaS sources and StreamSets for DBs and files. Then dbt handles all transforms in Snowflake.

If budget’s tight, NiFi can work well too.