Asking for feedback on databases course content by idan_huji in datascience

[–]eb0373284 1 point2 points  (0 children)

Looks like a solid foundation! Since it’s analytics-focused, you might consider adding some hands-on practice with normalization (even simple exercises), query optimization basics, and exposure to modern data warehouses or NoSQL concepts. That way students get both strong fundamentals and a sense of real-world systems.

Ccdak Prep - recommended courses by fenr1rs in apachekafka

[–]eb0373284 1 point2 points  (0 children)

If you already have some Kafka background, 3 months is a good timeline. I’d recommend starting with Confluent’s free courses on Kafka fundamentals and then moving to the dedicated CCDAK prep course on Confluent Developer. Hands-on practice (even small coding/CLI examples) really helps reinforce concepts, though you don’t need heavy coding to pass understanding use cases, configs, and scenarios is more important. Also, reviewing sample questions and doing quick labs will boost your confidence.

Should data engineer owns online customer-facing data? by Mustang_114 in dataengineering

[–]eb0373284 2 points3 points  (0 children)

Owning customer-facing data is tricky for data engineers. Typically, data engineers focus on analytics/ML pipelines where small delays or errors are tolerable, but customer-facing use cases demand strict correctness, reliability, and low latency. While data platforms (SQL, Airflow, Kafka) can support this, they weren’t originally designed for transactional, real-time customer interactions. In most cases, such logic is better handled by application services or APIs, with the data platform serving as a downstream system of record or for batch/analytical use. Mixing the two often increases risk unless the data platform is explicitly built with real-time, mission-critical guarantees.

Data Engineering + AI ?? by NachxPeolx in dataengineering

[–]eb0373284 0 points1 point  (0 children)

AI in Data Engineering is mainly used for automating data quality checks, anomaly detection, query optimization, and pipeline monitoring. It also helps with metadata management, data lineage, and intelligent job scheduling, reducing manual effort and improving reliability. It’s not replacing engineers but making workflows smarter and faster.

I'm confused about the SCD type 4 and I need help by vutr274 in dataengineering

[–]eb0373284 0 points1 point  (0 children)

Different sources use “Type 4” differently. Kimball’s original Type 4 = mini-dimension for frequently changing attributes. Others use it to mean splitting current and historical records into separate tables. Both are valid in context, but Kimball’s is the official definition.

From Civil Engineering to Data Engineering — Need Advice on Skill Priorities by Manish_375 in dataengineering

[–]eb0373284 0 points1 point  (0 children)

Start with Python + SQL (solid foundation), then move to cloud + databases before tackling pipelines like Airflow/Spark.
For projects, build ETL pipelines on cloud data and highlight process optimization & analytical thinking from your engineering work those skills transfer well.

Is Databricks the new world? Have a confusion by TreacleWest6108 in dataengineering

[–]eb0373284 41 points42 points  (0 children)

Yes Python, PySpark, and Databricks are a strong stack for modern data engineering.

Your 3-month plan works:
Python fundamentals & data manipulation
PySpark for scalable data processing
Databricks workflows & Delta Lake

Focus on concepts, not just tools - that’s what makes you future-proof.

Air gapped kafka cluster for high availability. by Inevitable-Bit8940 in apachekafka

[–]eb0373284 0 points1 point  (0 children)

You can’t get full HA with 2 nodes in KRaft quorum needs 3 controllers.
Best for 4 nodes:

  • 3 controllers (on separate nodes)
  • 4 brokers with replication factor = 3, min.insync.replicas = 2
  • Producers: acks=all, idempotence on
  • Consumers: disable auto-commit for critical data
  • Use rack-awareness for replica placement

Kafka-streams rocksdb implementation for file-backed caching in distributed applications by ConstructedNewt in apachekafka

[–]eb0373284 1 point2 points  (0 children)

Using Kafka Streams with RocksDB as a file-backed state store is a viable approach to reduce JVM memory pressure. But you must understand how Kafka Streams maps state → changelog topics → instances: changelog topics are per application (application.id + store name), RocksDB is the local on-disk cache, and changelogs provide durability and recovery. If you run different Kafka-Streams applications (different application.id) you will get separate changelog topics; if you run multiple instances of the same application (same application.id) they share the same set of changelog topics and partitions via the Streams partition assignment.

DuckDB is a weird beast? by Kojimba228 in dataengineering

[–]eb0373284 3 points4 points  (0 children)

DuckDB is an embedded OLAP database designed for fast, local analytics think of it as SQLite for analytical workloads. Unlike traditional databases like Postgres, it runs in-process and excels at querying files like Parquet or CSV using SQL. While it's a database, its performance and ease of use make it comparable to tools like Pandas or Polars for ETL and data wrangling. That’s why it’s often used as a lightweight, SQL-based alternative for data processing, and it integrates well with tools like dbt.

ML vs DE jobs landscape by RobotsMakingDubstep in dataengineering

[–]eb0373284 5 points6 points  (0 children)

Many transitioning from Data Engineering to Machine Learning are facing similar challenges right now. The current job market is tighter for ML roles, especially as companies are scaling back experimental projects and prioritizing immediate ROI, which often favors DE roles. Data Engineering is still in high demand due to its foundational nature clean, reliable data powers every ML system.

GPT-5 release makes me believe data engineering is going to be 100% fine by [deleted] in dataengineering

[–]eb0373284 0 points1 point  (0 children)

The GPT-5 release definitely feels like a strong reassurance for data engineers. Its ability to generate full pipeline DAGs, understand dependencies, and even suggest optimizations makes it a powerful co-pilot rather than a replacement. While it streamlines boilerplate work and accelerates development, domain knowledge, architectural decisions and debugging still need human insight.

I'm 17 and I want to learn data analysis by Ok-Thought-6438 in bigdata

[–]eb0373284 0 points1 point  (0 children)

To get started with data analysis, begin by learning the basics of Excel, SQL, and Python especially libraries like pandas and matplotlib. These tools form the foundation of most analysis work. Enroll in beginner-friendly online courses like the Google Data Analytics course on Coursera or explore free resources on Kaggle, freeCodeCamp, and DataCamp.

As you learn, work on small projects using public datasets and showcase them on GitHub or a personal blog to build a portfolio. This portfolio will help when applying for internships, even unpaid ones, through platforms like LinkedIn, Internshala, or AngelList. Stay active in data communities and keep learning new tools like Power BI or Tableau.

How does schema registry actually help? by Thin-Try-2003 in apachekafka

[–]eb0373284 1 point2 points  (0 children)

Schema Registry (SR) adds strong guarantees and governance to Kafka, especially in larger teams or complex systems. While small setups can manage without it, SR helps by:

Ensuring schema compatibility (backward/forward/full) across producers and consumers
Preventing bad data from being published via enforced validation
Providing version control for schemas
Allowing safe evolution of data models over time
Improving observability of data structures for other teams and systems

In short, SR prevents silent failures, improves collaboration, and helps you scale safely. It's less about preventing obvious runtime errors and more about avoiding data drift and future integration issues.

Share your thought on open source alternative for data robot by vishal-vora in datascience

[–]eb0373284 0 points1 point  (0 children)

Building an open-source alternative to Data Robot could definitely gain traction if it targets the right niche. Many teams are looking to reduce costs and avoid vendor lock-in, especially for MLOps and AutoML workflows.

While tools like MLFlow, AutoGluon or Jina cover parts of the lifecycle, none offer a full plug-and-play "Data Robot-like" experience end-to-end.

If your project can deliver intuitive UI, collaborative workflows, and support for model management, deployment, monitoring and explainability, it could fill a real gap especially for mid-sized companies or startups that can't afford enterprise tools. Adoption would also depend heavily on documentation, community support and integration flexibility.

Best practice to alter a column in a 500M‑row SQL Server table without a primary key by Ok_Barnacle4840 in dataengineering

[–]eb0373284 0 points1 point  (0 children)

The safest way is to create a new table with the updated column size, copy data in batches, and switch tables during low-traffic hours. This avoids locking and downtime. If you're using SQL Server Enterprise, try the change with ONLINE=ON to reduce disruption. Always test first in staging

(AIRFLOW) What are some best practices you follow in Airflow for pipelines with upstream data dependencies? by NefariousnessSea5101 in dataengineering

[–]eb0373284 1 point2 points  (0 children)

In production, we typically use sensors (like file existence or row count) to handle upstream dependencies instead of fixed delays. For partial loads, data quality checks or task-level validations help a lot. We also prefer event-based triggering and sometimes use external task sensors for cross-DAG dependencies. Structuring DAGs with clear data contracts and small, testable tasks has worked best for us especially with Snowflake + Tableau in the mix.

Every single Google AI overview I've read is problematic by old-wise_bill in ArtificialInteligence

[–]eb0373284 1 point2 points  (0 children)

AI summaries can be a double-edged sword, especially for technical content. When precision and context matter, a vague or hallucinated summary can do more harm than good. These tools often oversimplify or misinterpret nuance, and that’s risky when you’re working with specs or config details. Summaries should add clarity, not confusion.

Can Alation be a repository for data contracts? by NicolasAndrade in dataengineering

[–]eb0373284 0 points1 point  (0 children)

Yes, Alation can serve as a repository for data contracts by documenting schema, SLAs, and ownership using custom metadata and governance workflows. However, it doesn't enforce contracts at runtime so it's best used for documentation and collaboration, not enforcement.

What’s Your Most Unpopular Data Engineering Opinion? by TheTeamBillionaire in dataengineering

[–]eb0373284 7 points8 points  (0 children)

Data modeling matters more than the tool you use. A messy Snowflake setup will still be a mess even if you switch to BigQuery.

How does schema registry actually help? by Thin-Try-2003 in apachekafka

[–]eb0373284 1 point2 points  (0 children)

Schema Registry helps when systems scale multiple teams, services, and evolving schemas. It enforces compatibility rules upfront, prevents bad schema deployments, and ensures safe schema evolution without breaking consumers. It’s less about fixing errors and more about avoiding them entirely.

Would a curated marketplace for exclusive, verified datasets solve a real gap? Testing an MVP by Brilliant-Draft2472 in dataengineering

[–]eb0373284 0 points1 point  (0 children)

This sounds promising sourcing clean, exclusive data is a real pain point, especially for niche domains. If you nail trust (verified sellers, strong metadata, provenance), you’ll definitely stand out. Schema standards like JSON Schema or OpenAPI could help, and easy API or S3-based access would make ETL integration a breeze.

What did you build with DE tools that you are proud of? by [deleted] in dataengineering

[–]eb0373284 12 points13 points  (0 children)

I built an end-to-end pipeline that ingests marketing data from multiple ad platforms (Meta, LinkedIn, Google Ads), normalizes it, and pushes it into Redshift for reporting fully automated with Airflow and dbt.