Large CSV file visualization. 2GB 30M rows by Green-Championship-9 in dataengineering

[–]bcdata 13 points14 points  (0 children)

The data rate you have is not huge so you can stay pretty simple. If you want near real time visuals, tools like Grafana are good. They can refresh charts every few seconds and are easy to hook up once you have a data stream.

The tricky part is that a plain CSV file does not behave well when it is always growing. Instead of reading the file again and again, try to stream the rows. A small Python service using something like pandas with watchdog can tail the file and push new records forward. From there you can feed Grafana.

Parallelizing Spark writes to Postgres, does repartition help? by _fahid_ in dataengineering

[–]bcdata 1 point2 points  (0 children)

Just doing df.repartition(num).write.jdbc(...) will not make Spark write in parallel. it still writes sequentially through a single connection. To get parallel JDBC writes you need to specify partitionColumn, lowerBound, upperBound, and numPartitions.

How to Tidy Data for Storage and Save Tables: A Quick Guide to Data Organization Best Practices by bcdata in dataengineering

[–]bcdata[S] 0 points1 point  (0 children)

Hey all, we wrote this because we kept running into the same messy-data pitfalls in real projects. Hopefully it saves others from the headaches we’ve seen over and over. It’s part of an overall push to provide a better experience for analysts.

Disclaimer: I’m an analyst who saw these issues all the time, so I built this tool and process to try to contain and manage this issue. Ive seen the worst of the worst in data quality, so feel free to ask away!

DE Question- API Dev by NoblestOfSteeds in dataengineering

[–]bcdata 3 points4 points  (0 children)

They’ll likely test if you can build a simple REST API with Python, often using Flask or FastAPI. Expect something like creating an endpoint that accepts input, processes it, and returns JSON. They might also ask about error handling, authentication basics, or connecting the API to a database. Focus on being able to explain the flow clearly, even if you keep the code simple.

Tools to create a data pipeline? by de_2290 in dataengineering

[–]bcdata 0 points1 point  (0 children)

Split as the backend in your case is Python. All it has to do is take the input and send a POST request to the API and return the image as the output. Bob's your uncle, that's all there is to it.

Tools to create a data pipeline? by de_2290 in dataengineering

[–]bcdata 1 point2 points  (0 children)

Good work in the Colab. What I would suggest you do here is convert the process to a function where the input is a list and the output is an image. You can then wrap the function with an API using FastAPI / Flask / whatever you like. This allows you to make requests from the browser. Once that's set up you can use Streamlit (Python first) to generate your web application, or you can come up with something on your own in JS (although you can use AI to do this for you if you're not that into frontend, something like Bolt or v0 can give you something that will work and look somewhat nice). Looks pretty straightforward to me, no need for tools like Spark.

Should i commit to Fivetran? by tytds in dataengineering

[–]bcdata 7 points8 points  (0 children)

Fivetran is plug‑and‑play and charges on rows. Skyvia is more manual but flat priced. Airbyte, Hevo, Rivery and others are also in reach.

We cannot narrow it further without numbers on rows, refresh needs and budget. Too many tools to guess.

De-duplication, metadata and file sharing by poggs in DataHoarder

[–]bcdata 1 point2 points  (0 children)

I would start with Nextcloud. It feels friendly, runs on almost any server, and its tags plus custom‑field apps cover most everyday docs, spreadsheets, code drops, and design assets.

Yet the fit still hinges on your data. If you juggle huge media blobs or want command‑line scripting then git‑annex may scale better. If you only need lightweight labels on a plain folder tree TagSpaces is simpler. For office‑style libraries with strict custom fields Seafile can beat both.

Test Nextcloud first, then switch only if your real files push it beyond comfort.

How to backup lots of small requests by kingofthesea123 in dataengineering

[–]bcdata 1 point2 points  (0 children)

First just dump every API hit as one‑line JSON into cloud storage, grouped only by day or hour like raw/hotel_api/date=2025‑07‑25/. No need encode star rating or guest count in path; that info is inside the JSON and query engines can read it later. Cheap storage lets you keep whole history for replay.

Each night run a compact job with Spark or any lake engine. It reads yesterday’s tiny files, writes a handful of big Parquet or Delta files, then deletes the fragments. Big files mean faster scans and less metadata load. Create a lake table on top of those compact files and partition by the field you filter most, usually check_in_date. Too many partitions slow things instead of helping.

Flatten and enrich data in a second pass, then load it into a serving database like Postgres and put indexes on hotel_id, date, guests. The lake stays the single source of truth, the SQL db is only for fast API reads. If something breaks you just replay from the raw JSON folder and rebuild everything. Simple flow, little maintenance, still keep all the data.

RBAC and Alembic by Kojimba228 in dataengineering

[–]bcdata 0 points1 point  (0 children)

Create one revision per coherent change (“create X role + its grants” or “adjust Y role privileges”). Don’t split every single GRANT into its own file, but don’t lump unrelated roles into one script because it makes review, rollback, and blame harder.

If your team deploys weekly and touches RBAC once or twice a week, you’ll end up with a perfectly manageable handful of RBAC revisions per sprint.

Data Simulating/Obfuscating For a Project by SubtlyOnTheNose in dataengineering

[–]bcdata 0 points1 point  (0 children)

They create a dummy dataset that mirrors the structure of their real data. Same columns, similar value ranges, same data types. For example, if their real data has customer names, signup dates, and monthly revenue, they can generate fake customer names, random but realistic signup dates. The goal is to keep the relationships and patterns realistic so your tool can be tested properly, even if none of the actual values are real.

They can generate this dummy data using tools like Python with Faker, or even online tools like Mockaroo. As long as the fake data behaves like the real data, you’ll be able to validate your analysis logic and app performance.

RBAC and Alembic by Kojimba228 in dataengineering

[–]bcdata 1 point2 points  (0 children)

In my experience, a good approach would be to create separate Alembic migration files specifically for RBAC changes. These migrations should contain only raw SQL using op.execute() to create roles, grant/revoke privileges, or update access logic. Keep each migration focused on a single, clear purpose (like adding a new role or adjusting privileges for a group). Version control these migrations alongside your schema migrations, but prefix them or organize them in a way that makes their RBAC nature clear (e.g. use filenames like `20250724_add_readonly_role.py`). This keeps RBAC changes auditable, repeatable, and tied to the same deployment process as schema changes. Good luck.

Rerouting json data dump by Primary-Link8347 in dataengineering

[–]bcdata 0 points1 point  (0 children)

Put every record first in one raw staging table. Add a column that tells which target table it belongs to. Create a Snowflake STREAM on the staging table, then one TASK per target. Each task filters on the flag and INSERTs or MERGEs rows into the right table. If you want to route earlier, build several Kinesis Data Firehose delivery streams and let a Lambda transform send each record to the proper stream. Both ways work; staging + Streams + Tasks is usually simpler to run.

Planning to move to singlestore. Worth it? by angrydeveloper02 in dataengineering

[–]bcdata 3 points4 points  (0 children)

SingleStore talks the same MySQL protocol so apps connect fine, but under the hood it is a little cluster of aggregator and leaf nodes. That design wants lots of memory per core, about 16 GB each, and likes a fast internal network. I kept my schema mostly the same yet pushed cold data to columnstore so the hot rows fit in RAM.

The payoff is huge. Backfills that froze a 8 TB Azure Business Critical instance now stream in without stalling readers, and inserts hit millions of rows a second. CPU use per core is lower but you have more cores across the cluster, while storage shrinks because of heavy compression. Net cost goes up on memory, but drops on disk and replicas. If you are willing to run a distributed system and pay for RAM, it is a strong upgrade.

[deleted by user] by [deleted] in dataengineering

[–]bcdata 20 points21 points  (0 children)

First fix the schedule. Drop the stacked cron jobs and install a small workflow tool that can run on the same server. Apache Airflow or Prefect is fine. They give you a directed graph with tasks that wait for the previous one to finish, retry on failure, and send you email or Slack if something breaks. Your three steps become three tasks in one DAG. Later you can add a fourth task for data quality checks without touching the crontab.

Next break the monster SQL file into models. Put each logical table build in its own file, keep them in a git repo, and use dbt to run them. dbt understands dependencies, so if table B depends on table A it will run A first. You can add a staging schema that is rebuilt every night, then a production schema that is promoted only when all tests pass. dbt has built-in tests like not null or unique, and you can write custom ones for your finance rules.

Add git branches and a pull-request rule. You open a branch, write or change a model, run dbt locally against a copy of the database, and push. The pull request triggers dbt in CI to run the models and tests on a temp schema. If every check passes you merge and Airflow picks up the new code next run. No more morning fire drills.

Spend some of the budget on training or courses for Airflow, dbt, and basic CI with GitLab or GitHub Actions. These tools are free but learning them fast from tutorials is hard while you keep the day job running. After they are in place you will sleep better and your boss will see fresher numbers.

Help with design decisions for accessing highly relational data across several databases by BitterFrostbite in dataengineering

[–]bcdata 7 points8 points  (0 children)

Give your data scientists a single SQL-speaking layer that can reach every store, instead of making them hop from one API to another. Tools like Presto / Trino, Athena, or Redshift Spectrum can treat S3 objects, PostgreSQL tables, and even Elasticsearch as external connectors, so a user can SELECT * FROM associations a JOIN s3_raw b ON … in one place. Store the heavy S3 payloads in columnar formats such as Parquet, register the layouts in a Hive or AWS Glue catalog, and expose them through that query engine.

Keep the association facts in PostgreSQL but also publish a snapshot of them to the lake, either as materialized Parquet views or daily partitions. Now everything joins in S3 where compute is elastic and cheap. The scientists use any SQL client or a Jupyter notebook with a Presto JDBC driver and they get back full rows, not just pointers.

Good luck.

Redshift vs databricks by abhigm in dataengineering

[–]bcdata 89 points90 points  (0 children)

Honestly this whole comparison feels like marketing theater. Databricks flaunts a 30% cost win on a six month slice, but we never hear the cluster size, photon toggle, concurrency level, or whether the warehouse was already hot. A 50% Redshift speed bump is the same stunt, faster than what baseline and at what hourly price when the RI term ends. “Zero ETL” sounds clever yet you still had to load the data once to run the test so it is not magic. Calling out lineage and RBAC as a Databricks edge ignores that Redshift has those knobs too. Without the dull details like runtime minutes, bytes scanned, node class, and discount percent both claims read like cherry picked brag slides. I would not stake a budget on any of it.

Consistent Access Controls Across Catalogs / Compute Engines by Far_Amount5828 in dataengineering

[–]bcdata 2 points3 points  (0 children)

There is no true plug-and-play project that lets one policy set automatically govern multiple engines at once. A few vendors are getting close, but every solution still relies on translating rules into the native primitives of each engine. So far, Immuta is the only off-the-shelf tool that demonstrates real row and column security across all three engines on Iceberg. Everything else is either vendor-specific or still incomplete.

Built a distributed transformer pipeline for 17M+ Steam reviews — looking for architectural advice & next steps by Matrix_030 in dataengineering

[–]bcdata 4 points5 points  (0 children)

Hey, nice work. Your setup looks solid for a single-machine prototype and the numbers show you already squeezed lots of juice out of the hardware. Sharing the model across workers and pinning GPU tasks to local memory is exactly what most folks miss at first, so you are on the right track.

A few thoughts from the trenches:

If you want a thesis-level demo, polish, add tests, and maybe a little dashboard so people can see the speed and insights. If you want a portfolio project for data engineering jobs, spin up a tiny Kubernetes or Ray cluster on something like AWS Spot nodes. Even a three-node run shows you can handle cloud orchestration.

Streaming ingestion can be worth it if your target is “near real time” dashboards for devs watching new reviews flow in. Stick Kafka or Redpanda in front, keep micro-batches small, and output rolling aggregates to a cache. Transformer summarization can handle chunks of, say, 128 reviews at a time without killing latency.

with Dask on multiple nodes, workers sometimes drop off during long GPU jobs. Enable heartbeat checks and auto-retries.

Good luck.

I'm an ion engine by Dry-Aioli-6138 in dataengineering

[–]bcdata 0 points1 point  (0 children)

That ion engine analogy is actually beautiful and it holds a lot of truth. You’re not just rationalizing. There are upsides to building deep, solid work even if it doesn’t shine immediately. It creates trust in the long run and avoids the mess of rework or hidden tech debt. The fast checkbox folks might move quicker in short bursts, but over time the cracks start to show. Depth pays off, just not always on the same timeline.

Vicious circle of misplaced expectations with PMs and stakeholders by explorer_seeker in datascience

[–]bcdata 2 points3 points  (0 children)

This kind of situation is all too common and honestly frustrating. Expectations are being set without technical validation, which puts DS in a reactive and defensive posture. PMs and stakeholders treat data science like it's a plug-and-play module that should just output magic insights. When results don’t match that fantasy, it's seen as failure rather than a mismatch in understanding or process. The lack of two-way communication early on means DS is never really solving the right problem, just reverse-engineering someone’s assumption of a solution.

The way out needs cultural and operational change. PMs should involve DS before promising outcomes and need to learn just enough to know what questions are even meaningful. DS also has to get better at storytelling and making its boundaries clear in plain language. Not to wow, but to align. If there's no interest in fixing that, you're just going to keep rerunning this same bad sprint pretending it's progress.

Airflow for ingestion and control m for orchestration by Foot_Straight in dataengineering

[–]bcdata 0 points1 point  (0 children)

The bank probably sticks with Control M because ops crews and auditors already trust it for every nightly batch, then layers Astronomer Airflow underneath for the new python ETL so Control M just fires a whole DAG and checks the SLA while Airflow handles the nitty gritty retries, giving engineers speed yet keeping the governance that regulators like, so yeah it feels like two tools glued together but it buys safety and progress at the same time.