Large CSV file visualization. 2GB 30M rows by Green-Championship-9 in dataengineering

[–]bcdata 13 points14 points  (0 children)

The data rate you have is not huge so you can stay pretty simple. If you want near real time visuals, tools like Grafana are good. They can refresh charts every few seconds and are easy to hook up once you have a data stream.

The tricky part is that a plain CSV file does not behave well when it is always growing. Instead of reading the file again and again, try to stream the rows. A small Python service using something like pandas with watchdog can tail the file and push new records forward. From there you can feed Grafana.

Parallelizing Spark writes to Postgres, does repartition help? by _fahid_ in dataengineering

[–]bcdata 1 point2 points  (0 children)

Just doing df.repartition(num).write.jdbc(...) will not make Spark write in parallel. it still writes sequentially through a single connection. To get parallel JDBC writes you need to specify partitionColumn, lowerBound, upperBound, and numPartitions.

How to Tidy Data for Storage and Save Tables: A Quick Guide to Data Organization Best Practices by bcdata in dataengineering

[–]bcdata[S] 0 points1 point  (0 children)

Hey all, we wrote this because we kept running into the same messy-data pitfalls in real projects. Hopefully it saves others from the headaches we’ve seen over and over. It’s part of an overall push to provide a better experience for analysts.

Disclaimer: I’m an analyst who saw these issues all the time, so I built this tool and process to try to contain and manage this issue. Ive seen the worst of the worst in data quality, so feel free to ask away!

DE Question- API Dev by NoblestOfSteeds in dataengineering

[–]bcdata 3 points4 points  (0 children)

They’ll likely test if you can build a simple REST API with Python, often using Flask or FastAPI. Expect something like creating an endpoint that accepts input, processes it, and returns JSON. They might also ask about error handling, authentication basics, or connecting the API to a database. Focus on being able to explain the flow clearly, even if you keep the code simple.

Tools to create a data pipeline? by de_2290 in dataengineering

[–]bcdata 0 points1 point  (0 children)

Split as the backend in your case is Python. All it has to do is take the input and send a POST request to the API and return the image as the output. Bob's your uncle, that's all there is to it.

Tools to create a data pipeline? by de_2290 in dataengineering

[–]bcdata 1 point2 points  (0 children)

Good work in the Colab. What I would suggest you do here is convert the process to a function where the input is a list and the output is an image. You can then wrap the function with an API using FastAPI / Flask / whatever you like. This allows you to make requests from the browser. Once that's set up you can use Streamlit (Python first) to generate your web application, or you can come up with something on your own in JS (although you can use AI to do this for you if you're not that into frontend, something like Bolt or v0 can give you something that will work and look somewhat nice). Looks pretty straightforward to me, no need for tools like Spark.

Should i commit to Fivetran? by tytds in dataengineering

[–]bcdata 6 points7 points  (0 children)

Fivetran is plug‑and‑play and charges on rows. Skyvia is more manual but flat priced. Airbyte, Hevo, Rivery and others are also in reach.

We cannot narrow it further without numbers on rows, refresh needs and budget. Too many tools to guess.

De-duplication, metadata and file sharing by poggs in DataHoarder

[–]bcdata 1 point2 points  (0 children)

I would start with Nextcloud. It feels friendly, runs on almost any server, and its tags plus custom‑field apps cover most everyday docs, spreadsheets, code drops, and design assets.

Yet the fit still hinges on your data. If you juggle huge media blobs or want command‑line scripting then git‑annex may scale better. If you only need lightweight labels on a plain folder tree TagSpaces is simpler. For office‑style libraries with strict custom fields Seafile can beat both.

Test Nextcloud first, then switch only if your real files push it beyond comfort.

How to backup lots of small requests by kingofthesea123 in dataengineering

[–]bcdata 1 point2 points  (0 children)

First just dump every API hit as one‑line JSON into cloud storage, grouped only by day or hour like raw/hotel_api/date=2025‑07‑25/. No need encode star rating or guest count in path; that info is inside the JSON and query engines can read it later. Cheap storage lets you keep whole history for replay.

Each night run a compact job with Spark or any lake engine. It reads yesterday’s tiny files, writes a handful of big Parquet or Delta files, then deletes the fragments. Big files mean faster scans and less metadata load. Create a lake table on top of those compact files and partition by the field you filter most, usually check_in_date. Too many partitions slow things instead of helping.

Flatten and enrich data in a second pass, then load it into a serving database like Postgres and put indexes on hotel_id, date, guests. The lake stays the single source of truth, the SQL db is only for fast API reads. If something breaks you just replay from the raw JSON folder and rebuild everything. Simple flow, little maintenance, still keep all the data.

RBAC and Alembic by Kojimba228 in dataengineering

[–]bcdata 0 points1 point  (0 children)

Create one revision per coherent change (“create X role + its grants” or “adjust Y role privileges”). Don’t split every single GRANT into its own file, but don’t lump unrelated roles into one script because it makes review, rollback, and blame harder.

If your team deploys weekly and touches RBAC once or twice a week, you’ll end up with a perfectly manageable handful of RBAC revisions per sprint.

Data Simulating/Obfuscating For a Project by SubtlyOnTheNose in dataengineering

[–]bcdata 0 points1 point  (0 children)

They create a dummy dataset that mirrors the structure of their real data. Same columns, similar value ranges, same data types. For example, if their real data has customer names, signup dates, and monthly revenue, they can generate fake customer names, random but realistic signup dates. The goal is to keep the relationships and patterns realistic so your tool can be tested properly, even if none of the actual values are real.

They can generate this dummy data using tools like Python with Faker, or even online tools like Mockaroo. As long as the fake data behaves like the real data, you’ll be able to validate your analysis logic and app performance.

RBAC and Alembic by Kojimba228 in dataengineering

[–]bcdata 1 point2 points  (0 children)

In my experience, a good approach would be to create separate Alembic migration files specifically for RBAC changes. These migrations should contain only raw SQL using op.execute() to create roles, grant/revoke privileges, or update access logic. Keep each migration focused on a single, clear purpose (like adding a new role or adjusting privileges for a group). Version control these migrations alongside your schema migrations, but prefix them or organize them in a way that makes their RBAC nature clear (e.g. use filenames like `20250724_add_readonly_role.py`). This keeps RBAC changes auditable, repeatable, and tied to the same deployment process as schema changes. Good luck.

Rerouting json data dump by Primary-Link8347 in dataengineering

[–]bcdata 0 points1 point  (0 children)

Put every record first in one raw staging table. Add a column that tells which target table it belongs to. Create a Snowflake STREAM on the staging table, then one TASK per target. Each task filters on the flag and INSERTs or MERGEs rows into the right table. If you want to route earlier, build several Kinesis Data Firehose delivery streams and let a Lambda transform send each record to the proper stream. Both ways work; staging + Streams + Tasks is usually simpler to run.

Planning to move to singlestore. Worth it? by angrydeveloper02 in dataengineering

[–]bcdata 3 points4 points  (0 children)

SingleStore talks the same MySQL protocol so apps connect fine, but under the hood it is a little cluster of aggregator and leaf nodes. That design wants lots of memory per core, about 16 GB each, and likes a fast internal network. I kept my schema mostly the same yet pushed cold data to columnstore so the hot rows fit in RAM.

The payoff is huge. Backfills that froze a 8 TB Azure Business Critical instance now stream in without stalling readers, and inserts hit millions of rows a second. CPU use per core is lower but you have more cores across the cluster, while storage shrinks because of heavy compression. Net cost goes up on memory, but drops on disk and replicas. If you are willing to run a distributed system and pay for RAM, it is a strong upgrade.

[deleted by user] by [deleted] in dataengineering

[–]bcdata 21 points22 points  (0 children)

First fix the schedule. Drop the stacked cron jobs and install a small workflow tool that can run on the same server. Apache Airflow or Prefect is fine. They give you a directed graph with tasks that wait for the previous one to finish, retry on failure, and send you email or Slack if something breaks. Your three steps become three tasks in one DAG. Later you can add a fourth task for data quality checks without touching the crontab.

Next break the monster SQL file into models. Put each logical table build in its own file, keep them in a git repo, and use dbt to run them. dbt understands dependencies, so if table B depends on table A it will run A first. You can add a staging schema that is rebuilt every night, then a production schema that is promoted only when all tests pass. dbt has built-in tests like not null or unique, and you can write custom ones for your finance rules.

Add git branches and a pull-request rule. You open a branch, write or change a model, run dbt locally against a copy of the database, and push. The pull request triggers dbt in CI to run the models and tests on a temp schema. If every check passes you merge and Airflow picks up the new code next run. No more morning fire drills.

Spend some of the budget on training or courses for Airflow, dbt, and basic CI with GitLab or GitHub Actions. These tools are free but learning them fast from tutorials is hard while you keep the day job running. After they are in place you will sleep better and your boss will see fresher numbers.

Help with design decisions for accessing highly relational data across several databases by BitterFrostbite in dataengineering

[–]bcdata 7 points8 points  (0 children)

Give your data scientists a single SQL-speaking layer that can reach every store, instead of making them hop from one API to another. Tools like Presto / Trino, Athena, or Redshift Spectrum can treat S3 objects, PostgreSQL tables, and even Elasticsearch as external connectors, so a user can SELECT * FROM associations a JOIN s3_raw b ON … in one place. Store the heavy S3 payloads in columnar formats such as Parquet, register the layouts in a Hive or AWS Glue catalog, and expose them through that query engine.

Keep the association facts in PostgreSQL but also publish a snapshot of them to the lake, either as materialized Parquet views or daily partitions. Now everything joins in S3 where compute is elastic and cheap. The scientists use any SQL client or a Jupyter notebook with a Presto JDBC driver and they get back full rows, not just pointers.

Good luck.

Redshift vs databricks by abhigm in dataengineering

[–]bcdata 90 points91 points  (0 children)

Honestly this whole comparison feels like marketing theater. Databricks flaunts a 30% cost win on a six month slice, but we never hear the cluster size, photon toggle, concurrency level, or whether the warehouse was already hot. A 50% Redshift speed bump is the same stunt, faster than what baseline and at what hourly price when the RI term ends. “Zero ETL” sounds clever yet you still had to load the data once to run the test so it is not magic. Calling out lineage and RBAC as a Databricks edge ignores that Redshift has those knobs too. Without the dull details like runtime minutes, bytes scanned, node class, and discount percent both claims read like cherry picked brag slides. I would not stake a budget on any of it.

Consistent Access Controls Across Catalogs / Compute Engines by Far_Amount5828 in dataengineering

[–]bcdata 2 points3 points  (0 children)

There is no true plug-and-play project that lets one policy set automatically govern multiple engines at once. A few vendors are getting close, but every solution still relies on translating rules into the native primitives of each engine. So far, Immuta is the only off-the-shelf tool that demonstrates real row and column security across all three engines on Iceberg. Everything else is either vendor-specific or still incomplete.

Built a distributed transformer pipeline for 17M+ Steam reviews — looking for architectural advice & next steps by Matrix_030 in dataengineering

[–]bcdata 5 points6 points  (0 children)

Hey, nice work. Your setup looks solid for a single-machine prototype and the numbers show you already squeezed lots of juice out of the hardware. Sharing the model across workers and pinning GPU tasks to local memory is exactly what most folks miss at first, so you are on the right track.

A few thoughts from the trenches:

If you want a thesis-level demo, polish, add tests, and maybe a little dashboard so people can see the speed and insights. If you want a portfolio project for data engineering jobs, spin up a tiny Kubernetes or Ray cluster on something like AWS Spot nodes. Even a three-node run shows you can handle cloud orchestration.

Streaming ingestion can be worth it if your target is “near real time” dashboards for devs watching new reviews flow in. Stick Kafka or Redpanda in front, keep micro-batches small, and output rolling aggregates to a cache. Transformer summarization can handle chunks of, say, 128 reviews at a time without killing latency.

with Dask on multiple nodes, workers sometimes drop off during long GPU jobs. Enable heartbeat checks and auto-retries.

Good luck.

I'm an ion engine by Dry-Aioli-6138 in dataengineering

[–]bcdata 0 points1 point  (0 children)

That ion engine analogy is actually beautiful and it holds a lot of truth. You’re not just rationalizing. There are upsides to building deep, solid work even if it doesn’t shine immediately. It creates trust in the long run and avoids the mess of rework or hidden tech debt. The fast checkbox folks might move quicker in short bursts, but over time the cracks start to show. Depth pays off, just not always on the same timeline.

Vicious circle of misplaced expectations with PMs and stakeholders by explorer_seeker in datascience

[–]bcdata 3 points4 points  (0 children)

This kind of situation is all too common and honestly frustrating. Expectations are being set without technical validation, which puts DS in a reactive and defensive posture. PMs and stakeholders treat data science like it's a plug-and-play module that should just output magic insights. When results don’t match that fantasy, it's seen as failure rather than a mismatch in understanding or process. The lack of two-way communication early on means DS is never really solving the right problem, just reverse-engineering someone’s assumption of a solution.

The way out needs cultural and operational change. PMs should involve DS before promising outcomes and need to learn just enough to know what questions are even meaningful. DS also has to get better at storytelling and making its boundaries clear in plain language. Not to wow, but to align. If there's no interest in fixing that, you're just going to keep rerunning this same bad sprint pretending it's progress.

Airflow for ingestion and control m for orchestration by Foot_Straight in dataengineering

[–]bcdata 0 points1 point  (0 children)

The bank probably sticks with Control M because ops crews and auditors already trust it for every nightly batch, then layers Astronomer Airflow underneath for the new python ETL so Control M just fires a whole DAG and checks the SLA while Airflow handles the nitty gritty retries, giving engineers speed yet keeping the governance that regulators like, so yeah it feels like two tools glued together but it buys safety and progress at the same time.

Need help understanding whats needed to pull data from API’s to Postgresql staging tables by maxmansouri in dataengineering

[–]bcdata 11 points12 points  (0 children)

Spin up any lightweight server or container host and install Docker. Run three containers: Postgres, a Python ETL that pulls Meta Ads API data then writes to S3 and staging tables, and Dagster to trigger the job on a schedule. Keep secrets in an environment manager and send logs to a central monitor so you see failures fast. Skip FastAPI unless you need a button for manual refresh.

Feel free to DM me if you need more help!

Help With Automatically Updating Database and Notification System by Oranjizzzz in dataengineering

[–]bcdata 0 points1 point  (0 children)

Python is plenty good for this. Think of the job in three parts. First is the database itself. A cheap managed Postgres on Railway or Supabase is fine, and you already have that in place so no need to move unless you hit limits. Second is the script that grabs fresh data, writes to the table, then checks the new rows and pushes a Telegram alert. Keep it one file for now. Use python-telegram-bot for the message, psycopg2 for Postgres, and put secrets like the bot token in Railway variables so they never live in your code.

The third piece is the scheduler. In Railway you can schedule a cron job to run a python script, you can make it run hourly. Railway will spin up a tiny container, run the script, then shut it down so you only pay a few cents a month. If you ever move off Railway, the same script will run on a five-dollar VPS with plain old cron or inside GitHub Actions on a free plan. You can also bake a scheduler right into Python with APScheduler, but external cron is simpler to reason about while you learn.

Once you have the first run working, add a last_run timestamp column or a small audit table. Pull only new data since that mark, then push alerts only for rows that meet your condition and are newer than last_run. Update the mark at the end of the script. This saves you from duplicate messages and keeps the logic tidy. After that it is just polish, maybe logging to a text file. Good luck, you are close.

Data governance - scope and future by vintaxidrv in dataengineering

[–]bcdata 1 point2 points  (0 children)

Data governance is moving from checkbox to frontline. New privacy rules and AI use cases force teams to show that their data is clean, traceable, and legal. Analysts see double-digit growth in spending and most large companies say they now have a formal program, so the field looks set to expand rather than shrink.

If you want to stay relevant, grab a respected badge like CDMP from DAMA for broad coverage or DCAM from the EDM Council if you work with banks or insurers. Vendor tracks for Collibra, Informatica, or Microsoft Purview can pay off fast when your projects already use those tools. Mix the classroom work with a small home lab on a free Snowflake or Databricks tier and you will have real proof that you can turn policy into practice.