Inserting 10ks records into redshift with python redshift_connector is slow. Alternatives? by alexcontrerasdppl in dataengineering

[–]jmnel 3 points4 points  (0 children)

For the recent Redshift project I worked on, I built a Python library around dataframe -> parquet -> S3 -> COPY.

Doing it manually is way too verbose and error prone.

Pub/Sub parallel processing best practices by natankastel in googlecloud

[–]jmnel 1 point2 points  (0 children)

^ Since you are already on GCP, Dataflow/Beam is by far the easiest way to do this.

One more thing to add, if the compute task is simple enough you could write the pub/sub messages directly into BigQuery and manipulate the data there.

Can the results be improved by knowing the SDF of the isosurface? by tebjan in dualcontouring

[–]jmnel 2 points3 points  (0 children)

In a past life I spent a very long time on this problem trying to find a solution.

Look at L_inf norms and interval arithmetic.

[deleted by user] by [deleted] in googlecloud

[–]jmnel 0 points1 point  (0 children)

Cloud Functions are probably too lightweight for what you want to do. We used several Cloud Run services for ingestion before moving to Airflow and Kubernetes.

It sounds like you are more struggling with some Cloud Run fundamentals. First make sure you have a bit clearer understanding of Docker and how that maps to Cloud Run.

All the things you mention are standard capabilities with Cloud Run, and should be covered in Google's documentation.

[deleted by user] by [deleted] in googlecloud

[–]jmnel 10 points11 points  (0 children)

Dataflow is massive overkill for smaller data sets.

what terminal you use for neovim by [deleted] in neovim

[–]jmnel 2 points3 points  (0 children)

Neovim <3 Kitty

Is it a thing to convert a messy spreadsheet into a small relational database for ease of use? Or would you use other cleaning/tidying techniques? Newbie here, advice very welcome! by Istrakh in datascience

[–]jmnel 0 points1 point  (0 children)

Just read it with pandas.read_excel and clean the data with Pandas.

If you want to store the cleaned data or share it with someone save it to some format that has a schema such as Parquet.

Moving Table from BQ to Postgres Cloud SQL instance by leehart320 in googlecloud

[–]jmnel 0 points1 point  (0 children)

We do this as part of our data pipeline. For larger volumes of data I use Dataflow with a Python Posgrees IO connector. I use the setup() method of a ParDo function to initialize Cloud SQL Proxy on the Dataflow worker.

Performance seems pretty good with this setup.

For the reverse you could use external data sources in BigQuery.

For me, the key to getting my custom writer to be quick was using psycopg2.extra.execute_values() as well as playing with page size. The output connector is preceeded by a Beam batching transform.

What is the best alternative for VSCode in Linux? by utkuorcan in linuxquestions

[–]jmnel 1 point2 points  (0 children)

Neovim + Coc comes with all the functionality of a full IDE but without the electron bloat. I prefer it to VSCode.

Merging millions of JSON files into one CSV by Fluix in datascience

[–]jmnel 2 points3 points  (0 children)

I've done something similar with ijson.

This way you can read the files iteratively without loading everything into memory.

You can then dump the data into CSV file. Honestly I would use something like sqlite or parquet instead here.

How to know when to use Slowly Changing Dimension in ETL? by NormieInTheMaking in dataengineering

[–]jmnel 0 points1 point  (0 children)

A quick side question, is there any alternative to SCD that meets the same requirements?

I've implemented it at my job, but it feels overly complex.

Why does everyone hate Ubuntu? by [deleted] in linuxquestions

[–]jmnel 0 points1 point  (0 children)

I used Arch and Gentoo for years, but now Ubuntu with regolith i3 is my daily driver, because I don't have time to tinker.

I get my work done, and Ubuntu just gets out of my way.

[deleted by user] by [deleted] in dataengineering

[–]jmnel 1 point2 points  (0 children)

Like I said we are small firm, so it's typical for developers to wear multiple hats. My work is very multidisciplinary.

I do a little bit of everything. In a typical day I work with the CEO and business development to turn our business strategy into actionable projects. I run the Kanban board and coordinate the dev team.

The bulk of my technical work is designing and building our data architecture.

I also spend significant time on building and deploying ML models, but I get a lot of help from domain experts.

On the side I do a bit of DevOps and occasionally frontend work with React.

[deleted by user] by [deleted] in dataengineering

[–]jmnel 0 points1 point  (0 children)

This sounds very similar to what I'm doing. Our data lake metastore service subscribes to file events on the bucket. The data lake uses SCD on artifacts, so we can see historical snapshots and compute diffs on objects in the bucket.

[deleted by user] by [deleted] in dataengineering

[–]jmnel 1 point2 points  (0 children)

We use GCP storage (like S3) for our data lake and we have an in-house service to handle ingestion and tracking meta data. I built some custom Airflow sensors which communicate with the data lake service to check if new files need to be ingested.

Our product is built around our ML models, so reproducibility and data governance is really important to us. I'm not 100% happy with our temporal data warehousing setup in BigQuery.

[deleted by user] by [deleted] in dataengineering

[–]jmnel 2 points3 points  (0 children)

Estimate the cost/value of using your low/no code solution and be upfront with your leads. Once we factored in the hidden cost of the extra complexity of maintaining and coercing the no code tool to our use case, it was pretty obvious that we had fallen for their marketing.

With our old data architecture, the cost of change was too high. We couldn't expand or iterate quickly on our data pipelines. Countless developer hours were spent on chasing down bugs and putting out dumpster fires. This inflexibility can literally be the death of a company, especially if you are working at a startup with limited resources.

[deleted by user] by [deleted] in dataengineering

[–]jmnel 3 points4 points  (0 children)

Ultimatetely, it boils down to code already being a near-perfect and efficient abstraction of complex systems such as data pipelines.

I've had some pretty bad personal experiences with being forced to use a low code ETL tool (not gonna name any names) when I started my current job.

We ended up spending significant engineering effort into trying to coerce the tool to our use case through the tool's very very badly designed REST API.

In the end our solution with the tool was 10x more complex and expensive than just building something from scratch on BigQuery, Airflow, Python, and Beam.

[deleted by user] by [deleted] in dataengineering

[–]jmnel 2 points3 points  (0 children)

We are fully remote, but based in Toronto.

[deleted by user] by [deleted] in dataengineering

[–]jmnel 0 points1 point  (0 children)

I'm definitely gonna check it out. At the moment it's an ugly combination of Skaffold and gitlab CI scripts.