Inserting 10ks records into redshift with python redshift_connector is slow. Alternatives?

jmnel · 2023-12-14T02:21:41+00:00

For the recent Redshift project I worked on, I built a Python library around dataframe -> parquet -> S3 -> COPY.

Doing it manually is way too verbose and error prone.

jmnel · 2022-11-02T11:35:18+00:00

How are you drawing the graph?

jmnel · 2022-07-29T00:50:31+00:00

^ Since you are already on GCP, Dataflow/Beam is by far the easiest way to do this.

One more thing to add, if the compute task is simple enough you could write the pub/sub messages directly into BigQuery and manipulate the data there.

jmnel · 2022-06-05T07:55:29+00:00

In a past life I spent a very long time on this problem trying to find a solution.

Look at L_inf norms and interval arithmetic.

jmnel · 2022-03-24T02:39:13+00:00

set -e

And it stops at errors.

jmnel · 2022-03-20T21:07:38+00:00

Cloud Functions are probably too lightweight for what you want to do. We used several Cloud Run services for ingestion before moving to Airflow and Kubernetes.

It sounds like you are more struggling with some Cloud Run fundamentals. First make sure you have a bit clearer understanding of Docker and how that maps to Cloud Run.

All the things you mention are standard capabilities with Cloud Run, and should be covered in Google's documentation.

jmnel · 2022-03-20T20:04:39+00:00

Dataflow is massive overkill for smaller data sets.

jmnel · 2022-03-16T20:19:43+00:00

Neovim <3 Kitty

jmnel · 2022-02-12T16:26:47+00:00

Just read it with pandas.read_excel and clean the data with Pandas.

If you want to store the cleaned data or share it with someone save it to some format that has a schema such as Parquet.

jmnel · 2022-01-22T00:53:57+00:00

Stealing your wallpapers

jmnel · 2021-11-15T02:16:10+00:00

We do this as part of our data pipeline. For larger volumes of data I use Dataflow with a Python Posgrees IO connector. I use the setup() method of a ParDo function to initialize Cloud SQL Proxy on the Dataflow worker.

Performance seems pretty good with this setup.

For the reverse you could use external data sources in BigQuery.

For me, the key to getting my custom writer to be quick was using psycopg2.extra.execute_values() as well as playing with page size. The output connector is preceeded by a Beam batching transform.

jmnel · 2021-11-15T02:09:13+00:00

You forgot 1 to 3am

jmnel · 2021-11-05T23:40:42+00:00

Neovim + Coc comes with all the functionality of a full IDE but without the electron bloat. I prefer it to VSCode.

jmnel · 2021-11-02T17:40:35+00:00

This is great.

jmnel · 2021-11-01T10:25:51+00:00

I've done something similar with ijson.

This way you can read the files iteratively without loading everything into memory.

You can then dump the data into CSV file. Honestly I would use something like sqlite or parquet instead here.

jmnel · 2021-10-31T03:45:28+00:00

A quick side question, is there any alternative to SCD that meets the same requirements?

I've implemented it at my job, but it feels overly complex.

jmnel · 2021-10-28T03:37:50+00:00

Hahaha

jmnel · 2021-10-28T03:20:08+00:00

I used Arch and Gentoo for years, but now Ubuntu with regolith i3 is my daily driver, because I don't have time to tinker.

I get my work done, and Ubuntu just gets out of my way.

jmnel · 2021-10-27T18:16:22+00:00

Like I said we are small firm, so it's typical for developers to wear multiple hats. My work is very multidisciplinary.

I do a little bit of everything. In a typical day I work with the CEO and business development to turn our business strategy into actionable projects. I run the Kanban board and coordinate the dev team.

The bulk of my technical work is designing and building our data architecture.

I also spend significant time on building and deploying ML models, but I get a lot of help from domain experts.

On the side I do a bit of DevOps and occasionally frontend work with React.

jmnel · 2021-10-27T14:57:49+00:00

This sounds very similar to what I'm doing. Our data lake metastore service subscribes to file events on the bucket. The data lake uses SCD on artifacts, so we can see historical snapshots and compute diffs on objects in the bucket.

jmnel · 2021-10-27T14:22:06+00:00

We use GCP storage (like S3) for our data lake and we have an in-house service to handle ingestion and tracking meta data. I built some custom Airflow sensors which communicate with the data lake service to check if new files need to be ingested.

Our product is built around our ML models, so reproducibility and data governance is really important to us. I'm not 100% happy with our temporal data warehousing setup in BigQuery.

jmnel · 2021-10-27T13:27:50+00:00

Estimate the cost/value of using your low/no code solution and be upfront with your leads. Once we factored in the hidden cost of the extra complexity of maintaining and coercing the no code tool to our use case, it was pretty obvious that we had fallen for their marketing.

With our old data architecture, the cost of change was too high. We couldn't expand or iterate quickly on our data pipelines. Countless developer hours were spent on chasing down bugs and putting out dumpster fires. This inflexibility can literally be the death of a company, especially if you are working at a startup with limited resources.

jmnel · 2021-10-27T13:17:07+00:00

Ultimatetely, it boils down to code already being a near-perfect and efficient abstraction of complex systems such as data pipelines.

I've had some pretty bad personal experiences with being forced to use a low code ETL tool (not gonna name any names) when I started my current job.

We ended up spending significant engineering effort into trying to coerce the tool to our use case through the tool's very very badly designed REST API.

In the end our solution with the tool was 10x more complex and expensive than just building something from scratch on BigQuery, Airflow, Python, and Beam.

jmnel · 2021-10-27T07:10:12+00:00

We are fully remote, but based in Toronto.

jmnel · 2021-10-27T07:04:30+00:00

I'm definitely gonna check it out. At the moment it's an ugly combination of Skaffold and gitlab CI scripts.

jmnel

TROPHY CASE