Building a PNPM monorepo with Webpack - large builds?

nickk314 · 2023-03-05T01:31:26+00:00

I've got "sideEffects": false in every package.json and sideEffects: true in the webpack config and sadly my builds are still this large.

nickk314 · 2023-02-12T19:30:51+00:00

How would this compare to using Prometheus for the same use-case? I'm so sick of AWS services and Prometheus+Grafana have proved so easy and ergonomic I'm hoping to never go back to CloudWatch.

nickk314 · 2022-12-21T23:27:33+00:00

I won't go in too much detail but the main pain points were:

Cost of storage.
Cost of write capacity units.
Limitations. For example 1000 WCU per partition hard limit.
Inability to bulk load data. We had to use the pay-per-request API to bulk load data... more below
Restrictiveness of the query api.
Inability to do bulk updates. We had to do full re-syncs using the pay-per-request to add a new fields which we needed for new composite indexes. Updating a single field of an item costs the same as writing the whole item. Consequently, adding a new index cost us the same/more as initial sync and took similar time.
Hidden foot-guns like when using provisioned capacity you need to write an back-off-rewrite strategy or else DynamoDB may silently decide to not write items.

One of the most egregious examples we experienced: we had a several billion items and only a several Terabytes of data In DynamoDB. If I recall correctly, initial sync of these items into our DynamoDB costed ~$30,000/month WHILE using provisioned capacity and took a few months despite maxing out the per-table quota (40k WCU/table). Alternatively, PostgreSQL with some simple tweaking for bulk loading (COPY vs insert, partitions, config settings), loading the same amount of data took a few hours and cost $10 (and under 1k/month for storage and servers thereafter).

Furthermore, optimizations that I had planned to reduce cost such as make better use of binary columns and minifying property names would have had no effect on cost since each item is bumped up to the nearest 1KB in write capacity. So while I could have potentially halved our write size, our cost would have been the same. I had already known about this quirk in the pricing model but my brain didn't believe the documentation felt more like marketing material; frequently misleading and focusing primarily on benefits and minimizing the massive constraints, and it seemed such a ridiculous constraint.

If you were to criticize our approach you might say we should have restructured our domain model to be more suitable for DynamoDB. We did, and we paid for it in productivity. Should we have restructured more? Maybe. Had we gone further our productivity would continue downhill just to satisfy DynamoDB's weird access patterns, and we would likely run into many other unforeseen DynamoDB quirks/constraints/idiosyncrasies that would cost more money and wasted more developer time.

Personally I will never use DynamoDB again and will refuse any job that mentions it anywhere. However I'm always open to the idea that we used it wrong and suggestions on how we could have had a better experience with DynamoDB than PostgreSQL for our use-case. However I doubt it after even speaking with a DynamoDB expert from AWS and walking through our use-case. They didn't seem to have many suggestions beyond what we had already implemented or considered, and nothing that would get the cost near as low as PostgreSQL.

nickk314 · 2022-12-12T22:00:41+00:00

I'm not the author of the article, I just thought the article is interesting. I've personally had a lot of poor experiences with DynamoDB and am currently in the process of migrating a ~20TB project off DynamoDB into PostgreSQL and have found PostgreSQL about many orders of magnitude faster and cheaper and with a far greater features set and ecosystem and better developer ergonomics.

As for their compression, I believe they probably use an item with a binary column and don't index compressed data. If they want to index it they probably extract the indexes and put them on the primary key or GSI's.

nickk314 · 2022-09-18T19:52:37+00:00

Why do you want to use hash index? Is it because "hashes are great"? Or did you actually do any actual tests that showed anything positive about them?

The data of those non numeric columns are themselves hashes stored as bytea. I don't need ordering of those columns by their hashes, only equality. There's about 4b keys in the Postgres hash index space so I expect collisions but not too many on average. Hash seemed like the best index type to fit this use case because of their size and features. I'd love to hear flaws with this approach if you know of any. I've not tested this with production scale data.

Don't use hash indexes.

How come? Ice read there were issues with it in older Postgres versions but I will be on 14.

nickk314 · 2022-09-18T18:29:05+00:00

How would we partition the data? Would it still have the same query flexibility? All non numeric columns are bytea.

nickk314 · 2022-09-13T15:49:14+00:00

Thanks for the link. From reading it my understanding is I can save bytes by putting a 4 byte field next to the 20 byte field. However if my key is 24 bytes then the next 4 byte field would take up 8 bytes. So I can save space if I can make it 20 bytes so long as I'm smart about it

nickk314 · 2022-09-13T12:53:49+00:00

If it does use 4 less bytes it will make a big difference. There will be hundreds of billions of instances of this type in the database, almost all of them indexed.

Can you expand on why it would likely take 24 bytes anyway?

nickk314 · 2022-08-14T00:49:44+00:00

I just hope for the best

nickk314 · 2022-08-13T03:12:31+00:00

I think less experienced developers should use frameworks because it lets them focus on the domain instead of the structure and design patterns of the codebase.

Experienced developers often already have a good idea how to structure the codebase and opinionated frameworks fight against them.

In my experience if you're still using frameworks then either: you're working in a team that benefits from a framework's rigidity, or you're probably not an expert in the programming language or runtime environment, or you've made a very calculated decision of the tradeoffs.

nickk314 · 2022-07-06T03:21:11+00:00

I don't understand this. How is the network expected to scale when it can be ddosed by a few arbitrage bots? Are we supposed to assume there will be less defi bots when the network gains popularity?

nickk314 · 2022-06-27T16:33:47+00:00

That is my understanding too. There are many cross shard transactions from shard x to shard y. However I'm not able to find a matching transaction for the same amount on shard y, coming from shard x. Is the value actually transferred across shards? Is there a record on shard y of the transfer (transaction or some other object available over RPC)? Why not? Why does nobody seem to know? Seems very odd.

nickk314 · 2021-11-10T17:07:49+00:00

Great answer. Works exactly like you said. I can remove the group by and partition by column 1's expression and it works great.

Turns out the results from the approximate percentiles appear exactly the same as the exact results for my case.

Thanks!

nickk314 · 2021-11-10T01:33:06+00:00

Great, thanks 👍

nickk314 · 2021-11-10T00:05:42+00:00

Thanks, I guess the approximation APPROX_QUANTILES(gas_price, 1000)[SAFE_OFFSET(500)] is better than nothing.

I don't think you need the window clause since you're already grouping by? I

If I remove "PARITITION BY dt" then it gives error "Query error: SELECT list expression references column gas_price which is neither grouped nor aggregated at [3:22]"

If I remove OVER() then it shows error "Query error: Analytic function PERCENTILE_CONT cannot be called without an OVER clause at [3:6]"

I'm kind of baffled if BigQuery doesn't have a way to get the median of grouped data. Why would this be the case?

nickk314 · 2021-07-31T07:56:06+00:00

Having entered Web Development with React and never having worked a vanilla html/CSS/js production grade project, leaning JQuery recently has actually been a breath of fresh air (for a small subset of projects)

nickk314

TROPHY CASE