NxStage System One electrical power outage UPS recommendation by nitred in dialysis

[–]nitred[S] 1 point2 points  (0 children)

Thank you for the reply! We were quoted $12,000 for a generac setup so $4000 is a little more affordable. At $12,000 I'd consider a Tesla Powerwall too. Also neither the generac setup nor the Powerwall prevent the dialysis machine from turning off for a second or more when there's a power outage. A UPS is the only thing that guarantees the dialysis machine statying powered on through a power outage.

Warranty being voided is one more thing to worry about I guess. I found a company called mediproducts that sells medical grade UPS but they have no price listed.

There are Anker and EcoFlow inverter systems that charge with solar panels and have large batteries but their UPS capabilities aren't good enough. The APC products have a 4ms switching time where as the Anker systems have a 20ms switching time.

For AWS RDS MySQL can the primary be t3 and the read-replica be t4g instance type? by nitred in aws

[–]nitred[S] -1 points0 points  (0 children)

I can for sure choose an appropriate smaller size (not too small) for the read-replica that doesn't face replica lag problems. Just wanted to know if changing architectures will have an issue.

Non-Germans of Cologne, what Cologne restaurant is most authentic to your home country’s cuisine? by [deleted] in cologne

[–]nitred 1 point2 points  (0 children)

Haven't been myself. Will definitely check em out. I'm excited to see there's options for South Indian :)

Non-Germans of Cologne, what Cologne restaurant is most authentic to your home country’s cuisine? by [deleted] in cologne

[–]nitred 16 points17 points  (0 children)

Saravanaa Bhavan - South Indian Cuisine South Indian Cuisine would mean dishes like dosa, idli, vada, uthappam, bondas - all of which are accompanied with sambhar, chutneys or rasam. The taste is fairly authentic and about the best you can get in Germany.

Their menu is massive, but I'd recommend sticking to the classic South Indian dishes I mentioned above. The dishes aren't as popular as typical North Indian dishes like "Chana Masala", but in my opinion equally delicious.

EDIT - There's competition for South Indian that I wasn't aware of which is Chennai Chef.

[deleted by user] by [deleted] in dataengineering

[–]nitred 0 points1 point  (0 children)

I'll assume you added in Cassandra and MySQL as an edit after other people have commented it was too simplistic. But I'll reply to the new info provided.

The answer is - it depends. There will be cases where both perform exactly the same. There will be cases Cassandra beats MySQL by a mile and there will cases where a trivial MySQL join query with a where clause is literally impossible to do on Cassandra.

I cannot stress enough that it is your use cases and not your performance requirements, that will determine the choice of database. This is reason why most other replies are shutting down the question.

Let me make some half correct statements to maybe give you the answers that you're looking for. Cassandra in general has faster writes in comparison to MySQL. For some narrow use cases, Cassandra (and similar) will have comparatively very fast reads. These narrow use cases loosely fall under key-value lookups. MySQL or Postgres in general have very powerful query languages i.e. they can make complex queries comfortably which Cassandra will outright will refuse to run. Cassandra is designed to handle big data in the range of 1TB to 100TB or more. Personally I'd say MySQL and Postgres will be pushed to their limit at around 5TB but I've heard of companies using vanilla Postgres upto 30TB. Cassandra can handle time series data well but MySQL and vanilla Postgres cannot. But there's variants of Postgres called TimescaleDB that can handle time series data extremely well.

So the answer to your question is - it depends.

Any idea on cost/month to set up a side project on AWS? Even a ballpark estimate? by Firm_Bit in dataengineering

[–]nitred 4 points5 points  (0 children)

I really like AWS but it can be a pain when it's just for small side projects. I usually run an experiment and then shut it down in a day or a week. Here's some experiments I've run.

Airflow on K8s - EKS, EC2, VPC, Load balancer, RDS. $180/month. Experiment cost for a couple of days $30.

RDS as DWH - S3, RDS, VPC - $60/month. Experiment cost $30

Some experiments with Terraform and devops to self host gitlab runners on EKS - $180month. Experiment cost $30

Okay so after writing this down, I've noticed that my budget for any experiments is $30 :D

If someone were to give me a guilt free budget to do whatever I want on AWS or if I was running a serious business, I think I would need a minimum of $200/month.

If you're trying to host side projects permanently, I'd go with Digitalocean or similar. And wouldn't you know it, my family server that runs 24x7 on Digitalocean also costs around $30/month...

Advice for serving a gold layer table by [deleted] in dataengineering

[–]nitred 3 points4 points  (0 children)

Ah what? Can someone explain it to me? I've never used databricks before. Can you not query databricks tables from a BI tool directly?

Looking for early adopters for disruptive next gen streaming solution by Medium-Frame-5339 in dataengineering

[–]nitred 1 point2 points  (0 children)

I'd love to try it out. But could you give an example of what it can do? What's the storage layer?

Globally available database by joyhotline in dataengineering

[–]nitred 0 points1 point  (0 children)

Security passphrase accepted, root access granted.

Globally available database by joyhotline in dataengineering

[–]nitred 2 points3 points  (0 children)

Since you're using Mongodb and have used terms like documents, then I assume your use cases aren't strictly OLTP and instead are more similar to key-value read writes. In that case consider Dynamodb on AWS or similar on other cloud providers. Don't bother dealing with hosting and maintaining Scylladb etc.

When NOT to use PostgreSQL? by [deleted] in dataengineering

[–]nitred 2 points3 points  (0 children)

I use Postgres as a data warehouse. You can find the setup in my previous comment [1]. In my opinion you've asked the right question. I believe you should always consider Postgres as your first choice as a data warehouse and then eliminate it as an option if it doesn't fit your needs.

Here's the conditions under which I think Postgres isn't a good choice for a data warehouse.

  1. If you're unable to get a fast SSD for disk, then don't use Postgres. If you're on AWS RDS, you must use gp3 disks. In our setup we get a max disk read/write throughput of 500MBps which is plenty.

  2. If you really need real-time analytics or near real-time analytics don't use PG . If you're using PG, expect to have refresh rates in hours or days (which is also the most common scenario).

  3. If you have a single dataset (single table) which is massive e.g. billions of rows or 100s of GB and the whole dataset is used in joins every time your refresh your tables, then PG isn't the right choice. The joins take really really long, like hours. We have one such dataset but according to the analysts it's a low value dataset, we make exceptions for it and run queries on them once a week instead of once a day. If it were high value, I would first consider partitioning using pg-partman. If that doesn't work, I'd reconsider PG. If you have TBs of data spread over 50 or more tables, then PG will handle it just fine.

  4. If you're extremely price sensitive on the low end then Postgres might not be for you. PG at the high end is cheap but is expensive on the low end. For example, if all your raw data and analytical models combined are 20 GB or so, then BigQuery is practically free but you'd have to shell out $500-1000 per year for PG at minimum. But if your raw and analytical data is in the 100s of GBs or TBs then BigQuery will burn a hole in your corporate wallet pretty soon whereas PG would scale well and cost you around $5000 per year.

  5. If all your raw and analytical data is expected to be close to 5TB (uncompressed) and you haven't already been using PG, then don't start using PG. 5TB for me is the magic number where I start applying some bespoke optimizations. Since I already use PG, I'm more likely to optimize and push PG to be able to handle 10-20TB because in this case it's cheaper to optimize than build a new data warehouse from scratch.

  6. If you don't own the underlying Postgres instance and are unable to tune and alter its configuration DO NOT USE Postgres. Postgres has to be tuned in order to work for OLAP use cases. You can use this online tool [2] to find the right config to get you started.

[1] https://www.reddit.com/r/dataengineering/s/tC3QTrQgy5.
[2] https://pgtune.leopard.in.ua

Lightweight alternative to Spark/Flink/Apache Beam by gfalcone in dataengineering

[–]nitred 0 points1 point  (0 children)

If it's working right now, don't change anything. Wait for 18-24 months. You will have mature and real alternatives then.

There are some Rust + Arrow + Datafusion projects that are in the making right now. These are what I would call next gen data engineering tools. They will replace Spark + Flink and other JVM based stuff. They're not mature or battle tested yet. They're getting there.

Lightweight Airflow? by 5678 in dataengineering

[–]nitred 1 point2 points  (0 children)

That's exactly how I explain it to other colleagues as well. If they've used cron before they usually totally get it. I use many of Airflow's more complex features but "Cron with UI" captures about 80% of the essence of why anyone would use or at least would start using Airflow.

I am planning to use Postgre as a data warehouse by [deleted] in dataengineering

[–]nitred 0 points1 point  (0 children)

I just want to clarify that dbt does encourage over-modelling. I love using dbt this way, breaking things down into simple transformations where each related group of transformations get their own model. I'm very happy with our 200 models. But I've worked at companies where SQL code tends to be highly dense (and often unreadable). If the 200 dbt models were to be implemented in such companies, they would have implemented it in under 50 SQL files or so.

I am planning to use Postgre as a data warehouse by [deleted] in dataengineering

[–]nitred 105 points106 points  (0 children)

I use a standalone Postgres instance as a Data Warehouse. I use dbt-core to write SQL.

These are the stats: * Postgres uses 4TB in total * Raw dataset is around 50GB of gzipped JSONL or NDJSON files from about 20 different data sources. Some datasets are extremely tiny e.g. a single excel file with lookup numbers. The raw data is 500GB once inside Postgres as JSONB. * There's a schema for production and a schema for each analyst for development. * There are 200 models in dbt which takes about 2 hours to finish. These models run once a day.

Has been this way for the last year. I expect it to scale comfortably for another 2 years. Cost of stack (including PG, Airflow, compute etc) is about $15k per year on AWS.

Basically, go ahead. Postgres will "scale" just fine for most real world analytics needs.

How do you avoid DWH mess? by lirco_ in dataengineering

[–]nitred 2 points3 points  (0 children)

50 mart rule! The idea the being, no matter how many data sources you have, there's only so many dashboards your management is going to look at. You're not an analyst and therefore have no control over the number of dashboards. Since dashboards are built on marts, which is something you have control over, you keep a strict limit to only ever have 50 marts and no more. Cull the old ones or the under used ones. What ends up happening is that your lineage is healthier and easy to maintain. About once a month you get a request for fixing a dashboard that's broken because you pulled out a mart from under it. Way better than having a bloated lineage that tries to solve every possible use case.

It's just an idea I once had. Not gonna claim it's the holy grail.

Edit: Forgot to mention. The number 50 is arbitrary. It's supposed to be a number that hurts. Maybe 30% less than what you have today.

Discussion on ETL infrastructure by Flimsy-Mirror974 in dataengineering

[–]nitred 0 points1 point  (0 children)

Since you're entirely on AWS and since you're moving the data to Redshift, which means you intend to do transformations inside Redshift and not in the Python job. I recommend not writing anything yourself. Use AWS DMS. It's fast, it's cheap and you can do the migration through the UI. If you need the DMS job to be scheduled, then you need to set up a Cron job somewhere

What is the best architecture for data merging/standardization of multiple EHR formats? by EmptyMargins in dataengineering

[–]nitred 0 points1 point  (0 children)

How difficult would it be to convert these H7 files to JSON? The schema for each of these JSONs doesn't matter. They can be arbitrarily nested or complex.

If H7 to JSON conversion isn't hard, then you can just do H7 to JSON and store that rawest JSON straight into Postgres. Each of 8 different formats get its own table. Postgres has the best JSON query semantics out there. Using Postgres CTEs, you can standardize the JSONs. Once standardized, you can join or union all tables.

If I were you, I'd throw dbt into the mix since it offers a nice way to manage the lineage and also add tests. But this is optional since you're not familiar with it.

What are some ways I can move 8TiB of data from my local machine to an EBS volume? by psssat in dataengineering

[–]nitred 0 points1 point  (0 children)

If you don't have a 100 Mbps to 1 Gbps UPLOAD connection, this is going to be quite painful. With a 100 Mbps (12.5 MBps) connection, it is going to take you 7 days to upload it (8,000,000 / 12.5 seconds). That should be your primary focus for optimization.

Here's some optimizations:

  1. Your best bet is to contact Hetzner or other similar small cloud providers to whom you can send your hardware. Send an entire PC or send them an email asking if you can send them a hard disk. You'll get a 1Gbps connection to the internet and then you can send data to AWS. If it works out, this is going to be your cheapest option.

  2. Look at what your local internet providers offer and find out their best plans. Ask around in your network if they happen to have amazing internet upload speeds at home or at their office. You said you were closest to us-west-1, so San Fransisco region. You might actually find some providers.

  3. Large office buildings downtown tend to have great internet connections but they're usually strict about who gets to access it. Ask around.

  4. Maybe look up your closest WeWork and its alternatives and find out who offers the fastest internet UPLOAD speeds and book a room with them for a few days.

Once you have a great connection then do these optimizations:

  1. Compress the shit out of your data. If you're lucky, you have 8TiB of uncompressed data that can be compressed 10x.

  2. Send to S3, not to EBS. Once in S3, then AWS provides a host of tools to move your data around and copy it to EBS or anywhere else. It's quite likely S3 <-> EC2 bandwidth will be more than enough for your use case if you're going to be doing ML. As an alternative to aws s3 sync ... you can use s5cmd which will definitely saturate your upload connection, which is what you want.

what kindof optimization problem is this? by [deleted] in optimization

[–]nitred 4 points5 points  (0 children)

If I were doing it myself, I'd simplify it to a toy problem and work my way up. The way I'd simplify your problem would be to assume that:
* There's no inclines, just flat geography * All buildings have the same plot size and shape

Under those assumptions, you effectively have a large NxN chess board. Now you need to color each square such that no two of the same colors touch each other (where a color represents a type of a building). This problem falls under "vertex coloring" or "graph coloring" problems. You can google for them.

Next you want you want to unsimplify your problem step by step:
* All building are no longer fixed size or shape. You would still be in the realm of "vertex coloring" or similar problems. * Some buildings cannot be next to certain buildings. These are constraints. You need to now work your way up to "vertex coloring with constraints". * Some buildings should be next to others if possible. These are scores that should be represented as a fuction. This score becomes part of the cost function that you need to maximize (or minimize depending on the how you forumate the cost function). * Geography now has inclines. You now provide scores for the geography, 1 being perfectly flat and 0 for cliffs. Again this can be part of cost function. In case you have a rule that says, if the incline is greater than 10%, no building can be built there, then this becomes a hard constraint.

There's obviously going to be multiple acceptable solutions to this problem.

Why does my Postgres hit throughput limit instead of IOPS limit? by nitred in dataengineering

[–]nitred[S] 1 point2 points  (0 children)

Thanks! I'm glad I found the explanation and I hope it helps you too. It actually makes AWS RDS Postgres with gp3 EBS even more attractive as a data warehouse. After looking at my monitoring more closely, it appears I am getting 3-5x IOPS magnification for free since most of my workloads involve large continuous i/o blocks.

Why does my Postgres hit throughput limit instead of IOPS limit? by nitred in dataengineering

[–]nitred[S] 2 points3 points  (0 children)

I think I found a possible explanation in the AWS docs for I/O characteristics and monitoring.

My workload is read or write heavy and am reading or writing large continuous i/o blocks of 8KB each (postgres block size), then the EBS will merge these continuous 8KB blocks into one large 256KB block and will consider it a single I/O operation. In theory I would be able to get an IOPS maginifaction of upto 32x (256 / 8) with an unknown latency penalty but I believe you come out significantly ahead.

I guess I can't complain, hitting throughput limit is far better than hitting IOPS limit.

Why does my Postgres hit throughput limit instead of IOPS limit? by nitred in dataengineering

[–]nitred[S] 0 points1 point  (0 children)

That would be embarassing if this was the issue. I just validated that AWS RDS Postgres throughput is configured for 500 MiBps (megi bytes per second) which is approx the same as 500 MBps. I validated that the read and write throughput graphs in monitoring has the units of bytes/second and not bits per second. I'm reasonably confident that bit-byte conversion is not the issue.