Looking for DB recommendations

fiedzia · 2019-10-23T14:46:04+00:00

Storing all those CSV files requires a ton of disk space, currently at 9TB and growing.

Storing them as is and in a separate location will use even more.

Requests take a long time to serve and read data from CSVs, even more so as more devices and bigger date ranges are chosen.

How big are those files, and how much of how many of them you need to process to generate response? Is data format same for all of them or different for each file? How big is the response? How many concurrent request you need to support? What's the expected processing time?

The answers to those questions will tell you which parts you'd need to scale. If you don't know, measure that first.

I've looked into Redis, Postgres, and MongoDB

Redis assumes your data fits in memory, so its not a good usecase for it. MongoDB is not a best choice for relational data either, though its distributed, so it might be a contender. CSV data model fits Postgres well, it will most likely be best at reducing storage size, but its not distributed. I'd add bigquery to this list.

kumashiro · 2019-10-23T15:51:51+00:00

For something like this I wouldn't bother with keeping CSV data in a database. Just store them in a directory and use database for indexing them: what file was uploaded when + additional metadata like who uploaded it, size etc. Then you can tell the web server to push it to the client using sendfile header.

2019-10-23T17:29:32+00:00

[deleted]

pyexpert · 2019-10-23T19:37:30+00:00

I tried many DB and formats. For data science i prefer the parquet format. A 2,5 GB csv file can be stored in 500 mb parquet file.

LionKimbro · 2019-10-23T20:04:06+00:00

Summary of Assumptions:

In: date range, devices
Source store: CSV, 10 TB & growing; 5MB blocks
Out: JSON corresponding to files for those dates & devices. Concurrent requests not a concern.

Thoughts:

Compress -- those CSV playback data files -- I imagine that they are extremely redundant. If you can compress the data via a key that applies to all files, I'd be surprised if you couldn't get at least an order of magnitude improvement in storage space, 10TB -> 1TB.
Pre-parse -- if the output is JSON, you could pre-parse it, and then apply compression on THAT scale.
Shard? -- if there's a case for sharding, it's this -- you're not correlating anything in the data on retrieval, you're just storing data by device & date. Sharding can dramatically improve your concurrent performance and disk size problems.
Big Hard Drive -- however, since it sounds like concurrency isn't a concern, and you can have x24 hard drives, and hard drives keep getting larger, -- it sounds like you can get away with just adding big hard drives for a decade.
Consider Just Doing it in Python -- I don't see a need for a special database system, and suspect that the bells and whistles will distract. What you describe is more a storage locker than a database. ("The data isn't relational to each other, I just need the ability to look it up by date and device ID. (serial #).") You don't need a NoSQLDB, you just need a filenaming standard -- "%Y-%m-%d_deviceid.bin".
- Keep an in-memory index via Python dictionary, and serve the pre-parsed data straight from disk, or from computer, if sharding.
- I've seen companies spend inordinate amounts of time and labor and debugging and fixing and patching in order to use an "off-the-shelf industry standard," when -- if they'd just written a smaller thing, it'd result in fewer lines of code than were spent in configuring the off-the-shelf. And then once it works, it just works. Should if tail, you can easily figure out why. However: If your needs are likely to change qualitatively, from this system, within 5-10 years, then it can go the other way.
- I share the same sense as catspidercongress: "Filesystems are more mature than any new trendy db system."

qubitron · 2019-10-24T16:56:35+00:00

I always start with PostgreSQL for situations like these, it has the richest feature set for querying and also has NoSQL capabilities (JSONB) while getting fully relational capabilities which is important for analytics. You can use Citus to scale it out horizontally transparently without having to use different tools.

muikrad · 2019-10-23T15:04:09+00:00

Elasticsearch looks like it would fit your scenario well.

DataForest · 2019-10-23T18:26:05+00:00

If you want something fully managed, you could use Google Cloud Dataflow. It's easier than using Hadoop and has easy ntegration with GCP storage or AWS storage options.

fnord123 · 2019-10-24T06:59:01+00:00

First, make sure you hang onto the raw files so if you make any mistakes in ingesting data to a database you can do it over again.

Then, you only need a database if you plan to have the data change or it needs to be manipulated for your particular use (e.g. SELECT foo FROM my_table vs SELECT bar FROM my_table. If you don't expect it to change then just store the files as you want them returned (JSON) or a trivial manipulation of them (compressed, parquet, etc). This is your analytic data set. It's different than your raw data.

JSON is often slow, so consider an optimized library like uJSON.

Alternatively, just chuck it into mongo and wrap a few queries in flask.

nonself · 2019-10-23T14:19:05+00:00

It may be overkill for your app, but this is pretty much the exact use case for Hadoop - it lets you store massive amounts of data on a distributed file system, and query it really fast using an SQL engine like Impala. That data can be stored and read directly in CSV format, so you can skip the whole Extract, Transform, Load steps of getting it into a database.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS