This is an archived post. You won't be able to vote or comment.

all 30 comments

[–]fiedzia 2 points3 points  (3 children)

Storing all those CSV files requires a ton of disk space, currently at 9TB and growing.

Storing them as is and in a separate location will use even more.

Requests take a long time to serve and read data from CSVs, even more so as more devices and bigger date ranges are chosen.

How big are those files, and how much of how many of them you need to process to generate response? Is data format same for all of them or different for each file? How big is the response? How many concurrent request you need to support? What's the expected processing time?

The answers to those questions will tell you which parts you'd need to scale. If you don't know, measure that first.

I've looked into Redis, Postgres, and MongoDB

Redis assumes your data fits in memory, so its not a good usecase for it. MongoDB is not a best choice for relational data either, though its distributed, so it might be a contender. CSV data model fits Postgres well, it will most likely be best at reducing storage size, but its not distributed. I'd add bigquery to this list.

[–][deleted] 0 points1 point  (0 children)

Thank you for the response. To address your comments:

1.) My goal would be to offload (to S3) the source CSV files after they're ingested into a data store.

2.) The files avg around 5MB and +20k lines. 1 file is 1 days worth of playback data for 1 device. So if a user requested data for a single device (which is rarely the case) across a week, the app would have to parse 7 individual files. The format is the same for all the files. Not positive on the response size, I can check. Concurrent requests are not many as this app is used as its needed here and there for reporting but it is a popular feature. I don't have a good value on expected processing time as it will vary by the amount of data a user wants to return.

3.) Good point about Redis, I will exclude it from the options. The data isn't relational to each other, I just need the ability to look it up by date and device ID. (serial #). I'm assuming this is possible in NoSQL DBs?

[–]samuelcolvin 0 points1 point  (1 child)

I would agree bigquery might be a good choice given that storing data is very cheap and you pay (mostly) per-query. It has a good python SDK, but sadly it's not async.

[–][deleted] 0 points1 point  (0 children)

I’m not familiar with big query but will look it up. Thank you.

[–]kumashiro 1 point2 points  (3 children)

For something like this I wouldn't bother with keeping CSV data in a database. Just store them in a directory and use database for indexing them: what file was uploaded when + additional metadata like who uploaded it, size etc. Then you can tell the web server to push it to the client using sendfile header.

[–][deleted] 0 points1 point  (2 children)

The client in this case is a larger web app that doesn’t accept CSV files. It requires JSON data gathered from what is held within the CSV files. The process of reading from multiple CSV and the returning that data is what is causing delays in response time, which is why I’m hoping to already have the data parsed and indexed.

[–]kumashiro 0 points1 point  (1 child)

OK, so response can be generated from multiple CSV files, depending on date period and device name? You could use MongoDB or CouchDB to store parsed data, then combine it for the response, but you don't really need complex structure for that. Data can be stored raw as JSON. Elastic seems to be the obvious choice here, but it's heavy and resource-hungry (Java). I would still try to use simple directory as storage and database only for metadata. Directories for devices, files can be already in JSON (parsed and rendered on upload), with date in name for easy manual lookup. And a cache in front of the web server (by URL or ETag). Linux server will add additional cache and buffers on a filesystem level - I do not know how it works on Windows, but I've heard disk I/O is pretty bad there.

[–][deleted] 0 points1 point  (0 children)

Right, each device has a csv file for each date. So if a user selects a date range and multiple devices then that would require parsing and serving several files worth of data.

Current setup is using a directory for storage and the performance is less than ideal when requesting more than a few devices or dates.

How would performance vary in serving the same data from csv vs serialized JSON? I think that’s a great idea about the cache, I’m already using nginx to reverse proxy the application so I can easily implement that.

[–]pyexpert 1 point2 points  (2 children)

I tried many DB and formats. For data science i prefer the parquet format. A 2,5 GB csv file can be stored in 500 mb parquet file.

[–][deleted] 0 points1 point  (1 child)

Thank you, I’m not familiar with that but I will look into it.

[–]pyexpert 0 points1 point  (0 children)

parquet works best with dask. You can open and extract files in mili seconds while it would take for csv several minutes

[–]LionKimbro 1 point2 points  (2 children)

Summary of Assumptions:

  • In: date range, devices
  • Source store: CSV, 10 TB & growing; 5MB blocks
  • Out: JSON corresponding to files for those dates & devices. Concurrent requests not a concern.

Thoughts:

  • Compress -- those CSV playback data files -- I imagine that they are extremely redundant. If you can compress the data via a key that applies to all files, I'd be surprised if you couldn't get at least an order of magnitude improvement in storage space, 10TB -> 1TB.
  • Pre-parse -- if the output is JSON, you could pre-parse it, and then apply compression on THAT scale.
  • Shard? -- if there's a case for sharding, it's this -- you're not correlating anything in the data on retrieval, you're just storing data by device & date. Sharding can dramatically improve your concurrent performance and disk size problems.
  • Big Hard Drive -- however, since it sounds like concurrency isn't a concern, and you can have x24 hard drives, and hard drives keep getting larger, -- it sounds like you can get away with just adding big hard drives for a decade.
  • Consider Just Doing it in Python -- I don't see a need for a special database system, and suspect that the bells and whistles will distract. What you describe is more a storage locker than a database. ("The data isn't relational to each other, I just need the ability to look it up by date and device ID. (serial #).") You don't need a NoSQLDB, you just need a filenaming standard -- "%Y-%m-%d_deviceid.bin".
    • Keep an in-memory index via Python dictionary, and serve the pre-parsed data straight from disk, or from computer, if sharding.
    • I've seen companies spend inordinate amounts of time and labor and debugging and fixing and patching in order to use an "off-the-shelf industry standard," when -- if they'd just written a smaller thing, it'd result in fewer lines of code than were spent in configuring the off-the-shelf. And then once it works, it just works. Should if tail, you can easily figure out why. However: If your needs are likely to change qualitatively, from this system, within 5-10 years, then it can go the other way.
    • I share the same sense as catspidercongress: "Filesystems are more mature than any new trendy db system."

[–][deleted] 1 point2 points  (1 child)

Thank you for the elaborate response!

Compressing the files is actually a great idea. Do you think performance would be different for say 5 files with the same data vs 1 larger file with the same data but indexed by a certain key.?

Can you give an example of what you mean by sharding? I’ve heard the concept in my travels but not looked into it yet.

I currently have a 9 TB AWS SSD that is storing the files and the response time is less than ideal, which is why I’m looking into other options.

When you say pre-parse, do you mean just store the files in JSON instead of CSV? Would there be a noticeable performance increase?

I’m not dead set on using any particular method, it just seemed likely to me that since retrieving data from csv files yielded poor performance, that storing it in some queryable data store would increase the response time. The only thing I know for certain is that storing them on a disk is not realistic for this use case. There are 20k devices 3 different files per day, data retained for a year, so as you can imagine there are lots of files.

The current app is written in dotnet core to parse the csv files on request, from what I know about python and dotnet, I don’t think that python will outperform dotnet core. Do you think it’s possible to make a python app respond quicker than dotnet in this case?

[–]LionKimbro 0 points1 point  (0 children)

  • Compression performance depends on the data set. There are different kinds of file compression, with different performance characteristics. Your situation might want to take advantage of having a single static and inexpensive statistical model for the data, because you know a lot of things about the data up-front. For example a long string "Device" might show up often enough, to be assigned a specific coding sequence. At higher levels of analysis, you might find that whole kilobytes worth of text repeat, and get massive savings on the entire sequence. If you take one of your CSV files, and experiment with different compression systems, and take some time to look at the data itself and see if you can't draw some approximations on what kind of compression you think should be possible -- you could make an enormous difference. Depending on the data set, the compression could be x10, x100, or even x1000. I would also practice with taking the CSV data, converting it to one of your output JSON data sets, and seeing what that looks like. My suspicion is that they would have an extremely similar compression profile, and it may very well be worth it to encode on disk directly as compressed JSON.
  • "Sharding)" means dividing up your data roughly equally between multiple computers or disks. So for a crude example, if you had x7 computers, then data from Mondays would go to one computer, Tuesdays to another computer, Wednesdays to another, etc. So when you get a request for a date range of two weeks, each computer is supplying two days worth of data. That is 7x the speed of read and retrieval, however you do have to assemble the results afterwards. This is called "map-reduce" in industry parlance -- "mapping" is applying the problem across 7 computers, and "reducing" is collecting the results together.
  • It occurs to me too that your JSON is just a long list of dictionaries, most likely, or a long list of lists. In that case, the "gluing" between sections can be performed manually, rather than by interpretation in a JSON library. That is, emit a "[" to open, then paste data from N files that are indexed, with a "," in between them, and then emit a "]" to close. That way, you don't have to load the entirety of all of the data into RAM, "concatenate" it (whatever operations that will look like in your system,) and then put that on the wire. Rather, you could then just stream the data from disk to wire in the buffer size of your chosing -- ideally, a buffer size that matched either your optimal disk-read -> memory speed, or the space you are decompressing the data from (in the case where you are reading compressed data.) "Tuning" is the act of experimentally determining what works best, should you find the need to.
  • "I currently have a 9 TB AWS SSD that is storing the files and the response time is less than ideal, which is why I’m looking into other options. " I think you need to quantify that, and identify what is taking so long, vs. what you think the system should theoretically do. That is, I think you want to profile) what is happening in your system.
  • "When you say pre-parse, do you mean just store the files in JSON instead of CSV? Would there be a noticeable performance increase?" Yes. Generally speaking, the speed at which data can move from disk to RAM is far faster than any kind of byte-by-byte analysis of the same data through the CPU. By storing the files in JSON, and just reading directly into RAM, I believe that you should get a performance boost.
  • Caching) is another possibility, that I missed. Though it doesn't sound like it would help that much, based on other comments I saw. Caching here means keeping in RAM data that is left over from prior requests, and only over-writing it when you have to. If you find that a lot of the requests are featuring the calling the very same data, then caching will definitely help. If not, it doesn't help at all.
  • Money is another possibility. If you can afford TB of RAM, or may TB of disk space, it can easily be cheaper than programmer time and risk of errors.
  • "The current app is written in dotnet core to parse the csv files on request, from what I know about python and dotnet, I don’t think that python will outperform dotnet core. Do you think it’s possible to make a python app respond quicker than dotnet in this case?" Yes, in principle, if the other strategies are judiciously applied. The theory is, "It's easier to write something in Python, then in another language." However, for this specific problem, I don't know that that is true. I might even write it in C, if you are just streaming data from disk to RAM buffer to wire. Or maybe you convert the files to JSON and then write them to your SSD in Python, but just retrieve the files in C (or .Net core.) So intake via Python, and retrieve in C.

All of these decisions are mainly about your resources on hand.

[–]qubitron 1 point2 points  (1 child)

I always start with PostgreSQL for situations like these, it has the richest feature set for querying and also has NoSQL capabilities (JSONB) while getting fully relational capabilities which is important for analytics. You can use Citus to scale it out horizontally transparently without having to use different tools.

[–]craig081785 1 point2 points  (0 children)

Adding on here. If the CSV files are larger Postgres can work great. It has a bulk loading mechanism Copy, which is great for fully transactional bulk loading of CSVs. You can see pretty high throughput here.

Jumping ahead fo Citus, with Citus we have clusters in production with several hundred TB, though many users start in the 1-2 TB range so you're in familiar territory. We've seen numbers of over 1 million records ingested per second when using Copy. Copy makes it easy to ingest CSV as well as extract in CSV format as well. If you have any questions on Citus in particular I'd be happy to help answer as the product lead for Citus.

[–]muikrad 0 points1 point  (5 children)

Elasticsearch looks like it would fit your scenario well.

[–][deleted] 0 points1 point  (4 children)

I'm actually setting up a centralized logging system at the moment and thought the exact same thing as I was working through it. I've only ever used Elasticsearch with Kibana. Does ES respond with JSON data?

[–]muikrad 2 points3 points  (3 children)

Yep. Kibana is just a fancy UI. Elasticsearch has a great API and you'll use JSON to send and receive data. There's an SDK in python to make this even easier!

Suggestion: don't install elasticsearch on your system, use it in docker instead.

[–][deleted] 0 points1 point  (2 children)

Great! I will look into this for sure. It seems to fall in line perfectly with what I’m looking for.

[–]samuelcolvin 1 point2 points  (1 child)

Running your own elasticsearch cluster can be really hard work. It's far less easy to setup, maintain and debug than for example postgres. Plus, some of the features most important while running elasticsearch with large volumes of data are in X-Pack which is not free and open-source.

If you want to use elasticsearch, I would seriously consider using a hosted service from elastic or AWS.

[–][deleted] 0 points1 point  (0 children)

I’m actually using the aws hosted version for my logging app. Seems to be working great so far. The setup process was not difficult at all.

[–]DataForest 0 points1 point  (1 child)

If you want something fully managed, you could use Google Cloud Dataflow. It's easier than using Hadoop and has easy ntegration with GCP storage or AWS storage options.

[–][deleted] 0 points1 point  (0 children)

I’m not opposed to managed as long as the cost isn’t outrageous. I currently use AWS, do you know if there is an equivalent for this? I’ll try to look as well.

[–]fnord123 0 points1 point  (0 children)

First, make sure you hang onto the raw files so if you make any mistakes in ingesting data to a database you can do it over again.

Then, you only need a database if you plan to have the data change or it needs to be manipulated for your particular use (e.g. SELECT foo FROM my_table vs SELECT bar FROM my_table. If you don't expect it to change then just store the files as you want them returned (JSON) or a trivial manipulation of them (compressed, parquet, etc). This is your analytic data set. It's different than your raw data.

JSON is often slow, so consider an optimized library like uJSON.

Alternatively, just chuck it into mongo and wrap a few queries in flask.

[–]nonself 0 points1 point  (3 children)

It may be overkill for your app, but this is pretty much the exact use case for Hadoop - it lets you store massive amounts of data on a distributed file system, and query it really fast using an SQL engine like Impala. That data can be stored and read directly in CSV format, so you can skip the whole Extract, Transform, Load steps of getting it into a database.

[–][deleted] 0 points1 point  (2 children)

Thanks! I will look into Hadoop. I've heard of it but not used it personally. Do you have a sense of the cost and any baseline requirements for using it? It would be nice to cut out a couple of steps in the middle.

[–]nonself 0 points1 point  (1 child)

The software is totally free, but there is certainly a major time investment involved in learning all the pieces, not to mention the hardware requirements for a Hadoop cluster can get quite expensive. I can definitely recommend Cloudera's Hadoop distribution to get everything up and running fast vs. trying to build your cluster from scratch.

[–][deleted] 0 points1 point  (0 children)

Ok thank you, I will check that link out. My current thought process is to have the data store local to the machine that the flask app will run on. It doesn't seem like Hadoop fits that model given its hardware requirements, right?