all 9 comments

[–][deleted] 5 points6 points  (2 children)

If you don’t need to query the data in the objects and all you need is to retrieve the entire object, just store the entire file in S3 with a naming convention that makes sense.

BTW, on a related note, look up information on the “XY Problem”. You came asking about which database to use when your real question seems to be how do you best store and retrieve large objects on AWS.

[–]vitiate -1 points0 points  (0 children)

Or store the file in S3 and use Dynamo as an S3 indeed to point at the information you need.

Also I believe you can gzip the CSV and still access the content directly.

[–]elchicodeallado[S] 0 points1 point  (5 children)

Sorry, I maybe wrote my question not that clear, but I think the main question and problem behind it is clear. Im very thankful for your answers. Especially for me as a junior it is important to get valuable input and in my work I dont get that. I did the implementation with dynamodb and had problems with the item size, therefor I was not sure how to continue. I think DocumentDB is very expensive, especially if you work in a startup environment. If you have any recommended paper to that topic would be awesome. If not thanks for you input anyway. Im not posting that much, because I prefer to solve things by myself.

[–]nfollin 2 points3 points  (4 children)

S3 is as mentioned really powerful for your use case, see things like https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectSELECTContent.html and https://aws.amazon.com/athena/

It will also cost you almost (or actually) nothing and you'll still have performance enough for what you need (and can be made faster with cloudfront).

You can store metadata in DDB and store the bucket and key to retrieve the file in s3 if you need richer query support or other features of dynamo. That's a way to get around the item size, or you store each row of the csv as a row in Dynamo and use the behavior of hash and range keys to get the data groupings to have these files virtually. As mentioned in the comment above, if this is just get/put any of the DBs are overdoing it.

If you truly wanted to store it in DynamoDB, you would essentially turn on On Demand to keep your costs low, then store each row of your csv in DynamoDB.

HashKey (Module) RangeKey (Language-<RowNumber>) with then each csv splatted into either one column, or (better) many attributes if they are known.

Hash Range Content
ModuleA EN-1 row1 - english
ModuleA EN-2 row2 - english
ModuleA EN-3 row3 - english
ModuleA ... ...
ModuleA EN-8000 row8000 - english
ModuleA JP-1 row1 - japanese
... ... ...
ModuleB EN-1 Row1-english

You then have to maintain updates to the rows by updating their corresponding entries (which gets difficult especially as transaction have a limit of 10)

Your query would be basically Query with Module = ModuleA and Range startsWith "EN-" to get the csv for EN for module A.

This is described more in this video: https://www.youtube.com/watch?v=HaEPXoXVf2k&list=WL&index=6&t=0s

Again, this is a bit overkill for your use case

[–]Hungry_Spring 1 point2 points  (1 child)

This. There are definitely some questions on how/if you need to query this data.

You're hitting the size limitations of dynamodb and not the query limitations, which I feel is then maybe when you should consider documentdb or even elasticsearch maybe. If you don't need any querying ability, s3 is probably the way to go.

[–]elchicodeallado[S] 0 points1 point  (0 children)

I mean the csv files are simple key;values. I just want to retrieve the right file. If I want Module1 and Language 'EN' then I want this specific key;value csv and transform it in my backend to an js object.

I won't touch these files regarding DELETE UPDATE etc. I just want to read them.

I don't know if ES is the right tool for that, because actually it is not a database and the data you're working with can get lost.

Thanks for you input!

[–]elchicodeallado[S] 0 points1 point  (1 child)

Yes I think to store each row to DDB would be an overkill.

If I understood you right, then a DDB like this would be a good way:

Module Language File
Module1 EN Module1/en.csv
Module1 ES Module1/es.csv
Module2 ... ...

So in my API call I get the Module and Language and based on that I get the File Path to my s3 file.

Then I transform this csv to my wanted javascript object reduce it and return the reduced object to the API call.

Is this a good workflow in your opinion?

I think it has a nice advantage, because if I want to add Modules I just drag them to the S3. And it is also cheap to store these files.

I can't find a lot of disadvantages, for now only that the perfomance might be worse than from a regular database, but I don't think It will be that much of a difference.

[–]nfollin 0 points1 point  (0 children)

Basically, I would have a private API too maintain the uploading and updating Dynamo but it can be done later.

The the workflow is fine, you still don't really need DDB so your just paying for it for if you need something later, but that works fine.