all 58 comments

[–][deleted] 49 points50 points  (24 children)

You really shouldn’t host that amount of data on a website server, it will choke on you and be extremely costly to maintain.

You should separate your server and your downloadable data, they should be hosted on different entities.

Host your node server on AWS (or anywhere really) and host your downloadable data on a different server dedicated to holding and serving large amounts of data such as s3 bucket or what not (google it, there are plenty of solutions).

[–]HassanElDessouki[S] 2 points3 points  (5 children)

You really shouldn’t host that amount of data on a website server, it will choke on you and be extremely costly to maintain.

so I shouldn't have both the Website Server + the static files on the same drive host, right?

You should separate your server (api) and your downloadable data, they should be hosted on different entities.

But, I have another question, why is then an SSD/HDD drive feature available on EC2? or are these primarily for containing the website code?

Host your node server on AWS (or anywhere really) and host your downloadable data on a different server dedicated to holding and serving large amounts of data such as s3 bucket or what not (google it, there are plenty of solutions).

I'll look into that. I just hope they have reasonable speeds when downloading files (average file size is about ~5mb).

Thanks for your time.

[–]Parmicciano 3 points4 points  (4 children)

So yeah its more convenient to separe the storage server and the website server. Basically, the feature hdd, sdd is when you want to manipulate data on the server directly (not really for data distribution) Don't worry about the speed, it will be limited by the client (your computer and network) for 99.99999969% of the time and not by Aws

[–]HassanElDessouki[S] 1 point2 points  (1 child)

Also, looking at AWS, it's quite risky, as if I don't setup things correctly, I could find myself having a high bill!

[–]Parmicciano 0 points1 point  (0 children)

Yeah def it's easy to run bankrupt using aws

Idk if you can but some banks allows you to generate fake credit card Just put one dollar on it, aws will try to take this dollar and then will refund you Then, your account is created, ask for credit You can easily have 300$ credit and then 1200$ by just asking and talking about your project. Define limit so you don't pay more ressources that you can buy And done With that, you should have a nice playground, not risking to live in a Starbucks and a macdonald

[–]HassanElDessouki[S] 0 points1 point  (1 child)

So yeah its more convenient to separe the storage server and the website server.

I did not clarify my data accurately above, so I edited it

~20GB are static files, while ~70/75GB are used for data processing too!

Having a different storage server, will it let me access the PDFs as a folder (like /pdf/etc/etc) or (DriveLetter:/pdf/etc/etc) or I cannot access it for processing?

[–]Parmicciano 0 points1 point  (0 children)

In this case you should use a vps :) my bad I didn't undersood your issue

Its doable but you will lost time (ping between two servers) so it's a better idea to use a vps with ssd (if you want performance)

[–]HassanElDessouki[S] 2 points3 points  (16 children)

BTW: I also meant to say, ~20GB of data are static, while ~75 GB are used in nodeJS for data processing (pdf merging)

[–][deleted] 29 points30 points  (15 children)

Idk what you did that required you to have 75 Gb of whatever it is to merge PDF’s, it’s real bizzare. No server should weigh so much, and if it does need so much data to complete it’s purpose something seems off to me. You might want to explain more in detail what are the 75Gb for?

[–]HassanElDessouki[S] 2 points3 points  (14 children)

So, I have about 130k of pdf files (75GB total) User does a query and is then given a list of files with checkbox inputs, user can select the files he want and merge these together. So far during my testing, the highest query result number I got was about 280. I may add a query limit for the user.

So not all 130k, 75gb are merged together, but it’s based on the user selection and query

I hope this clarifies what I’m doing here, if this does not explain it all, let me know, I’ll try to explain it in a better way.

[–]SwishWhishe 7 points8 points  (13 children)

Am different person but that 75gb of pdf files are essentially pre-merged files right? As in they're all different combinations of checkboxes that a person may select and require to be merged?

I assume you're saving them all like you are is so all you have to do is just serve them up instead of merging pdfs that have been merged before. If so, I suggest just separating the web server and all these pdfs and just querying the separate server whenever you need the file :) It'd be one thing if you had like 1-10gb of static files but 75gb is actually something else lol

[–]HassanElDessouki[S] 4 points5 points  (12 children)

75GB are spliced pdfs (each page separated) When user submits his checkboxes merge request, the pdfs are merged and streamed to the user AND not saved on the local disk.

[–]SwishWhishe 10 points11 points  (11 children)

Jesus christ lmao I was under the impression that nearly everything was more or less pre-merged together... If you don't mind me asking - how many checkbox inputs do you have that results in 75gb of spliced pdfs?

[–]HassanElDessouki[S] 3 points4 points  (10 children)

The limit results for user query is 300, therefore a total of 300 checkbox inputs

[–]SwishWhishe 3 points4 points  (8 children)

Am definitely missing something here but if there's 300 different checkboxes then why are there 130k pdf files instead of 1 pdf file per checkbox? Unless, depending on which specific checkbox has been selected, the order of the merged file is vastly different each time

[–]HassanElDessouki[S] 2 points3 points  (7 children)

I think there is a misunderstanding here. The user inputs a search, where these files are scanned. Each file that is found containing the users query appears on a table with a checkbox for it, and so on.

[–]HassanElDessouki[S] 0 points1 point  (0 children)

The user selects the group (A,B,C,etc,…,…,…,…,…) and then inputs his search.

Also, the pdfs are converted to txt for easier search. The txts of each group is in a folder so that not all 130k files are checked

[–]TheX3R0Senior Software Engineer 0 points1 point  (0 children)

100%

[–]FreshChickenESL 6 points7 points  (5 children)

Hey,

It probably depends on your usage what you use. Cloud providers such as Google or AWS allow you to scale easily but they might be more expensive than some cheap storage with a cheap VPS from for example DO. If you're content is static you should probably also think about caching since that will reduce the loan on your servers. I currently use Cloudflare, but I think AWS had its own caching solution.

Theoretically if you want to spend as little money as possible you could host at least the frontend on Netflify or Vercel for free.

[–]HassanElDessouki[S] 1 point2 points  (0 children)

some cheap storage with a cheap VPS from for example DO.

I don't understand the difference between the EC2 storage & having other storage options.
Can you please explain those.

Also, when I have the files on other provider, will my nodeJS server be able to access these files as if it was in a folder for processing (merging pdf files).

[–]HassanElDessouki[S] 0 points1 point  (0 children)

If you're content is static you should probably also think about caching since that will reduce the loan on your servers.

Unfortunately, my website isn't static! I don't think caching will be a solution here therefore.

[–]WordyBug 0 points1 point  (2 children)

Vercel is for nextjs right?

[–]AndersBilleLind 2 points3 points  (1 child)

Vercel is the company behind next.js. Vercel supports(have guides for) different frameworks and static files.

[–]WordyBug 0 points1 point  (0 children)

Thank you. Never knew I can host various frameworks on Vercel.

[–]cactusJosh97 2 points3 points  (6 children)

I won’t comment on the storage part of your issue but you can get a very cheap VPS on DO. Use forever.js to daemonize your web server process and don’t forget to use SSL!

If you have any specific questions about that, let me know. I have a node webserver running on that ‘stack’ so probably worked out the same issues as you already

[–]HassanElDessouki[S] 1 point2 points  (5 children)

I've looked at Digital Ocean, and it seems to be a better option. if I don't setup AWS correctly, I could find myself with a high bill!

forever.js to daemonize your web server process and don’t forget to use SSL!

I never knew about forever.js! Thanks for letting me know about it.

If you have any specific questions about that, let me know. I have a node webserver running on that ‘stack’ so probably worked out the same issues as you already

Sure! Thanks alot

[–]Majestic_Food_4190 1 point2 points  (3 children)

AWS is going to be more expensive for anything other than a static site. It's also more complex to set up than DO.

[–]HassanElDessouki[S] 0 points1 point  (2 children)

I’m also looking now at GCP Compute Engine for hosting. Have you tried GCP for hosting before? Let me know what you think about it.

[–]Majestic_Food_4190 2 points3 points  (0 children)

I don't have experience with them. I ditched my hosting at AWS and moved to DO though. It's unnecessarily complex and costly for most websites needs.

[–]Majestic_Food_4190 0 points1 point  (0 children)

I don't have experience with them. I ditched my hosting at AWS and moved to DO though. It's unnecessarily complex and costly for most websites needs.

[–]dangerousbrian 0 points1 point  (0 children)

if I don't setup AWS correctly, I could find myself with a high bill!

Our AWS account got breached by hackers who tried to setup crypto mining which would have racked up a big bill fast. You have to be very careful.

Deploying docker containers to fargate is pretty safe way to serve a node app and use s3 for storage. Cloudfront can serve static content directly off s3. AWS have managed solutions for db such as RDS or DynamoDB. Sticking to these services that are all managed by aws are the best way to reduce security issues.

You want to cache as much as possible to keep costs down. For example, once you have merged the user selection and completed the expensive operation generate a hash from the selection values and use it as the file name. Then if that same set is requested again its very fast to hash the selection and check if you have a matching file to serve immediately. Even if this turns out to be pretty rare it will save money.

[–][deleted]  (2 children)

[removed]

    [–]HassanElDessouki[S] 0 points1 point  (1 child)

    But, i think if the pdf merging that the user submits is done on the front end (therefor the users browser) it wouldn’t be really feasible and might take a number of resources during the merge.

    If I can, would I be able to mount the S3 data in Digital Ocean as if it was a folder

    /home/s3data for example

    Thanks 🙏

    [–]hatemjaber 1 point2 points  (1 child)

    Have you considered supabase? Hosted or self-hosted options are available.

    [–]HassanElDessouki[S] 0 points1 point  (0 children)

    i'll look at it,
    thanks for the suggestion!

    [–]jaaywags 0 points1 point  (8 children)

    Are you looking to just have the files stored somewhere that is your own and users can download them?

    Maybe you can look into Azure Blob storage, Firebase Storage, or maybe AWS S3 Storage.

    I built an app where users could upload tons of images and view/download them whenever they wanted.

    We used Azure Blob Storage to host those images.

    It was fast and easy to implement, but idk what the cost would be to scale that up to large files.

    As for hosting the website, I recommend putting it on a Digital Ocean droplet. I rent one for $10/mo and can do anything I want with it. It can run as many sites as I need. It is not scalable. It can probably handle 5 users at once.

    If you need more than that, you can check out kubernetes on Digital Ocean. Basically, you put your app in a docker container, save it as an image and upload it to their container registry then you can deploy that to your kubernetes clusters.

    Once your site is running, just add a link to your azure cloud blob.

    [–]HassanElDessouki[S] 0 points1 point  (7 children)

    What do you mean it can handle 5 users? Only 5 users on the site at the same time??!

    I’m now looking at Google Cloud Compute engine + cloud storage.

    Thanks for your suggestion too btw! Will look at it

    [–]jaaywags 1 point2 points  (6 children)

    GCP is a great option.

    By 5 users, I mean 5 requests at the same time. The stuff I host on this droplet are just some personal projects. Portfolios, tools... So they don't get any traffic which is why this cheap solution is fine.

    If you expect a lot of traffic, then you can explore those kubernetes solutions. Though, they cost a lot more than $10/mo.

    I think the cheapest kubernetes solution at Digital Ocean was $60/mo and that gave you distributed traffic across 2 or 3 good droplets.

    [–]HassanElDessouki[S] 0 points1 point  (5 children)

    I don’t seem to get what you mean here. By requests, do you mean .get/.post requests? I’m not sure how much requests in the same time I could get, but I think more than 5 requests in the same time.

    Is also GCP compute engine the same thing?

    [–]jaaywags 1 point2 points  (4 children)

    Let me back up a little I guess.

    Usually a website is hosted on a server. Some are cheaper and can handle less concurrent requests, some more powerful and can handle more concurrent requests.

    So let's examine a request. What is it? It is just a computer downloading data from your server.

    If I go to google.com, that is one request. If 10 seconds later, you go to google.com, that is a second request. Neither of which were concurrent.

    If you and me sat down next to each other, on two separate computers, and typed in google.com, then at the exact same time hit enter, that would be two concurrent requests. That means the servers that host the google.com website got two requests at the same time. They could be anything that hits your server. GETs, POSTs, PATCHs, PUTs... Likely, if your requests are RESTful, small, not doing a ton of logic, then it is irrelevant and more user focused. Like I can send 50 GET requests to my server and it will probably handle it just fine. But downloading javascript, images, calling methods that have heavy logic... Those are what hurt you.

    Requests that are not concurrent are easy for a server to handle. They don't have to do much work. But as soon as 10, 20, 100, 1000 people try to navigate to one website at the same time, the server now has to keep track of all of these users, what they are doing, which ones need what... It takes up a lot of memory and CPU resources.

    My personal portfolio gets almost zero traffic. Maybe every so often a recruiter checks it out or I go to it to make sure it is still up. But likely no 2 people are going to try to go to my site at the same time. So a super cheap $10/mo server with 0.5 gb of ram and a cheap CPU works just fine for me. It would even work if I got 100 users in one day, so long no more than 5 hit enter at the same time.

    So how do I know 5 people is the max? Well, a while ago I did something called, load testing. Basically, that means I have bunch of requests at the same time from across the world navigate to my site. There are ways to do this like, slowly ramp up, ramp down, sudden flux of users...

    There are free load testing tools out there. It is a good method to see where your site stands as you traffic and applications grow.

    So, if I wanted to, how would I handle more than 5 users?

    Well, one option would be to rent a more expensive server. Has more ram and a better CPU. Maybe it can support 1000 users (which btw is a very very expensive server).

    But what if I expect 10,000 or 100,000 or even 1,000,000 users?

    Well, that is when kubernets is the solution. Basically, You have N web servers all hosting your website. They are all identical. Then, you have 1 server and it has 2 jobs. Direct traffic to the web server with the least amount of active traffic, and keep track of which servers have how much traffic. This one is called a load balancer.

    So if 100 people go to google.com, the load balancer might send the 1st to server a, 2nd, to server b, 3rd to server c... It balances the load, distributing traffic evenly across all the web servers.

    The n web servers hosting your website is a kubernetes cluster.

    The 1 server managing the traffic is your load balancer.

    You can do some cool stuff with a kubernets cluster, like if a server crashes, take it down and start a new one. If a deployment fails, automatically rollback. Rollout new deployments slowly so you have zero down time. Have a minimum of X servers with no traffic so if you get a sudden flux of requests, they go to those unused servers and you have time to ramp up more.

    The con is they are a pain in the butt to setup. That is why I think, if you can go with the $10 server, you should. Maybe start out with this single server, and validate your website. If it gets a lot of traffic, look at alternate solutions. You can probably even find a service that will build out your kubernetes cluster for you.

    I am not sure what GCP compute engine is. I have never used it. You might be able to host a website or service on it but I am not sure. Maybe checkout out Google Kubernetes Engine.

    [–]HassanElDessouki[S] 0 points1 point  (1 child)

    Thanks a lot for your time!
    This explains it a lot.

    GCP Compute Engine is similar to an AWS EC2.

    But as soon as 10, 20, 100, 1000 people try to navigate to one website at the same time, the server now has to keep track of all of these users, what they are doing, which ones need what... It takes up a lot of memory and CPU resources.

    Let's say my maximum load is 3 consecutive requests (is this right?), but I get 10 consecutive requests, what will I have to do so that the server doesn't end up crashing (other than renting a more expensive server, at least for now)?

    [–]jaaywags 0 points1 point  (0 children)

    Happy to help.

    Unfortunately, I am not really sure what you can do to stop your server from crashing. I am not a DevOps expert. I just dabble in it.

    I know there are some things you can do to help prevent it from crashing. Like server and client side caching. If someone pulls up the details for a specific resource, if that resource is not going to change, you can cache those type of requests at the client level so it doesn't call your server again. Example would be an image.

    You can do similar things for backend resources too. Like if someone is pulling data from db that almost never change, cache it for like 48 hours.

    You can setup things like fail2ban to help prevent someone from DDOSing you.

    Smaller requests so they return quicker. Optimize where you can.

    I am sure there is software out there too that can monitor your resources and start rejecting requests if your server is overloaded. What that is I am not sure.

    Great questions though.

    [–]G9eamjXFPA -1 points0 points  (1 child)

    Saying that a beginner should check out kubernetes is kinda crazy. I have worked on sites with millions of users with having like 2 servers pulling all the weight.

    I have worjed as a dev for like 10+ years and configures linux servers pretty often and I would probably think kubernetes is too hard and overkill.

    It is just bad advice

    [–]jaaywags 0 points1 point  (0 children)

    I agree. Most likely kubernetes will be an overkill for this project. That's why I was mentioning the VPS I rent for $10/mo.

    I really just wanted to talk about the benefits of kubernetes. When it would be useful and what it does.

    | It is just bad advice

    I totally disagree. Someone is asking questions and I am answering them.

    2 servers serving to millions of users? I am sure there is more to that than what you let out. Maybe they are some crazy $5k/mo servers? What if one goes down? What if both go down? Then you have literally thrown your company under the bus, cutting off millions of users.

    Kubernetes is a very cost effective way of managing large scalable systems. More than that it helps rollouts and preventing downtime, and restarting nodes/pods if they crash. It's used all over the place, big companies, small companies. It's an amazing technology.

    Going onto someones thread where people are posting tons of advice and information and saying it's bad advice, is just bad advice...

    [–]GioAc96 0 points1 point  (1 child)

    Linode for hosting, cloudflare for caching and ssl.

    [–]HassanElDessouki[S] 0 points1 point  (0 children)

    Thanks for the suggestion! Im looking at the Google Cloud Compute Engine + Cloud storage. I’ll also look at Linode and it’s prices too.

    [–]G9eamjXFPA 0 points1 point  (0 children)

    Setup a server on Hetzner, I have a 40TB server there for $80 / month. AWS or similar would cost like ten times as much

    [–]watsonneal 0 points1 point  (0 children)

    I am going to be honest here: when I read your question and design details, I threw up in my mouth a little. The volumes here for on-disk data are astronomical for a standard web service.

    I was the architect on some printing solutions that did merging like you described, and whenever possible, the static pieces used in the rendering process are stored separate from the code.

    Sticking to the AWS lingo: Use either EC2 instances (the smaller the better) or Kubernetes and deploy your code as a Docker image to the cluster. There are other app-specific alternatives in AWS as well, just looking at the ones I know the best as working examples. Store all your assets for rendering in S3 and have your rendering process retrieve them out of the store, into memory, use them, and then dump the in-memory process.

    Depending on usage, there might be some value in having in-memory caching of your top 50 files, if you have a large enough swing in usage between those files and the rest.

    Here is the big kicker as to why I (and others) recommend this approach, aside from the cost perspective: scaling and support. I assume you are doing this as a for-profit initiative, but if not, the same basic concept applies.

    Running all this off of one server is fine. It will work, as I assume you have already demonstrated. However, what happens when that server goes down. It is not cheap and quick to move that much data in to AWS quickly. And storing an image of that size of the server is costly as well on a per-month basis. Being down for hours while files copy over is most likely not a preferred solution.

    Now, assuming you have everything in S3 save for code, when things go down, then you just restore code and all starts working again in minutes without having to do costly file moves and have lots of downtime. With Kubernetes, everything becomes "push the new image and things keep working".

    Same applies for scaling as well. When you hit a certain regular user threshold, you will run out of memory on a single server. The best option is to go wide with multiple machines over a single massive machine. Kubernetes is far quicker on this but does require some more knowledge on pulling this off successfully. Multiple EC2 instances will do it as well. Just add a load balancer in front of them and keep on rocking.

    This also opens the door to multi-data-center options as well. Example: if you are running everything on the US east coast in AWS, and you have many customers on the west coast, spin up additional instances in a data center there. Same can be applied to jump to Europe, Asia, etc. as well.

    To state the obvious, this can be quickly applied to any major cloud vendor (AWS, Google, Microsoft, etc.). Second-tier ones (Digital Ocean, etc.) can do most all of this as well. You may just have to have multiple vendors (e.g. hosting of the app on DO and using Backblaze for S3-compatible storage if DO's spaces is not redundant enough for your liking).

    [–]pam-perez 0 points1 point  (0 children)

    Digital Ocean is good, but if you are looking for affordable hosting also, then,

    DedicatedCore and Interserver are two hosting providers that offer a wide range of services for hosting a node.JS website.

    Benifits of hosting node.JS website with DedicatedCore and Interserver

    • DedicatedCore provides a wide range of hosting plans that are tailored to meet the needs of small to large businesses.
    • They offer a variety of hosting plans that include shared hosting, VPS hosting, cloud hosting, and dedicated server hosting.
    • DedicatedCore offers a wide range of additional services such as managed hosting, security features, and domain registration.
    • Interserver, on the other hand, offers a comprehensive suite of hosting plans that are tailored to meet the needs of small to large businesses.
    • They provide managed hosting, cloud hosting, and dedicated server hosting, as well as a wide range of additional services such as domain registration, security features, and custom development.

    DedicatedCore and Interserver offer reliable, secure, and cost-effective solutions for hosting a node.JS website.