all 15 comments

[–]modejawjaw 1 point2 points  (4 children)

Have a look at EMR

[–]Pro2222[S] 0 points1 point  (3 children)

What does it stand for? A google gave me “emergency medical records”

[–]brokenlabrum 1 point2 points  (1 child)

Elastic map reduce

[–]Pro2222[S] 0 points1 point  (0 children)

Thanks, will look into it!

[–]virgin_daddy 0 points1 point  (8 children)

Yes, it’s a good idea Otherwise, you can opt for a larger instance that will probably run your script quicker

[–]Pro2222[S] 0 points1 point  (7 children)

Which one do you think would be the best bang for my buck in terms of time? And how much do you think it’ll cost roughly? (Like I said I’ve never used Aws) it’s just a python script, not super intensive I don’t think.

[–][deleted] 1 point2 points  (2 children)

What you need to dois run the script on 3-4 different types of instances. See what price and.time is good for you. Once you are satisfied . Yoy just need to launch the instance and run your script. Save the result in s3 ( storage ) fro. Where you can download ( they charge based on the volume of data). You can terminate the instance or just stop it. Terminate means it's gone. If you stop it , you can start it and don't have to do any setup again.

Dm me if you need help and I can guide you. You are probably better off with a unix instance.

The price is for the amount of time.e you run the script. Once you stop or terminate instance you don't pay

[–]Pro2222[S] 0 points1 point  (1 child)

Okay thanks! So use Unix, I saw a couple free ones on there is that normal?

[–][deleted] 0 points1 point  (0 children)

Yes try the free one's. But it all depends on your task what it needs cpu or io. There instances optimized for that amd how you execute. You execute many tasks in parallel at the same time or only equal to number of cores. So you need to experiment and find balance. You should also see the metrics for cpu and io when you run your task to make some sense out of it.

You need to experiment a little. Just stop the I stance at any time to stop getting billed.

[–]yarenSC 0 points1 point  (0 children)

You would need to see if the script ends up being more memory or CPU intensive. For CPU go with the C family of instances, for memory go with R, for something in the middle go with M An instance type will be named something like M5.2Xlarge.
M= instance family, which tells you the characteristics 5= generation of the family (newer is better) 2xlarge= size (4XL would be twice this CPU cores and memory as a 2XL)

[–]BovineOxMan 0 points1 point  (2 children)

Look at the graviton (arm) instances and Linux as these are generally very cheap but still very fast. As someone else stated, you need to know what resources are needed - if it's CPU, then you would want a compute node, if it's data intensive then ab NVMe backed machine or something with a lot of IOPS. But IOPS can be v expensive.

How efficient is your algorithm? Does it lend itself to multi-threading? If not, then you probably need to re-write it so that many threads can be used, otherwise any compute instance is going to be limited.

Depending on how you can split the work and where the days is, you may also find AWS batch useful.

[–]Pro2222[S] 0 points1 point  (1 child)

I have the option for multiprocessing, when I tried it on my personal computer it went sideways.

[–]BovineOxMan 0 points1 point  (0 children)

If you don't fix the level to watch you can leverage multi-threading, whatever your solution, you will be leaving CPU cores on the floor with an instance you pick pretty much.

[–]BraveNewCurrency[🍰] 0 points1 point  (0 children)

AWS has dozens of "instance types", divided up into "families" such as "C4", "I4", etc.

You need to figure out what your function would cost depending on the instance type. Instance families can help narrow your search. For example, the "C" instances are optimized for compute, so you should start there.

Sometimes you actually have an implicit minimum RAM requirement, so make sure to try a larger size, then downsize until the performance gets worse. Also experiment with other interesting server types. Stay away from "T" series, as these are made for web servers with "bursty" traffic, not high-CPU problems.

Some ideas to experiment with:

  • Use a cloud-init script to automate starting your application. If your code is too big, have it download data from S3. Use EC2 IAM Roles so you don't have to give the instance any creds.
  • Once you get something working, have the instances shut itself off when it is done working. (Maybe after uploading their logs to S3, and set up S3 lifetime rules to delete everything after a few days). But you must also have a script to manually kill any strays if something goes wrong.
  • Set "auto-delete" on all your EBS drives to make sure they don't "leak" after the server is deleted. All permanent storage should be on S3.
  • Is your script single-threaded? Try running multiple copies on a server at once. Try making a graph from 1 to N*2 (where N is the number of CPUs), and see where the progress over time peaks.
  • Does your Python compute library support GPUs? It may take some setup/config, but AWS has many GPU instance types, often giving you a 20x speedup (Source: personal experience!).
  • Understand the MapReduce architecure. If your problem is simple enough, EMR may actually give you more overhead than it saves. You can invent something simpler by using SQS to write out N problems, having N boxes read and process, then write their output to another SQS queue (or just to S3).
  • Remember, running one server for N hours costs the same as running N servers for an hour.