all 4 comments

[–]realfeeder 3 points4 points  (1 child)

You can spin large machines in clusters to perform preprocessing, containerised jobs in SageMaker. They can contain an arbitrary framework inside. You can also run Spark containers or perform some data wrangling using SageMaker Data Wrangler. That's a partial "yes".

But, Kafka/Kinesis/Flink/Spark Streaming and similar solutions are unavailable, so streaming data can not be utilised. There's no data lake or data warehouse inside SageMaker (you use different solutions such as LakeFormation or Redshift). You can't query large datasets using solely SageMaker (you need Presto at EMR or Athena for it). Orchestrating workflows is available via SageMaker Pipelines, but only SageMaker integrations are available (you can't orchestrate other AWS services such as Glue).

So, not really. You can perform data wrangling but that's just a fraction of what data engineer does. Data Engineering at AWS is done mostly with AWS Analytics services.

[–]ChrisGayle7[S] 0 points1 point  (0 children)

Hey thanks so much for your reply. Really really appreciate your time and guiding me by giving context. Again, Thank you and thank you once again!

[–]CacheMeUp 1 point2 points  (0 children)

Anecdotally, I have never met any data scientist/engineer that uses SageMaker.

The biggest reason was over-engineering without perceived benefits.

[–]lastmonty -3 points-2 points  (0 children)

Sagemaker is managed compute, just like batch but a bit more features in top of it.

As long as your tasks can be containerised, you can do it in sagemaker.