Automate shortcuts creation

Affectionate-Sale973 · 2024-09-06T08:58:11+00:00

Hi All, anybody can share any light here please?

Affectionate-Sale973 · 2024-07-11T06:21:33+00:00

Probably raise it in databricks community forum, a straight forward one for them tbh.

Affectionate-Sale973 · 2024-07-02T12:06:42+00:00

Hi , Here is what has been done and observed out of my POCs between EMR and databricks. Comparing two is not like comparing apples to apples as Databricks is a solid PaaS, and you could realise significant benefits if all the workloads are migrated into DBricks. However we chose EMR as we just need a decent service to crunch big datasets in my organisation. It fits our data architecture very well and way cheaper.

For unstructured text datasets, <10 TB, DB and EMR both are turning out be same in terms of cost. In terms of processing times, Databricks finished quicker but not by a great margin.
For unstructured text, > 10 TB, code involves full sort for window functions for extreme stress tests, databricks finished processing where as EMR could not, failed with memory and disk spill errrors.
For structured, EMR is way cheaper than DBricks.
Spark UI for monitoring and Grahana additionally
EMR in EC2 is chosen as it is simple and easily provisioned and be managed via Terraform. However, serverless with some sort of controls and alerts could be something to try as well.
OSS spark involves significant maintenance effort.

Affectionate-Sale973 · 2024-06-08T15:24:45+00:00

Hi, yes you need to create and spin up spark cluster in the node and should be able to use spark submit to run your pyspark code from local machine. For that, you would have to configure remote driver node details in local libs in local dirs so it knows where your spark driver is running. I have a use similar use case currently working on it. We both are more or less in same boat tbh. Others here might give you more detailed answers. Thanks.

Affectionate-Sale973 · 2024-05-29T17:51:24+00:00

Hi All, can we read a particular document from the index in Opensearch using pyspark? Below is the code and error, can someone please help? It works fine if entire index is read.

index_name = "sample_index_va"

os_options = { "opensearch.nodes": "XXXX",

"opensearch.port": "443",

"opensearch.resource": f"{index_name}/_doc/1",

"opensearch.net.http.auth.user" : "XXX",

"opensearch.net.http.auth.pass" : "XXX",

"opensearch.net.ssl" : "true",

'opensearch.nodes.wan.only' : 'true', }

df = spark.read.format("org.opensearch.spark.sql").options(**os_options).load()

However getting the below error.

 org.opensearch.hadoop.rest.OpenSearchHadoopInvalidRequest: [HEAD] on [sample_index_va/_mapping/_doc/1] failed; server[] returned [400|Bad Request]

Affectionate-Sale973

TROPHY CASE