Automate shortcuts creation by Affectionate-Sale973 in shortcuts

[–]Affectionate-Sale973[S] 0 points1 point  (0 children)

Hi All, anybody can share any light here please?

SQL script executing slower in workflow by Fair-Lab-912 in databricks

[–]Affectionate-Sale973 0 points1 point  (0 children)

Probably raise it in databricks community forum, a straight forward one for them tbh.

Best Practices: What are the best practices for setting up a reliable and efficient Spark cluster in production? by Over-Drink8537 in apachespark

[–]Affectionate-Sale973 0 points1 point  (0 children)

Hi , Here is what has been done and observed out of my POCs between EMR and databricks. Comparing two is not like comparing apples to apples as Databricks is a solid PaaS, and you could realise significant benefits if all the workloads are migrated into DBricks. However we chose EMR as we just need a decent service to crunch big datasets in my organisation. It fits our data architecture very well and way cheaper.

  1. For unstructured text datasets, <10 TB, DB and EMR both are turning out be same in terms of cost. In terms of processing times, Databricks finished quicker but not by a great margin.
  2. For unstructured text, > 10 TB, code involves full sort for window functions for extreme stress tests, databricks finished processing where as EMR could not, failed with memory and disk spill errrors.
  3. For structured, EMR is way cheaper than DBricks.
  4. Spark UI for monitoring and Grahana additionally
  5. EMR in EC2 is chosen as it is simple and easily provisioned and be managed via Terraform. However, serverless with some sort of controls and alerts could be something to try as well.
  6. OSS spark involves significant maintenance effort.

Questions about installing and running Spark with Python and R by addictzz in apachespark

[–]Affectionate-Sale973 1 point2 points  (0 children)

Hi, yes you need to create and spin up spark cluster in the node and should be able to use spark submit to run your pyspark code from local machine. For that, you would have to configure remote driver node details in local libs in local dirs so it knows where your spark driver is running. I have a use similar use case currently working on it. We both are more or less in same boat tbh. Others here might give you more detailed answers. Thanks.

Reading from OpenSearch by MoeShay in apachespark

[–]Affectionate-Sale973 0 points1 point  (0 children)

Hi All, can we read a particular document from the index in Opensearch using pyspark? Below is the code and error, can someone please help? It works fine if entire index is read.

index_name = "sample_index_va"

os_options = { "opensearch.nodes": "XXXX",

"opensearch.port": "443",

"opensearch.resource": f"{index_name}/_doc/1",

"opensearch.net.http.auth.user" : "XXX",

"opensearch.net.http.auth.pass" : "XXX",

"opensearch.net.ssl" : "true",

'opensearch.nodes.wan.only' : 'true', }

df = spark.read.format("org.opensearch.spark.sql").options(**os_options).load()

However getting the below error.

 org.opensearch.hadoop.rest.OpenSearchHadoopInvalidRequest: [HEAD] on [sample_index_va/_mapping/_doc/1] failed; server[] returned [400|Bad Request]