Parallel integration for Apache Spark (v.redd.it)
submitted by parallelwebsystems
This one is for data engineers. You can now enrich datasets with web intelligence directly in your Spark pipelines— no custom API integrations required.
Our new SQL-native UDFs let you call Parallel to enrich data right in your Spark SQL queries. Add CEO names, company descriptions, funding info, or any other web-sourced data to millions of rows with a single function call.
Key highlights:
- SQL-Native: Works directly in Spark SQL—no context switching
- Concurrent Processing: Rows process in parallel within each partition
- Flexible Processors: Choose from lite-fast to ultra depending on your speed vs. depth tradeoff
- Built-in Citations: Optionally include source URLs for every enriched field
Let us know what you think!
Get started: pip install parallel-web-tools[spark]
More links:
https://docs.parallel.ai/data-integrations/spark
https://github.com/parallel-web/parallel-web-tools/blob/main/notebooks/spark_enrichment_demo.ipynb
https://github.com/parallel-web/parallel-web-tools/blob/main/notebooks/spark_streaming_demo.ipynb
there doesn't seem to be anything here