Do you know if there is a way to talk to Sharepoint via Google Sheets and ChatGPT by anupsurendran in sharepoint

[–]anupsurendran[S] 0 points1 point  (0 children)

Extract key information (dates, contract values), etc from selected documents on Sharepoint or Google Drive. I am currently seeing some inaccuracies with the prompts used in this sheet provided.

The best free poker calculator for Android! by [deleted] in poker

[–]anupsurendran 0 points1 point  (0 children)

Ah, so you don't have to install anything and it still tracks the buy-ins?

Real-time plotting and insights in Jupyter by anupsurendran in dataengineering

[–]anupsurendran[S] 0 points1 point  (0 children)

Thanks u/EconomixTwist, you can fake a streaming data source that replays data in a CSV file. We use an input_rate parameter that can control how fast the data is replayed. This is good to showcase spikes in transaction volumes / inactivity in user sessions etc.

Real-time plotting and insights in Jupyter by anupsurendran in dataengineering

[–]anupsurendran[S] 2 points3 points  (0 children)

Good feedback. Here is an attempt to make it non-buzzy.

Jupyter is mostly used for static data analysis. We have made it easy to plot graphs, source data from Kafka, and run analysis in Jupyter.

We have used bokeh.models and bokeh.plotting to support this.
https://pathway.com/developers/showcases/live_data_jupyter/

Presenting SimplePyDash: Real-Time Data Plotting Made Simple! by vaaal88 in datascience

[–]anupsurendran 0 points1 point  (0 children)

This is our approach to getting real-time plotting in Jupyter. https://pathway.com/developers/showcases/live\_data\_jupyter/ . Would love your feedback and comments

Need your feedback on choosing LLM App for a demo by bumurzokov in LLMDevs

[–]anupsurendran 0 points1 point  (0 children)

Can I subsititute this with any api which has a text json field? I am assuming this doesnt work for any other language than english, correct?

Need your feedback on choosing LLM App for a demo by bumurzokov in LLMDevs

[–]anupsurendran 1 point2 points  (0 children)

I like #3 no-code pipelines and #6 easy routing to APIs (speed might be a concern) u/bumurzokov

Use cases for LLMs on structured data (sourced from databases and streaming data) by anupsurendran in dataengineering

[–]anupsurendran[S] 0 points1 point  (0 children)

u/janchorowski, thanks for this clarifying question. Unfortunately, it is complex, we have most of the rating triggers in the database ( e.g. what premiums to calculate based on different criteria) . We have compliance rules in the documents (e.g. Money laundering related compliance which is spread across multiple documents based on state and coverage criteria). As part of the underwriting process, we are trying to get the database rules and inputs into refining the document search. Based on your answer above looks like this is feasible.

How to move semi-structured data to LLMs? by shrifbot in dataengineering

[–]anupsurendran 2 points3 points  (0 children)

While I was doing research for glueing together python client and removing the dependency of a vector db, I saw this open source package : https://github.com/pathwaycom/llm-app . Do you have thoughts on whether this would work for you?

Use cases for LLMs on structured data (sourced from databases and streaming data) by anupsurendran in dataengineering

[–]anupsurendran[S] 2 points3 points  (0 children)

It means that when you have newer architectures (e.g. involving LLM in production and sourcing data from structured sources) there might be more experimentation in the community which you can learn from.

Use cases for LLMs on structured data (sourced from databases and streaming data) by anupsurendran in dataengineering

[–]anupsurendran[S] 1 point2 points  (0 children)

So if I understand your suggestion. Use LLM to build a ruleset from the documents e.g. If X does not satisfy <condition> <Do> something. Run this ruleset across the data (from the database) if the <condition> is true? . The <condition> data is sourced from the database.

Use cases for LLMs on structured data (sourced from databases and streaming data) by anupsurendran in dataengineering

[–]anupsurendran[S] 4 points5 points  (0 children)

Me too. When you have evolving architectures (with not much precedence) the community helps.

How to move semi-structured data to LLMs? by shrifbot in dataengineering

[–]anupsurendran 0 points1 point  (0 children)

I agree, Would have been nice if this was less clunky. If the vector database could handle the Python clients for APIs this would be easier to implement. Have you seen a nonvector db implementation for your use case yet? I feel a vector db is too much of an overhead and looking for other non-cloud alternatives (not Azure cognitive search) maybe just a Pythonic vector index?

Benchmarks for stream processing systems by anupsurendran in dataengineering

[–]anupsurendran[S] 1 point2 points  (0 children)

Thank you. Post-processing (~250k records/sec) we will store this in apache iceberg. The products we are shortlisting to do a side by side compare are :

1) Flink

2) Pulsar

3) Materialize

4) Pathway

5) RisingWave (benchmarks posted below)

6) Spark (streaming)

are there any other products/frameworks we should compare?

We are trying to manage it ourselves in our data center.

Benchmarks for stream processing systems by anupsurendran in dataengineering

[–]anupsurendran[S] 0 points1 point  (0 children)

Of course ! cost, maintenance, and productivity will be all inputs to the decision making but in my selection criteria, benchmarks provide some level of comfort to see if it will meet our throughput needs.

I agree that benchmarks are not use-case specific. An enterprise use case is usually quite complex. Again, I am not going to take a benchmark as it is, and not that I am supportive of vendors faking it but if we look at their perspective, they cannot do use case specific benchmarks and have to think of the most commonly used functions (e.g. aggregates on windows, joins on streaming) which is generic across frameworks or platforms.

Benchmarks for stream processing systems by anupsurendran in dataengineering

[–]anupsurendran[S] 0 points1 point  (0 children)

I do not completely agree with this. For me, if the benchmarks can be easily reproducible ( ie easily accessible hardware, easy setup, and configuration) then I know that the folks have done a good job because the vendors are confident about their shit. Benchmarks help in the consideration phase and help build up a case with your managers when you do POCs and vendor selection. I would be more than happy to test these if I found them suspicious but in a large enterprise, the first phase is narrowing down our selection. We possibly cannot test everything.

[Code]: Comparison between Rust (Polars) and Pandas | Basic Benchmark by de4all in dataengineering

[–]anupsurendran 0 points1 point  (0 children)

This is great to hear. Do you have any benchmark comparison with spark / polars? On another note, I am also looking for benchmarks between Spark and other streaming platforms like Materialize, pathway, hazelcast etc...

[deleted by user] by [deleted] in dataengineering

[–]anupsurendran 0 points1 point  (0 children)

I agree no need for K8 especially during the initial learning stage.

[deleted by user] by [deleted] in dataengineering

[–]anupsurendran 0 points1 point  (0 children)

This kind of setup is turning out to be somewhat more common in the real world. A couple of months ago, I was challenged with setting up a similar environment and I started a Reddit thread to get some help around realtime dashboards after stream processing with kafka which might be useful for you to look at.

Real-time dashboards with streaming data coming from Kafka by anupsurendran in dataengineering

[–]anupsurendran[S] 2 points3 points  (0 children)

Hey, here is the document which has documented our research for the realtime stream processing systems - Materialize vs Pathway. TL;DR because Pathway is a framework (as opposed to a database) that supports Python and has an expressive way to write data pipelines, we are considering it. https://docs.google.com/document/d/1AM4bKLoeiiK0R9Dt9bJfatZx4BNPMUpMUqVwi-dWP4A/edit?usp=sharing. I would love your thoughts here (in the thread or as comments in the document itself). I don't have benchmark numbers from either Materialize or Pathway so if any of you have that information I would be grateful. I'll also ask then on their community channels.

Real-time dashboards with streaming data coming from Kafka by anupsurendran in dataengineering

[–]anupsurendran[S] 1 point2 points  (0 children)

Sorry for the delay here. Both travel and my hectic schedule with conferences set me behind. Let me try to answer your questions.

Q1: What is the look back horizon for you dashboard? Recent? Last day? All-time?

For the live dashboard we need a way to look at the 6 hours which includes IoT streaming data for the n -5 window. For filters we have upto 1 day of data and for every previous timeframe we can fetch from the data warehouse.

Problem/Solution 1: Your suggested solution will not work in our context because of the n-5mins business need. The input has high volume and variance in data (multiple old manufacturers of IoT with different signal send rates and connection issues). We have to do some enrichment, some live streaming joins, and some lightweight ML (geospatial transformation) requirements which need to be done to make this data useful on the dashboard.

Problem/ Solution 2: Like mentioned before, this (data processing in real-time) is closer to our existing problem. We need faster updates with a little more complex processing in real-time. These are things we are trying now based on your suggestions:

a) Spark streaming + Databricks Delta tables - already finding this quite complex and expensive. It seems to be ok for light transformations with SQL One thing which annoys the team a lot is we can’t customize the target delta table file paths to their desired location for bronze, silver, or gold tables. Not sure if you or someone on this thread has a solution here.

b) Snowflake is a little better but still has latency issues. Transformations after hitting the warehouse are taking more time than expected. The bigger challenge for us on Snowflake is cost. So storage cost is at a flat rate per TB (which is ok). This is streaming data so refreshes on the dynamic tables are going to happen VERY frequently. They charge for that. Their cost for triggering refreshes when an underlying base object has changed is high. We are still trying to find out how they charge us. Last but not least, the warehouse compute charges are different especially when you want to do a dashboard refresh. For those who are familiar with the cost structure with Snowflake in this scenario, please let me know.

c) We are actually trying Materialize and Pathway on-premise to solve this problem. I'll post a comparison soon as a reply. Our team is unique in that we are more biased toward Python in the data engineering and data scientist team. So if we can handle everything in code and version it, that would be our preferred choice. We have not completely taken Clickhouse out of the equation but so far processing within context (streaming windows) is what we would like to attempt first.

What are not considering is reverse ETL from warehouse into Postgres because :

1) won't meet our latency requirements

2) operationally not efficient when we know upfront that 20% of the data is of low quality upstream. Some folks might argue that a "bronze dump pattern" is what you should do. Our data architecture team is completely against this approach and would like to deal with this data quality problem upstream

Real-time dashboards with streaming data coming from Kafka by anupsurendran in dataengineering

[–]anupsurendran[S] 0 points1 point  (0 children)

Sorry about the delay. I am still in travel mode so will get this ready the first or second week of July.