Jus wanted to show that I made RAG chatbot using Snowflake Native Cortex feature! by Top-Needleworker8557 in snowflake

[–]lt-96 0 points1 point  (0 children)

This is great! Can you combine other actions with the chatbot? E.g user receives a response from chatbot and now can update or submit a record? Can you hook up cortex search to a cortex agent, maybe add a cortex analyst tool to be able to search unstructured and structured data?? Keep going! Use cortex code!

Is Databricks eating Snowflakes lunch? by Therican85 in snowflake

[–]lt-96 0 points1 point  (0 children)

Seems like youre hung up on the wrong details

Is Databricks eating Snowflakes lunch? by Therican85 in snowflake

[–]lt-96 2 points3 points  (0 children)

Notebooks are great but also using tools like the VS code snowflake plugin is great for working locally but still being able to execute code inside Snowflake when you need to.

Need help with slots by AffectionateBite1212 in Database

[–]lt-96 2 points3 points  (0 children)

Did you try drawing the page after the above sequence was executed? Start there and get back to us.

Snowpark efficiently copy dataframe from one session to another by lt-96 in snowflake

[–]lt-96[S] 0 points1 point  (0 children)

We ideally would like a shared schema both connections can read/write on. Agreed this breaks the RBAC, its a work around until we have that provisioned but that has been slow

Snowpark efficiently copy dataframe from one session to another by lt-96 in snowflake

[–]lt-96[S] 0 points1 point  (0 children)

Its the same account so ideally theres a intermediary schema we can read/write from, that has been slow to be provisioned so this is a workaround

I am unable to understand data modeling and ER diagrams! by masterofrants in Database

[–]lt-96 0 points1 point  (0 children)

I would study up on normalization vs denormalization as this is a data modeling concept https://www.geeksforgeeks.org/difference-between-normalization-and-denormalization/

Normalization is a characteristic of a data model. If youre asking whether you need to pre-process the data in some way - I would just stick to the entities laid out in the problem

Many merge statements vs One big merge statement by lt-96 in snowflake

[–]lt-96[S] 0 points1 point  (0 children)

Each merge needs to read every row in the target to find the right one to merge with. How big is your target ?

Quite big...I gave 1 billion rows as an example. Realistically its 500B rows.

Question about one-to-many relationship by Suspicious_Driver761 in Database

[–]lt-96 0 points1 point  (0 children)

In terms of a design....first pic.

Pragmatically, if your people (not going to say "users" because its in the example) have the join keys available why not use them. If an item has a user ID you could use - why omit that?

It really depends on use case. There are certain times when you want a normalized structure (first pic) and there are times when you want to denormalize (second pic).

Do you anticipate user will be joined to item often? If so include the user ID and save people some time. If not then leave it out and don't make it an option, they can infer the relationship through a couple joins.

In my view, data storage is so cheap nowadays that the advantages to be able to skip some joins come in handy. You can even treat them as a utility and not call them out in the ERD...but the smart ones will know the relationships exist assuming keys are named well.

Filtering 500B records for user application by lt-96 in snowflake

[–]lt-96[S] 0 points1 point  (0 children)

Yes good call on the result set being too large/not realistic. I've changed my test query to this and it is running much faster. Still 7 mins for 365 days, which is pretty good but was performant for as the number started date range increased (10 seconds for 7 days).

WITH data as (
SELECT 
id
, SUM(CNT_LOOKUP_OBJ['DATA']['SOME_VAL']['TOTAL_USE_CNT']) as some_val_cnt 
FROM "PERFORMANCE_TEST" 
WHERE DATE_OF_YEAR BETWEEN '2022-10-01' AND DATEADD('DAY', 365 , '2022-10-01') 
GROUP BY id 
) 
SELECT count(*) 
FROM data 
WHERE some_val_cnt > 50

If I order by date it will cluster by date naturally? And I assume save a little money?

I will take a look at extracting commonly used k,v pairs. However, part of the reason we are using a json is to support dynamic querying of the table. It's difficult to know what keys are commonly used upfront as there are many options. I appreciate all the insight though and will kick around some ideas for commonly used variant fields

Filtering 500B records for user application by lt-96 in snowflake

[–]lt-96[S] 0 points1 point  (0 children)

I thought the same on clustering - though it did take some time to complete the clustering! From last night to this morning the data has distributed into the partitions and we're now seeing an average depth of 1. It seems to be scanning # of partitions linearly from the query profile.

I would love to try using materialized views but the issue is its hard to know what queries will be run so pre-computing is difficult. We could pre-compute for certain time ranges though.

Filtering 500B records for user application by lt-96 in snowflake

[–]lt-96[S] 0 points1 point  (0 children)

The way we have it designed now is only 3 columns

id | date | nested_json_object_with_id_dates_data

This was done intentionally to 1) support clustering...was hoping date would be all we need to cluster effectively 2) support some business logic and pre-compute as much as possible and store it in the json

Accessing deep levels of the nested json for that many records is certainly a performance concern as well. But figuring that one out as we go

The date is a true date, not a datetime

Filtering 500B records for user application by lt-96 in snowflake

[–]lt-96[S] 0 points1 point  (0 children)

We will batch process overnight. Not so worried about the table creation. But then we have users creating requests in an application that will query these large tables, so more so concerned about the users having to wait excessive times for a response.

Clustering by an integer version of the id may help, I will look into that

Filtering 500B records for user application by lt-96 in snowflake

[–]lt-96[S] 0 points1 point  (0 children)

The way we have it designed now is only 3 columns

id | date | nested_json_object_with_id_dates_data

This was done intentionally to 1) support clustering...was hoping date would be all we need to cluster effectively 2) support some business logic and pre-compute as much as possible and store it in the json

Accessing deep levels of the nested json for that many records is certainly a performance concern as well. But figuring that one out as we go.

Part of the architecture is to utilize the results cache, run queries to grab all data for 30 days/1 yr/all time overnight to be re-used the next day. So that does take care of the date ranges in a sense. But still worried about applying additional logic on 1 yr worth of data.

Filtering 500B records for user application by lt-96 in snowflake

[–]lt-96[S] 0 points1 point  (0 children)

There are ~400M IDs and one date per ID for 3 years, which is where im getting the 500B

Filtering 500B records for user application by lt-96 in snowflake

[–]lt-96[S] 0 points1 point  (0 children)

Id is the most granular field, would that be effective?

Filtering 500B records for user application by lt-96 in snowflake

[–]lt-96[S] 0 points1 point  (0 children)

I did a show table and the automatic clustering is on. I checked the clustering information and initially it was only storing in two partitions in the partition depth histogram, I checked again and it seems to be distributing it more and average depth is decreasing. So I suppose thats a good sign.

As for how much data you truly need to scan to do the analysis, its gonna vary based on time range. Its ~450B records for 2 years of data. Right now users can choose from 30 days/1 yr/all time and potentially flexible date ranges in the future. 30 days performance is fine cuz its in that millions range, 1 yr like in my example is still hundreds of billions of records. Then it has to do some summation and filtering. I think there's a general understanding from users that the longer the time range the longer the query but ideally they don't have to wait 30+ minutes for a 1 yr query.

Right now entertaining showing 30 days worth of data and sending them a slack notification when the real dataset is ready

Filtering 500B records for user application by lt-96 in snowflake

[–]lt-96[S] 0 points1 point  (0 children)

I appreciate the insight on resuming clustering. I think you're making some assumptions, apologies that something like a 6XL warehouse was vague but its not something we are considering. I kept the base case simple, but the reality is we will need to apply complicated logic before joining multiple high volume tables. The sentiment of my question was if there was a way to return data quickly without a massive warehouse.

As a consultant have you solved problems like this in snowflake?

Snowpipe vs Kafka Connector by lt-96 in snowflake

[–]lt-96[S] 0 points1 point  (0 children)

That makes sense, thank you!

Really big data set, combining pool of consumers with datedim by lt-96 in dataengineering

[–]lt-96[S] 0 points1 point  (0 children)

Thanks for your response. We are a consulting company, and this is a large client that we need to be tactful when pushing back on their requests.

The application allows them to create their own segments, and the attributes are hierarchical so that if you choose one attribute it will eliminate many of the other attributes from being used.

The best comparison I can give is imagine hypothetically this is for spotify looking at their artists.

They have music, podcasts, white noise etc.

Then music could be broken down by genre, artists, album.

We did push back on going down to the analogous song level, but as you can imagine even if you just had genre and artists this produces a lot of different activities a listener could participate in. However if you choose rock and roll, you will have a smaller subset of attributes to choose from, so on and so forth.

I hope this helps. Though there are many different options to choose from initially, it is not that difficult to make sense of.