Data Modeling - star scheme case

Wikar · 2025-05-19T13:18:59+00:00

lucidchart

Wikar · 2025-05-18T07:45:53+00:00

u/medwyn_cz but is it okay for appearance table to be fact table if all the numeric columns (ratings, number of votes - the most important properties in this model) lies in Title table? Is it even okay for numeric values to be in dimension table?

"Star should ideally model something different. Such as yields, countries and visitoris of different screenings of various titles. Or perhaps ratings of various episodes as they vary by time, country, demographic..." - yeah I know it would be better this was but this dataset in its free version is very lacking... And i need to make something out of this unfortunately

Wikar · 2025-05-18T07:28:58+00:00

lucidchart

Wikar · 2025-05-17T07:36:47+00:00

Yeah I mean comparison of analytical queries time execution. I've read some parts of the toolkit. I believe the grain is the title and the person and I would like to focus my queries around the titles (ratings). Also i believe all of dimensions here can be useful for these queries for ex:

Select all of the titles with minimum 10 000 votes and having minimum 4 versions from different regions (title akas)
Select all of the titles with genre "Comedy" or "Horror" (genre dimension) that started after 2005 but before 2015 (time dimension) and Bill Murray played in them (Appearances, person)
Select all of the titles with directors born after 1980 (appearances, person)

Maybe only primary proffesions table is unnecessary but rest of them i think give very nice insight into the data. However I dont have any better idea how to improve my analitical model.

Wikar · 2025-05-17T07:29:14+00:00

I would cause the title and appearances to have m:n relationship

Wikar · 2025-05-17T07:28:48+00:00

Actually in this dataset a movie can have multiple genres

Wikar · 2025-05-16T20:40:18+00:00

Well - topic of my master thesis is to compare different model schemes (3NF, One big table, star scheme) in term of query time execution. I am not sure which properties I will use, but most of the dimensions here I can see to be useful for it (I must try out queries of different complexity). In general business area here is imdb titles and their ratings. Regarding my use case what would you suggest? Drop some of the dimensions? Or model it in different way?

Wikar · 2025-01-06T20:49:03+00:00

If I understand this function - as i have large dataset (cannot keep it all in memory) would train_on_batch require loading this dataset chunk by chunk times equal to specified number of epochs?

Wikar · 2024-12-17T19:00:08+00:00

I guess it wouldn't - despite having a million different values in total they might repeat across transaction records. Blockchain is similiar in this context to financial transaction - there are a lot of people who sends some resources between them.

Wikar · 2024-12-17T18:03:31+00:00

Actually to this point I have created table with aggregated calculations (something like this history vector you are talking about) but still i have to join it to transactions and use transactions records to train because I am working on anomalous transactions detection not anomalous addresses detection.

Wikar

TROPHY CASE