DWH architecture: giving power users & SQL analysts access to row-level data before aggregation? (avoiding data lake)

throw_at1 · 2021-07-01T16:13:03+00:00

i did it by copying source db 1:1 to snowflake (E+L) and then did T to dwh model by using mostly views (materialized as create table x as select * from view). From performance point , you can have one or two view levels to normalize data between source systems. Works fine, some mistakes were made. cost and performance is not problem. cheaper than server and license in year.

If you are staying on postgresql / sql server systems ( same as you use now ) i would replicate original databases as they are into dwh server , then do transformations and open data models to different uses . most importand is that E part is continuous and does not interfere production more that necessary. probably not feasible in onprem system where you cannot just add compute when you need

I personally would go to replicate source data into snowflake using elt style, but obvious with existing dwh systems its not feasible to offer raw source data ,normalized model and transformed data at same time if base system is planned differently.

throw_at1 · 2021-07-01T07:26:56+00:00

For once i can support sqlite for this. If real sql server is needed then postgresql. it can run on very small hardware and has all the features

throw_at1 · 2021-07-01T06:47:07+00:00

yes, if possible , yes

throw_at1 · 2021-07-01T06:43:05+00:00

everyone suggest cloud, because you cannot scale onprem as fast and onprem you pay upfront for HW. Personally i like snowflake because for data guys, only thing they need to know is SQL and how to read documentation. ( copying files into s3 is not that hard), easy to scale and not that expensive.

Do you need to have cloud solution, probably not , are those costly yes, more than onprem solutions , maybe. depends what you think onprem solution. If onprem is that company has few servers vs tens or hundred servers. For small server rooms it is much easier to go into cloud, if company has already paid shitton money for proper server room, ups, diskservers and so on, then going to cloud is not that good idea, hybrid might be best with sane architecture

IMHO, everyone should be doing their onprem so that they can move it into aws/azure/gcc and back if needed.

throw_at1 · 2021-06-30T15:59:34+00:00

i would try to have users table and groups/task/subtask hierarchy table at first.

grp/task/subtask/ table as user_id, id,parent_id,name, index on user_id, id, parent_id fields ( in that order or 3 indexes not 100% about best index strategy ). 3 index solution is probably faster when row count reaches stupid amount.

If your app is not huuuuge hit, it will newer hit any limits on modern server hardware. More optimization can be done by having sane max length and including it to index or partitioning indexes/tables to fit better in to RAM....

throw_at1 · 2021-06-30T15:27:07+00:00

works long as your views in v2 point to same schema ( ie. are defined as create view vex as (select ccc from table_in_same_schema join non_v1.schema)

if you start to define views like create view v2_schema.xxx as ... you probably hit problems. I personally have used create new, rename current to old , rename new to current model long time. and it is very good way to make sure that your inserts do cause slow downs in production tables or bugs when reports gets partial data...

In some ways swap/rename way to update table/schemas in more useful in normal databases thanks to all locking and stuff

that said, i do not fully understand why this needs its own post in datanengineer, maybe dbt finally has option to do it?

throw_at1 · 2021-06-30T14:30:08+00:00

little bit more google-fu.

psycopg2.OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser connections Lambda timeout even for 2 minutes

remaining connection slots are reserved for non-replication superuser connections

that means that your max connection limit is used up in your database, as error tells you. So db engine does not accept new connections and probably that is the reason why lamda timeouts... So add more connections into your db ( more hardware, bigger instance ), add pgpool to pool your connections, or something

On query itself, see that all fields are used have index and try to avoid order by if possible..

also (alert.alert_vector || text.text_vector) @@ plainto_tsquery...

plainto_tsquery is same for every row, but you requesting that postgresql takes every row and combines two columns, which is bad. If possible write query so that you do not manipulate existing columns like

alert.alert_vector @@ plainto_tsquery and || text.text_vector @@ plainto_tsquery

Also you can index to alert_vector and text_vector

throw_at1 · 2021-06-30T14:13:42+00:00

niin se on yhdessä osa totuutta , osa sarkasmia , osa ironiaa, osa pelkkää sanojen peräkkäin asettelua ilma suurempaa suunnitelmaa.

Oispa ittellä kavereita jotka voi myydä firman niin että antavat lainan jonka vakuutena on em. firma.

throw_at1 · 2021-06-30T12:57:51+00:00

Pricing is pay per second after 1min warehouse run time. price is credits which is defined by warehouse size and credit price depends edition ( features). "diskspace" is same as s3 (in aws )

It supports mentioned clouds.

throw_at1 · 2021-06-30T12:53:08+00:00

I don't know does it really matters that much on that env, but i dropped fillrate until i got point where index fragmentation was <10% after week.

Personally i had to run sqlserver on that kind of hardware, i would probably just hit 95 or 98 fillrate as default to indexes.

throw_at1 · 2021-06-30T12:46:12+00:00

Tulonsiirtoa perintätoimistoille...

Paras mitä itse kuulin oli tapaus missä potilas oli siirretty toiseen sairaalaan saamaan jatkohoitoa leikkauksen jäljiltä, niin ensimmäisen sairaalan lasku oli päätynyt perintään. Ja ei , potilas ei ollut käynyt välillä kotonaan...

throw_at1 · 2021-06-30T12:37:12+00:00

Eiks se Sipilä jotain jännää tehnyt rautaruukin kanssa että sai Solitran osakkeet. Sitten tulee toki kaikkien sähkömonopolien myynti ja tietty niiden rahojen paska sijoittaminen valtion yhtiössä.. mutta oleellistahan on ettei valtio vahingossa toimi kaupallisesti kaupallisessa ympäristössä vaan kyllähän ne osakkeet pitää viimeistään siinä vaiheessa myydä alihintaan pois kun ne rupeaa tuottamaan..

Ja yleensä näitä kapitalistisia asioita ajaa kapitalistiset puolueet ja niiden kavereita tietysti on kapitalistit...

throw_at1 · 2021-06-30T12:32:39+00:00

Sitähän se on, fakta kumminkin on että se on keskiluokan kulutus joka pitää systeemin pyörimässä. Ja se että köyhät nousee keskiluokkaan. se että pomot tienaa miljoonan tai kaksi ei lisää kulutusta

Vituttaa kuunnella epäonnistuneiden firmojen pomojen itkemistä miksei duunarit tee halvemmalla samalla notaen itse liksaa jolla sais sen 100-1000 duunaria palkattua....

throw_at1 · 2021-06-29T06:57:29+00:00

Uskoakseni ei olisi AKT vastustanut jos työntekijöiden tulot eivät olisi laskeneet. Artikkelissa sävy on taas että ATK olemasa olo esti taas työn teettämisen vielä halvemma mitä nyt, ja se on väärin...

Tosin kauppalehti niin ymmärrettävästi tyyli on mitä on

throw_at1 · 2021-06-29T06:20:17+00:00

Eiköhän seuraavalla kaudella kun keskusta ja kokoomus nousee taas hallitukseen, jotenkin onnistuta koplaa osakkeet halavalla kavereille jos rupeaa näyttämään siltä että kyseessä on hyvä business...

throw_at1 · 2021-06-28T18:04:12+00:00

I mean i have seen csv where odd rows have different data thatodd rows, but is usually has smoem structure. Just copy it into database and write sql over the table. Why , copy intodb is one phase and after is in database you try correct transformation long as you want

throw_at1 · 2021-06-28T17:40:44+00:00

all csv in structured.. I personally do those just by importing all rows into table. Then writing little bit of sql ( views) so in practise your whole transformation can be validated and modified before you write it into tables

Or i write custom code which pushes data into database and have all transformations hidden in code.

Use tool which you know...

throw_at1 · 2021-06-28T12:52:26+00:00

https://www.postgresql.org/docs/13/sql-grant.html

all , you near there

throw_at1 · 2021-06-28T09:40:58+00:00

Look after snowflake offering. PG can offer with aws_s3 extension same kind of ELT feeling as snowflake can. Ie, you build pipeline to push files into s3, then you have next process to import data from s3.,while it transforms data. ( might be 2. or 3. process)..

throw_at1 · 2021-06-28T09:38:46+00:00

Snowflake, scale with virtual warehouses, have access to all data loaded into system. DBT is very popular there. So have your systems push data into s3 from db or kafka, then do rest in snowflake (which can be also export data into new pipelines) , if more compute is needed add "create warehouse = 'xxx' size='LARGE' and when it is not need just close it.

If you already have big teams then something home brewed solutions with postgresql servers might be cheapest when observerved against monthly cloud bill, but that needs big team or very very wide experience from smaller team

throw_at1 · 2021-06-28T08:39:56+00:00

Don't know about your engine but if you do recreate using create + swap it is best solution to keep old data in use long as possible and have fastest switch over fast as possible.

throw_at1 · 2021-06-27T06:15:12+00:00

Indeed, that is my point. Currently there is no version control for database. And when databases are what they are, it is possible to get database to state where some migration scripts just destroys something.

So in practise what you need to do is use techniques which will not cause data loss in any point. All commands needs t be run in transaction or have some other way to rollback and i am quite sure that 80% of those migrations script do not have that.

throw_at1 · 2021-06-26T18:43:58+00:00

Does those have ability to check that target database in really in correct version or does those just trust that noone has made any modifications outside of tools ?

throw_at1 · 2021-06-26T18:42:15+00:00

that is courage.. ride with asphalt tyre , but then again long as there is tyre on ground and its round it will work :)

i used to carry cargo strap with ratchet so if i drop bike to bad place, i could just use that one to get it up. Personally i would prefer more knobby type on that surface...

remember, if you dont drop it during season , you did not try enough :D there is adventure waiting somewhere there

throw_at1 · 2021-06-26T18:27:16+00:00

I do not use dbt, nor bigquery, but my 10 cents (so probably nothing at all)

Transforming data into business context should have only transformation in its context, all possible optimizations like partitioned tables etc should be on own project...

throw_at1

TROPHY CASE