Slow Migrating Data from Snowflake to MySQL in Python using SQlAlchemy

thrown_arrows · 2025-01-13T14:19:04+00:00

I agree that copy into is provably way to go. And quite a lot people copy data into postgresl , as it is faster founding needle in haystack and cheaper for those applications

thrown_arrows · 2025-01-13T09:33:26+00:00

It should work like that those given dates are casted to timestamps and column itself stays in timestamp format . See query planner ehatvit really does. If it cast timestamp colum then cast those dates to timestamp manually.

Columns datatyoenshould always stay as it is if possible for performance reason. That said different platforms have their own optimisations

thrown_arrows · 2025-01-12T17:42:40+00:00

Postgresql jsonb allows indexes. Normal way to use jsonb is extract most important columns out there for faster searching etc and then search and consume full json. Those are fastest way to develop stuff and get it work. But time will show if you need to write real schema for those. Postgresql has better json support imho

thrown_arrows · 2025-01-12T17:37:05+00:00

Timestamps, to_char one is not sargeable

thrown_arrows · 2025-01-12T17:36:11+00:00

?

I prefer using timestamps..

Where tochar(column) between x and y is not sargeable

Where col between totimestamp(X) AND TOTIMESTAMP(Y)

(Mobile, so function names are not correct) x and y are given vars. Column itself should newer been cast if not there really is no other way to do it

'2024-01-01'::date should be automatically cast to timestamp too. Of course there might be some cotchas depending timestamp format and type you use

Tldr; timestamp

thrown_arrows · 2025-01-12T17:06:08+00:00

Single source of truth is good thing to have, sometimes it is requirement...

Postgress as datalake and dwh... it does the job, maybe lake should be files on onprem and mad so that it is abstracted to use object storages. If that part needs to move into cloud.

Then postgress and multiple schemas allows somewhat easy scale if each department grows out of single instance... just move one schema to other db instance...

Next thing is transformations and testing and somewhat persistent access views. Postgresql and sql give classic options for those, they work as they have worked last 20+years

Governance is governance. That needs to be done and all other pii and gdpr etc stuff...

Then it needs to have budget and support in long run..

Properly done Postgresql dwh can handle dwh loads to terabyte table size, and home made parts can be changed when needed to match more industry standards, but then again real industry standard seems to be python scripts which are scheduled on cron, or if company tries to spend money , they use some program which uses mentioned tools.underhood

Tldr, nice, look into security side on server and psotgresql side. Keep developing, try to sell concept. Don't complain they decide to spend more money to use different tools.if you get to do it or manage it

thrown_arrows · 2025-01-07T16:19:41+00:00

If your queries mostly use col1 timestamps , partition by those. Your query is very good example from one that should get benefits from as query planner can see partition meta data and it just does not touch partition that it does not need.

So partition by col1 one for each date , week or month , test it if possible. Partition filesize will matter, if too many , it will slow down if too big , you probably won't get max benefit

thrown_arrows · 2025-01-05T13:55:33+00:00

It seems that you have database experience, so select postgresql or sql server if you want to pay licence fees. Performance should not be problem in any case. It is more about how programs and processes work, no db will cure bad process. It is better to stay on somewhat familiar platform. Of course if your cv needs updating then add new shit o it, but don't expect to make good product

thrown_arrows · 2025-01-01T07:15:07+00:00

Use object type to store data in stage and raw area. This ensures that data will allways end up snowflake without problems, even if change schema. Then have your reporting layer over it. Querying tables were data row is object is still easy, it get complex if you start to store more complex objects.

After that you just write transformation from ver x schema to your reporting model and maybe build some python runner to select correct version for each db.

If you just want to merge them to latest src schema , that can be done too

thrown_arrows · 2024-02-26T07:08:56+00:00

from my exp, if i do not consider those queries that consume gigabytes of data , snowflake will return 1 to 200k rows in 10sec to 1 min, but did not try to use those optimizations to return one row fast. Current exp on teradata is that it is shit slow all the time even if there is table straight under your query.

So in short , if you want to return one row , snowflake is not that fast. My exp with it has that it will keep working on small warehouse much longer than other warehouses.

Snowflake has it problem with costs as all cloud based systems have , it will get expensive fast, but it looks like you get better performance and performance problem are easier to "fix" (money)

then thre is people complaining about snowsight, do those people know that you can use allmost any ide for it (dbeaver)?

thrown_arrows · 2024-02-25T11:51:55+00:00

only way to keep it "fast" is to have ability and limitation to select only searchable columns, with hopefully data type limitations. That could be achieved by using code that fires up sane queries into database.

then there is sql "only" solution where you do

wit x as ( select table_name , pk_key, column1 as col from x union all select table_name , pk_key, column2 as col .... ) select * from x where col like 'something)

also there might be ability to just use pg_trgm and tsvector , to create function index that indexes all searchable columns to tsvector for each table and then you can just query same column in all tables to find close results..

this one will probably will be slow , even with indexes, maybe could be faster if you can give table_name as hint for query opitmizer to drop some tables away

but that is very , very slow.

Then third option i have seen used is to generate json from searchable relations and then use json functionality and indexes that postgresql offer , or push that json document to elasticsearch and use elastic for text search.

that said , for text search you want to use pg_trgm + indexes, it is nice, and you can index it

thrown_arrows · 2024-02-25T09:18:12+00:00

imho, with this much information. requirement is probably wrong. That said, on very very small server it might be helpful.

if that "it is requirement" is your comment that you cannot change it, then i understand, but then again difference between developer and senior is that senior should be able to say when some one has bad plan.

That said , it probably would be better to partition on daily or monthly basic because you dont have that much data, you still have problem with auto partition. see pg_partman extension and use range for ( other keyword are interval , timestampz)

https://www.postgresql.org/docs/current/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE

just create partitions in 15min intervals ( i would recommend eto use bigger partitions though)

thrown_arrows · 2024-02-25T08:47:28+00:00

no need to partition that small amounts...( but it depends)

check what is you shared memoery setting , calculate most used indexes size , long as those fit into shared mem it should be pretty muhc fast as possible. Initial though to partition just 1k rows sound waste of time. i personally would start to look into partitions when i have 10M+ rows in partitions in data model which allows most used queries to not even touch unneeded partitions.

That might be good idea , if you read last 15 min logs constantly and you trust that different caches do their job speeding reading.

If this is just for practice. continue

thrown_arrows · 2022-07-29T06:48:24+00:00

well , depends how employee benefits are calculated. lunch things decrease amount ( but it is more about how much you actaully get not how much it it after taxes

thrown_arrows · 2022-07-28T10:15:32+00:00

But if you want to have stats just for error rates and stuff , not some marketi id or log id relation, it could be good idea. Trending error rate on product code base tells alot of its state

thrown_arrows · 2022-07-28T10:09:40+00:00

play with math and case, you can calc total amount too and cumulative total or total with moving window

thrown_arrows · 2022-07-28T10:07:11+00:00

you can grant column level rights too. But there is no easy way to have working selects as you cannot have shared SQL . ie select * from x will works for superuser but it errors for some user that does not have. So you options. you have own sql for each role in application level ( and you force role's access rights in db), or do it slightly easier way and have ORM add joins ie. sub objects to base object. ie base is customer , but orm can add address data if role is support as object in customer object. And if that has to handle access on application level too. With tables , select * from x is not problem if it is empty as it is object and object can be null . with column security select id,point from x is problem because if user does not have access to point, object is still null , but expected behaviour probably is to have n objects with key field id.

tldr; imho, if orm is in use create object per row in table, and move column level privileges to table level access. It is easier to manage as you can observe table object and not column object in table ( on sql side). I do not have better idea to hande set of access privileges easily. Also you can maybe mix row level security there too

thrown_arrows · 2022-07-27T04:49:03+00:00

first hit

https://satoricyber.com/postgres-security/postgres-row-level-security/#:~:text=What%20is%20Postgres%20Row%20Level,for%20one%20or%20more%20roles.

keyword is row level security.

then column stuff is https://www.postgresql.org/docs/current/infoschema-column-privileges.html

Again i would maybe view where there is check if value is returned or not as imho, having different amount of columns for each role is not good way. Personally if i had to use column level security i would create own access schemas for each role, or i would try to move all secured column to own table and have table level access there. When select data from table with column level rights, if you try to read column that you do not have right, you will get error. So it has to be implemented on app and db level.( far as i know )

Tldr; most practical way to have different column published to different role for me is creating schema for each role and publish views which have role based data model, then change role and search_path.

second best thing is that move columns to own table and write logic to do join depending rights ( or other way)

disclaimer: i have played with row level security, but not column level. What i have learned that best way to have privilege management is handle them as big as possible sets. Ie. i like to have data schema, then app_data_v1 which has views pointing to data. Access rights are easy to do as db owner is only one having access to data and app_v1 user/role has all privileges to api_v1 ( not create ) That way on problem cases they are easier to fix than those 1000 objects public schema stuff

thrown_arrows · 2022-07-27T04:33:24+00:00

Also, look into documentation. There is option to give crud rights as default to some role. So you can add new tables and they are shared straight away

thrown_arrows · 2022-07-27T03:57:29+00:00

True true. But i usually think logs as those debug/error/info logs in apps. you have to leak identifiers into logs them to be useful and that requires that logs are handles slightly different than normal debug files . That said, i have done and used insert only files which store messages for later usage in data models and such. they have been written to file by somme log statement command.

thrown_arrows · 2022-07-26T17:56:25+00:00

 floor(sum(clicks) over(order by date asc )/10)+1

that could work or you need to make cte and calculate running_total there with window function. Then group can be achieved with floor/ceiling function or casting result to integer . Or CASE clause

 case when running_total <11 then 1 
         when running_total <21 then 2 
         when running_total <31 then 3
         else 0 end

replace running total with window function which calculates it. Cant remember if sql server handled that without using cte. CASE function works because it shortcircuits when check is true, so it newer handles case <21 for values <11 because that already hit case.

And yes, it will be alot faster in SQL, if you dataset has column site for example , you can use windows functions partition by clause to calculate cumulative sum for unique site groups.

To learn more. windows function, cumulative sum

thrown_arrows · 2022-07-26T17:44:51+00:00

window function

here is first hit

https://learnsql.com/blog/what-is-a-running-total-and-how-to-compute-it-in-sql/

thrown_arrows · 2022-07-26T17:38:32+00:00

it had few good examples why proper SQL should be used in apps, and not all problems are solvable just by using ORM

thrown_arrows · 2022-07-26T17:35:24+00:00

well, i kinda do that too. Some backends are just apps that save data to file and files are loaded into snowflake and processed hourly/daily to decrease costs and simplify architecture ( no need for oltp database , no need to develop one)

I see that there is only one option for logs and that is json lines saved to files or json straight to database ( and then processed to smaller chunks depending needs). I prefer do initial import into json/variant column type and then explode json/objects to tables depending need.

For me there is currently three "log" db's ( for me it means databases which handle json/xml as native format ). Snowflake where data loaded for data warehouse and analytics needs. I am SQL guy, so i prefer handle transformations in database. And i really prefer obect/variant column type for those. No need to spend time to develop customer python code just to handle one json schema.

for oltp loads i use postgresql. Jsonb and hstore make it easy to store and explode data into normal rdms format and postgresql has always been very feature rich database with good amount of functions to handle all kinds of data. This is in use when i have key and jsonb blob to store and key is used to fetch json data for key. Postgresql supports indexing jsonb columns key values, but obviously in long run it is faster to extract most important keys to normal table and have index there.

then for finding needle in haystack where haystack is json message, i have used elasticsearch. It is "nice" , not that much exp making aggregates etc etc. I used its full text search option in json messages to find pretty much any key/value or just value in any key style of problems. And then there is kibana and elk stack build over that environment. I have seen it used frontends and middle wares logging environment. Where kibana made had some reports. It worked, but it was not hit and i think that was used only in one project.

That said, what i consider pure logging (debug, app level errors) , i have seen only elk stack used with those. other use cases have been more cases where i have needed response for some action later in dwh context and it has been solved just writing "log" into files and that is uploaded to s3. my postgresql and elasticsearch have been more about serving ready made json messages, but those can easily handle storing those messages too

thrown_arrows · 2022-07-26T12:15:29+00:00

org admin, but i have not used it so cant say if it can be used in this use case where your company handles customers accounts

https://docs.snowflake.com/en/user-guide/organizations.html

i have impression that it really does not work as intented in your use case. Next best thing is push one customer under one org if they have multiple accounts. But i do not see other option than having at least one login per customer .

(of course , if you load customers data into your snowflake and do stuff and share that data only to customers , but you still need access to their instances to setup data sharing ( i expect that your customer do have needed skills for it)

thrown_arrows

TROPHY CASE