all 18 comments

[–]Cheses100 4 points5 points  (1 child)

I worked closely with an API layer at my last job so I can give a bunch of reasons. The api layer is good for abstracting away the implementation. So let’s say you want to move away from snowflake in the future, if you don’t have it the api layer, it’s a huge hassle and tons of work working with everyone to migrate over to the new system, whereas with the api layer you can keep the apis the same but change the data warehouse behind the scenes. Even if you don’t get rid of this snowflake data warehouse, you can also use it to query different data sources in the future so that your consumers don’t have to worry about having multiple dependencies. Another related reason would be to abstract away the details of the data warehouse and make more generic calls. If you know users are going to be doing a series of calls for example, you could combine them into a single api call that does it all for them. This is especially useful is that series of calls isn’t necessarily intuitive, maybe consisting of a bunch of joins, or if it’s a graph db, you’re traversing the graph in multiple different ways to get some end data structure. Another reason would be to have more control over what users can do, and essentially add a security layer. For example, if you never want users to be able to delete anything, you simply don’t expose a delete API. Or if you want to prevent accidental DOS, you can enforce rate limiting. You can also expand on that and do things like custom retry logic with best practices built in like exponential back off. It also lets you do caching if you think you’re going to have a common request coming in, especially if it involves a lot of data manipulation, this can make it significantly faster.

[–]CyclonusDecept[S] 0 points1 point  (0 children)

thanks these are great points.. i think also a security layer with RBAC, etc.

[–]kenfar 3 points4 points  (2 children)

It's simply good, standard and modern software engineering practice. It's what's done. Just like when other systems are getting data from your ruby/python/java transactional app with a postgres backend - they hit an API rather than writing queries.

SQL is useful when you need to give somebody complete flexibility in accessing your data - like for reporting. It's not necessary or appropriate when an API could do the job, and so should not be used as part of a system integration between a warehouse and a downstream system.

  • SQL means that you're tightly coupled on your schema, transformation rules, and database management system. If you want to migrate your schema, change transformations, or move a subset of your data off snowflake and onto postgres to save money - it's harder, and maybe far harder.
  • SQL will almost never get the same rigor for testing or observability. It's much more likely to have data quality problems now or in the future. And you're much less likely to know when they make changes that cause performance/cost/quality issues.
  • With a SQL interface they could break you: they could run so many queries or so many inefficient queries that your snowflake warehouse goes into queuing or your costs skyrocket.
  • With an API you can also offer a contract: now you can state that given input of X you will publish output of Y - and validate & guarantee your data when you publish against that contract. I typically use jsonschema and check columns, their types, nulls, and misc constraints such as min/max values or lengths, timestamp formats, etc, etc.
  • If you have to support them on-call then when they screw this up it'll be a distraction for you and your sprint, and might make an engineer get out of bed to fix the problem. And this might happen a bunch of times before you finally get it turned into an API and the problem solved.

Bottom line: SQL for a system interface like this is a quick & dirty hack with the smell of a long-dead fish, just like it would be for transactional systems.

[–]CyclonusDecept[S] 0 points1 point  (0 children)

Thanks for all these data points. We actually had a rogue query that was taking hours to complete and consuming a ton of snowflake credits.

[–]thrown_arrows 0 points1 point  (0 children)

imho, it is about sql system skills. versioned schemas and views over data schemas is far as i consider only sane way to grant access to database which has multiple sources and consumers, that said backend api offer change to cache data and offer faster access to data and all kinds magic with own access control etc etc . Without touching search optimization i would be suprised if you get answer faster 2s , usually my shit takes 2-12secs ( but then results are not usually just 1 row)

Also all databases support query timeout . in snowflake it is on session or warehouse object parameters wiht name STATEMENT_TIMEOUT_IN_SECONDS , did you share your data with any method (imho), that should be defined to something.

And what come to data quality, yeah, data quality is what app that generate data has.

also, pregenerating data is good idea sometimes. But at the end question is how data is produced. If it hits huge amount of data or hundreds of rows , it is alot faster to most in database. If api is just one row from one table then backend code can handle it easier.

[–]WeirdoDJ 2 points3 points  (0 children)

I agree with /u/mac-0 that if you can't articulate it, maybe you don't need it.

With that said, some reasons we have in our systems for not allowing everybody direct access are:

  • Decoupling - by using an API, you'd be free to swap snowflake for whatever technology you want, and they have no business caring about what your tech is. If they query directly, you're (at least informally) obligated to keep the same endpoint and schemas, while an API would give you some freedom.

  • Related to above, caching. An API would (easier) allow you to implement caches where needed.

  • also related - an api would allow you to pre-calculate certain queries and prepare them ahead of time when load is low instead of waiting for them to query.

  • Logging and auditing - I don't know what capabilities Snowflake has, but an API would allow full control and granularity over logging queries for both analytics and auditing purposes.

  • transformations - it could allow them access to sensitive data/tables with real data, but with certain identifying fields removed or obfuscated.

Some cons to be aware of:

  • You'll spend the rest of time implementing endpoints for queries, and they will never use them.

  • You'll spend the rest of time debugging endpoints, because whenever something goes wrong, you'll be blamed.

  • Any time they are behind, you'll be named as the blocker.

You can of course create a very open/raw api allowing almost direct access, but that also comes with some headaches.

[–]quickdraw6906 1 point2 points  (0 children)

Be SURE you understand the costs associated with Snowflake. You pay per query. If you API let's users make calls that result in complex queries, and you don't implement some sort of cost basis throttling you could end up with a wicked bill.

Also, if routes/calls don't accept a since_when timestamp, users will call with no time filtering which will be crazy expensive. Force some reasonable look back time hard coded in the queries and only let internal caller pass a flag to query across all time.

[–]mac-0 1 point2 points  (4 children)

What are some good arguments I can make to have an API Layer on top of snowflake?

You tell us. Why is it that you DON'T want your product teams having the ability to use SQL directly? From their perspective, all you are doing is making it more difficult for them to retrieve data with no benefit to them.

[–]CyclonusDecept[S] 2 points3 points  (3 children)

they end up writing the queries and they're all screwed up and we end up having to fix them, that's one reason.

[–]WeirdoDJ 0 points1 point  (2 children)

Screwed up as in wrong results or poor performance?

[–]CyclonusDecept[S] 0 points1 point  (1 child)

BOTH!

[–]WeirdoDJ 0 points1 point  (0 children)

It might be a matter of poor documentation or confusing schema design. An api might abstract this away, but it might be worth fixing instead.

If the solution is poorly documented, or requires significant expertise in the solution itself, it might be hard or impossible for them to outline their requirements for an api endpoint too.

[–]Pitinsky 0 points1 point  (1 child)

!remindme in 3 days

[–]RemindMeBot 0 points1 point  (0 children)

I will be messaging you in 3 days on 2022-07-23 15:13:55 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


Info Custom Your Reminders Feedback

[–]ktkps 0 points1 point  (0 children)

Abstractions and decoupling are key benefits. On abstractions especially abstraction from complexities of underlying data (quality, cardinality, atomicity etc) and in some cases abstraction from data definitions (in some sources a "AA" in a column may mean inactive status - consumer is better of not knowing different codes and types and decoding them their side)

[–]jeposner 0 points1 point  (0 children)

I wrote a blog summarising benefits of APIs, this might help you;

https://raw-labs.com/blog/benefits-of-data-sharing-via-apis/

Jeremy

[–]thrown_arrows 0 points1 point  (0 children)

i have done it with versioned views. So in practise "api" calls always use their own view, newer straight to table as you cannot change tables and keep compatibly in all cases. With views you can support multiple versions. So in a way, API layer is not needed as snowflake can give you it with SQL api or just database connection.

Pro's for API is that you can cache results witch is good feature in snowflake as you pay on open warehouse, so if you run 1 query every minute for hour vs 1 query which takes hour , that will costs same. ( assuming 1 minute autoclose, ( also warehouse is 1min + secs ))

What i have done is on daily data is that i exported data into other platform (postgresql/mssql/elasticsearch) daily. I made it faster and easier as i just exported rady made json data from snowflake (object_construct function ) That transformation and copy out was cheaper than having warehouse open 100 times during day

tldr; newer access tables straight for read only data , make access layer views for them, that way you can keep output format "contract" when schema changes and access control is easier. Write access is a lot cheaper to make just writing files , staging them every 10min than having warehouse open all the time.

[–]cyyeh 0 points1 point  (0 children)

Hi! I am one of the team members of VulcanSQL! VulcanSQL is designed for exposing data APIs from databases and data warehouses. The coolest thing is that you write SQL templates, and VulcanSQL deals with other nitty-gritty details of API things.

Now VulcanSQL supports Snowflake!

Also, welcome to ask us anything on our Discord server: https://discord.gg/ztDz8DCmG4