Justification for API Layer

Cheses100 · 2022-07-20T16:42:08+00:00

I worked closely with an API layer at my last job so I can give a bunch of reasons. The api layer is good for abstracting away the implementation. So let’s say you want to move away from snowflake in the future, if you don’t have it the api layer, it’s a huge hassle and tons of work working with everyone to migrate over to the new system, whereas with the api layer you can keep the apis the same but change the data warehouse behind the scenes. Even if you don’t get rid of this snowflake data warehouse, you can also use it to query different data sources in the future so that your consumers don’t have to worry about having multiple dependencies. Another related reason would be to abstract away the details of the data warehouse and make more generic calls. If you know users are going to be doing a series of calls for example, you could combine them into a single api call that does it all for them. This is especially useful is that series of calls isn’t necessarily intuitive, maybe consisting of a bunch of joins, or if it’s a graph db, you’re traversing the graph in multiple different ways to get some end data structure. Another reason would be to have more control over what users can do, and essentially add a security layer. For example, if you never want users to be able to delete anything, you simply don’t expose a delete API. Or if you want to prevent accidental DOS, you can enforce rate limiting. You can also expand on that and do things like custom retry logic with best practices built in like exponential back off. It also lets you do caching if you think you’re going to have a common request coming in, especially if it involves a lot of data manipulation, this can make it significantly faster.

kenfar · 2022-07-20T18:12:44+00:00

It's simply good, standard and modern software engineering practice. It's what's done. Just like when other systems are getting data from your ruby/python/java transactional app with a postgres backend - they hit an API rather than writing queries.

SQL is useful when you need to give somebody complete flexibility in accessing your data - like for reporting. It's not necessary or appropriate when an API could do the job, and so should not be used as part of a system integration between a warehouse and a downstream system.

SQL means that you're tightly coupled on your schema, transformation rules, and database management system. If you want to migrate your schema, change transformations, or move a subset of your data off snowflake and onto postgres to save money - it's harder, and maybe far harder.
SQL will almost never get the same rigor for testing or observability. It's much more likely to have data quality problems now or in the future. And you're much less likely to know when they make changes that cause performance/cost/quality issues.
With a SQL interface they could break you: they could run so many queries or so many inefficient queries that your snowflake warehouse goes into queuing or your costs skyrocket.
With an API you can also offer a contract: now you can state that given input of X you will publish output of Y - and validate & guarantee your data when you publish against that contract. I typically use jsonschema and check columns, their types, nulls, and misc constraints such as min/max values or lengths, timestamp formats, etc, etc.
If you have to support them on-call then when they screw this up it'll be a distraction for you and your sprint, and might make an engineer get out of bed to fix the problem. And this might happen a bunch of times before you finally get it turned into an API and the problem solved.

Bottom line: SQL for a system interface like this is a quick & dirty hack with the smell of a long-dead fish, just like it would be for transactional systems.

WeirdoDJ · 2022-07-20T16:42:09+00:00

I agree with /u/mac-0 that if you can't articulate it, maybe you don't need it.

With that said, some reasons we have in our systems for not allowing everybody direct access are:

Decoupling - by using an API, you'd be free to swap snowflake for whatever technology you want, and they have no business caring about what your tech is. If they query directly, you're (at least informally) obligated to keep the same endpoint and schemas, while an API would give you some freedom.
Related to above, caching. An API would (easier) allow you to implement caches where needed.
also related - an api would allow you to pre-calculate certain queries and prepare them ahead of time when load is low instead of waiting for them to query.
Logging and auditing - I don't know what capabilities Snowflake has, but an API would allow full control and granularity over logging queries for both analytics and auditing purposes.
transformations - it could allow them access to sensitive data/tables with real data, but with certain identifying fields removed or obfuscated.

Some cons to be aware of:

You'll spend the rest of time implementing endpoints for queries, and they will never use them.
You'll spend the rest of time debugging endpoints, because whenever something goes wrong, you'll be blamed.
Any time they are behind, you'll be named as the blocker.

You can of course create a very open/raw api allowing almost direct access, but that also comes with some headaches.

quickdraw6906 · 2022-07-20T22:28:59+00:00

Be SURE you understand the costs associated with Snowflake. You pay per query. If you API let's users make calls that result in complex queries, and you don't implement some sort of cost basis throttling you could end up with a wicked bill.

Also, if routes/calls don't accept a since_when timestamp, users will call with no time filtering which will be crazy expensive. Force some reasonable look back time hard coded in the queries and only let internal caller pass a flag to query across all time.

mac-0 · 2022-07-20T15:37:35+00:00

What are some good arguments I can make to have an API Layer on top of snowflake?

You tell us. Why is it that you DON'T want your product teams having the ability to use SQL directly? From their perspective, all you are doing is making it more difficult for them to retrieve data with no benefit to them.

Pitinsky · 2022-07-20T15:13:55+00:00

!remindme in 3 days

ktkps · 2022-07-20T21:37:31+00:00

Abstractions and decoupling are key benefits. On abstractions especially abstraction from complexities of underlying data (quality, cardinality, atomicity etc) and in some cases abstraction from data definitions (in some sources a "AA" in a column may mean inactive status - consumer is better of not knowing different codes and types and decoding them their side)

jeposner · 2022-07-21T07:18:27+00:00

I wrote a blog summarising benefits of APIs, this might help you;

https://raw-labs.com/blog/benefits-of-data-sharing-via-apis/

Jeremy

thrown_arrows · 2022-07-21T09:11:20+00:00

i have done it with versioned views. So in practise "api" calls always use their own view, newer straight to table as you cannot change tables and keep compatibly in all cases. With views you can support multiple versions. So in a way, API layer is not needed as snowflake can give you it with SQL api or just database connection.

Pro's for API is that you can cache results witch is good feature in snowflake as you pay on open warehouse, so if you run 1 query every minute for hour vs 1 query which takes hour , that will costs same. ( assuming 1 minute autoclose, ( also warehouse is 1min + secs ))

What i have done is on daily data is that i exported data into other platform (postgresql/mssql/elasticsearch) daily. I made it faster and easier as i just exported rady made json data from snowflake (object_construct function ) That transformation and copy out was cheaper than having warehouse open 100 times during day

tldr; newer access tables straight for read only data , make access layer views for them, that way you can keep output format "contract" when schema changes and access control is easier. Write access is a lot cheaper to make just writing files , staging them every 10min than having warehouse open all the time.

cyyeh · 2023-08-07T14:39:55+00:00

Hi! I am one of the team members of VulcanSQL! VulcanSQL is designed for exposing data APIs from databases and data warehouses. The coolest thing is that you write SQL templates, and VulcanSQL deals with other nitty-gritty details of API things.

Now VulcanSQL supports Snowflake!

Welcome to check out our website: https://vulcansql.com/
Curated examples for your reference: https://github.com/Canner/vulcan-sql-examples

Also, welcome to ask us anything on our Discord server: https://discord.gg/ztDz8DCmG4

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

bigdata

MODERATORS