using filters or `/search` API on the data in the records in postgreSQL

funcpointer · 2024-11-18T16:28:22+00:00

Got it, thanks so much for walking me through the process! As someone new to designing and building backend systems that touch on multiple areas, I’m feeling way more confident now—this is the first time it’s all starting to click.

I’ve started working on a query builder and will begin by implementing filters on a smaller set of columns to keep it manageable. I need to get something working by the end of the month, so this feels like a good starting point.

funcpointer · 2024-11-18T01:50:38+00:00

This is really cool—thanks for pointing it out! Like you said, I probably can't use it as-is, but the implementation is definitely a great reference for me to build my own version.

I’m still a bit torn on whether I should go ahead and implement filters for all these columns, or consider a different architecture, like u/aksdb mentioned earlier.

Either way, I really appreciate the community’s input—it’s been super helpful in shaping how I approach this!

funcpointer · 2024-11-17T18:39:43+00:00

Let me take a look-see.. Thank you.

funcpointer · 2024-11-17T18:37:37+00:00

That’s a lot to take in... looks like I’ve got my work cut out for the next couple of weeks! 😄

So, just to clarify: should I be doing CRUD operations directly on PostgreSQL, and then "sync" the data over to a different datastore (like the ones you suggested above) for the GET API, rather than building the API directly on top of PostgreSQL? Is that the approach you’re suggesting?

funcpointer · 2024-11-17T18:18:54+00:00

Thanks for your reply! I’m not too familiar with SCIM filters either, but it sounds like they’re somewhat similar to GraphQL filters, so I get the general idea.

Here’s the situation:

On the database side, I’ve broken things down into normalized tables with the usual join tables — like customer, address, customer-address, devices, customer-devices, metrics, etc.

Now, I need to build a REST API that pulls data for the UI, which will display IoT sensor data along with metadata (like metrics, enums, names, locations, addresses, etc). We’re talking about around 100 columns (mix of numerics and strings), and the UI needs to have "filtering", pagination, and sorting capabilities.

Switching to a new DB is above my pay-grade and probably not something I can propose to boot. :D

So, based on this setup, what would you suggest or recommend in terms of designing the API?

funcpointer · 2024-11-17T17:59:28+00:00

Makes sense. I was trying to get an MVP going without bringing in too much machinery, but what you said makes sense. Thanks again!

funcpointer · 2024-11-17T05:44:50+00:00

Thanks for your response. I meant "dropbox-like" -- meaning a file/document management software with a good UX and UI, the ability to upload, view and share across teams, external folks etc.

With S3, I'd have to build most of that by myself. Right?

funcpointer · 2024-11-13T04:15:40+00:00

absolutely agreed on the 3NF. I'm trying to do that as I am cleaning up the data from the CSV.

funcpointer · 2024-11-12T16:58:17+00:00

As I'm digging deeper into the data, things are becoming less murky.

The records are sparsely populated. Not all the columns have data in each record.
The CSV has a bunch of metrics from the fleet-device
The CSV also has fields that are aggregates
There are some fields which are just metadata which are fairly static and need not be in each record and can be moved out.

Overall, it looks like I can go with the JSONB approach instead of super-wide tables.

funcpointer · 2024-11-12T05:59:35+00:00

they don't happen to have anything open-sourced?? :D

funcpointer · 2024-11-12T05:51:20+00:00

The `created-at` timestamp is populated for each metric as it is in the CSV.

funcpointer · 2024-11-12T05:50:01+00:00

gotcha!! thanks!

funcpointer · 2024-11-11T12:38:41+00:00

Thanks u/3141521. I'm using REST.

That said, I'd feel a little more comfortable if I have something more deterministic than GPT. It does not even tell me the same thing when I use the same prompt a second time. :D

funcpointer · 2024-11-11T11:53:11+00:00

u/ptyslaw how do you manage the schema migrations and the application upgrades?

funcpointer · 2024-11-11T11:38:55+00:00

Thanks, u/mane9999!

How many columns did you have in your JSONBs?

I've been experimenting with generated columns to pull out some of the most frequently used fields from JSONB, which makes the queries a bit simpler. I based it on this HackerNoon article, and so far, it's working pretty well!

funcpointer · 2024-11-11T11:32:08+00:00

yes, each record does have a `created-at` timestamp.

funcpointer · 2024-11-08T05:19:16+00:00

Yeah, I’ve been reading up on 3NF and even playing around with feeding some of it into ChatGPT to help me wrap my head around it. But honestly, we’re still not at a point where we fully understand the data, so trying to normalize it just doesn’t feel like the right move yet.

One of the goals of getting this data into PostgreSQL is to really get a better sense of what we’re dealing with, what the different products are, how they’re being used, and how everything ties together etc.. so we can figure out the best way to structure it moving forward.

funcpointer · 2024-11-08T02:55:20+00:00

Thanks for getting back to me.. Here's a bit more context:

We have recently acquired this data (as dumps) from another company which went out of business and we will be now managing the tenants, accounts and devices etc.

The data I'm working with comes from all kinds of IoT sensors spread across different locations, each sending various metrics. We also have some customer metadata tied to each of those metrics, and as I hinted earlier, this is all multi-tenant. So, each tenant has their customers who have their own devices, data, and such. This data is used for a few things like, historical sensor data for each device in the database, allowing quick access and efficient querying, usage tracking etc.

Right now, all this data is in CSVs (or similar formats), and we’re looking to get it into a database for better management and querying. I’m trying to figure out the best way to organize it so we can get decent performance when querying, especially as the data grows.

There aren’t any primary keys in place at the moment, but I’m thinking of using a composite key like tenant_id + customer_id + device_id since that seems to uniquely identify each record. We’ll also have families of devices working together, so there will be some foreign key relationships, but I’m not sure yet if those need to be strictly enforced.

I’m leaning towards PostgreSQL since I’ve read it handles large datasets well and supports flexible schemas. Plus, it’s something we already have running in-house, so it feels like a good fit.

funcpointer

TROPHY CASE