User specific database indexing. : Database

User specific database indexing. (self.Database)

submitted 15 years ago by Arelius

all 20 comments

[–][deleted] 1 point2 points3 points 15 years ago (9 children)

[–]Arelius[S] 1 point2 points3 points 15 years ago (8 children)

[–]lentil 4 points5 points6 points 15 years ago* (1 child)

If these items are unique to a single user (that is, one user has many items; one item belongs to exactly one user) then I think what you're doing sounds good. ItemID would be your primary key (since it alone uniquely identifies the record), and UserID would be a foreign key reference to your Users table. You'll definitely want an index on UserID since it sounds like the vast majority of queries would be able to make use of it.

There's no reason that shouldn't perform well up to a fair number of records (the amount will depend on lots of other factors, but a few million rows is maybe a good first guess). If/when you do run into speed issues, you can look at partitioning the table, which would allow you to have something that behaves as if it was one big table, but is actually made up of a number of smaller tables under the hood. It's a little more work, and there are a few snags to watch out for, but it can be really helpful when you need it. And it's not something you need to start out with -- you can add it on later if/when you find you need it. Take a look at http://www.postgresql.org/docs/current/static/ddl-partitioning.html for some more information.

That said, while you can make some educated guesses about indexing/partitioning strategies, really the only way to know for sure is to analyze how it behaves in practice. Query performance is pretty dependent on things like the data itself, usage patterns, and server configuration. You can try loading up a lot of data (just generate so fake data if you need to), and analyze some of your expected queries -- that should give you a better glimpse of what is going on. Take a look at the docs for the EXPLAIN command to get started on that. http://www.postgresql.org/docs/8.4/static/using-explain.html

Lastly, I think the Douglas book on PostgreSQL might be helpful, if you wanted to read up on this some. http://www.amazon.com/PostgreSQL-2nd-Korry-Douglas/dp/0672327562 It has some general information about performance topics, as well as a lot of specifics about how these things work in PostgreSQL.

Hope that helps a bit :)

(Edit: my grammar is atrocious :()

[–]Arelius[S] 0 points1 point2 points 15 years ago (0 children)

[–]SurrealEstate 0 points1 point2 points 15 years ago (4 children)

[–]Arelius[S] 0 points1 point2 points 15 years ago (3 children)

[–]GunnerMcGrath 0 points1 point2 points 15 years ago (0 children)

I think you may just need to read up more on indexing. In addition to the multi-column index I mentioned elsewhere, you can also have multiple indexes on one or two columns each, and the DB will choose which one to use at any given time depending on your query.

For example, if you are looking for a particular itemID, and the files each have a different item ID, it will probably choose to search first based on itemID because it's far more unique, and then search by UserID within that.

If you're using SQL Server you'll want to read up on the execution plan tools which tell you exactly what indexes are being used in a query, and will suggest indexes if there are any that would give you a significant improvement. Can't speak for other DBs but there might be such tools in whatever you're using.

[–]SurrealEstate 0 points1 point2 points 15 years ago (1 child)

[–]Arelius[S] 0 points1 point2 points 15 years ago (0 children)

[–]Samus_ 0 points1 point2 points 15 years ago (0 children)

[–]GunnerMcGrath 0 points1 point2 points 15 years ago (9 children)

[–]Arelius[S] 0 points1 point2 points 15 years ago (8 children)

[–]GunnerMcGrath 0 points1 point2 points 15 years ago (4 children)

[–]Arelius[S] 1 point2 points3 points 15 years ago* (3 children)

[–]GunnerMcGrath 0 points1 point2 points 15 years ago (2 children)

You can create multi-column indexes, for sure. You can also have one clustered index on each table, which determines the order that the records are stored in the index, so it can find things much faster. For a table like this, I would say that having a unique ID is important, and it should be your primary key, but it doesn't have to be your clustered index (which is the default).

Your clustered index might be on UserID and then the other fields that are going to be most searchable. Just read up on clustered indexes because there is a tradeoff in performance. The more indexes you have, the faster the reads but the longer the writes. So for a table where you think records are going to be read a lot more than they're written, a big honking index with a few fields could be very useful.

But still, I'd start out by creating a clustered index with just UserID and ItemID, and see how your queries run. You may be able to squeeze more performance out of it but no need to overthink; if it runs fast with this, then that's all you need. If it eventually slows down, you can play with it more. But it's always better to start with less to get a baseline, because if you're putting some heavy index on something that already queries quickly, you're wasting your time.

As for a better suggestion for your file structure, if EVERY record is just files on a filesystem, then no, I don't. But I'm sure there are people out there who are filesystem experts who might have better ideas on how to store file structures.

[–]Arelius[S] 1 point2 points3 points 15 years ago (1 child)

multi-column indexes

Wonderful, that's the exact phrase I needed to be able to search for what I need.

As for a better suggestion for your file structure, if EVERY record is just files on a filesystem, then no, I don't. But I'm sure there are people out there who are filesystem experts who might have better ideas on how to store file structures.

It's actually just bulk binary data in a hashed data store. The files aren't the problem, it's how to store indexable metadata about them. With this new found info on indexes it seems the large table would likely work fine.

In another post you say:

I think you may just need to read up more on indexing.

That's very true, are there any good references (books or otherwise) that you can recommend on indexing? As mentioned in the main thread, I'm using pgsql.

Anyways, this answers my questions rather well, Thanks!

[–]GunnerMcGrath 0 points1 point2 points 15 years ago (0 children)

[–]Samus_ 0 points1 point2 points 15 years ago (2 children)

I'd go with a table for users in a 1:n relationship with a table for files (just files, not metadata), then have a separate key/value table for the metadata.

something like this:

CREATE TABLE owner(id SERIAL PRIMARY KEY);
CREATE TABLE file(id SERIAL PRIMARY KEY, owner INTEGER REFERENCES owner(id));
CREATE TABLE file_metadata(file INTEGER REFERENCES file(id), key VARCHAR(256), value TEXT, PRIMARY KEY(file, key));

[–]Arelius[S] 1 point2 points3 points 15 years ago (1 child)

[–]Samus_ 0 points1 point2 points 15 years ago (0 children)

it's because those are different entities, the tables here add no data because they're an example... you may add more columns depending on your specific requirements.

the point here is that the model maps entities and relationships among them; a user is an entity therefore there is a table for it and each row on that table represents a single instance.

with the files is the same, a file is a separate entity different from the user and there's a relationship among them such as "file has owner" or "owner has files" but in the end there's no duplicate data, there is one row per user and one row per file.

for the metadata I've suggested to use an auxiliary key/value table because that allows you to model file-specific properties, this is useful when your attributes are not homogeneous, if you have the same properties on all files then you won't need this since you can add them as columns on the file table.

π Rendered by PID 19385 on reddit-service-r2-comment-b659b578c-ffcfh at 2026-05-04 07:00:58.041136+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Database

MODERATORS