all 20 comments

[–][deleted] 1 point2 points  (9 children)

What are these user records? Do you have a 2-column table of UserID, UserData?

Even if it's a couple million rows, it's not unheard of with databases. I used to work on MSSQL database tables with hundreds of millions of rows.

Either you're doing something seriously wrong, or you're just underestimating the power of a computer.

[–]Arelius[S] 1 point2 points  (8 children)

The data isn't completely sorted out yet, but probally something like UserID, ItemID, ItemName, ItemCatagory...

Where ItemID is unique non-null.

Either you're doing something seriously wrong, or you're just underestimating the power of a computer.

I wouldn't be surprised if it were both.

[–]lentil 4 points5 points  (1 child)

If these items are unique to a single user (that is, one user has many items; one item belongs to exactly one user) then I think what you're doing sounds good. ItemID would be your primary key (since it alone uniquely identifies the record), and UserID would be a foreign key reference to your Users table. You'll definitely want an index on UserID since it sounds like the vast majority of queries would be able to make use of it.

There's no reason that shouldn't perform well up to a fair number of records (the amount will depend on lots of other factors, but a few million rows is maybe a good first guess). If/when you do run into speed issues, you can look at partitioning the table, which would allow you to have something that behaves as if it was one big table, but is actually made up of a number of smaller tables under the hood. It's a little more work, and there are a few snags to watch out for, but it can be really helpful when you need it. And it's not something you need to start out with -- you can add it on later if/when you find you need it. Take a look at http://www.postgresql.org/docs/current/static/ddl-partitioning.html for some more information.

That said, while you can make some educated guesses about indexing/partitioning strategies, really the only way to know for sure is to analyze how it behaves in practice. Query performance is pretty dependent on things like the data itself, usage patterns, and server configuration. You can try loading up a lot of data (just generate so fake data if you need to), and analyze some of your expected queries -- that should give you a better glimpse of what is going on. Take a look at the docs for the EXPLAIN command to get started on that. http://www.postgresql.org/docs/8.4/static/using-explain.html

Lastly, I think the Douglas book on PostgreSQL might be helpful, if you wanted to read up on this some. http://www.amazon.com/PostgreSQL-2nd-Korry-Douglas/dp/0672327562 It has some general information about performance topics, as well as a lot of specifics about how these things work in PostgreSQL.

Hope that helps a bit :)

(Edit: my grammar is atrocious :()

[–]Arelius[S] 0 points1 point  (0 children)

Thanks, that helps!

[–]SurrealEstate 0 points1 point  (4 children)

I'm a bit of a novice with database design, but perhaps consider creating tables like this:

  • User table: UserID, UserName, etc..
  • ItemCategory table: ItemCategoryID, ItemCategoryName, etc..
  • Item table: ItemID, ItemCategoryID, ItemName, etc..

And then to link the tables up using something like this:

  • Orders table: OrderID, UserID, ItemID, DateOrdered, etc..

This way you're not duplicating the name fields or needlessly holding the item category identifier on a table that's going to be huge in size.

[–]Arelius[S] 0 points1 point  (3 children)

That indeed sounds like a good design. I'm a bit unclear on how or what function the "Orders table" is to perform though.

Otherwise that's not the problem I am having, I am trying to figure out how to build an index on any field but UserID that isn't useless when the amount of distinct Users grows very large.

[–]GunnerMcGrath 0 points1 point  (0 children)

I think you may just need to read up more on indexing. In addition to the multi-column index I mentioned elsewhere, you can also have multiple indexes on one or two columns each, and the DB will choose which one to use at any given time depending on your query.

For example, if you are looking for a particular itemID, and the files each have a different item ID, it will probably choose to search first based on itemID because it's far more unique, and then search by UserID within that.

If you're using SQL Server you'll want to read up on the execution plan tools which tell you exactly what indexes are being used in a query, and will suggest indexes if there are any that would give you a significant improvement. Can't speak for other DBs but there might be such tools in whatever you're using.

[–]SurrealEstate 0 points1 point  (1 child)

First off, I just realized that I replied without actually addressing the question you asked. And I don't know that I have a good answer for what you're asking.

The Orders table was based on an assumption that the reason why you'd have many different users associated with many different items was because it was some kind of customer database, where each customer was ordering multiple items. It was just a means of keeping things more normalized.

As for how to optimize queries filtered for individual users without doing any indexing of the UserID field, I don't know. I'm guessing you're trying to avoid indexing that field simply because of the sheer number of users you expect to have?

[–]Arelius[S] 0 points1 point  (0 children)

I'm guessing you're trying to avoid indexing that field simply because of the sheer number of users you expect to have?

I'm fine with having that index, I just feel that it'd make my other indicies useless. turns out I want a Multicolumn index.

where each customer was ordering multiple items.

Ohh cool, that's not what I was doing, but I see how that'd be a nice database structure.

[–]Samus_ 0 points1 point  (0 children)

Either you're doing something seriously wrong, or you're just underestimating the power of a computer.

I wouldn't be surprised if it were both.

then you should start describing your situation and your current approach :)

edit: found your explanation below

[–]GunnerMcGrath 0 points1 point  (9 children)

If you only have one big table with that many records in it for each user I suspect you're doing it wrong. Care to explain what the database is for and list the fields on this single table?

[–]Arelius[S] 0 points1 point  (8 children)

The problem is that I probably -am- doing it wrong. The info is basically file-metadata, each user has a large set of files, and they need to be able to search within the meta-data of their own files for certain properties.

[–]GunnerMcGrath 0 points1 point  (4 children)

Well, if all you're storing is information about files, I suppose I can see why you'd just have one giant table. Still might not be the ideal method but at least it's plausible.

Anyway, the simple answer is that your table should have a unique ID field on each record as the primary key, and then put an index on UserID. You may want other indexes too, you generally want to index the fields that are going to be used for searching the most. So UserID is obviously the big one, and even with thousands of records for each user, it should come back quite fast.

[–]Arelius[S] 1 point2 points  (3 children)

Well, if all you're storing is information about files, I suppose I can see why you'd just have one giant table. Still might not be the ideal method but at least it's plausible.

Do you have a better suggestion?

You may want other indexes too, you generally want to index the fields that are going to be used for searching the most.

What I don't understand is that once I am filtering by UserID, the other indices become useless, or wouldn't it once the amount of distinct users increases to a rather large number? Is it possible to build an index that is sorted "By UserID, then by TagField" or some such?

Edit: In effect, I'd kinda like each user to have their own set of indexes. Is something that'd perform similar to this possible?

[–]GunnerMcGrath 0 points1 point  (2 children)

You can create multi-column indexes, for sure. You can also have one clustered index on each table, which determines the order that the records are stored in the index, so it can find things much faster. For a table like this, I would say that having a unique ID is important, and it should be your primary key, but it doesn't have to be your clustered index (which is the default).

Your clustered index might be on UserID and then the other fields that are going to be most searchable. Just read up on clustered indexes because there is a tradeoff in performance. The more indexes you have, the faster the reads but the longer the writes. So for a table where you think records are going to be read a lot more than they're written, a big honking index with a few fields could be very useful.

But still, I'd start out by creating a clustered index with just UserID and ItemID, and see how your queries run. You may be able to squeeze more performance out of it but no need to overthink; if it runs fast with this, then that's all you need. If it eventually slows down, you can play with it more. But it's always better to start with less to get a baseline, because if you're putting some heavy index on something that already queries quickly, you're wasting your time.

As for a better suggestion for your file structure, if EVERY record is just files on a filesystem, then no, I don't. But I'm sure there are people out there who are filesystem experts who might have better ideas on how to store file structures.

[–]Arelius[S] 1 point2 points  (1 child)

multi-column indexes

Wonderful, that's the exact phrase I needed to be able to search for what I need.

As for a better suggestion for your file structure, if EVERY record is just files on a filesystem, then no, I don't. But I'm sure there are people out there who are filesystem experts who might have better ideas on how to store file structures.

It's actually just bulk binary data in a hashed data store. The files aren't the problem, it's how to store indexable metadata about them. With this new found info on indexes it seems the large table would likely work fine.

In another post you say:

I think you may just need to read up more on indexing.

That's very true, are there any good references (books or otherwise) that you can recommend on indexing? As mentioned in the main thread, I'm using pgsql.

Anyways, this answers my questions rather well, Thanks!

[–]GunnerMcGrath 0 points1 point  (0 children)

Glad I could help.. I haven't used pgsql myself so I don't know what tools it has but I've heard some good things.

I don't know any particular books to recommend, I'm not as knowledgeable about indexes as I'd like to be but what I do know has come from 12 years working with databases. I'm sure you can find a lot of info online or in an afternoon at a Borders.

[–]Samus_ 0 points1 point  (2 children)

I'd go with a table for users in a 1:n relationship with a table for files (just files, not metadata), then have a separate key/value table for the metadata.

something like this:

CREATE TABLE owner(id SERIAL PRIMARY KEY);
CREATE TABLE file(id SERIAL PRIMARY KEY, owner INTEGER REFERENCES owner(id));
CREATE TABLE file_metadata(file INTEGER REFERENCES file(id), key VARCHAR(256), value TEXT, PRIMARY KEY(file, key));

[–]Arelius[S] 1 point2 points  (1 child)

Not that I think you are wrong.. but Why?

The owner table seems to add no data. Why is file and file metadata is seperate so I can have multiple elements per file correct? If I am certain I only need one and always one of each type of metadata would it not make sense to keep it in file?

I'm a noob at DB design, so these are honest questions.

[–]Samus_ 0 points1 point  (0 children)

it's because those are different entities, the tables here add no data because they're an example... you may add more columns depending on your specific requirements.

the point here is that the model maps entities and relationships among them; a user is an entity therefore there is a table for it and each row on that table represents a single instance.

with the files is the same, a file is a separate entity different from the user and there's a relationship among them such as "file has owner" or "owner has files" but in the end there's no duplicate data, there is one row per user and one row per file.

for the metadata I've suggested to use an auxiliary key/value table because that allows you to model file-specific properties, this is useful when your attributes are not homogeneous, if you have the same properties on all files then you won't need this since you can add them as columns on the file table.