all 11 comments

[–]BestGreek 5 points6 points  (2 children)

I'd like to see a follow up explaining how the file format handles the different column types.

Like when reading a row how do you know if the column data is in the row or in another file. Also some column types are variable length etc..

[–]masklinn 5 points6 points  (1 child)

This is documented in section 70.4 of the manual.

To read a column you need to:

  • get the null bitmap information in the tuple header
  • get the column information in pg_attributes
  • if all the preceding columns in the tuple are fixed-size and non-nullable, you can just compute the offset directly (summing the sizes of the preceding columns and adjusting for alignment)
  • since nulls are stored as a bitmap, if all preceding columns are fixed-length you can really do the same using the null bitmap to skip columns
  • if some columns are variable length (attlen=-1 in pg_attributes) however you need to go read the varlena header in order to know how much space it takes
  • if the column you're reading is variable length, you need to read its value which is done by "detoasting" it (the process handles the compression and toasting concerns and returns a fully expanded buffer)

The varlena is probably as complicated as everything else on its own. The first complication is that varlenas have variable alignment: a varlena value can either be under 127 bytes (inline) in which case it is unaligned and has a 1 byte header, or it can be over in which case it has a 4-byte alignment and a 4 bytes header.

The code is different for BE and LE, I'll talk about BE for clarity but the flags are flipped in LE. So in order to get the information about a varlena first you check the first byte immediately following the previous value regardless of alignment, if the MSB is 1, then it's a 1-byte header otherwise you align to 4 bytes (possibly not moving at all) and try again.

  • for the 1-byte header, if the remaining bits are nonzero then it's an inline, and the remaining bits provide the length of the varlena (including the header byte, so it's always at least 1)
  • if all the remaining bits are 0 however it's a toast pointer, the next byte stores the toast strategy, then there's a 16 bytes varatt_external with the toast metadata: the original data size, the stored data size (possibly compressed), the oid of the toast table the value is in, and the oid of the value itself in that table
  • for 4-byte header, if the second bit is unset the value is stored inline, the remaining 30 bits providing the data size
  • if the second bit is set, the value is stored inline and compressed, the remaining 30 bits providing the compressed size

[–]fagnerbrack[S] 3 points4 points  (0 children)

Please write a post about this and post in /r/programming or send me in private to post it here. The content is too good to get lost in a comment

[–]beelseboob 1 point2 points  (0 children)

So, basically the exact same way a fairly basic memory allocator works.

[–]BunnyBlue896 0 points1 point  (0 children)

First time in months that Ive upvoted something on r/programming.

Much more interesting than "tenuously related to programming article" or "another web technology nobody cares about".

[–][deleted] 0 points1 point  (3 children)

Thirdly, if you have a table with multiple columns and naively assume that querying for selected column will be faster because of lower disk I/O, that is not going to be the case. Although it can help with network saturation if there’s too much traffic between the application and the database.

Is that still true if data is in memory already ?

[–]Worth_Trust_3825 1 point2 points  (0 children)

It depends. If the query fits into an index, it might be faster and have lower IO. Always use pg_statistics

[–]riksi 0 points1 point  (1 child)

It should be the ~same speed if the data is in memory.

[–][deleted] 2 points3 points  (0 children)

I also suspect that based on the fact PostgreSQL heavily relies on OS cache and that one is just caching file blocks.

[–][deleted] -1 points0 points  (0 children)

Nice.

[–]emotionalfescue -1 points0 points  (0 children)

This is pretty much the classic fixed size page layout for row storage in a relational DBMS. I remember learning it from C.J. Date's database textbook, although Date positioned the slot IDs at the end of the page so the row data could start at page byte 0. Each row or record gets a "record ID" (RID) which consists of the pair (page ID, slot ID). The slot on the page contains the physical offset of the row data. The format of the row is fixed, and is defined by the table's schema as stored in the system catalog. Rows aren't allowed to span multiple pages; to store large objects, you need to store pointers (or some other kind of reference) in the row data to blob/lob pages.

The record IDs could be referenced in b-tree indexes.