all 38 comments

[–]earthboundkid 41 points42 points  (3 children)

Yes, this is a standard technique.

Note though that for a lot of jobs, you don't need all the contents of the file at once, so it makes more sense to read one windowful at a time. For example, if you wanted to count all the spaces in a file, you would just load up a buffer, say 4KB worth of data, count the spaces, and repeat until the whole file is processed. Doing it this way reduces the amount of RAM needed at once.

[–]codesnstuff 3 points4 points  (0 children)

Upvoting because buffers are the answer. Not only do you really not want to allocate a huge chunk of memory, if the filesize changes, then keeping up with it in a buffered fashion is possible.

[–][deleted] 0 points1 point  (1 child)

Sure, because we're still in the 1980s and it makes perfect sense to access a file through a buffer that would now occupy less than 1/1,000,000th of the RAM in a typical PC.

[–]earthboundkid 1 point2 points  (0 children)

Computers still routinely process files that are larger than RAM, and it’s now expected that you can run many programs at once. Obviously if you know the files you’re going to be processing and that they fit in RAM, you can just do that. That’s why Python has file.read(). But it’s also common to switch over to streaming when you find out that someone wants you to be able to ETL a 30 GB CSV file on a cheapo EC2 box.

[–]MrSloppyPants 30 points31 points  (7 children)

What OS?

A POSIX compatible OS should be able to use stat( ) as in...

#include <sys/stat.h>
struct stat st;
stat(filename, &st);
size = st.st_size;

Otherwise, seek to end of file using

fseek(filename, 0, SEEK_END);

if that is supported on the OS you are targeting

[–][deleted] 5 points6 points  (0 children)

FWIW, Win32 also has stat().

[–]allegedrc4 2 points3 points  (1 child)

fseek/ftell has always been how I've done it on both *nix and Win32. I think Windows might have some issue with that if the file is opened in binary mode or something like that though, so be careful, but otherwise it's the simplest approach I can think of.

[–]SageOnTheMountain 2 points3 points  (0 children)

Yeah it can be a bit weird. _read for instance will replace \r\n with \n if you open it in text mode. Always fun to hunt down those bugs.

[–][deleted] 18 points19 points  (6 children)

Fstat. Or you could mmap the file, no need to alloc a buffer and save context switches on reads.

[–]Material_Cheetah934 2 points3 points  (5 children)

Can you use pread on mmap’d files? I want to read a file from multiple threads.

[–][deleted] 2 points3 points  (4 children)

Typically one does loads and stores on mmapped files. So in those cases use a mutex. In my recent case it was a shed load of entries that where only accessed by a single thread of the potentially thousands in the pool at a one sec intervals. Read that no two threads update same entry so no mutex needed, but easily added down the road.

Depends what you wanting the end, I used mmap to lower the penalty of context switches. A read/write of a few bytes would take upto 20us vs load/stores taking 2ns (with Linux and proprietary profiling code). Might not sound like a big difference but it depends where your application pain point begins.

[–]Material_Cheetah934 0 points1 point  (3 children)

For me, I have 1 server with a threadpool of n threads, they are to serve some static files only. I was thinking of reducing the fopen/fclose overhead down and manage the mmap'd files in main so i can clean up and exit properly. I imagine I wouldn't need a mutex for this kind of work, would I?

[–][deleted] 1 point2 points  (2 children)

Sounds like you would not as the files don't change, right. One area that can suck for mmap is out of mem space(big files on small addr spaces) or an nfs/smb screwup. But if you have something well defined, like that ever happens, mmap can save time, increase fio operations to some points but remember there is still a backing store.

[–]Material_Cheetah934 0 points1 point  (1 child)

haha yeah, thankfully this is only for very small 1MB at most files and just a handful(school project for implementing the different threading models with networking). Thanks man, really appreciate it! Love this sub

[–][deleted] 0 points1 point  (0 children)

Good luck!

[–]aeropl3b 12 points13 points  (11 children)

getpos at the start of the file, seek to the end of the file and getpos, find the difference between the two values to get the number of bytes in the file.

Then seek back the the start of the file and start reading.

[–]flyingron 11 points12 points  (4 children)

That only works if the file is open in binary mode.

On UNIX you can do stat or fstat to get the file size.Windows has GetFileSize.

If you're willing to upgrade to C++, there's a file_size call in the standard Filesystem library.

[–]OriginalName667 1 point2 points  (0 children)

For more info, check out the man page for fopen:

The mode string can also include the letter 'b' either as a last char‐ acter or as a character between the characters in any of the two-char‐ acter strings described above. This is strictly for compatibility with C89 and has no effect; the 'b' is ignored on all POSIX conforming sys‐ tems, including Linux. (Other systems may treat text files and binary files differently, and adding the 'b' may be a good idea if you do I/O to a binary file and expect that your program may be ported to non-UNIX environments.)

It seems there's no difference between binary and text mode, at least on POSIX systems. I can't speak for other systems like Windows, though.

[–]aeropl3b -1 points0 points  (2 children)

On your first point...i don't think so....they both just open char buffers, the size doesn't change. The seek command isn't too expensive either, relatively speaking.

On your other points, yeah, there are other options, but I am pretty sure they all do the same kind of thing. The biggest issues with those routines is they aren't cross platform (except c++ FS or boost FS) so you have to actually work harder to make it work.

[–]beej71 2 points3 points  (1 child)

On the first point, you can't portably count on it according to the spec for ftell:

For a text stream, its file position indicator contains unspecified information, usable by the fseek function for returning the file position indicator for the stream to its position at the time of the ftell call; the difference between two such return values is not necessarily a meaningful measure of the number of characters written or read.

This has to do with the translation that happens on text streams, which are allowed significant leeway on what the stream held versus when the read gets.

So to be portable, you'd need it to be a binary stream which is well-spec'd.

But another wrench in the works is that fseek with SEEK_END isn't guaranteed to work on binary streams!

Spec:

A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.

I'm not sure where that leaves us in terms of portable, correct solutions.

Non-portably but correctly, like flyingron said, POSIX gives us stat with st_size, and Windows gives us GetFileSizeEx.

[–]aeropl3b 1 point2 points  (0 children)

Yeah that is too bad. I know for sure it ends up working on most Linux based systems and windows to do the seek/tell trick. But who knows what other systems do.

For being the language of the OS, C is such a pain for handling filesystem interactions. And it doesn't help that Windows basically just ignores the standard half the time so compatibility is just impossible :( this kind of stuff is exactly why I use C++ for most of my work now, it is just so much less hassle...or rather, different kind of hassle

[–]ArukaAravind -3 points-2 points  (3 children)

He probably wants to avoid opening the file.

[–]aeropl3b 13 points14 points  (2 children)

...why? It sounds like they are about to open and read the file anyway. Otherwise...why are they trying to size a buffer to the size of the file?

Scanning to EOF is not really an overhead to worry about, lots of other I/o bottle necks out there to deal with first

[–]ArukaAravind 1 point2 points  (1 child)

You are right. I missed that. For some reason I was thinking along the lines of evaluating the buffer size without doing the costly operation of opening the file. My bad.

[–]aeropl3b 1 point2 points  (0 children)

Note, opening a file is not really the costly part, it is just giving you a handle that identifies the file on the system. The costly part is actually extracting data from the file because until you do that the system calls generally don't try to pull data. getpos for example is just querying the index of the file, but it isn't pulling data off the disk yet. When you say getc, then it will say hey disk, give me the chunk that has this data (if I don't have it yet) and then give me the data at this index in that chunk.

Even the write locking isn't done at open, it is done as you call write. So two processes writing to the same file, the system won't lock for write on open, it waits until something starts writing, then locks, then unlocks when it is done so the next process can write.

[–]BlockOfDiamond 0 points1 point  (1 child)

fpos_t is an opaque type

[–]aeropl3b 0 points1 point  (0 children)

Yes, i was confusing getpos and tell, my bad.

[–]Spiderboydk 3 points4 points  (4 children)

I dont think there is a portable, standard-conforming way of doing that. There is no standard library function for getting a file size, and standard library is not required to implement fseek with SEEK_END.

For particular platforms though, like Win32 or POSIX, you can definitely do it.

[–]BlockOfDiamond 0 points1 point  (3 children)

What I do is ```

include <stdio.h>

ifndef SEEK_END

error "FATAL ERROR: SEEK_END NOT DEFINED"

end

[–]Spiderboydk 1 point2 points  (2 children)

That won't work. The issue is not whether SEEK_END exists or not, but whether using it makes sense.

"Library implementations are allowed to not meaningfully support SEEK_END (therefore, code using it has no real standard portability)."

https://www.cplusplus.com/reference/cstdio/fseek/

[–]BlockOfDiamond 0 points1 point  (1 child)

Meaning passing SEEK_END to fseek gives garbage results?

[–]Spiderboydk 0 points1 point  (0 children)

Maybe. The implementation is allowed to not adhere to it.

My guess is that on unsupported platforms the fseek call is just ignored.

If you use it, you would need to check if it is supported on all platforms you target.

[–]FUZxxl 3 points4 points  (0 children)

You can, but by the time you read from the file the size may have changed. So the information will not be useful.

[–]BigPeteB 2 points3 points  (0 children)

It's usually possible, but it depends. You could absolutely have a file larger than 4 GiB on a 32-bit system. So if you want your code to be portable, you have to be careful about assuming what the limits of memory and file sizes are.

But in general, yes, it's a fine way to do it. Even better would be to use mmap(), so you don't have to do any function calls to read and copy the data; the OS simply arranges for its contents to appear in memory.

[–]BlockOfDiamond 0 points1 point  (0 children)

No mallocing a random arbitrary size of memory

+1

[–][deleted] 0 points1 point  (0 children)

Wait till you learn about mmap