Structure of UNIX file system

This is reading notes of Operating Systems: Three Easy Pieces.

Modelling file and directory

We can model File and Directory in Python as follows:

class File:
    inode: int
    data: bytes
class Directory:
    inode: int
    content: T.List[T.Tuple[str, int]]

Both File and Directory has the internal name (inode).

File’s content is an array of bytes, while Directory’s content is a list of pairs (readable_file_name, the_file_inode)

Create a file

In C, the system call open returns the file descriptor which is an integer, private per process, and is used in UNIX systems to access files.

The file descriptor corresponds to an open file handle object:

class FileHandle:
    # reference count because the object is shared
    ref: int
    readable: bool
    writable: bool
    inode: Inode
    offset: int

In Python open return the file handle instead of the file descriptor. os.fdopen and open are same. os.fdopen accepts file descriptor instead of a path. Both return the file handle.

open, fdopen and fopen

See https://stackoverflow.com/questions/1658476/c-fopen-vs-open

Read a file

Here strace is introduced, which is very useful.

If read() returns 0, does it mean it is closed? No! It just means the file has no more content (EOF).

If you open two files, then it will have independent file descriptor and FileHandles.

Shared file table

The file table is shared between parent and child processes. That means when you open a file, seek the file in child, you will see the offset change in parent as well. Because the FileHandle is shared. ref is used to track how many processes are sharing the handle. when all processes close the handle, the object will be removed.

The dup() call creates a new file descriptor that refers to the same underlying open file as an existing descriptor.

#+sh cat dup.c

fsync

write() buffer the content to write in memory. fsync() does not. It sounds like flush.

Note that you need to create the folder as well:

Interestingly, this sequence does not guarantee everything that you might expect; in some cases, you also need to fsync() the directory that contains the file foo. Adding this step ensures not only that the file itself is on disk, but that the file, if newly created, also is durably a part of the directory. Not surprisingly, this type of detail is often overlooked (忽略), leading to many application-level bugs [P+13,P+14].

rename

Rename a file is atomic.

In the Emacs example, I do not fully understand why it needs to write('foo.txt.tmp') and then rename it to foo.txt. Why no open('foo.txt') and then write the new content? Is it because open and write are not atomic but rename is? Yes I think so.

File stats

stat() or fstats (what is the difference?). A lot of information including:

stat $(mktemp)
#+sh stat $(mktemp)

This information is kept in a structure called inode.

Remove file

Remove a file is same as unlink a file.

> strace rm foo
unlink("foo")

Make directories

mkdir("foo", 0777)

read


Back to Home