The Sun VFS interface has been most widely used of the three described here. It is also the most general of the three, in that filesystem-specific data and operations are best separated from the generic layer. Although it has several disadvantages which were described above, most of them may be corrected with minor changes to the interface (and, in a few areas, philosophical changes). The DEC GFS has other advantages, in particular the use of the 4.3BSD namei interface and optimizations. It allows single or multiple components of a pathname to be translated in a single call to the specific filesystem and thus accommodates filesystems with either preference. The FSS is least well understood, as there is little public information about the interface. However, the design goals are the least consistent with those of the Berkeley research groups. Accordingly, a new filesystem interface has been devised to avoid some of the problems in the other systems. The proposed interface derives directly from Sun's VFS, but, like GFS, uses a 4.3BSD-style name lookup interface. Additional context information has been moved from the user structure to the nameidata structure so that name translation may be independent of the global context of a user process. This is especially desired in any system where kernel-mode servers operate as light-weight or interrupt-level processes, or where a server may store or cache context for several clients. This calling interface has the additional advantage that the call parameters need not all be pushed onto the stack for each call through the filesystem interface, and they may be accessed using short offsets from a base pointer (unlike global variables in the user structure).
The proposed filesystem interface is described very tersely here. For the most part, data structures and procedures are analogous to those used by VFS, and only the changes will be treated here. See [Kleiman86] for complete descriptions of the vfs and vnode operations in Sun's interface.
The central data structure for name translation is the nameidata
structure.
The same structure is used to pass parameters to namei,
to pass these same parameters to filesystem-specific lookup routines,
to communicate completion status from the lookup routines back to namei,
and to return completion status to the calling routine.
For creation or deletion requests, the parameters to the filesystem operation
to complete the request are also passed in this same structure.
The form of the nameidata structure is:
/* * Encapsulation of namei parameters. * One of these is located in the u. area to * minimize space allocated on the kernel stack * and to retain per-process context. */ struct nameidata { /* arguments to namei and related context: */ caddr_t ni_dirp; /* pathname pointer */ enum uio_seg ni_seg; /* location of pathname */ short ni_nameiop; /* see below */ struct vnode *ni_cdir; /* current directory */ struct vnode *ni_rdir; /* root directory, if not normal root */ struct ucred *ni_cred; /* credentials */ /* shared between namei, lookup routines and commit routines: */ caddr_t ni_pnbuf; /* pathname buffer */ char *ni_ptr; /* current location in pathname */ int ni_pathlen; /* remaining chars in path */ short ni_more; /* more left to translate in pathname */ short ni_loopcnt; /* count of symlinks encountered */ /* results: */ struct vnode *ni_vp; /* vnode of result */ struct vnode *ni_dvp; /* vnode of intermediate directory */ /* BEGIN UFS SPECIFIC */ struct diroffcache { /* last successful directory search */ struct vnode *nc_prevdir; /* terminal directory */ long nc_id; /* directory's unique id */ off_t nc_prevoffset; /* where last entry found */ } ni_nc; /* END UFS SPECIFIC */ };
/* * namei operations and modifiers */ #define LOOKUP 0 /* perform name lookup only */ #define CREATE 1 /* setup for file creation */ #define DELETE 2 /* setup for file deletion */ #define WANTPARENT 0x10 /* return parent directory vnode also */ #define NOCACHE 0x20 /* name must not be left in cache */ #define FOLLOW 0x40 /* follow symbolic links */ #define NOFOLLOW 0x0 /* don't follow symbolic links (pseudo) */
The nameidata is used to store context used during name translation. The current and root directories for the translation are stored here. For the local filesystem, the per-process directory offset cache is also kept here. A file server could leave the directory offset cache empty, could use a single cache for all clients, or could hold caches for several recent clients.
Several other data structures are used in the filesystem operations. One is the ucred structure which describes a client's credentials to the filesystem. This is modified slightly from the Sun structure; the ``accounting'' group ID has been merged into the groups array. The actual number of groups in the array is given explicitly to avoid use of a reserved group ID as a terminator. Also, typedefs introduced in 4.3BSD for user and group ID's have been used. The ucred structure is thus:
/* * Credentials. */ struct ucred { u_short cr_ref; /* reference count */ uid_t cr_uid; /* effective user id */ short cr_ngroups; /* number of groups */ gid_t cr_groups[NGROUPS]; /* groups */ /* * The following either should not be here, * or should be treated as opaque. */ uid_t cr_ruid; /* real user id */ gid_t cr_svgid; /* saved set-group id */ };
A final structure used by the filesystem interface is the uio structure mentioned earlier. This structure describes the source or destination of an I/O operation, with provision for scatter/gather I/O. It is used in the read and write entries to the filesystem. The uio structure presented here is modified from the one used in 4.2BSD to specify the location of each vector of the operation (user or kernel space) and to allow an alternate function to be used to implement the data movement. The alternate function might perform page remapping rather than a copy, for example.
/* * Description of an I/O operation which potentially * involves scatter-gather, with individual sections * described by iovec, below. uio_resid is initially * set to the total size of the operation, and is * decremented as the operation proceeds. uio_offset * is incremented by the amount of each operation. * uio_iov is incremented and uio_iovcnt is decremented * after each vector is processed. */ struct uio { struct iovec *uio_iov; int uio_iovcnt; off_t uio_offset; int uio_resid; enum uio_rw uio_rw; }; enum uio_rw { UIO_READ, UIO_WRITE };
/* * Description of a contiguous section of an I/O operation. * If iov_op is non-null, it is called to implement the copy * operation, possibly by remapping, with the call * (*iov_op)(from, to, count); * where from and to are caddr_t and count is int. * Otherwise, the copy is done in the normal way, * treating base as a user or kernel virtual address * according to iov_segflg. */ struct iovec { caddr_t iov_base; int iov_len; enum uio_seg iov_segflg; int (*iov_op)(); };
/* * Segment flag values. */ enum uio_seg { UIO_USERSPACE, /* from user data space */ UIO_SYSSPACE, /* from system space */ UIO_USERISPACE /* from user I space */ };