What is a System Call in Unix/Linux

Sarath Pillai's picture
System Calls

Let me start this article by stating the fact that “Linux is a multi user and multiprocessing operating system.”  That sentence is quite simple and straightforward to understand. Because we all know the fact that multiple users can be logged into the same Linux machine, and can do their individual tasks, all at the same time.
 

You generally cannot have a multiuser operating system, without having the capability of multiprocessing. Because to make two users logged in at the same time, the computer needs to run multiple processes at the same time (think of two different shell process for those users, think of two different text editors opened by those two users, think of the GUI interfaces opened by those two users..and so on, all at the same time).
 

Well of-course there are exceptions..Windows 95, 98, Windows 2000 etc are all multiprocessing systems without multi user capability (only one user can access the system at a time).
 

Apart from multiprocessing capability, a multiuser system also needs to be preemptible in nature.

Preemptible means the operating sytem temporarily interrupts (pause) the current process, with two intentions in mind.  One is to resume it later, other is to let other process run. Each process will be interrupted (paused) and resumed later, and during that period, other processes will be executed. This goes on forever. One important thing to note here is the fact that the operating system never requests a process to pause, its a forcefull interruption(basically there is no cooperation required from the process that is being paused).

The component in the operating system that does this sort of pause and resume operation is called as "Scheduler".

This sort of pause and resume carried out by the operating system is also sometimes called as "Context Switch". So the bottom line is this... Even though we call a system as multiprocessing, technically the computer CPU is doing only one thing at a time ( the scheduler keeps on switching the tasks, and we get the illusion that things are happening at the same time. This is the reason when there are so many processes running, we start to experience a slow response from the system - because the scheduler then have to go through more pauses).

One important thing to note here is that this pause and resume (context-switches) thing is very essential to fairly share CPU resources, in a timely manner, among all the processes that are running.

 

Related: Process Administration in Linux

 

Now we understand why it is very essential for a multiprocessing system to be preemptible in nature.

 

From the early days of Linux, user-space programs were preemptible. Meaning the CPU will throw them out and make room for other processes to come in in a timely manner. As we discussed, this is indeed a good thing, because even if a user writes an infinite loop, that loop wont prevent other programs from running. Because the CPU does not need permission of the program to pause it.

However kernel-space programs were not preemptible (well until 2.6 version of Linux kernel). 

I have used two new terms in the last two paragraphs. One is user-space and the other is kernel-space. What do they mean?

 

Programs are always considered "suspicious", if you look at it from the perspective of an operating system(Kernel). No program is trusted. Only the kernel is trusted and nothing else.

So it is natural that there should be two main modes of operation. One is for the operating system itself and the other is for all other programs. Hence, the computer hardware(CPU) can be in two states(well several CPU architectures have more than two states). One is generally called User-mode (less privileged mode), and the other is called Kernel-mode (privileged mode).

Remember that these two modes are in the hardware. i.e CPU. The operating system can switch the CPU from one mode to another mode as and when required (by altering some bit values in the registers on demand - which generally keeps on happening many times in a computer).

When the CPU is in kernel-mode, there is unrestricted access to all hardware and system resources. When CPU is in user-mode, there are restrictions to what it can do, and cannot access hardware and other system resources.

Remember...Reading a file, writing a file, accessing network, taking inputs from other devices are all hardware related activities. And all of this cannot be achieved in the user-mode.

 

Generally all user programs begin with user-mode. And when it needs access to system resources and hardware, it issues a request to the kernel. Programs use Kernel-API to issue this request.

 

"System Call" is nothing but a term used to refer a particular function provided by the kernel. Programs invoke different functions provided by the kernel (system calls), to fulfill a requirement that requires privileges. There are many functions defined by the kernel that programs can use. So it is well defined and limited in number. Check this link (Linux source code from github) to see some of the defined system calls: https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl

 

The system call API/Interface provided for user programs has several advantages.

  1. Programs using these interfaces are portable, as they can work on any kernel that provides the same interface.
  2. Security: Kernel validates the request before fulfilling it
  3. Programmers can concentrate on the application development rather than worrying about the low level hardware stuff (as they will use the functions - system call for hardware related stuff)

 

So what is the difference between System Call and an Interrupt?

 

The word interrupt itself says what it is. It is an operation done by either the software or the hardware to get attention from the CPU. A hardware interrupt is generated by a hardware attached to the system. For example, your Network interface card will generate an interrupt when it receives a network packet. The printer attached to the system can generate an interrupt when it completes printing some document.

 

A software interrupt is generated by a program running on the computer to get attention from the CPU. All possible interrupts(both hardware and software) are numbered, and are stored in memory with the location of the function that will be executed when that interrupt gets triggered. You can consider this as an array containing the list of interrupts and their respective function address, with some options like the CPU privilege levels with which a particular interrupt should be handled (kernel-mode or user-mode that we understood earlier).

 

So when a particular interrupt gets triggered, from the options present in the entry for that interrupt, the CPU knows which privilege level it should set for executing that interrupt handler.

 

System call is generally a software interrupt. So one important thing to remember is this... Both system calls and interrupts are numbered.

 

One major difference between software interrupt and hardware interrupt is the fact that hardware interrupt can be fired at any time. For example, a user typing something on the keyboard will trigger a hardware interrupt, but software interrupt can only be triggered by something which is currently being executed(ie: currently running) by the CPU. How else could a software do anything on its own?

 

So how does a program make system call?

 

The program simply sets the identification number of the system call, along with some arguments, in a predefined location (a register named EAX, and arguments are stored inside registers named EBX, ECX, EDX etc.) and then triggers a software interrupt. The interrupt number 128(0x80 in hexa) is reserved for system calls. The kernel code is then executed, which will look for the system call number inside EAX register.

The register name might differ in 64 bit Intel processors (in the case of 64 bit instead of EAX, a register named RAX will be used for storing system call number).

Remember..These registers are general purpose registers in the CPU, which can be used by user programs. There are special purpose registers as well in the CPU, which cannot be modified by user programs.

 

As a Unix/Linux programmer, you do not have to worry much about how to store the system call number with arguments in the registers. This is because the complexities are managed by the existing available C libraries.  When i say "C libraries", am referring to the GNU C library (glibc). It comes with wrapper functions that can take care of almost all required system calls in Linux.  You can find most of these located inside /usr/include/ in a Linux distribution (the header files that will provide the function that you can use in the program).

 

For example, to change the permission of a file in a c program, you do not have to know how that system call is implemented, you can simply include some header files which will provide user friendly functions that you can use in the program.

 

#include <sys/types.h>
#include <sys/stat.h>
#include <errno.h>

int main()
{
    int permission;
    permission = chmod("/tmp/testfile", 0555);
}

 

Compile it with gcc as gcc -o <outfile> <infile>. For example, if you have created the above conent in a file named "permission.c", the command to compile it for creating an executable with the name of "permission" will be #gcc -o permission permission.c

 

The above shown example uses chmod function called chmod to change the permission of the file /tmp/testfile to 0555. The programmer does not need to know how and what system call is being fired.  Similar to chmod, you have many many c library functions available to achieve different things. For example open() function from standard glibc for opening a file, read() function to read the contents of the file etc. 

 

It is quite possible that the GNU c library is yet to implement a recently added feature in the kernel (for example a new system call that the kernel provides, but the GNU C library do not have any function available yet). 

For such requirements, GNU provides a generic function called as "syscall". You can consider it as an API, which will accept parameters like system call number, and other arguments etc. Similar to any other function in C, we just have to include the required header file and call the function "syscall", with the system call number and arguments.  

 

Let us try to use a getuid system call by using "syscall" generic function.

 

#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
 
int main()
{
  int uid;
  uid = syscall( __NR_getuid );
  printf( "Current user uid is %d\n", uid );
  return 0;
}

 

 

Yes we have used something called __NR_getuid as an argument to syscall function. This is because the names and their respective system call numbers are mapped inside /usr/include/asm/unistd_32.h or /usr/include/asm/unistd_64.h (depending upon your computer architecture).  An example content of that file will look like the below.

 

#define __NR_getrusage                98
__SYSCALL(__NR_getrusage, sys_getrusage)
#define __NR_sysinfo                99
__SYSCALL(__NR_sysinfo, sys_sysinfo)
#define __NR_times                100
__SYSCALL(__NR_times, sys_times)
#define __NR_ptrace                101
__SYSCALL(__NR_ptrace, sys_ptrace)
#define __NR_getuid                102
__SYSCALL(__NR_getuid, sys_getuid)
#define __NR_syslog                103
__SYSCALL(__NR_syslog, sys_syslog)

 

The number towards the right on each line in the above shown snippet indicates systemcall number. Kernel also keeps an internal mapping of function names and numbers for each system call.  The syscall generic function will take the arguments that we supply and will take care of the complexities of storing them in the relevant CPU registers that is appropriate for the architecture.

 

From the above shown snippet, we can see that 102 system call number is for getuid. So we could easily write the getuid syscall function as below(by using the number 102 for getuid, instead of __NR_getuid.

 

uid = syscall( 102 );

 

How to add a new System Call?

 

Now that we understand how to trigger a system call using glibc standard wrapper functions and generic "syscall" function. Let us understand the things that needs to be done to add a new system call.

 

We need to achieve three things to get a new system call.

 

1. A new kernel function that will be executed when we trigger the system call. It is actually quite straightforward. Once you have cloned the kernel source code, create a new directory inside the main source directory(let us call the directory "smplecall"). This directory should contain a .c file(for example samplecall.c), which will house your function. The function should have the below format.

 

#include <linux/kernel.h>

 asmlinkage long sys_samplecall(void)
{
        printk("Hi this is an example system call\n");
        return 0;
 }

 

2. Create a Makefile in the new directory(with the below content), so that our new .c will be compiled during kernel compilation.

 

obj-y:=samplecall.o

 

3. Open the main Kernel Makefile and ensure our new directory is included (see that we have added smpcall as the last argument).

 

core-y          += kernel/ certs/ mm/ fs/ ipc/ security/ crypto/ block/ smpcall/

 

4. Now we should give a number to our system call(remember in the beginning we discussed that the kernel keeps a mapping between the function name and a number - called system call number). This number will be passed as argument to our generic syscall function. The file is typically this one: https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl

Just make an entry in that file with a new number.

 

5. The last step is to add our new system call entry inside syscalls.h header file. Typically inside include/linux/ inside the source tree. Our entry for this example will look like the below.

 

asmlinkage long sys_samplecall(void)

 

 

Now compile the kernel, this new kernel should then have our newly created system call. Similar to our getuid system call example with syscall function, use the new system call number and it should work.

As mentioned earlier, the operation of keeping the system call number in the architecture specific CPU registers will be taken care of by the generic syscall function.

 

Is software interrupt the only mechanism to trigger a system call?

Software interrupt instruction is one of the methods available for triggering a system call. There are other CPU instructions that can be used as well. For example, "sysenter and sysexit" instruction provided by Intel. "Syscall instruction provided by AMD". Remember these are CPU instructions, and not interrupts.

 

Both these methods (syenter, sysexit provided by Intel and syscall instruction provided by AMD) are considerably faster compared to the legacy Interrupt method.  In face Intel calls this new instruction set as Fast System Calls. Check the instruction manual from Intel over here (Chapter 4: Sysenter - Fast System Call): Intel Instruction Set Manual

 

So the main thing to understand here is that different architectures use their own way of making a system call. Another example is some Motorola architectures uses "trap 0" instruction. Generally trap 0 is used for exceptions.

 

The register names can also be different depending upon the architecture. For example, Motorola uses a register named d0 for storing systemcall number (instead of eax).

 

How to see the system calls triggered by a process?

The best tool available in Linux is "strace" command. You can execute "strace" as shown below.

 

strace cat testfile

 

The above will show all system calls that gets triggered while viewing that "testfile" using cat. Calls like open, close, read, and other necessary calls will all be visible on screen Strace is a good tool to troubleshoot issues and identify where a particular process is getting stuck. "Strace" needs a dedicated post of its own.

 

Rate this article: 
Average: 3.6 (389 votes)

Add new comment

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.