The Kernel

Last month I mentioned that I’m working on a real-time operating system as part of my CS 452 coursework this term. The first half of the course focuses on developing a kernel that provides basic OS services such as task creation and scheduling, inter–task communication (ITC) and serial port I/O. My partner Rollen and I having been spending a large number of hours in the lab each week, but until recently there hasn’t been anything interesting to write about.

Most of the development time was spent laying the ground for the OS — this included writing the context switcher, task scheduler and basic queuing data structures. Over the last few days we moved on to writing the message passing system that allows tasks to communicate with each other. This aspect of the kernel is more interesting, so I figured I would do a write up on some of the optimizations we made.

The trains used in our lab travel at about half a meter per second; in other words, it takes them about 10 milliseconds to travel half a centimeter. This is the granularity that CS 452 kernels typically operate at to effectively schedule trains. Better system performance means more physics and path finding calculations can fit in between train controller time steps.

The course instructor required us to submit timing data to demonstrate that our system performance was suitable. The primary benchmark measured message round trip time; that is, the time taken for a task to send a message to another task and get a reply back. For CS 452, an “average” kernel usually takes between 30 and 60 microseconds to complete the process. Anything above 100 microseconds is considered to be too slow for the purposes of the course. Rollen and I managed to achieve a round trip time of 7.5 microseconds in the ideal case, with an upper bound of 10 microseconds when running with non-ideal task priorities. A large contributing factor to our performance was the overall kernel design, however we also put in extra lab time to optimize for performance.

Context Switching

In the early stages of the project I tested a lot of ideas on a Raspberry Pi at home. One of the areas I focused on was efficient hardware and software interrupt handling. A context switch is required every time a hardware interrupt fires or a user-space task makes a system call (software interrupt), so fast switching is crucial for overall performance.

In our kernel, we wrap our system calls in C functions to trick GCC into storing the parameters where we want them. The ARM calling convention places the first four parameters in registers zero to three, and the remaining parameters on the stack. The return value of a function is stored in r0. In our implementation the first parameter is always the system call id. For example, we have something like the following:

static inline sysCall0(unsigned int id)
{
    __swi();
}

static inline sysCall1(unsigned int id,
                       unsigned int arg0)
{
    __swi();
}

static inline sysCall2(unsigned int id,
                       unsigned int arg0,
                       unsigned int arg1)
{
    __swi();
}

...

#define sysPass()                       sysCall0(0);
#define sysExit()                       sysCall0(1);
#define sysTaskId()                     sysCall0(2);
#define sysCreateTask(priority, entry)  sysCall2(2, priority, entry)
...

Within the task code, we can use the system calls like regular functions. Return results are passed through as expected:

void UserTask()
{
    printf("My id is %d\r\n", sysTaskId());
    unsigned int childId = sysCreateTask(2, &ChildTask);
    printf("Created child task with id %d\r\n", childId);
    ...
    sysExit();
}

Calling __swi triggers a software interrupt, which results in the active task returning control to the kernel. The first thing the interrupt handler does is call the kernel’s system call handler. The inputs to the handler are the same as the ones specified in the user-space portion of the system call. Since GCC has already stored those parameters to r0-r3 no extra work is necessary to extract them. The general structure is as follows, where the kernelSystemCall function does the work for actually executing the specific call:

_enterTask
    ...

_softwareInterrupt:
    msr     cpsr_c, #0xDF       ; Switch to system mode, IRQ off
        
    stmfd   sp!, {r1-r15}       ; Stack user registers
    bl      kernelSystemCall    ; Perform a system call. r0 = call id, R1-R3 = args, etc
    stmfd   sp!, {r0}           ; Stack the return value of the system call
                    
    mov     r0, sp              ; Store the user SP to r0
    msr     cpsr_c, #0xD3       ; Switch to supervisor mode, IRQ off
    bl      kernelSchdule       ; Updates the old tasks descriptor w/ the sp and pick a new task.
                                ; The sp of the next task to run is returned in r0
                                                                
    cmp     r0, #0              ; Check the new tasks stack pointer. 0 for kernel termination
    bne     _enterTask          ; Switch back into a task
                                                                
_kernelExit
    ...

The same design is used for hardware interrupts, however instead of calling kernelSystemCall we branch to the kernelInterrupt function to process the interrupt. Some aspects of the assembly code listed are intentionally glossed over — for example the user mode program counter is actually retrieved from the supervisor mode LR and the SPSR needs to be saved somewhere. The complete system call handler is actually 14 instructions long; the corresponding enterTask is 6 assembly instructions.

The reason the system call is done in system mode rather than supervisor mode is that system mode is a privileged mode that still has access to the user-space stack pointer. In supervisor mode the banked sp_svc register is used instead, making it harder to access parameters put on the user stack by GCC. The user registers are stacked before the system call because a record of the system call parameters is sometimes needed later down the road. For example, the blocking call sysSend(int id, void* message, void* reply) has results written back into its parameters. The parameters are updated at some undetermined point in the future when the task is unblocked by a call to sysReceive. The r0 register is stacked because it holds the system call return value, which will be needed again when the calling task is rescheduled.

The other noteworthy aspect of our context switch is that there is no concept of a kernel stack or a kernel main loop. All of the kernel data is stored in a static structure and accessed by the various kernel functions as needed. The context switch design described during CS 452 lecture was not structured in this way; after stacking the user registers it unstacks kernel state and branches back into a kernel loop written in C. By using an implicit kernel loop as shown above, the cost of stacking and unstacking kernel state is eliminated.

Optimizing Scheduler Performance

One of the design requirements for the project is that all system calls are preemptive. In other words, both software and hardware interrupts result in a call to the task scheduler. As such, scheduler performance has direct impact of system performance.

The scheduler in our kernel uses a priority queue to manage active tasks. The queue is implemented as a set of circular array buffers — each priority level has a corresponding buffer with a head and tail index. When an element is enqueued it gets stored in the array at the location pointed to by the tail. The tail is then increment modulo the queue length. The dequeue operation work in reverse; a dequeue returns the value at the head index and advances the head modulo the length. A bit vector (store as an unsigned int) keeps track of which of the queues have tasks in them. Since our hardware uses an ARMv4 chip it doesn’t support the CLZ instruction, so some of the tricks from this site are used to inspect the bit vector.

On the ARM system we use, the modulus operator is relatively expensive. When compiling C code that uses the mod operator, e.g. a % b, GCC falls back to a software implementation of the operator. This is also true for division, since the hardware doesn’t have native modulus or division instructions. Using a software mod increases code size and has a negative effect on performance. To speed things up, Rollen suggested that we use queues with power of two lengths and a bitwise-and instead. The code boils down to the following, where mod is a power of 2 and val is any integer less than mod:

// Compute val % mod
unsigned int valMod = val & (mod - 1)

The technique isn’t a generic replacement for the modulus operator, however it works sufficiently well for our queuing structures. Using the bitwise-and technique shaved off around 2 microseconds from our round trip time.

The other priority queue optimization that impacted performance was the addition of a combined push/pop function. Typical scheduler behavior involves pushing the current active task to the queue and popping off a new task to run. In cases where the same task is pushed and popped the queue is doing unnecessary work. The combined push/pop function does the same thing as calling push and pop individually, however it’s able to return early when it detects that a queue access is unneeded. The fact that push and pop are combined into one method also allows for some shared calculations and one less function call, resulting in marginal performance increases in all cases.

Optimizing Memcpy

One of the benchmarks that we had to provide for our kernel submission was performance with large messages size. Rather than sending the entire data, our kernel sends a pointer to the message and the receiving task uses memcpy to copy the message data as needed. A naive memcpy looks something like the following:

void memcpy(unsigned int* dst, unsigned int* src, unsigned int len)
{
    while (len--)
    {
        *dst++ = *src++;
    }
}

This can be improved upon by writing memcpy in assembly. The example below copies 8 values from the address in r0 to the address in r1 using ARM’s store/load multiple instructions:

_cp:
    stmia r0!, {r2, r3, r4, r5, r6, r7, r8, r9}
    ldmia r1!, {r2, r3, r4, r5, r6, r7, r8, r9}

The memcpy implementation in our kernel copies data in blocks of 8 integers at a time, and falls through when there are fewer than 8 values. Something like:

memcpy:
    stmfd   sp!, {r4-r11}
    add     r11, pc, #12
    _cp:
        subs    r2, r2, #8
        bpl     _memfull
        rsb     r3, r2, r2, asl #2
        sub     pc, r11, r3, asl #2

    _memfull:
        stmia   r0!, {r3, r4, r5, r6, r7, r8, r10}
        ldmia   r1!, {r3, r4, r5, r6, r7, r8, r10}
        b       _cp

    _mem7:
        stmia   r0, {r3, r4, r5, r6, r7, r8, r9}
        ldmia   r1, {r3, r4, r5, r6, r7, r8, r9}
        b       _cpexit
    ...

    _mem1:
        stmia   r0, {r3}
        ldmia   r1, {r3}
        b       _cpexit

    _cpexit:
        ldmfd sp!, {r4-r11}
        bx lr

The code in the _cp label tracks how many words are left to copy and directly modifies the PC so the CPU takes the correct branch. Any time there are fewer than 8 words left the appropriate _memN branch is taken an the memcpy function returns. Otherwise the _memfull codepath is repeatedly run until fewer than 8 words are left. The exact explanation for how code in _cp works is left as an exercise to the reader. In short, the code basically acts as a jump table into the labels below it.

It’s worth noting that performance gains from using stmia and ldmia aren’t all that massive. The main benefit is the reduction in instruction count, and thus a decrease in instruction fetch and decode time. The memory is still copied one word at a time, i.e. the eight values aren’t written in parallel. For truly fast memory copies, some sort of SIMD instruction is required. On ARM this would be the NEON extension, which isn’t available on our hardware. The ARM documentation provides some interesting benchmarks on various memcpy methods, including a NEON-based copy that seems to outperform everything else.

Timing Results

As mentioned in the intro, the round trip time for a one word message in our kernel takes about 7.5 microseconds. Copying a 64 byte message data block incurs an extra cost of around 0.72 microseconds, bringing the round trip time up to 8.22 microseconds.

One of the other interesting benchmarks that we recorded was the cost of a context switch. This was done by timing the sysPass call which simply returns control to the kernel and immediately schedules another task. sysPass ended up taking around 1.8 microseconds to execute. Since the message passing requires three system calls (sysSend, sysReceive and sysReply), the total overhead from the system call handler is around 5.4 microseconds. The remaining time is spent handling the message passing itself.

Future Work

We still have a few items to complete before the kernel is done, all of which should be finished by the end of the month. This week I started working on an event system so tasks can call sysAwait(event_id) to block on specific hardware interrupts. When the desired interrupt fires the task is unblocked and returned to the appropriate ready queue. The events will be used for interrupt-driven I/O and timed delays in the tasks. At the moment the only interrupt source that’s hooked into the kernel is are the timers, however the code is more or less ready for use with other interrupts.

If I have time this week I’m also hoping to port the latest version of our project over to the Raspberry Pi. The hardware interface is completed abstracted away from the kernel itself, so running the kernel on the Pi should only require a new hardware interface specific to the Pi itself. This isn’t a course requirement of course, but so far it’s been extremely useful to test ARM code out on the Pi.

The next blog post on my CS 452 project will probably come around the time we start working with the trains near the end of the month.