Tuesday, June 3, 2014

Exploiting CVE-2014-0196 a walk-through of the Linux pty race condition PoC

By Samuel Groß


Recently a severe vulnerability in the Linux kernel was publicly disclosed and patched. In this post we'll analyze what this particular security vulnerability looks like in the Linux kernel code and walk you through the publicly published proof-of-concept exploit code by Matthew Daley released May 12th 2014.

The original post by the SUSE security team to oss-security announced that the vuln was found accidentally by a customer in production! You can find the patch at this link.

The core issue is located in the pty subsystem of the kernel and has been there for about five years. There was about one year in the middle where the vuln was not present, we'll talk about that a bit later in this post.

Background on the pty/tty subsystem

In order to fully understand the vuln we'll have to dive into the pty/tty subsystem of the linux kernel so lets start there.

A tty is "..an electromechanical typewriter paired with a communication channel." Back in the day a tty was made up of a keyboard for the input, a screen or similar display for the output and an OS process that was attached to this concept of tty. The process would then receive the input and it's output would be redirected to the screen. Those days are long gone but command line applications are not (thankfully!) and today we mostly use pseudo terminals. The main difference here is that instead of a keyboard and screen another process sits at the master side of the pty (for example a terminal emulator). Think of a pty as a bidirectional pipe or socket with some additional hooks in place (for example if you type a ctrl-c on the master side the kernel will interpret it instead of sending it to the slave. In this case the kernel will send a SIGINT signal to the slave process which will often cause it to terminate execution).

It's the pty subsystem's job to take input from either side of the pty, look for specific bytes in the byte stream (e.g. a ctrl-c), process them and deliver everything else to the other side. There is additional logic involved here which is not present in other IPC concepts such as pipes or sockets. This logic takes care to ensure things like echoing characters you type at the master end are also written back to it, pressing the backspace key to remove previously typed characters actually works on display, or sending signals like SIGINT when ctrl-c is sent. This logic is called line discipline (ldisc in short). Upon receiving data from either side the kernel will store the data in a temporary buffer (struct tty_buffer) and queue a work item to process the incoming data (flush it to the line discipline) at a later point and deliver them to the client side (I assume this is mainly done for "real" terminals whose input arrives in interrupt context (i.e. keyboard press, USB packet, ...) and should thus be handled as fast as possible). In this vuln we'll be racing one of these worker processes while it processes data to find the exploitable condition.

You can learn more about the pty subsystem here: http://www.linusakesson.net/programming/tty/

The vulnerability

For background we'll first need to introduce some important structures from include/linux/tty.h. (all source code excerpts were taken from Linux 3.2.58 except if stated otherwise):

struct tty_buffer { struct tty_buffer *next; char *char_buf_ptr; unsigned char *flag_buf_ptr; int used; int size; int commit; int read; /* Data points here */ unsigned long data[0]; };
As seen above a tty_buffer data structure temporarily holds a fixed number (well under normal circumstances) of bytes that have arrived at one end of the tty and still need to be processed.
tty_buffer is a dynamically sized object, so the char_buf_ptr will always point at the first byte right after the struct and flag_buf_ptr will point to that address plus 'size'. tty_buffer.size (which is only the size of the char buffer) can be any of the following: 256, 512, 768, 1024, 1280, 1536 and 1792 (TTY_BUFFER_PAGE).

The actual size of the object is then calculated as follows: 2 x size (for characters + flags) + sizeof(tty_buffer) (for the header), causing the tty_buffer to live in one of the following three kernel heap slabs: kmalloc-1024, kmalloc-2048 or kmalloc-4096.

struct tty_bufhead { struct work_struct work; spinlock_t lock; struct tty_buffer *head; /* Queue head */ struct tty_buffer *tail; /* Active buffer */ struct tty_buffer *free; /* Free queue head */ int memory_used; /* Buffer space used excluding free queue */ };
A tty_bufhead is, as the name implies, is the head (or first) data structure for tty_buffers. It keeps a list active buffers (head) while also storing a direct pointer to the last buffer (the currently active one) to improve performance. You will often see references to bufhead->tail in the kernel source code, meaning the currently active buffer is requested. It also keeps it's own freelist for buffers smaller than 512 bytes (see drivers/tty/tty_buffer.c:tty_buffer_free()).

struct tty_struct { int magic; struct kref kref; struct device *dev; struct tty_driver *driver; const struct tty_operations *ops; /* ... */ struct tty_bufhead buf; /* Locked internally */ /* ... */ };
The tty_struct data structure represents a tty/pty in kernel space. For the sake of this post all you need to know is that it stores the tty_bufhead and thus the buffers.

Alright, let's start with the function mentioned in the commit message, tty_insert_flip_string_fixed_flag() in drivers/tty/tty_buffer.c.
It is responsible for storing the given bytes in a tty_buffer of the tty device, allocating a new one if required:

The call chain leading up to this function roughly looks like this: write(pty_fd) in userspace -> sys_write() in kernelspace -> tty_write() -> pty_write() -> tty_insert_flip_string_fixed_flag()
int tty_insert_flip_string_fixed_flag(struct tty_struct *tty, const unsigned char *chars, char flag, size_t size) { int copied = 0; do { int goal = min_t(size_t, size - copied, TTY_BUFFER_PAGE); int space = tty_buffer_request_room(tty, goal); /* -1- */ struct tty_buffer *tb = tty->buf.tail; /* If there is no space then tb may be NULL */ if (unlikely(space == 0)) break; memcpy(tb->char_buf_ptr + tb->used, chars, space); /* -2- */ memset(tb->flag_buf_ptr + tb->used, flag, space); tb->used += space; /* -3- */ copied += space; chars += space; /* There is a small chance that we need to split the data over several buffers. If this is the case we must loop */ } while (unlikely(size > copied)); return copied; }
This function is fairly straightforward: At -1- tty_buffer_request_room ensures that enough space is available in the currently active buffer (tty_bufhead->tail), allocating a new one if required. At -2- the incoming data is written to the active buffer and at -3- the 'used' member is incremented. Note that tb->used is used as an index into the buffer.

The commit message mentions that two separate processes (a kernel worker process echoing data previously written to the master end and the process at the slave end writing to the pty directly) can enter this function at the same time due to a missing lock, thus causing a race condition.
So what could happen here? The commit message provides us with the following scenario:

            A                                       B 
memcpy(buf(tb->used), ...) 
tb->used += space; 
                                        memcpy(buf(tb->used), ...) ->BOOM

In here we see two processes (A and B) writing to the pty at the same time. Since the first process updates tb->used first the memcpy() of the second process will write past the end of the buffer (assuming the first write already filled the buffer) and thus causes the memory corruption.
Now this looks reasonable at first but is actually only part of the story.
Here are some observations that don't quite fit with this scenario:
- When running a simple PoC the kernel seems to crash very fast (on older kernels at least), while the scenario above seems relatively hard to achieve
- Looking at the debugger shows that often multiple pages of kernel data have been overwritten upon crashing. This can hardly be the case when only sending e.g. 2 x 4096 bytes at once

Also take a look at the following (slightly shortened) stacktrace, produced by setting a breakpoint at tty_insert_flip_string_fixed_flag()

#0  tty_insert_flip_string_fixed_flag (tty=tty@entry=0xffff880107a82800, 
    chars=0x0, flag=flag@entry=0 '\000', size=1)                      /* -1. */
#1  tty_insert_flip_string (size=<optimized out>, 
    chars=<optimized out>, tty=0xffff880107a82800)
#2  pty_write (tty=0xffff880117cd3800, buf=<optimized out>, c=<optimized out>)
#3  tty_put_char (tty=tty@entry=0xffff880117cd3800, ch=66 'B')        /* -2- */
#4  process_echoes (tty=0xffff880117cd3800)
#6  n_tty_receive_char (c=<optimized out>, tty=0xffff880117cd3800)
#7  n_tty_receive_buf (tty=0xffff880117cd3800, 
    cp=0xffff880117a78828 'B' ..., fp=0xffff880117a78a2d "", count=512)
#8  flush_to_ldisc (work=0xffff880117cd3910)
#9  process_one_work (worker=worker@entry=0xffff880118f507c0, 
#10 worker_thread (__worker=__worker@entry=0xffff880118f507c0)
#11 kthread (_create=0xffff880118ed9d80)
#12 kernel_thread_helper ()

This is the code path a worker process takes when performing a flush to the line discipline. As can be seen at -1- and -2- the echoing is actually done byte by byte.
Clearly we can't cause much harm by only overwriting a buffer with a single byte when the chunk still has unused space left (as will be the case for tty_buffer objects).

In the following we will now assume that the race went something like this: Process A wrote 256 bytes, process B (performing an echo) entered tty_buffer_request_room() before A updated tb->used, causing it to not allocate a fresh buffer. Afterwards B wrote another byte to the same buffer and incremented tb->used further.

To understand what is really causing the memory corruption take a look at the tty_buffer_request_room() function called by tty_insert_flip_string_fixed_flag().

int tty_buffer_request_room(struct tty_struct *tty, size_t size) { struct tty_buffer *b, *n; int left; /* -1- */ unsigned long flags; spin_lock_irqsave(&tty->buf.lock, flags); /* -2- */ /* OPTIMISATION: We could keep a per tty "zero" sized buffer to remove this conditional if its worth it. This would be invisible to the callers */ if ((b = tty->buf.tail) != NULL) left = b->size - b->used; /* -3- */ else left = 0; if (left < size) { /* -4- */ /* This is the slow path - looking for new buffers to use */ if ((n = tty_buffer_find(tty, size)) != NULL) { if (b != NULL) { b->next = n; b->commit = b->used; } else tty->buf.head = n; tty->buf.tail = n; } else size = left; } spin_unlock_irqrestore(&tty->buf.lock, flags); return size; }
Now things start to get interesting, note how at -1- 'left' has type int while 'size' is of type size_t (aka unsigned long). Assuming we previously won the race and have written 257 bytes while the buffer was only 256 bytes large then we now have the following situation:
b->size is 256
b->used is 257

Looking at the code above, at -3- 'left' will now equal -1 and at -4- will be casted to an unsigned value, resulting in 18446744073709551615 (assuming 64 bit long) which is definitely larger then the given size. The following block will be skipped and no new buffer will be allocated for the current request even though the current buffer is more than full.
At this point sending more data to the pty will result in the data being put into the same buffer, overflowing it further (remember 'used' is used as an index into the buffer). Since b->used will still be incremented for each byte we can now overflow as much data as we want.
Also note that this function is locked internally (at -2-), thus serializing access to it.

Now we are ready to draw an updated scenario that leads to an overflow:
        A (Slave)                          B (Echo)

        |                     // waiting for A to release the lock
                              // tb->used < tb->size,
                              // no new buffer is allocated
memcpy(.., 256);
                              memcpy(.., 1);

tb->used += space; 
                              tb->used += space;    
                              // tb->used is now larger than tb->size

Note that we will win the race as soon as the echoing process enters tty_buffer_request_room and calculates 'left' before the first process gets to update tb->used. Since the whole memcpy() operation is in between, that time frame is relatively large.

So as far as race condition scenarios go, the single case mentioned in the commit message is only one possible way that can result in memory corruption (and only if A fills the buffer completely).
In general any sequence that results in tb->used being larger than tb->size will result in a memory corruption later on. For that to happen the first process must send data to completely fill a buffer (i.e. sending tb->size bytes in total) while the echoing process must enter tty_buffer_request_room() before the first process updates tb->used (this leads to tty_buffer_request_room() not allocating a fresh buffer). The corruption is then caused by sending more data to the pty which will continue to overflow the same buffer.

At this point the vuln turns into a standard kernel heap overflow.

And we'll conclude this section with fun fact: The race in this vuln can actually be won by using just one process. This stems from the fact that we are racing a kernel worker process and not a second user-land process.

Getting to root - The exploit

Here we want to quickly analyze the published exploit code which will hopefully be easy to understand now that the details of the vuln are known.

Going step-by-step with PoC's console output we see...

[+] Resolving symbols

Yep, that's what it's doing. Note that some modern distributions (notably Ubuntu) set /proc/sys/kernel/kptr_restrict to 1, thus disabling /proc/kallsyms. For repository kernels this is merely an inconvenience though since the kernel image (and System.map) can be downloaded locally and the addresses taken from there.

[+] Doing once-off allocations

Stabilizing the heap. We need to make sure existing holes are filled to maximize the chances of getting objects laid out linear in the address space. We want our target buffer to be followed by one of our target objects (struct tty_struct).

[+] Attempting to overflow into a tty_struct... 

Now we are racing.

This is fairly straightforward, open a pty, spawn a new thread and write to both ends at the same time. Afterwards the child thread will send the data needed to overflow into the adjacent chunk. Assuming the race has been won at the start then there is no time pressure on these operations as discussed above.
Also note that only one byte is sent to the master end, this is done so the number of bytes that has yet to be sent can be calculated.

The exploit targets tty_struct structures which end up in the kmalloc-1024 slab cache. The buffer we will overflow will thus have to be in that cache as well (so tb->size = 256 which is also the minimum size). Before writing to the slave end the first time (to allocate a fresh buffer) the exploit creates a bunch of new pty's, thus allocating tty_structs in kernel space. It will then close one of them in hopes that the newly allocated buffer will end up in the freed chunk. If this works out we will have a bunch of tty_structs, followed by the buffer followed by more tty_structs in the kernel address space.

Let's take a quick look at the function executed by the new thread to overflow into the following chunk:
void *overwrite_thread_fn(void *p) { write(slave_fd, buf, 511); write(slave_fd, buf, 1024 - 32 - (1 + 511 + 1)); write(slave_fd, &overwrite, sizeof(overwrite)); }
The first write here will fill the previously allocated buffer (right after closing one of the pty's we allocated a new buffer by writing one byte to the slave fd). Note that the author assumes the buffer to hold 512 bytes while it's size is 256 (MIN_TTYB_SIZE). The reason for that is that on newer releases the kernel can use the flag buffer for data as well (if it knows the flags won't be needed), so the usable size of the buffer is doubled.

The next write fills the memory chunk of the buffer completely. The chunk is 1024 bytes large and so far we have written 32 bytes (sizeof(struct tty_buffer)) + 511 + 1 (the first write to the slave fd) + 1 (the echoed byte from the master fd).

The final write overwrites into the next heap chunk with a fake tty_structure previously created.

Now remember that tty_struct has a member 'ops' that is a pointer to a tty_operations struct? Well those ops members in the linux kernel are always pointers to structures holding function pointers themselves (if you're familiar with C++ this is similar to the vtable pointer of C++ objects). These function pointers correspond to actions performed on the device, there's one for open(), one for close() one for ioctl() and so on. Now assuming we have overwritten the object then 'ops' will now be under our control, pointing into user space. There we have prepared an array of function pointers pointing to our kernel payload.

Now as soon as we perform an ioctl on the tty device we will hijack the kernel control flow and redirect it into the payload. There we'll execute the standard prepare_kernel_cred(0) followed by commit_creds(), elevating our privileges to root:

[+] Got it :)

# id
 uid=0(root) gid=0(root) groups=0(root)

Note that SMEP/SMAP will prevent this exploit (as well as the grsecurity system) as they prevent the kernel from accessing user-land data (SMAP) and code (SMEP).


Unlike most other race conditions, in the case of this vuln the attacker is only able to control one of the two processes. Kernel worker processes will check for new work items regularly but can't really be affected by user space. This seems to make a huge difference for different kernel versions, on 3.2 it usually only takes a couple seconds to win the race while on 3.14 it can take multiple minutes.

As mentioned in the PoC code another thing that limits the reliability is the slab cache size in use. As previously discussed the buffer can only be in one of the following slabs: kmalloc-1024, kmalloc-2048 and kmalloc-4096. At sizes this big the chance of hitting the last chunk in the last page of a slab becomes more likely, further limiting the reliability. When that happens the code will overflow into uncontrolled data. This might have no consequences (no important data has been overwritten), lead to a crash later on (some object has been overwritten that is referenced at some point in the future) or even lead to an immediate panic/Oops (for example when the next page is mapped read only).

As also mentioned in the PoC exploit the flags cause some trouble on older kernels (before the commit acc0f67f307f52f7aec1cffdc40a786c15dd21d9) as b->size bytes following the overwritten part will always be cleared to zero. Thus when overwriting a controlled object either the whole objects needs to be restored (and the zeros written into unused space before the end of the chunk) or an object needs to be found where parts of it can safely be overwritten with zeros.

For the last part it might be possible to target tty_buffer objects when exploiting the vuln on pre 3.14 kernels. Here the header can be overwritten, yielding an arbitrary write (overwrite char_buf_ptr and afterwards send data to the pty) while the zeroes can safely be written into the buffer space and won't cause any trouble.

Is Android vulnerable?

As stated in the advisory the vulnerability dates back to 2.6.x kernels, roughly 5 years old. That would imply that pretty much every android device out there is vulnerable to this issue. Running a quick PoC on a newer device (for example the Nexus 5, HTC One or Galaxy S4) it seems the race can never be won there though. Let's again take a look at some kernel source code, this time from the HTC One (m7) Cyanogenmod kernel source.

int tty_insert_flip_string_fixed_flag(struct tty_struct *tty, const unsigned char *chars, char flag, size_t size) { int copied = 0; do { int goal = min_t(size_t, size - copied, TTY_BUFFER_PAGE); int space; unsigned long flags; struct tty_buffer *tb; spin_lock_irqsave(&tty->buf.lock, flags); /* -1- */ space = __tty_buffer_request_room(tty, goal); tb = tty->buf.tail; if (unlikely(space == 0)) { spin_unlock_irqrestore(&tty->buf.lock, flags); break; } memcpy(tb->char_buf_ptr + tb->used, chars, space); memset(tb->flag_buf_ptr + tb->used, flag, space); tb->used += space; spin_unlock_irqrestore(&tty->buf.lock, flags); copied += space; chars += space; } while (unlikely(size > copied)); return copied; }
The Interesting difference is that at -1- we see that the function here is actually locked internally. Now as stated above to win the race the second process needs to enter __tty_buffer_request_room() before the first process updated tb->used. This is not possible if the function is locked like this.

Taking a look at the git history of the Linux kernel it turns out that all kernels between c56a00a165712fd73081f40044b1e64407bb1875 (march 2012) and 64325a3be08d364a62ee8f84b2cf86934bc2544a (january 2013) are not affected by this vuln as tty_insert_flip_string_fixed_flag() was internally locked there.

For Android that means quite a few of the newer devices are not vulnerable to this issue, most of the older ones are though and there are some newer ones integrated the 64325a3be08d364a62ee8f84b2cf86934bc2544a Linux kernel patch, making them vulnerable again.


Kernel exploits are hard, getting them reliable is even harder! This concludes our analysis of CVE-2014-0196, we hope you have gained some deeper understanding of this vuln and kernel level security in general. For more details on linux kernel exploitation you can take a look at our last post: How to exploit the x32 recvmmsg() kernel vulnerability CVE 2014-0038

If you have feedback or have worked on something similar let us know, you can email us at: info/at\includesecurity.com