How your Computer Works: inside an x86 chip

You press a switch, and Linux starts booting. What’s going on under the hood in the meantime? Let us explain.

Why do this?

  • Understand why Linux works the way it does
  • Troubleshoot occasional problems better
  • Earn yourself some geek points

Linux has a long-established reputation of being the operating system for hackers. It became unnecessary to know what happens behind the curtains of major desktop distributions a while ago, but it is still beneficial to understand how computers really work. Being able to decipher cryptic error messages means you can diagnose Linux problems (let’s face it – this happens from time to time) much more quickly. It is also fun and not really difficult if you have some programming background.

We often hear that Linux is ideal for education, as it doesn’t attempt to hide the inner workings from anyone curious enough to look at them. Let’s take this one step further and learn something from our favourite OS. So, brew yourself a cup (or glass) of whatever you prefer, and let’s get started.

Image

Even the high-end x86 computer you may have on the table today follows the architecture John von Neumann described back in 1945. Image provided by LANL, public domain (see http://commons.wikimedia.org/wiki/File:JohnvonNeumann-LosAlamos.gif

A bird’s eye view

Although your computer is undoubtedly a very sophisticated device, the operations it performs are (in essence) quite simple. The Central Processing Unit, or CPU, continuously samples program instructions from memory and executes them on data that also comes from RAM. Most modern computers (and even smartphones) have multicore CPUs, which are essentially multiple processors on the same die sharing some resources, like caches. From a programmer’s viewpoint, multicore and multiprocessor don’t differ too much, and in this tutorial, we’ll use both terms interchangeably. Processor instructions are rather low-level: you can load and store bytes of memory, do basic maths, jump unconditionally (like goto) or conditionally (like if-else), and maybe calculate CRC32 checksums or do AES encryption, if your CPU provides these extensions. However, there are no higher-level functions, like “convert integer to string”. Some CPUs (like x86) can access memory directly, while others (ARM) operate on registers.

Registers also store the processor’s state, and carefully saving and restoring them is how Linux switches tasks (threads or processes).

An x86 CPU can operate in different modes. Sometimes, programs are granted access to the whole memory, and no address translation is performed. This is known as ‘real’ mode, and is used during the boot phase or in older operating systems like MS-DOS. Other times, the CPU may check memory access rights and prevent one program from touching another. This is called ‘protected mode’, and it’s often combined with paging, or address translation, to produce the paged protected mode that Linux (and other major OSes) run in.

Finally, most modern CPUs are 64-bit, but can run in either 32-bit ‘compatibility’ or 64-bit mode (the ‘long mode’ term refers to both). The memory address and register size are determined by whether the processor is 32- or 64-bit; 64-bit CPUs can address more memory and handle data in larger chunks, which usually means better performance.

Image

Most modern CPUs are in fact multiprocessors. This AMD A10 (Kaveri) chip is no exception: it has four CPU and eight GPU cores. (Photo: Yulia Sinitsyna)

Add-ons

Computers also have peripheral devices, like video cards or network adapters. Nowadays, they are often built-in and not separate extension cards, but for our discussion, it doesn’t really matter. To communicate with these devices, programs running on the CPU use I/O ports or memory-mapped registers (MMIO). MMIO is memory range that is accessed like the rest of RAM, but resides on the device rather than in a DRAM bank. If a device needs CPU attention, it sends an interrupt, which can be as simple as putting a selected wire voltage to low or high, or as sophisticated as a special type of message within the PCI bus. Either way, this signal ends at the local APIC, or Advanced Programmable Interrupt Controller. It’s integrated with the CPU (or its core) and uses clever algorithms to decide when to tear the CPU away from what it’s currently doing and ask it to service the interrupt. APIC can also serve as a source of interrupts itself: it has a built-in timer and facilitates sending inter-processor interrupts or IPIs.

More on registers

A register is essentially a small (typically 64- or 128-bit) and fast memory cell built into the CPU. Some registers have a predefined meaning or purpose, while the others can be used to store arbitrary data. The latter ones are usually called general-purpose registers, or GPRs. The number of registers available and their names (also known as a register file) are defined by the CPU architecture. A 64-bit x86 CPU has 16 general-purpose registers: RAX-RDX, RDI, RSI, RBP, RSP, and R8-R15. If you’re wondering about the names, then historically the Intel 8086 processor had 16-bit AX, BX, CX and DX registers. Intel 80386 (I still have this one in the attic!) introduced 32-bit mode. Registers also became 32-bit wide and got an E-prefix (“E” stands for “Extended”). The R-prefix and numeric registers were introduced with 64-bit mode, to increase size of the register file. In 64-bit mode, E-prefixed registers become lower halves of their R-prefixed counterpart.

Special-purpose registers come in several flavours. First there’s RIP, or the program counter, which stores the memory address immediately after the current instruction. The instruction address is defined relative to the code segment base, which is available via the CS register. There are other segment registers, like DS for data, SS for stack, or GS, useful when switching from user space to kernel space in 64-bit mode. Segments date back to 16-bit times where a single word-sized register wasn’t able to address memory beyond 64k.

Nowadays, with registers being at least 32-bit wide, segments are not that important (for more on this, see Mike’s assembler tutorial on page 106). In fact, they are mostly ignored in 64-bit long mode. However, some bits of the CS register are still recognised; these include privilege level field (so segments can be used as a memory access control mechanism), and the ‘L’ flag, which is set for segments containing 64-bit code. This way, legacy 32-bit processes can run in a 64-bit OS if they are assigned a CS with L=0. Library code, however, shares the CS register with the process it is linked to, and as plugins are often implemented as shared libraries, there’s no way to run a 32-bit plugin in a 64-bit host (except by putting it inside a separate 32-bit process).

Then there are the control registers – CRx. Officially, there are sixteen of them, but only few are currently used. CR0 determines whether the CPU is in real, protected, or paged mode, or a combination thereof. The CR3 register contains a pointer to the page table root used to translate addresses in paged mode. We’ll cover this and the CR2 register shortly.

When the CPU is in protected mode, only privileged code (the operating system kernel) is allowed to change control registers. This way, for example, Linux processes’ virtual address spaces are kept isolated from each other.

Finally, there are model-specific registers, or MSRs. As the name suggests, different CPU models may have different MSRs even within the same (x86) architecture. These registers are widely employed to support advanced features that weren’t initially part of the x86 architecture. This includes 64-bit mode, or the x2APIC interrupt handling found on modern Intel CPUs and vital for low-latency virtual machines. Model-specific registers are referred by their numbers and are 64-bit wide.

Image

Serial ports are easy to program, but with modern PCs, you may need a bit of cabling to actually use them. Photo: Yulia Sinitsyna.

Back into labs

There’s much that can be said about registers, but it’s time to practice. Let’s start with a simple experiment. Open a terminal and type:

$ cat /proc/cpuinfo

You’ll get a lot of information about your CPUs, somewhat like this:

processor : 0 vendor_id : GenuineIntel ... flags : fpu vme ... vmx ... nx ...

Many of these bits of data are obtained with CPUID processor instruction. It queries CPU functions (also known as “leafs”) by number (passed in the EAX register) and receives results in EAX, EBX, ECX and EDX. For instance, the GenuineIntel string is returned by function 0. For an AMD processor, like the one in the photograph, it would be “AuthenticAMD”. Bit 5 set in ECX for function 1 means the CPU supports VMX (Intel’s virtualisation technology). Checking that /proc/cpuinfo contains the vmx flag is common advice when KVM and VirtualBox refuse to start, and now you know where it really comes from.
/proc/cpuinfo shows only a subset of CPUID leaves. If you want them all, consider the cpuid tool:

$ sudo cpuid 0 # 0 is processor number Leaf Subleaf EAX EBX ECX EDX 00000000 00000000: 0000000d .... 68747541 Auth 444d4163 cAMD 69746e65 enti 00000001 00000000: 00630f01 ..c. 00040800 .... 3e98320b .2.> 178bfbff ....

Note how “AuthenticAMD” is returned.

You can play with MSRs in a similar fashion, although I wouldn’t recommend writing to arbitrary registers in a production environment because you can hang your machine quite easily. Reading is safer, but unless you know the register numbers it’s very much an “arbitrary value in – arbitrary value out” experience. Consider the following:

$ sudo rdmsr 0xc0000080 d01

RDMSR is short for “read MSR”; it’s an assembly instruction that the tool was named after. MSR number 0xc0000080 is the Extended-Feature-Enable (EFER) register, and it has various uses. For instance, to enable SVM (AMD’s virtualisation technology), you set bit 12 in EFER to 1. If you are on AMD, start any KVM guest and run the command again:

$ sudo rdmsr 0xc0000080 1d01

0x1d01 is 1110100000001b, so bit 12 is set. Trying to clear it with the wrmsr tool while the guest is running triggers a kernel bug in KVM (don’t say I enticed you to try this):

$ sudo wrmsr 0xc0000080 0xd01 $ dmesg | tail [25360.493222] ------------[ cut here ]------------ [25360.493279] kernel BUG at arch/x86/kvm/x86.c:290! [25360.493323] invalid opcode: 0000 [#1] PREEMPT SMP

EFER also enables long mode. Two bits, LME (number 8) and LMA (number 10), indicate that long mode is enabled (E) and active (A). An OS that wants to run in long mode sets LME bit, and the mode is activated as soon as paging is enabled in CR0. So, both LME and LMA are running in a 64-bit system. The value above has both bits set, as well as bits 0 and 11. The former enables fast system calls, and bit 11 is the famous NX (or Non-eXecutable) flag that makes it harder to exploit common vulnerabilities.
It’s optional, but we saw that our CPU supports it in
/proc/cpuinfo.

Image

An example of address translation using two-level page tables. Offset within the page is not translated but added to the result. Image by Yulia Sinitsyn

A word of memory

How a CPU sees memory depends on processor mode. Here, we’ll speak of paged protected mode, as it is the main one Linux runs in. In this mode, memory can be seen as a collection of pages. Put simply, a page is a 4k (or more, like 2MB or even 1GB) chunk of bytes that share common properties, like cacheability or access rights. Many low-level structures, like virtual machine control blocks, start at a page boundary, or are page-aligned. Linux keeps track of every page frame in the system with a special structure (struct page): it may sound wasteful, but in reality this uses only a tiny fraction of available memory. Linux also combines pages into “zones”, and you can easily see how they are organised on your system in /proc/zoneinfo.

When in paged mode, the CPU also uses special tables to translate the addresses it operates on (or “virtual”) into real ones (or “physical”). A part of the CPU called the Memory Management Unit (MMU) is responsible for this, and the IOMMU does the same for external peripherals. Translation enables each process to have its own private address space, and a kernel to protect itself from user-mode programs.

To support this protection, page tables contain access control flags. A page can be marked as present, writable, executable, or available for the kernel only. If code accesses a page that isn’t present or is otherwise unavailable, the faulty address is stored in the CR2 register and a hardware exception is thrown. If this happens in userspace, the kernel checks if the page accessed was swapped out to the disk and either reads it back or sends a SIGSEGV to the program (that’s the famous “Segmentation fault” error). If the exception occurs in kernel mode, it is considered a serious bug, and the kernel panics.

Translation tables

Translation tables are essentially chained arrays (see the diagram). First, the virtual address is split into several parts (four, if in 64-bit mode). Each part is then used as an index into the table that contains either an address for the next page table, or (at the lowest level) a physical address of the page itself. The exact format of the page table entry differs depending on page size or extensions, like Physical Address Extension (PAE). If PAE is enabled, more bits are allocated for the page’s physical address, allowing 32-bit code to reference memory whose physical address is beyond 4GB. This doesn’t magically make 32-bit address space more than 4GB in size, but it helped 32-bit servers to have more RAM until 64-bit servers become mainstream.

Now let’s check which memory regions are defined on your computer (this depends on BIOS, RAM size, and the kernel version):

$ cat /proc/iomem ... 00100000-bf680fff : System RAM 01000000-015427b9 : Kernel code 015427ba-018e037f : Kernel data 01a05000-01b2afff : Kernel bss ... c0000000-feafffff : PCI Bus 0000:00 ... d7100000-d80fffff : PCI Bus 0000:06 d7100000-d710ffff : 0000:06:00.0 d7100000-d710ffff : ath9k

Addresses shown here are physical. You can see where the kernel lives, and also the memory range assigned to Atheros network adapter (ath9k). If a user space program can modify translation tables and map that address, it could re-program a card. For example, it could send network packets bypassing a system firewall. That’s why managing page tables is the kernel’s job.

Registers in the wild

CPU registers are a scarce resource. As they are much faster to access than memory, using them for data operations drastically increases performance. However, data begins and ends in memory, and the associated overhead can make a whole game not worth the candle. That’s why compilers use clever algorithms to allocate registers for a program’s variables in the most optimal way. The C programming language even provides the register keyword, which can be used as a hint for the compiler. It was useful back in older days, but now most optimising compilers are smart enough and simply ignore it.

A bit of device

Finally, let’s briefly cover external devices. x86 architecture provides what’s called I/O space – 64k of addresses (known as “ports”) accessible with IN and OUT assembly instructions. They are already enough to send data over serial line. That’s why system programmers often use serial ports for low-level debugging – USB, for instance, is much more elaborate. One way Linux lets you see how I/O ports are used should feel familiar:

$ cat /proc/ioports 0000-0cf7 : PCI Bus 0000:00 0000-001f : dma1 0020-0021 : pic1 0040-0043 : timer0 0050-0053 : timer1 0060-0060 : keyboard 0064-0064 : keyboard ...

Here, we see that ports 0x60 and 0x64 are assigned to a PS/2 keyboard controller. Surprisingly, we can also use it to reboot the computer! This explains why access to the I/O port from Linux userspace is usually prohibited. However, the ioperm() and iopl() system calls can make a desired I/O port range (or a whole I/O address space) accessible. Root privileges are generally required, and it is rarely a good idea to use these unless you run some old X server.

Now you have a some understanding of how the computer you use everyday works internally. That wasn’t that hard, was it? This tutorial is a kind of high-level overview, but you can find all details in Intel’s SDM (Software Developer’s Manual) or AMD’s APM (Architecture Programmer’s Manual), available freely (as in beer) from the respective websites. Linux opens many possibilities to try new stuff you learned. If you come up with some clever experiment, don’t forget to share it with us!

Prepare your tools

cpuid, rdmsr and wrmsr are parts of the Intel-developed msrtools package. You are unlikely to get them with package manager, but they are small and trivial to compile yourself. Just download the tarball (only 7k!) from https://01.org/msr-tools, unpack and run make. If you get errors when running tools, ensure you have the cpuid and msr stock kernel modules loaded, and device nodes created; the MAKEDEV-cpuid-msr script bundled with the sources can help with the latter. Root privileges are required, so be careful.