Virtual Memory & x64 Long Mode

In my last blog posting I have talked about how you can read and execute the Second Stage Boot Loader of your own Operating System. In today’s blog posting I want to show you in the first step how you can remove the dependency from floppy disks. Afterwards we will switch our CPU into the x64 Long Mode, which is necessary to be able to execute our x64-based OS Kernel.

Reading Data through ATA PIO

Over the last few blog postings, you have already learned how to read from floppy disks through the BIOS interrupt 0x10. Floppy disks are great for the first steps in OS development, because you can interact with them quite nicely through the BIOS. But if you want to run your OS on a physical computer, you need to have a physical floppy drive (there are USB-based versions available). Besides that, you are running your OS on a very ancient technology.

Therefore, I did some research to find a better option for a storage device, which can be interfaced in a quite effortless way. One of my prerequisites was that the storage device can be interfaced through I/O Ports instead of BIOS interrupts. As soon as we are switching the CPU into x64 Long Mode, you have no BIOS interrupts available anymore, and therefore you can only communicate with your hardware through their various I/O Ports.

After some time, I came up with the ATA PIO interface. With that interface you can read from and write to an IDE-based hard disk. Yes, you read correctly: the hard drive must be an IDE attached hard disk. This is not a big problem when you deal with a Virtual Machine, because this is just a configuration option within your virtual disk file:

But when we talk about physical hardware it can get a little bit more complicated, because these days hard disks are normally connected through a SATA controller – or even through a NVMe controller. When I test out my OS on physical hardware, I use a 10-year-old Lenovo W510 notebook (Quad-Core CPU with 8 GB RAM). It has also one internal SATA slot where you can put in your hard disk. To be able to deal with this hard disk through ATA PIO, you must switch the SATA Controller into the so-called Compatibility Mode. Then your hard disk just acts like a traditional IDE-based hard disk. This compatibility mode even works with SATA SSDs!

The output of the build process of my OS is still a 1.44 MB large FAT12 formatted floppy image. You can use the command dd to copy that image in raw format to your physical hard drive. With this approach, I can now run my OS on physical hardware directly from an SSD drive which is interfaced through ATA PIO. It does not give you the native speed of the SSD drive, but we do not care about speed in these first baby steps. You can access the ATA PIO interface through the I/O ports ranging from 0x1F0 to 0x1F7:

0x1F0: Access to the read data or the data to be written
0x1F2: Sector count that must be read or written
0x1F3 to 0x1F5: Logical address of the disk sector that must be read or written
0x1F7: Command Port

0x20: Read
0x30: Write

The following listing shows the assembly code that is necessary to read a given disk sector through ATA PIO into main memory. The register BX must contain the number of sectors to be read, the register ECX must contain the starting LBA address, and the register ES:EDI must contain the destination memory address.

;================================================
; This function reads a sector through ATA PIO.
; BX:  Nunber of sectors to read
; ECX: Starting LBA
; EDI: Destination Address
;================================================
ReadSector:
    ; Sector count
    MOV     DX, 0x1F2
    MOV     AL, BL
    OUT     DX, AL

    ; LBA - Low Byte
    MOV     DX, 0x1F3
    MOV     AL, CL
    OUT     DX, AL

    ; LBA - Middle Byte
    MOV     DX, 0x1F4
    MOV     AL, CH
    OUT     DX, AL

    ; LBA - High Byte
    BSWAP   ECX
    MOV     DX, 0x1F5
    MOV     AL, CH
    OUT     DX, AL

    ; Read Command
    MOV     DX, 0x1F7
    MOV     AL, 0x20    ; Read Command
    OUT     DX, AL

    .ReadNextSector:
        CALL    Check_ATA_BSY
        CALL    Check_ATA_DRQ

        ; Read the sector of 512 bytes into ES:EDI
        ; EDI is incremented by 512 bytes automatically
        MOV     DX, 0x1F0
        MOV     CX, 256
        REP     INSW

        ; Decrease the number of sectors to read and compare it to 0
        DEC     BX
        CMP     BX, 0
        JNE     .ReadNextSector
RET

The label .ReadNextSector also calls the utility functions Check_ATA_BSY and Check_ATA_DRQ. These two functions are checking for the BSY and DRQ flag of the ATA PIO interface. The following listing shows their implementation.

;================================================
; This function checks the ATA PIO BSY flag.
;================================================
Check_ATA_BSY:
    MOV     DX, 0x1F7
    IN      AL, DX
    TEST    AL, 0x80
    JNZ     Check_ATA_BSY
RET

;================================================
; This function checks the ATA PIO DRQ flag.
;================================================
Check_ATA_DRQ:
    MOV     DX, 0x1F7
    IN      AL, DX
    TEST    AL, 0x08
    JZ      Check_ATA_DRQ
RET

With these functions you are now able to read a given file from a hard disk. The following listing shows the function LoadFileIntoMemory that is used now from the boot sector code to read a given file from a FAT12 partition into memory.

;=================================
; Loads a given file into memory.
;=================================
LoadFileIntoMemory:
    .LoadRootDirectory:
        ; Load the Root Directory into memory.
        ; It starts at the LBA 19, and consists of 14 sectors.
        MOV     BL,  0xE                                ; 14 sectors to be read
        MOV     ECX, 0x13                               ; The LBA is 19
        MOV     EDI, ROOTDIRECTORY_AND_FAT_OFFSET       ; Destination address
        CALL    ReadSector                              ; Loads the complete Root Directory into memory

    .FindFileInRootDirectory:
        ; Now we have to find our file in the Root Directory
        MOV     CX, [bpbRootEntries]                    ; The number of root directory entries
        MOV     DI, ROOTDIRECTORY_AND_FAT_OFFSET        ; Address of the Root directory
        .Loop:
            PUSH    CX
            MOV     CX, 11                              ; We compare 11 characters (8.3 convention)
            MOV     SI, FileName                        ; Compare against the file name
            PUSH    DI
            REP     CMPSB                               ; Test for string match

            POP     DI
            JE      .LoadFAT                            ; When we have a match, we load the FAT
            POP     CX
            ADD     DI, 32                              ; When we don't have a match, we go to next root directory entry (+ 32 bytes)
            LOOP    .Loop
            JMP     Failure                             ; The file image wasn't found in the root directory

    .LoadFAT:
        ; Store the first FAT cluster of the file to be read in the variable "Cluster"
        MOV     DX, WORD [DI + 0x001A]              ; Add 26 bytes to the current entry of the root directory, so that we get the start cluster
        MOV     WORD [Cluster], DX                  ; Store the 2 bytes of the start cluster (byte 26 & 27 of the root directory entry) in the variable "cluster"

        ; Load the FATs into memory.
        ; It starts at the LBA 1 (directly after the boot sector), and consists of 18 sectors (2 x 9).
        MOV     BL, 0x12                                ; 18 sectors to be read
        MOV     ECX, 0x1                                ; The LBA is 1
        MOV     EDI, ROOTDIRECTORY_AND_FAT_OFFSET       ; Offset in memory at which we want to load the FATs
        CALL    ReadSector                              ; Call the load routine
        MOV     EDI, [Loader_Offset]                    ; Address where the first cluster should be stored

    .LoadImage:
        ; Print out the current offset where the cluster is loaded into memory
        ; This introduces a short delay, which is somehow needed by the ATA PIO code...?
        MOV     AX, DI
        CALL    PrintDecimal
        MOV     SI, CRLF
        CALL    PrintLine

        ; Load the first sector of the file into memory
        MOV     AX, WORD [Cluster]                      ; First FAT cluster to read
        ADD     AX, 0x1F                                ; Add 31 sectors to the retrieved FAT cluster to get the LBA address of the first FAT cluster
        MOV     ECX, EAX                                ; LBA
        MOV     BL, 1                                   ; 1 sector to be read
        CALL    ReadSector                              ; Read the cluster into memory
        
        ; Compute the next cluster that we have to load from disk
        MOV     AX, WORD [Cluster]                      ; identify current cluster
        MOV     CX, AX                                  ; copy current cluster
        MOV     DX, AX                                  ; copy current cluster
        SHR     DX, 0x0001                              ; divide by two
        ADD     CX, DX                                  ; sum for (3/2)
        MOV     BX, ROOTDIRECTORY_AND_FAT_OFFSET        ; location of FAT in memory
        ADD     BX, CX                                  ; index into FAT
        MOV     DX, WORD [BX]                           ; read two bytes from FAT
        TEST    AX, 0x0001
        JNZ     .LoadRootDirectoryOddCluster
          
    .LoadRootDirectoryEvenCluster:
        AND     DX, 0000111111111111b                   ; Take the lowest 12 bits
        JMP     .LoadRootDirectoryDone
            
    .LoadRootDirectoryOddCluster:
        SHR     DX, 0x0004                              ; Take the highest 12 bits
            
    .LoadRootDirectoryDone:
        MOV     WORD [Cluster], DX                      ; store new cluster
        CMP     DX, 0x0FF0                              ; Test for end of file
        JB      .LoadImage

    .LoadRootDirectoryEnd:
        ; Restore the stack, so that we can do a RET
        POP     BX
RET

The boot sector code has also changed because now we must call the above-mentioned function during the startup. In addition, the boot sector code loads 2 additional files into memory for execution:

1. KLDR16.BIN
This is the Second Stage Boot Loader that is implemented in x16 Real Mode, which still has access to the BIOS. After getting the necessary information from the BIOS, it switches the CPU into x64 Long Mode, where it executes the x64 based KLDR64.BIN file.

2. KLDR64.BIN
This is the Third Stage Boot Loader that is implemented in x64 Long Mode. It currently just prints out the date and time that we have retrieved from the BIOS. In the next release of my OS, it will read through ATA PIO the x64 based OS kernel KERNEL.BIN from the FAT12 partition into memory and executes it. The only purpose of this additional boot loader file is to load the KERNEL.BIN file to the physical memory address 0x100000 and executes it. This task must be done in KLDR64.BIN, because the CPU is now already in x64 Long Mode, and there we can access higher memory addresses like 0x100000. This would be impossible to do in KLDR16.BIN, because the CPU is at that point in time still in x16 Real Mode. The implementation of this functionality will be covered in the next blog posting. The following listing shows the rewritten boot sector code.

Main:
    ; Setup the DS and ES register
    XOR     AX, AX
    MOV     DS, AX
    MOV     ES, AX

    ; Prepare the stack
    ; Otherwise we can't call a function...
    MOV     AX, 0x7000
    MOV     SS, AX
    MOV     BP, 0x8000
    MOV     SP, BP

    ; Print out a boot message
    MOV     SI, BootMessage
    CALL    PrintLine

    ; Load the KLDR64.BIN file into memory
    MOV     CX, 11
    LEA     SI, [SecondStageFileName64]
    LEA     DI, [FileName]
    REP     MOVSB
    MOV     WORD [Loader_Offset], KAOSLDR64_OFFSET
    CALL    LoadFileIntoMemory

    ; Load the KLDR16.BIN file into memory
    MOV     CX, 11
    LEA     SI, [SecondStageFileName16]
    LEA     DI, [FileName]
    REP     MOVSB
    MOV     WORD [Loader_Offset], KAOSLDR16_OFFSET
    CALL    LoadFileIntoMemory

    ; Execute the KLDR16.BIN file...
    CALL KAOSLDR16_OFFSET

As you can see, both files are read into memory, and finally we continue our code execution at the memory address 0x2000 where the KLDR16.BIN resides.

BIOS Information Block and A20 Line

The first step in the KLDR16.BIN code execution is to retrieve all the necessary information from the BIOS. At this point in time, we only retrieve the current date and time from the BIOS and store them in a memory area that I call the BIOS Information Block – the BIB. In the future we will enhance the BIB with additional information from the BIOS – like the Memory Map and information about the supported graphic modes. The information from the BIOS Information Block will be later used and processed by the x64-based OS kernel. The following listing shows how the current date and time is stored in the BIB.

;=================================================
; This function retrieves the date from the BIOS.
;=================================================
GetDate:
    ; Get the current date from the BIOS
    MOV     AH, 0x4
    INT     0x1A

    ; Century
    PUSH    CX
    MOV     AL, CH
    CALL    Bcd2Decimal
    MOV     [Year1], AX
    POP     CX

    ; Year
    MOV     AL, CL
    CALL    Bcd2Decimal
    MOV     [Year2], AX

    ; Month
    MOV     AL, DH
    CALL    Bcd2Decimal 
    MOV     WORD [ES:DI + BiosInformationBlock.Month], AX

    ; Day
    MOV     AL, DL
    CALL    Bcd2Decimal
    MOV     WORD [ES:DI + BiosInformationBlock.Day], AX

    ; Calculate the whole year (e.g. "20" * 100 + "22" = 2022)
    MOV     AX, [Year1]
    MOV     BX, 100
    MUL     BX
    MOV     BX, [Year2]
    ADD     AX, BX
    MOV     WORD [ES:DI + BiosInformationBlock.Year], AX
RET

;=================================================
; This function retrieves the time from the BIOS.
;=================================================
GetTime:
    ; Get the current time from the BIOS
    MOV     AH, 0x2
    INT     0x1A

    ; Hour
    PUSH    CX
    MOV     AL, CH
    CALL    Bcd2Decimal
    MOV     WORD [ES:DI + BiosInformationBlock.Hour], AX
    POP     CX

    ; Minute
    MOV     AL, CL
    CALL    Bcd2Decimal
    MOV     WORD [ES:DI + BiosInformationBlock.Minute], AX

    ; Second
    MOV     AL, DH
    CALL    Bcd2Decimal
    MOV     WORD [ES:DI + BiosInformationBlock.Second], AX
RET

Before we switch the CPU into the x64 Long Mode, we also must enable the so-called A20 Line. This line must be enabled on a system so that we can access all memory areas. Unfortunately, there are so many different methods how to enable this line – depending on the used hardware. You can check out the complexity of it in the source code of the Linux Kernel. The following listing shows one method that works currently for me.

;=============================================
; This function enables the A20 gate
;=============================================
EnableA20:
    CLI	                ; Disables interrupts
    PUSH    AX          ; Save AX on the stack
    MOV     AL, 2
    OUT     0x92, AL
    POP	    AX          ; Restore the value of AX from the stack
    STI                 ; Enable the interrupts again
RET

Virtual Memory on an x64 System

After we have done all these individual steps, we are finally able to switch our CPU into the x64 Long Mode to gain access to the whole available system memory and to be able to execute 64-bit instructions. But before we do that, we must talk about Virtual Memory on a x64 system.

Every time when we have accessed main memory up to this point in time, we have dealt with so-called Physical Memory Addresses. A physical memory address, like 0x2000 where the KLDR16.BIN file resides, is the physical location within the installed RAM module. On the other hand, a Virtual Memory Address, abstracts memory addresses from the underlying RAM modules. The x16 Real Mode has no idea about virtual memory, it only works with physical memory and therefore with physical memory addresses.

But as soon as you switch your CPU into x32 Protected Mode or the x64 Long Mode, your CPU only deals with virtual memory addresses. Every memory address that you provide in machine code is treated as a virtual memory address. But physical memory still can be only accessed with physical memory addresses. Therefore, you need a component which translates a given virtual memory address into a physical memory address. That component is called the Memory Management Unit (MMU) and is part of the CPU.

Based on a so-called Translation Method, the translation between a virtual and physical memory address happens. The cool thing about virtual memory addresses is the fact that it adds an additional layer of abstraction. With that abstraction layer the CPU can enforce different memory policies – based on what the running OS enforces. Here are some examples:

A running process can’t access the memory region from a different process because each process has a different physical memory region assigned.
A user mode process can’t directly access Kernel data structures because the Kernel memory space is not accessible from a user mode process.
Some parts of the main memory can be marked as read-only. When some machine code tries to write to these memory regions, a CPU fault will be triggered.

The idea of virtual memory is to divide the whole available system memory into regions called Pages. A traditional page on a x32/x64 system is normally 4 KB large – 4096 bytes. In addition, Intel CPUs also offer larger page sizes – like Large Pages (2 MB) and Huge Pages (1 GB). A page is always accessed through a virtual memory address. With the provided translation function, the virtual memory address is mapped to a physical memory address – the so-called Page Frame. The page frame has the same size as the page.

2 concurrent running processes can access the same virtual memory address, but through the translation function this virtual address is mapped for each process to a different physical page frame. That’s the power of virtual memory. The following picture illustrates this very important concept.

The question is now how a page is mapped to a physical frame and where that mapping is stored. The CPU uses here so-called Page Tables, which are used to store the mapping/translation information. Each running process has an individual set of page tables assigned. The address of the current active page table is stored in a special CPU register called CR3 – the Control Register 3. As soon as a process switch occurs in the OS, the Kernel must load the address of the new active page table into that register. Afterwards the virtual addresses of the newly active running process are mapped to different physical page frames.

The x64 CPU architecture uses a 4-level hierarchy to store the mapping information in various page tables and each page table has here a size of 4 KB. Each entry in a page table has a fixed size of 8 bytes (64 bits), and therefore you can store 512 entries in a page table. The whole 64-bit long virtual memory address acts here as page table indexes for the various levels – as seen in the following picture.

Each page table index is 9 bits long. This makes sense, because each page table has 512 entries, and 2^9 = 512. With 9 bits we can address each of these individual entries. The remaining bits from bit 48 to bit 64 are ignored, because the current x64 CPU architecture only supports 48 bit long virtual addresses. But it is very important that these remaining bits have a correct value. They must have the same bit value as bit 47 (Sign Extension). These are so-called Canonical Memory Addresses. With only 48 bits in use, the x64 architecture gives us 256 TB of addressable virtual memory. Intel Ice Lake CPUs are introducing a 5-Level Paging, where the bits 48 – 57 are also in use. This extends the addressable virtual memory to 128 PB (peta byte). The x64 architecture calls the 4 levels in the page table hierarchy as follows:

Level 4: Page Map Level 4 Table (PML4T)
Level 3: Page Directory Pointer Table (PDPT)
Level 2: Page Directory Table (PDT)
Level 1: Page Table (PT)

When you now have a virtual memory address, the CPU must perform a so-called Page Table Walk to translate the virtual memory address to a physical memory address. During the page table walk the given virtual memory address is used as lookup values into the various page tables. A page table walk consists of the following steps:

The physical memory address of the Page Map Level 4 Table is read from the CR3 register.

Bits 39 – 47 of the virtual memory address are used to determine the index into the Page Map Level 4 Table.
The entry of the specific index is read and returns the physical memory address of the Page Directory Pointer Table.

The Page Directory Pointer Table is read based on the provided physical memory address from the last step.

Bits 30 – 38 of the virtual memory address are used to determine the index into the Page Directory Pointer Table.
The entry of the specific index is read and returns the physical memory address of the Page Directory Table.

The Page Directory Table is read based on the provided physical memory address from the last step.

Bits 21 – 29 of the virtual memory address are used to determine the index into the Page Directory Table.
The entry of the specific index is read and returns the physical memory address of the Page Table.

The Page Table is read based on the provided physical memory address from the last step.

Bits 12 – 20 are used to determine the final physical frame.

The following table gives describes the format of a page table entry of 64 bits.

Bit(s)	Description
0	Is the page currently in memory?
1	Is it allowed to write to the page?
2	If not set, only Kernel mode code can access this page.
3	Writes are going directly to memory.
4	No cache is used for this page.
5	The CPU sets this bit when the page is accessed.
6	The CPU sets this bit when a write occurs on this page.
7	Must be 0 in Level 1 and Level 4. When set it creates 1 GB page in Level 3, and 2 MB pages in Level 2.
8	The page isn’t flushed from caches on an address space switch.
9 – 11	Freely usable by the OS.
12 – 51	Physical address of the frame of the next page table in the next level below.
52 – 62	Freely usable by the OS.
63	If set, it forbids code execution from this page

It is very important to recall here that the bits 12 – 51 are storing the *physical address* of the page frame that must be accessed in next level below of the page table hierarchy.

Note: The bits 0 – 11 are of course also part of the physical address of the page frame, but the physical address must be always aligned at a 4 KB boundary (4 KB are 0x1000 in hexadecimal and in binary 1000000000000), and therefore these lower 12 bits are always set to zero and can be used as flags as mentioned in the table above.

Therefore, a page table walk performs a lot of physical memory accesses. As you know, accessing physical memory is very slow compared to the speed of the CPU. That’s the reason why a CPU includes a so-called Translation Lookaside Buffer (TLB). The TLB caches the recent translations of virtual memory addresses to physical memory addresses to speed up page table walks. Of course, the TLB must be flushed by the OS as soon as an address space switch occurs. The TLB flush is normally performed by writing a new value into the CR3 control register.

Switching the CPU into x64 Long Mode

In the last section I gave you a quick overview about virtual memory on an x64 system. As you have learned, every memory address acts as a virtual memory address. Therefore, it is very important that you have a working page table hierarchy in place when you switch the CPU into the x64 Long Mode. Otherwise, your OS would crash, because even the memory address in the instruction pointer register RIP is also treated as a virtual memory address that must be translated to a physical memory address.

Therefore, the first step is to create a simple page table hierarchy in physical memory (we are still in x16 Real Mode!) where a virtual memory address maps to the same physical memory address (a virtual memory address has the same value as its physical memory address). This is called an Identity Mapping. We will identity map the first 2 MB of physical memory. The identity mapping is just done temporarily so that we can switch the CPU into x64 Long Mode. When the real OS Kernel is loaded afterwards into memory and is executed, it will set up the final page table structures in memory. The following picture shows the page table hierarchy structure that we will create in the next step in memory.

As you can see from the picture, the PML4T, the PDPT, and the PDT have only a valid page table entry at the first entry 0. This makes sense, because the bits 21 – 47 are always zero when we deal with a memory address below 2 MB. The maximum value below 2 MB is in decimal 2097151 (2 MB minus 1 byte) and is represented in binary as follows:

000000000000000 000000000 000000000 000000000 111111111 111111111111
Sign Extension PML4 PDPT PDT PT Offset

When we perform a page table walk with this limited range of virtual memory addresses (0 – 2 MB), we must access *always* the first entry (zero-based!) in the PML4T, the PDPT, and in the PDT – based on the bit pattern from above. Which entry you must access in the PT depends on the bits 12 – 20 of the virtual memory address. One PT covers a memory region of 2 MB, and therefore we must fill each of the 512 entries in the PT. Each entry points to the same identity mapped physical memory location. The following assembly code shows how to create this page table hierarchy in memory where the PML4T starts at the physical memory address 0x9000.

SwitchToLongMode:
    MOV     EDI, 0x9000
    
    ; Zero out the 16KiB buffer.
    ; Since we are doing a rep stosd, count should be bytes/4.   
    PUSH    DI                                  ; REP STOSD alters DI.
    MOV     ECX, 0x1000
    XOR     EAX, EAX
    CLD
    REP     STOSD
    POP     DI                                  ; Get DI back.
 
    ; Build the Page Map Level 4 (PML4)
    ; es:di points to the Page Map Level 4 table.
    LEA     EAX, [ES:DI + 0x1000]               ; Put the address of the Page Directory Pointer Table in to EAX.
    OR      EAX, PAGE_PRESENT | PAGE_WRITE      ; Or EAX with the flags - present flag, writable flag.
    MOV     [ES:DI], EAX                        ; Store the value of EAX as the first PML4E.
 
    ; Build the Page Directory Pointer Table (PDP)
    LEA     EAX, [ES:DI + 0x2000]               ; Put the address of the Page Directory in to EAX.
    OR      EAX, PAGE_PRESENT | PAGE_WRITE      ; Or EAX with the flags - present flag, writable flag.
    MOV     [ES:DI + 0x1000], EAX               ; Store the value of EAX as the first PDPTE.
 
    ; Build the Page Directory (PD)
    LEA     EAX, [ES:DI + 0x3000]               ; Put the address of the Page Table in to EAX.
    OR      EAX, PAGE_PRESENT | PAGE_WRITE      ; Or EAX with the flags - present flag, writeable flag.
    MOV     [ES:DI + 0x2000], EAX               ; Store to value of EAX as the first PDE.
 
    PUSH    DI                                  ; Save DI for the time being.
    LEA     DI, [DI + 0x3000]                   ; Point DI to the page table.
    MOV     EAX, PAGE_PRESENT | PAGE_WRITE      ; Move the flags into EAX - and point it to 0x0000.
    
    ; Build the Page Table (PT)
.LoopPageTable:
    MOV     [ES:DI], EAX
    ADD     EAX, 0x1000
    ADD     DI, 8
    CMP     EAX, 0x200000                       ; If we did all 2MiB, end.
    JB      .LoopPageTable
 
    POP     DI                                  ; Restore DI.

After we have set up the necessary page table hierarchy in memory, we will perform the switch into the x64 Long Mode. To enable the x64 Long Mode, you must work together with a so-called Extended Feature Enable Register (EFER). Please check out this article for more details about it. The following code shows the necessary code.

; Disable IRQs
    MOV     AL, 0xFF                            ; Out 0xFF to 0xA1 and 0x21 to disable all IRQs.
    OUT     0xA1, AL
    OUT     0x21, AL
 
    NOP
    NOP
 
    LIDT    [IDT]                               ; Load a zero length IDT so that any NMI causes a triple fault.
 
    ; Enter long mode.
    MOV     EAX, 10100000b                      ; Set the PAE and PGE bit.
    MOV     CR4, EAX
    MOV     EDX, EDI                            ; Point CR3 at the PML4.
    MOV     CR3, EDX
    MOV     ECX, 0xC0000080                     ; Read from the EFER MSR. 
    RDMSR    
 
    OR      EAX, 0x00000100                     ; Set the LME bit.
    WRMSR
 
    MOV     EBX, CR0                            ; Activate long mode -
    OR      EBX, 0x80000001                     ; - by enabling paging and protection simultaneously.
    MOV     CR0, EBX                    
 
    LGDT    [GDT.Pointer]                       ; Load GDT.Pointer defined below.
 
    JMP     CODE_SEG:LongMode                   ; Load CS with 64 bit segment and flush the instruction cache

[BITS 64]
LongMode:
    MOV     AX, DATA_SEG
    MOV     DS, AX
    MOV     ES, AX
    MOV     FS, AX
    MOV     GS, AX
    MOV     SS, AX

    ; Setup the stack
    MOV     RAX, QWORD 0x70000
    MOV     RSP, RAX
    MOV     RBP, RSP
    XOR     RBP, RBP

    ; Execute the KLDR64.BIN
    JMP     0x3000

Phew, we are now in x64 Long Mode! The code execution continues at the label LongMode, where we first set up the segment registers and the stack. Afterwards we jump to the virtual memory address 0x3000 (physical memory addresses are now gone!) for continuing the code execution. That’s the memory location where the KLDR64.BIN file resides. This virtual memory address is identity mapped to the physical memory address 0x3000. The KLDR64.BIN file currently just prints out a welcome message and the information obtained from the BIOS Information Block that we set up earlier. I will cover in the next blog posting in more details how to write to the screen in x64 Long Mode, because by now you have no access to the BIOS interrupts anymore!

Summary

This was a very long blog posting today, because we have covered a lot of stuff – especially around virtual memory on an x64 system. This knowledge is crucial because we must set up a whole 4-level page table hierarchy to be able to enable the x64 Long Mode. As soon as you are in the x64 Long Mode, your CPU will treat each memory address as a virtual memory address. Therefore, we also have identity mapped the first 2 MB of memory to be able to continue the code execution in x64 Long Mode.

In the next blog posting we will concentrate in more details on the KLDR64.BIN file, which finally loads the real x64 OS Kernel into memory and executes it. Stay tuned.

Thanks for your time,

-Klaus