Tuesday, December 30, 2025

Building a Complete DOS Emulator in Go

 




The code of the dos-simulator can be found here:

https://github.com/ms1963/dos-emulator


Introduction and Rationale

When I started my university studies, the most used operating system had been Microsoft‘s MS-DOS - though some had serious doubts whether DOS (Disk Operating System) really was an operating system or not. In that era memory capacity was calculated in Kilobytes instead of Gigabytes. It was hard to live and work with these constraints, but somehow we managed to write ressource-efficient programs. As a homage to these ancient times I wrote a DOS emulator.

The development of a DOS emulator serves multiple important purposes in modern computing. First and foremost, it provides a means to preserve and run legacy software that was developed during the DOS era, which spans from the early 1980s through the mid-1990s. Many businesses, educational institutions, and individuals still possess valuable software written for DOS that cannot run on modern operating systems. By creating a DOS emulator, we enable these programs to execute on contemporary hardware without requiring vintage computers or complex virtualization setups. Additionally, building a DOS emulator serves as an excellent educational project for understanding low-level computer architecture, operating system concepts, and the x86 instruction set. It provides hands-on experience with CPU emulation, memory management, interrupt handling, and file system operations. Furthermore, a DOS emulator written in a modern language like Go offers portability across different platforms, allowing DOS programs to run on Windows, Linux, macOS, and other operating systems without modification. The project also demonstrates how modern programming languages can be used to implement complex system-level software while maintaining readability and maintainability.


Understanding the DOS Environment

Before diving into implementation details, it is essential to understand what DOS actually is and how it operates. DOS, which stands for Disk Operating System, is a single-tasking operating system that provides a command-line interface and basic services for running programs. The most common version was MS-DOS, developed by Microsoft, though compatible versions like PC-DOS and FreeDOS also existed. DOS programs run in real mode on x86 processors, which means they have direct access to memory and hardware without the protection mechanisms found in modern operating systems. The DOS environment consists of several key components that our emulator must replicate. These include the BIOS, which provides basic input and output services through software interrupts, the DOS kernel itself, which offers file management and program execution services, and the command interpreter, typically COMMAND.COM, which provides the user interface. Programs interact with DOS and the BIOS primarily through software interrupts, which are essentially function calls to operating system services. Understanding this interrupt-based architecture is crucial for building an accurate emulator.


Core Components Overview


A complete DOS emulator requires several major components working together in harmony. The first and most fundamental component is the CPU emulator, which must accurately simulate the behavior of an Intel 8086 or compatible processor. This includes implementing all registers, flags, and the instruction set. The second major component is the memory subsystem, which provides the one megabyte address space that DOS programs expect, organized into segments and offsets. The third component is the BIOS emulation layer, which handles hardware-related interrupts for video output, keyboard input, disk operations, and system services. The fourth component is the DOS kernel emulation, which implements file system operations, program loading, and process management through DOS interrupts. The fifth component is the file system interface, which bridges between DOS file operations and the host operating system's file system. Finally, we need a user interface component that provides either a command-line shell or a way to directly execute DOS programs. Each of these components must be carefully designed and implemented to create a functional emulator.


CPU Emulation Architecture

The CPU emulator forms the heart of our DOS emulator and requires meticulous attention to detail. The Intel 8086 processor, which we are emulating, is a sixteen-bit processor with a segmented memory architecture. We must implement all of the processor's registers, which include four general-purpose registers that can be accessed as sixteen-bit values or split into eight-bit high and low bytes. These registers are AX, BX, CX, and DX, where AX serves as the accumulator, BX as the base register, CX as the counter, and DX as the data register. Each of these can be accessed as AL and AH for AX, BL and BH for BX, CL and CH for CX, and DL and DH for DX. Additionally, we need four index and pointer registers: SI for source index, DI for destination index, BP for base pointer, and SP for stack pointer. The processor also has four segment registers: CS for code segment, DS for data segment, ES for extra segment, and SS for stack segment. The instruction pointer IP tracks the current position in the code being executed. Finally, we must implement the FLAGS register, which contains individual bits representing the state of various conditions such as carry, zero, sign, overflow, parity, auxiliary carry, direction, interrupt enable, and trap flags.


Implementing the Instruction Set

The instruction set implementation is perhaps the most labor-intensive part of building a CPU emulator. The 8086 processor supports over one hundred fifty different instructions, each of which must be accurately implemented. These instructions fall into several categories. Data movement instructions include MOV for moving data between registers and memory, PUSH and POP for stack operations, XCHG for exchanging values, and LEA for loading effective addresses. Arithmetic instructions include ADD, SUB, MUL, DIV, INC, and DEC for basic mathematical operations, as well as ADC and SBB for operations with carry. Logical instructions include AND, OR, XOR, NOT, and TEST for bitwise operations. Shift and rotate instructions include SHL, SHR, SAL, SAR, ROL, ROR, RCL, and RCR for bit manipulation. Control flow instructions include JMP for unconditional jumps, various conditional jumps like JZ, JNZ, JE, JNE, JG, JL, and others, as well as CALL and RET for subroutine operations and LOOP instructions for iteration. String instructions include MOVSB, MOVSW, STOSB, STOSW, LODSB, LODSW, CMPSB, CMPSW, and SCASB for efficient string and memory block operations. Finally, we have interrupt instructions like INT, IRET, and special instructions like HLT, NOP, and flag manipulation instructions. Each instruction must be decoded from its binary representation and executed with proper flag updates.


Memory Management Implementation

The memory subsystem must provide the segmented memory model that DOS programs expect. In the 8086 architecture, memory addresses are calculated using a segment and offset pair. The physical address is computed by shifting the segment value left by four bits and adding the offset, which allows addressing up to one megabyte of memory despite using sixteen-bit registers. Our emulator allocates a byte array of one megabyte to represent the entire addressable memory space. We implement functions to read and write both bytes and words from memory, handling the little-endian byte ordering that x86 processors use. The memory subsystem must also support the Program Segment Prefix, or PSP, which is a 256-byte data structure that DOS creates at the beginning of each program's memory space. The PSP contains important information such as the command line arguments, file handles, and pointers to the environment. When loading a program, we must set up the PSP correctly and initialize the segment registers to point to appropriate locations. The stack grows downward from high memory, and we must ensure that stack operations correctly update the stack pointer and handle stack overflow conditions gracefully.


BIOS Interrupt Implementation

BIOS interrupts provide low-level hardware services that DOS programs rely upon. The most important BIOS interrupt for our purposes is INT 10h, which handles video services. This interrupt supports numerous functions identified by the value in the AH register. Function 0Eh provides teletype output, which writes a character to the screen and advances the cursor. This is the most commonly used function for simple text output. Function 00h sets the video mode, function 02h sets the cursor position, function 03h gets the cursor position, function 06h scrolls the screen up, function 09h writes a character with attributes multiple times, and function 0Fh gets the current video mode. Our implementation maintains a virtual video buffer and cursor position, updating them as programs call these functions and outputting the results to the host system's console. INT 16h provides keyboard services, with function 00h waiting for a keypress and returning the character, function 01h checking if a key is available without waiting, and function 02h getting the keyboard shift status. INT 13h handles disk services, though for our emulator we can provide minimal implementations since DOS handles most file operations. INT 1Ah provides time and date services, returning the system time in various formats. We implement function 00h to get the tick count since midnight, function 02h to get the real-time clock time, and function 04h to get the date.


DOS Interrupt Implementation

DOS services are accessed through INT 21h, which is the most complex interrupt to implement as it provides dozens of functions. The function number is specified in the AH register, and parameters are passed in other registers. Function 01h reads a character from standard input with echo, function 02h writes a character to standard output from the DL register, function 06h provides direct console I/O with the ability to check for input without waiting, function 07h and 08h read characters without echo, function 09h writes a string terminated by a dollar sign to standard output, and function 0Ah reads a buffered line of input. Function 0Eh selects the current drive, and function 19h gets the current drive. Function 25h sets an interrupt vector, and function 35h gets an interrupt vector. Function 2Ah gets the system date, function 2Ch gets the system time, and function 30h gets the DOS version number. File operations include function 3Ch to create a file, function 3Dh to open a file, function 3Eh to close a file, function 3Fh to read from a file, function 40h to write to a file, function 41h to delete a file, function 42h to seek within a file, function 43h to get or set file attributes, function 47h to get the current directory, function 4Eh to find the first matching file, function 4Fh to find the next matching file, and function 56h to rename a file. Directory operations include function 39h to create a directory, function 3Ah to remove a directory, and function 3Bh to change the current directory. Function 4Ch terminates a program with a return code, and functions 51h and 62h get the Program Segment Prefix address. Each of these functions must be carefully implemented to match DOS behavior while interfacing with the host operating system's file system and console.


Program Loading Mechanisms

Loading DOS programs requires understanding two different executable formats: COM files and EXE files. COM files are the simpler format, consisting of raw machine code with no header or relocation information. A COM file is loaded at offset 0100h within its segment, immediately following the PSP. All segment registers are set to point to the same segment, creating a unified 64KB address space. The instruction pointer is set to 0100h, and the stack pointer is set to FFFEh at the top of the segment. COM files are limited to approximately 64KB in size due to this single-segment architecture. EXE files are more complex and begin with a header containing metadata about the program. The header includes a signature of 4D5Ah or 5A4Dh, the number of bytes in the last page, the total number of 512-byte pages, the number of relocation entries, the size of the header in paragraphs, minimum and maximum memory allocation requirements, initial stack segment and pointer values, a checksum, initial instruction pointer and code segment values, the offset to the relocation table, and an overlay number. When loading an EXE file, we must read this header, calculate the actual program size, load the program data into memory at an appropriate segment, process any relocation entries by adjusting segment references to account for where the program was actually loaded, and set the segment registers and instruction pointer according to the header values. The relocation table contains pairs of segment and offset values pointing to locations in the code that contain segment references that must be adjusted.


File System Integration

Integrating with the host operating system's file system presents interesting challenges because DOS file operations must be translated to modern file system calls. DOS uses a handle-based approach where files are opened and assigned a numeric handle, which is then used for all subsequent operations on that file. We maintain a map of DOS file handles to host operating system file objects. Handles zero, one, and two are reserved for standard input, standard output, and standard error respectively. When a DOS program opens a file, we translate the DOS filename to a host path, open the file using the host operating system's file API, assign it a handle number, and return that handle to the program. Read and write operations translate the DOS handle to the host file object and perform the operation. Seeking within files requires converting between DOS seek modes and host seek modes. Directory operations must translate between DOS directory structures and host directory structures. The Directory Transfer Area, or DTA, is a memory structure that DOS uses to return file information during directory searches. We must populate this structure with file attributes, modification time and date, file size, and filename when programs search for files. DOS filenames follow the eight dot three format, so we may need to truncate or convert long filenames from the host system.


Building the Interactive Shell

The interactive shell provides a user-friendly interface for running DOS programs and managing files. Our shell implements a command prompt that displays the current drive and directory, reads user input, parses commands, and executes them. Built-in commands include DIR to list directory contents in a DOS-style format showing filename, size, date, and time, CD to change the current directory, MD and MKDIR to create directories, RD and RMDIR to remove directories, DEL and ERASE to delete files, TYPE to display file contents, COPY to copy files, REN and RENAME to rename files, ECHO to display messages, DATE to show the current date, TIME to show the current time, MEM to display memory information, CLS to clear the screen, VER to show the emulator version, and EXIT or QUIT to terminate the emulator. Additionally, we implement emulator-specific commands such as DEBUG to toggle debug mode which shows detailed execution information, STEP to enable single-step execution, TRACE to show each instruction as it executes, REGS to display CPU register contents, DUMP to show memory contents in hexadecimal, STACK to display stack contents, STATS to show execution statistics like instruction count and execution time, and DISASM to disassemble memory contents into assembly language. When a user types a filename with a COM or EXE extension, the shell attempts to load and execute that program. The shell must handle errors gracefully and provide helpful error messages when files are not found or operations fail.


Debugging and Testing Infrastructure

Building a robust DOS emulator requires extensive testing and debugging capabilities. We implement several debugging features to help diagnose problems. Debug mode prints detailed information about each instruction being executed, including the address, instruction name, and register values. Step mode pauses execution after each instruction and waits for user input, allowing careful examination of program behavior. Trace mode logs execution flow without pausing, useful for understanding program behavior over longer sequences. The register display shows all CPU registers and flags in a formatted manner, making it easy to see the current processor state. Memory dump functionality displays memory contents in both hexadecimal and ASCII, allowing inspection of data structures and code. Stack display shows the most recent values pushed onto the stack, helpful for debugging function calls and returns. The disassembler converts machine code back into assembly language mnemonics, making it easier to understand what a program is doing. Statistics tracking counts instructions executed, measures execution time, and calculates instructions per second, providing performance insights. We also implement breakpoint support, allowing execution to pause when reaching specific addresses. Comprehensive logging helps track interrupt calls and their parameters, making it easier to see how programs interact with DOS and BIOS services.


Handling Edge Cases and Compatibility

Achieving good compatibility with real DOS programs requires handling numerous edge cases and subtle behaviors. Flag updates must be precise, as some programs rely on specific flag states after operations. The parity flag, for example, must correctly reflect whether the low byte of a result has an even number of set bits. Arithmetic operations must properly set carry, overflow, auxiliary carry, zero, and sign flags according to x86 specifications. String operations with the REP prefix must correctly decrement CX and check for zero, and REPZ and REPNZ must also check the zero flag. Segment wraparound must be handled correctly when addresses exceed 64KB boundaries. Stack operations must properly handle stack overflow and underflow conditions. Interrupt handling must save and restore flags correctly, and IRET must pop flags, code segment, and instruction pointer in the correct order. File operations must handle DOS-style line endings with carriage return and line feed, and text mode versus binary mode distinctions. Path handling must support both forward slashes and backslashes, handle drive letters correctly, and support relative and absolute paths. Case insensitivity must be maintained for filenames and commands, as DOS is case-insensitive. Memory allocation must prevent programs from accessing memory outside their allocated space while still allowing the freedom that DOS programs expect.


Performance Optimization Strategies

While correctness is paramount, performance is also important for a usable emulator. Several optimization strategies can significantly improve execution speed. Instruction decoding can be optimized by caching decoded instructions rather than re-decoding the same code repeatedly. A simple instruction cache indexed by memory address can provide substantial speedups for loops and frequently executed code. Register access can be optimized by using native integer types rather than constantly masking and shifting bits. Memory access patterns can be optimized by using direct array indexing rather than function calls for simple reads and writes. Interrupt dispatch can use a jump table or switch statement rather than a long chain of if-else statements. Flag calculations can be optimized by only computing flags that are actually used, though this requires careful analysis of which instructions affect which flags and which subsequent instructions test those flags. String operations can be optimized by implementing them with native loops rather than simulating individual instructions. The REP prefix can execute multiple iterations in a single step rather than decoding the same instruction repeatedly. Just-in-time compilation techniques could theoretically translate x86 code to native code, though this adds significant complexity. Profiling the emulator helps identify bottlenecks and guides optimization efforts toward the areas that will provide the most benefit.


Error Handling and Robustness

A production-quality emulator must handle errors gracefully and provide useful feedback. Invalid opcodes should be detected and reported rather than causing crashes or undefined behavior. Memory access violations should be caught and reported with the address and context. Stack overflow and underflow should be detected and handled appropriately. File operation errors from the host operating system should be translated into appropriate DOS error codes and returned to programs. Division by zero should trigger the appropriate interrupt rather than crashing the emulator. Invalid interrupt numbers should be handled gracefully. Malformed EXE headers should be detected during program loading and reported clearly. Resource exhaustion, such as running out of file handles, should be managed properly. The emulator should never crash due to program behavior, as DOS programs expect to have full control and may intentionally or accidentally perform invalid operations. Logging and error reporting should provide enough context to diagnose problems, including the instruction address, register state, and operation being attempted. Recovery mechanisms should allow the emulator to continue running or shut down cleanly even when errors occur.


Extending and Enhancing the Emulator

Once the basic emulator is functional, numerous enhancements can improve its capabilities and usability. Graphics support could be added by implementing VGA or EGA graphics modes, allowing graphical DOS programs to run. Sound support could emulate the PC speaker or Sound Blaster, enabling games and multimedia applications. Mouse support through INT 33h would allow programs that use a mouse to function properly. Extended memory and expanded memory support would enable programs that require more than 640KB of conventional memory. Disk image support could allow mounting floppy disk images or hard disk images, providing a more authentic DOS environment. Network support could enable DOS networking applications to function. Save state functionality could allow saving and restoring the entire emulator state, useful for debugging or preserving program state. Scripting support could automate testing or program execution. Configuration files could allow customizing emulator behavior, memory size, or available drives. Plugin architecture could allow extending the emulator with custom interrupt handlers or hardware emulation. Performance profiling could identify hot spots in emulated programs. Code coverage analysis could help with testing. Integration with modern development tools could enable using the emulator as part of a development workflow.


Conclusion and Future Directions

Building a DOS emulator is a substantial undertaking that provides deep insights into computer architecture, operating systems, and software engineering. The project demonstrates how modern programming languages like Go can be used to implement complex system software while maintaining clarity and portability. A functional DOS emulator preserves access to legacy software, provides an educational tool for learning about computer systems, and offers a platform for retro computing enthusiasts. The implementation requires careful attention to detail in CPU emulation, accurate interrupt handling, proper file system integration, and robust error handling. While the basic emulator provides core functionality, numerous opportunities exist for enhancement and optimization. The skills developed while building such an emulator transfer to many other domains, including virtual machine implementation, operating system development, compiler construction, and embedded systems programming. As computing continues to evolve, emulators like this ensure that the software heritage of earlier eras remains accessible and functional, allowing future generations to experience and learn from the programs that shaped the development of personal computing. The complete source code and documentation for this emulator serve as both a functional tool and an educational resource for anyone interested in low-level programming, computer architecture, or software preservation.

michael@intelligent-gagarin:~/code/go/ms-dos$ 

No comments: