tub -- evade TASK_UNMAPPED_BASE
for large dynamic arrays with shared libs under Linux on
February 18, 2005
author: jreiser BitWagon com
If an app wants to use large dynamic arrays (upto 2.8GB or so), then until
recently the default behavior of many distributions of Linux on x86 has forced
the app to be linked statically, and not use dynamic linking or shared libraries.
When dynamic linking is used, then the first mmap(0, ...) gets
assigned to a fixed address TASK_UNMAPPED_BASE, which is (TASK_SIZE
/ 3) in linux/include/asm-i386/processor.h. Typically
TASK_SIZE is 3GB (0xc0000000), so TASK_UNMAPPED_BASE
is 1GB (0x40000000), which leaves something less than 2GB as the
largest chunk of contiguous address space for dynamic user data. Although
it is nearly trivial to make task_unmapped_base part of inherited
per-process state controlled by setrlimit() and getrlimit(),
Linus has not done so. Even if Linus' source were changed tomorrow,
it would still be a year or two before an application reasonably could rely
on this feature being present in an arbitrary installation. Hardware
with 64-bit address space is still somewhat expensive and uncommon.
tub is a user-mode hack which works around the problem on today's
Linux for x86. When linked into the application, then tub intercepts
all mmap(0, ...), chooses from its master list the page frames to
be used, and calls the kernel with mmap(frame_address,,,MAP_FIXED,,).
In effect, tub allows the application to set task_unmapped_base
to any address. Setting task_unmapped_base = brk(0); allows
for maximum contiguous address space. Shared libraries and dynamic linking
can still be used because tub intercepts even the mmap() calls performed
by the dynamic linker ld-linux.so.2. prelinked shared libraries
work OK. The prelinking is ignored, and the library is mapped into
the page range with the lowest available addresses. As long as the
.so was compiled with -fpic, then readonly pages are shared
just as much as before. Pages with relocation (including the _GLOBAL_OFFSET_TABLE__)
probably are shared less than usual, because the prelinked relocated values
must be relocated again.
tub consists of about 2.8KB compiled code written in C and assembler,
plus some link-time scripts. The link-time scripts make the app look
like it has no PT_INTERP, and change the entry point to be inside
tub code. Because the on-disk app has no PT_INTERP,
then execve() starts the process at Elf32_Ehdr.e_entry,
instead of at the entry to the program interpreter /lib/ld-linux.so.2.
Upon entry at runtime, then tub changes the AT_ENTRY
to _start, reverts the current process image to having a PT_INTERP,
and maps the program interpreter itself. tub arranges to intercept
all calls from the program interpreter to mmap/mmap64/munmap,
restores the stack, and then jumps to the entry point of the program interpreter.
Any successful mmap/mmap64 with PROT_EXEC,
MAP_PRIVATE, and !MAP_ANONYMOUS is scanned for further instances
of mmap/mmap64/munmap, which are also intercepted.
For example, the interception for mmap looks like:
|mmap: # as in ld-linux.so.2 or libc.so.6
int $0x80 # or call *%gs:0x10
|mmap: # as rewritten by tub during execution
int $0x80 # or call *%gs:0x10
Each intercepting call takes 5 bytes, the same as the overwritten
mov and cmp. The assembly-language routines __pre_mmap
and __post_mmap handle scratch register contents and processor flags,
then call corresponding C-language routines tub_pre_mmap and tub_post_mmap.
By taking care with the subroutine linkage conventions (arguments on
stack [by value-result] and in registers, and return value), everything just
fits. tub_pre_mmap looks for argument values that should be
changed, consults a bitmap of free pages, changes the addr to be
the desired frame, and ORs MAP_FIXED into flags.
As of 2004-04-24, tub has been enhanced to work with glibc-2.3.2
and NPTL, and ld-linux.so.2 "over-mapping" and executable file by including
.bss in the first mmap (in order to guarantee address-space reservations
when there are "holes" in the new PT_LOAD or the existing address space.)
Also, version 0.94 fixed bugs in handling mmap64() and exec-shield
(random placement by tbe Linux kernel of individual mmap() requests
that do not specify MAP_FIXED.)
Version 0.95 (2005-02-05) handles mremap(), and accommodates some
quirks of gcc 3.3.1-2mdk and the #include files of
Version 0.96 (2005-02-16) fixes a SIGBUS that happened with some modules
such as libpthread-0.10.so which have a large .bss.
Version 0.97 (2005-02-18) makes tub more robust by removing some dependencies
on the particular code generated by differing versions of gcc.
Version 0.98 (2008-07-16) adapts to evolution of elf.h and Linux 2.6.24.
Detecting the body of mmap/mmap64/munmap
in newly-mapped pages is heuristic and not as robust as it could be.
The allocator for page frames is multi-thread safe, and somewhat efficient;
it uses spin wait during thread-to-thread contention.
The allocator also detects re-entrant use by a signal handler.
In theory such a situation can be handled, but it is too complex.
So, the current implementation gives a message on stderr and aborts.
Of course, doing an explicit mmap (or any system call)
in a signal handler is a dubious idea.
However, *printf() buffering typically uses mmap.
So, establish buffering (or no buffering) by calling setbuf,
setbuffer, setlinebuf or setvbuf for the FILE
before enabling the handler.