A couple of avenues were explored that could be interesting in some set of circumstances.
Using mmap(2) instead of brk(2) was actually slower, since brk(2) knows a lot of the things that mmap has to find out first.
In general there is little room for further improvement of the time-overhead of the malloc, further improvements will have to be in the area of improving paging behaviour.
It is still under consideration to add a feature such that if realloc is called with two zero arguments, the internal allocations will be reallocated to perform a garbage collect. This could be used in certain types of programs to collapse the memory use, but so far it doesn't seem to be worth the effort.
Malloc/Free can be a significant point of contention in multi-threaded programs. Low-grain locking of the data-structures inside the implementation should be implemented to avoid excessive spin-waiting.