Add an implementation of thread local storage. This uses a container
that is pointed to by the otherwise-unsed `$tp` register. This container
is allocated on-demand, so threads that use no TLS will not allocate
this extra memory.
Signed-off-by: Sean Cross <sean@xobs.io>