Optimize `needless_doctest_main`, make it short-circuit, make sure that
we don't spin up a new compiler on EVERY code block.
---
The old implementation was creating a new compiler, new parser, new
thread, new SessionGlobals, new everything for each code block. No
matter if they actually didn't even contain `fn main()` or anything
relevant.
On callgrind, seems that we're reducing about a 6.7242% de cycle count
(which turns out to be a 38 million instruction difference, great!).
Benchmarked in `bumpalo-3.16.0`. Also on bumpalo we spawn 78 less
threads. This moves `SessionGlobals::new` from the top time-consuming
function by itself in some benchmarks, into one not even in the top 500.
Also, populate the test files.
changelog:[`needless_doctest_main`]: Avoid spawning so many threads in
unnecessary circumstances