Paralellize analyzeProgram
Summary:
A few analysis steps needed cleaning up, and a few locks were
required, but it was mostly ready to go.
To help find places we were modifying global state in the
AnalysisResult, I canged the AnalysisResultPtr argument of
Construct::analyzeProgram to an AnalysisResultConstPtr. Once done,
this only decreased the time for analyze all by about 15% on a 56 core
machine, so clearly something was not parallelizing well.
Running perf showed there was no lock contention, and very little
context switching; but all the time was being spent in
AnalysisResult::analyzeProgram(ConstructPtr), which is really just a
recursive walk calling Construct::analyzeProgram as it goes.
A while ago ricklavoie had mentioned that passing shared_ptrs around was
very expensive; I'd expermented with that a long time ago, and after
one huge win (replacing vectors of shared_ptrs that got copied around
a lot with vectors of hphp_raw_ptrs), was unable to get any
others. But based on perf, I rewrote all the getNthKid functions to
return ConstructRawPtr and analyzeProgram to take a ConstructRawPtr,
and it made no difference at all.
Then I realized the problem is with the AnalysisResultPtr we were
passing around. Most of the shared pointers don't really matter; its a
bit more expensive, but only one thread is using any given pointer at
any given time so its only an extra couple of memory operations. But
the AnalysisResultPtr is global; every single thread is doing atomic
incs and decs the same shared pointer, so almost all are missing in
the cache. So I changed all the AnalysisResultConstPtrs that I'd just
added to AnalysisResultConstRawPtrs, and got a 6x win.
There's clearly more available - it should have been a 30x win, but
I'll do that in a followup (which will probably help preOptimize too).
Reviewed By: alexeyt
Differential Revision:
D5619284
fbshipit-source-id:
b1c8f67f4b6b4a88164997dc379d238b218c59bb