tdf#109042 : Add support for multivariate regression...
to regression tool. This means we now support more than
one X variable(independent variable). One caveat is that
all X variable observations needs to be present adjacent
to each other in the same table. For example if data is
grouped by columns, a valid organization of X variables
look like :-
X Variables ---->
A B C ...
XVar1 XVar2 XVar3 ... XVarN |
0.1 0.45 0.32 ... Observations
0.34 0.23 0.54 ... |
0.23 0.56 0.90 ... |
0.32 0.11 0.78 ... V
This patch also makes our regression tool output to have
similar structure to what Excel and Gnumeric does. This
means more statistical measures are added including
confidence intervals for all parmeter estimates.
We already have support for Logarithmic and Power regression
in addition to plain Linear regression. This patch's
multivariate support extends to all of these types of
regressions.
Earlier all regression statistics were computed separately
from scratch, which mostly compute the same regression
multiple times. This would slow things down if the
data-set being analysed is big. This is not true anymore
as we use LINEST() formula. LINEST() formula provides all
the necessary statistics needed in regression analysis, so
here it is called just once and its output components are
referenced to compute other statistics(derived).
Following are the UI changes for the regression dialog box :-
1. Changed the regression-type selectors from check-boxes
to radio-buttons. So only one type of regression can
be done at a time. This is because the output of a single
regression type itself shows a lot of information and
if do all types of regression, it is hard to read and
interpret especially for bigger data-sets with lots of
X variables.
2. Allow the variable's ranges to have label in them, via
a checkbox. If labels are provided, they are used to
annotate the variable specific statistics and the user
can easily identify the stats corresponding to each
variable.
3. More robust input validity checks, with error messages
at the bottom of the dialog to let the user know which
of their entry is invalid.
4. User can enter the confidence level (default = 95%)
for computing the confidence intervals of each estimate.
5. Make residual computations optional via a check-box,
as this involves writing a table with all X's and Y
with predicted Y and residual for each observation.
If the data-set is big, or the user just care about
the estimates and confidence intervals, they can
avoid this.
Finally the patch includes a uitest that tests all
3 types of regressions with a small dataset. The ground
truths for the tests were obtained by running
regression tool in Gnumeric.
Change-Id: I9762b716eae14b9fbd16e2c7228edf9e1930dc93
Reviewed-on: https://gerrit.libreoffice.org/56809
Tested-by: Jenkins
Reviewed-by: Michael Meeks <michael.meeks@collabora.com>
Reviewed-by: Tomaž Vajngerl <quikee@gmail.com>