writeup/results.tex

   1 We first describe numerical tests that show how well the model fits the data.  We describe the compositions and relative sizes of clusters and how we can infer program behavior from examining the \texttt{ArrayList} clusters.
   2
   3 \subsection{The number of underlying Markov Chains}
   4 To investigate the problem of determining how many underlying mixtures to choose, we split our data into 90\% training and 10\% test.  (Performing $n$-fold cross-validation would have been better if time had permitted).  We then ran our learning algorithm for different values of $k$, noting the training likelihood and test likelihood.  Because the EM algorithm is sensitive to initialization, we chose models with the best training likelihood over 5 random initial starting points.  We also included the BIC score to compare it with test set performance.  (A mixture of $k$ components has $k-1$ parameters for the prior and if $n$ is the number of possible states, $n-1$ parameters for the initial probabilities and $n^2$ parameters for the transition probabilities (since we have an end state, this is $n^2$ rather than $n(n-1)$.)  We show our results in Table~\ref{table:crossresults}.
   5
   6 \begin{table}
   7 \centering
   8 \begin{tabular}{|c|c|c|c|c|}
   9 \hline
  10 $k$ & training log likelihood & test log likelihood & number of parameters & BIC score for training data\\
  11 \hline
  12 2 & -12090 & -1295 & 143 & -12673 \\
  13 4 & -10112 & -1078 & 287 & -11284 \\
  14 6 & -9268 & -991 & 431 & -11028 \\
  15 8 & -9214 & -989 & 575 & -11562 \\
  16 10 & -9207 & -988 & 719 & -12143 \\
  17 \hline
  18 \end{tabular}
  19 \caption{Cross-validation results.}
  20 \label{table:crossresults}
  21 \end{table}
  22
  23 The data shows that $k=6$ is the first $k$ value to yield a locally optimal test log likelihood.  The BIC score agrees that this is the best choice for number of underlying clusters.  One observation of note is that we do not seem to have an over-training problem in terms of the classical increase in test error.  Taking into account that the training log likelihood is over 9 times as much data, our training likelihoods and test likelihoods are well-matched enough that we do not expect to be grossly overfitting.  This could, however, be because our training set was significantly larger than our test set.
  24
  25 \subsection{Relationship between programs and clusters}
  26 If clusters map traces back to their original programs, then this suggests different programs have different patterns of interactions; if not, then this suggests there may be some common patterns of interaction.  (If the number of clusters $k$ is less than the number of programs, then each cluster would be associated with a set of programs.)  Clusters have mixed composition and that for each value of $k$ there is a large cluster.
  27
  28 To examine this, we determined the best cluster for each training sample (by taking the cluster with the highest $p(z | x)$) and grouped these mappings by program.  If traces from a specific program are more likely to belong to a specific cluster, then we would expect the highest percentage of traces the program has in a cluster at least as high as the prior probability for the cluster.  The intuition is that given no information about the trace, we will map traces to clusters based on the prior, so certain clusters are expected to have a higher fraction of traces.  If we see that the number in the largest clusters for a program exceeds this fraction, then there may be some underlying structure binding a program to a given cluster.
  29
  30 The data (Table~\ref{table:bestclust}) shows that that there was a single dominant cluster composed of samples from every program.  It appears that the identity of the program was not a strong determinant of the type of trace or interaction, and rather, there was a single dominant way of interacting with the API that was consistent across programs.  This held true across different choices of $k$, hinting that adding more mixture components would simply refine our performance on the rarer edge cases.
  31
  32 \begin{table}
  33 \centering
  34 \begin{tabular}{|c|c|c|c|c|}
  35 \hline
  36 program & best cluster & number in best cluster &  \% in best cluster & prior of best cluster\\
  37 \hline
  38 mallet-import-data & 2 & 29 & 0.66 & 0.43 \\
  39 mallet-train-profile & 2 & 23 & 0.79 & 0.43 \\
  40 jweather & 2 & 2203 & 0.80 & 0.43 \\
  41 mallet-evaluate-profile & 2 & 23 & 0.79 & 0.43 \\
  42 lucene-index & 2 & 232 & 0.34 & 0.43 \\
  43 \hline
  44 \end{tabular}
  45
  46 \begin{tabular}{|c|c|c|c|c|}
  47 \hline
  48 program & best cluster & number in best cluster &  \% in best cluster & prior of best cluster\\
  49 \hline
  50 mallet-import-data & 5 & 28 & 0.64 & 0.47 \\
  51 mallet-train-profile & 5 & 23 & 0.79 & 0.47 \\
  52 jweather & 5 & 2212 & 0.81 & 0.47 \\
  53 mallet-evaluate-profile & 5 & 23 & 0.79 & 0.47 \\
  54 lucene-index & 5 & 342 & 0.50 & 0.47 \\
  55 \hline
  56 \end{tabular}
  57 \caption{Best Clusters for $k=4, 6$, respectively.}
  58 \label{table:bestclust}
  59 \end{table}