writeup/results_desc.tex

   1 \subsection{Call sequence analysis}
   2 We profiled how the code interacted with \texttt{ArrayList} library methods \texttt{add}, \texttt{get}, \texttt{remove}, \texttt{size}, \texttt{clear}, \texttt{toArray}, \texttt{iterator}, and \texttt{clone}.  There were the functions that appeared in the libraries we profiled; there are other functions such as \texttt{addAll} that did not appear.  We clustered a total of 3522 sequences.  The most frequently occurring sequences are very short sequences ([\texttt{add}], [\texttt{size}, \texttt{add}], [\texttt{size}, \texttt{get}, \texttt{remove}]).
   3
   4 Figure~\ref{fig:cluster3} shows the top transitions for different sequences.  Looking more closely at the $k=2$ case, we end up with clusters that have the following methods with high frequency, from most frequent to least frequent:
   5 \begin{enumerate}
   6 \item \texttt{get}, \texttt{size}, \texttt{toArray}, \texttt{add}, \texttt{clone}, \texttt{clear}
   7 \item \texttt{add}, \texttt{get}, \texttt{size}, \texttt{remove}, \texttt{clone}, \texttt{clear}, \texttt{iterator}
   8 \end{enumerate}
   9 For $k=3$ we get the following most frequent calls (from most to least):
  10 \begin{enumerate}
  11 \item \texttt{size}, \texttt{get}, \texttt{add}, \texttt{remove}, \texttt{iterator}, \texttt{clone}
  12 \item \texttt{toArray}, \texttt{size}, \texttt{add}, \texttt{get}
  13 \item \texttt{get}, \texttt{add}, \texttt{size}, \texttt{clear}, \texttt{clone}, \texttt{toArray} (relatively much fewer)
  14 \end{enumerate}
  15 From these clusters we can infer that \texttt{get}, \texttt{add}, and \texttt{size} are a part of all modes of interacting with the library.  The clusters for $k=3$ suggest that if we were to categorize uses into three modes, we would have 1) creating \texttt{ArrayList} objects for the purpose of turning them into an array, 2) building an \texttt{ArrayList} and iterating over the elements, and 3) using only random access methods like \texttt{get} and \texttt{add}.  These usage patterns also suggest copying (\texttt{clone}) and emptying (\texttt{clear}) are not typically used when creating an \texttt{ArrayList} for the purpose of turning it into an array.
  16
  17 For $k=6$ we get similar results, with \texttt{iterator} and \texttt{toArray} contuiting to be separated, one cluster of just \texttt{size} and \texttt{get}, and other clusters combining the usual functionalities.  We may also be able to separate read-only cases from read-write cases.  Though we performed this inference by examining our data, we could potentially automate this process by looking at calls not common across clusters and how they are used, perhaps with more processed data (e.g., with multiple equivalent calls compressed into one).
  18
  19 In observing the initial and transition probabilities we noted that Markov mixture chains will not cluster sequences separately if they use disjoint states.  Unlike GMMs which have a single mean, for Markov chains, a single chain can actually contain multiple (semantic) clusters if the clusters have disjoint states.
  20  For instance, for one chain we learned as part of a mixture, the initial probabilities are mostly spread between \texttt{get} (0.50 prior) and \texttt{add} (0.41 prior).  Examining the transition probabilities, we see that (\texttt{get -> end}: 0.98), (\texttt{size -> toArray}: 1.0), (\texttt{toArray -> add}: 0.14, \texttt{toArray -> size}: 0.8524). This suggests that the generated sequence will look very different depending on the initial state: if it is \texttt{get} is chosen, then we are very likely to terminate immediately.  If it is \texttt{size}, then we will likely bounce between \texttt{size} and \texttt{toArray}.  Because the initial state probabilities are more or less a coin flip between get and size, this one cluster actually captures two distinct interactions.  Because of this, it is possible that our chains cluster together things that should be apart, and we need some sort of post-processing to really extract the exact modes of interaction.  (For instnace, we could break constituent chains into separate chains with deterministic start states if initial probabilities are split equally.)
  21
  22 Though we had hoped it would not matter, the length of the sequence seems to correspond to its classification.  For $k = 2, 4, 5$, the largest cluster is associated with the shortest average sequence length and the smallest cluster is associated with the longest average length.  (We could have tried to account for this in  our learning algorithm, but we had not thought it would have a significant effect.)
  23
  24 \begin{figure}
  25 \centering
  26 \begin{tabular}{|l|l|}
  27 \hline
  28 Cluster 1 (291) & Cluster 2 (3231) \\
  29 \hline
  30 \texttt{get->get} (77.0\%) & \texttt{size->get} (27.4\%)\\
  31 \texttt{size->toArray} (6.6\%) & \texttt{get->get} (14.6\%)\\
  32 \texttt{toArray->size} (5.6\%) & \texttt{get->size} (14.5\%)\\
  33 \texttt{add->add} (3.5\%) & \texttt{get->remove} (8.5\%)\\
  34 \texttt{size->size} (1.1\%) & \texttt{size->add} (7.7\%)\\
  35 \hline
  36 \end{tabular}
  37
  38 \begin{tabular}{|l|l|l|}
  39 \hline
  40 Cluster 1 (685) & Cluster 2 (208) & Cluster 3 (2629)\\
  41 \hline
  42 \texttt{size->get} (30.4\%) & \texttt{size->toArray} (40.0\%) & \texttt{get->get} (91.5\%)\\
  43 \texttt{get->size} (18.1\%) & \texttt{toArray->size} (34.1\%) & \texttt{get->clear} (1.1\%)\\
  44 \texttt{size->size} (14.1\%) & \texttt{add->add} (17.3\%) & \texttt{size->add} (1.4\%)\\
  45 \texttt{get->remove} (10.7\%) & \texttt{toArray->add} (5.8\%) & \texttt{clear->get} (1.1\%)\\
  46 \texttt{size->add} (9.2\%) & \texttt{add->size} (2.9\%) & \texttt{size->get} (1.1\%)\\
  47 \hline
  48 \end{tabular}
  49 \caption{Most frequently occuring call sequences for $k=2,3$.}
  50 \label{fig:cluster3}
  51 \end{figure}