From b28a824cf8215c59003e1fa2e997b276d324b3d4 Mon Sep 17 00:00:00 2001 From: Jean Yang Date: Thu, 4 Dec 2008 17:59:59 -0500 Subject: [PATCH] small changes to conclusion --- writeup/conclusion.tex | 2 +- writeup/markov.tex | 4 ++-- writeup/results.tex | 2 +- writeup/results_desc.tex | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/writeup/conclusion.tex b/writeup/conclusion.tex index 76fe43f..52caf14 100644 --- a/writeup/conclusion.tex +++ b/writeup/conclusion.tex @@ -3,7 +3,7 @@ We have explored applying machine learning techniques to the problem of clusteri For classification, we developed the $k$ Markov chain model based on the GMM mixture model and variable length HMM model developed in class. We implemented 1) a Java profiler for extracting object-level Java method traces from any Java JAR executable and 2) a MATLAB routine for learning and classifying with a $k$ Markov chains mixture model. We constructed method traces for learning from Java programs we found online. In learning, we explored the problem of determining the number of underlying mixtures by running cross-validation for different values of $k$ and examining the BIC score. We used cross-validation and the BIC score to confirm our expectations that, after a certain point, increasing $k$ does not yield better results. -Through examining cluster contents and parameters of the Markov Chains, we determined that one could automatically learn a set of common ways of interacting with libraries. Individual programs did not appear to have idiosyncratic ways of interacting with the libraries we examined; this is consistent with our expectation that there is some fixed number of modes of interaction. Our model assigned most traces to a single cluster, indicating that there may be a dominant mode of interaction. It seems that for the \texttt{ArrayList} library, using Markov chains provides sufficiently strong assumptions to yield interesting results. We noticed, however, the deficiency that a single Markov chain can comprise multiple modes of interaction, especially if those modes consist of disjoint sets of method calls, but it seems that one could mitigate this problem with post-processing. +Through examining cluster contents and parameters of the Markov Chains, we determined that one could automatically learn a set of common ways of interacting with libraries. Individual programs did not appear to have idiosyncratic ways of interacting with the libraries we examined; this is consistent with our expectation that there is some fixed number of modes of interaction. Our model assigned most traces to a single cluster, indicating that there may be a dominant mode of interaction. It seems that for the \texttt{ArrayList} library, using Markov chains provides sufficiently strong assumptions to yield interesting results. We noticed, however, the deficiency that a single Markov chain can comprise multiple modes of interaction, but it seems that one could mitigate this problem with post-processing. We would also like to note that we were fortunate that the Markov model structure could capture the \texttt{ArrayList} interactions in a sufficiently interesting way. If we had a library that required interactions that were better expressed with, say, a probabilistic context-free grammar, the learning results may not have been quite as nice. The results of this work has many applications. Given that we know that there are common use cases of an API, we could generate a useful set of example use cases for the programmer. We could also use this information in code synthesis: knowing about different usage modes, and perhaps about the common cases, could greatly optimize the search space. This information is also useful in test generation: we might be able to predict interactions with the code we write and make sure our code is robust to those cases. diff --git a/writeup/markov.tex b/writeup/markov.tex index 10c4e55..ec9fda2 100644 --- a/writeup/markov.tex +++ b/writeup/markov.tex @@ -20,7 +20,7 @@ We chose this model because it fits our goals of clustering different patterns o \item Mixture models are a natural way to uncover clusters in an unsupervised fashion by examining the component models of the mixture. \item The temporal relationship between method calls are more interesting and informative than, say, the counts of certain method calls. Markov chains capture this in a simple way by assuming that the system only cares about the previous state. While program interaction is not solely determined by the previous function call (in fact, there may be add/remove sequence that have context-free structures), we reasoned that choosing a simpler model would make learning easier while still yielding insight into API interaction\footnote{We also considered using other classification methods ($k$-means clustering, SVMs) and capturing the temporal nature of our data by selecting pairs of function calls as features, but Markov chains capture this in a more straightforward way}. \end{enumerate} -We considered using hidden Markov models (HMMs) as well, but it was not clear to us what the hidden states would be. Also, the learning problem would have been considerably more difficult if we chose to use a mixture of HMMs, as we would need to choose the number of mixtures as well as the number of hidden states. +We considered using hidden Markov models (HMMs) as well, but it was not clear to us what the hidden states would be. The learning problem would have been considerably more difficult if we chose to use a mixture of HMMs, as we would need to choose the number of mixtures as well as the number of hidden states. \subsection{Maximum Likelihood Estimation} Similar to our derivation of the EM algorithm for other models, we first take a look at maximum likelihood estimation in the case where we observe exactly the underlying Markov Chain that our observations were drawn from. In this case, simlar to GMMs, the maximum likelihood estimates for the parameters in our model are simple. Suppose that we have $n$ training sequence and define $\delta(z, i)$ to be 1 iff sample i was drawn from chain $z$. The maximum likelihood estimates for our parameters will be: @@ -35,4 +35,4 @@ Now that we have the estimates for the case where we know which Markov Chain eac \begin{eqnarray*} \delta(z, i) = \frac{q(z) t_z(x_{i1}) t_z(END| x_{im}) \prod_{j=2}^{m} t(x_{ij} | x_{i,j-1})}{\sum_z q(z) t_z(x_{i1}) t_z(END | x_{im}) \prod_{j=2}^{m} t_z(x_{ij} | x_{i,j-1})}. \end{eqnarray*} -We implemented both the representation of the $k$-Markov chains model and EM for this model in MATLAB without using any existing libraries or code. \ No newline at end of file +We implemented both the representation of the $k$-Markov chains model and EM for this model in MATLAB without using any existing libraries or code. diff --git a/writeup/results.tex b/writeup/results.tex index 657d676..e3728f9 100644 --- a/writeup/results.tex +++ b/writeup/results.tex @@ -25,7 +25,7 @@ The data shows that $k=6$ is the first $k$ value to yield a locally optimal test \subsection{Relationship between programs and clusters} If clusters map traces back to their original programs, then this suggests different programs have different patterns of interactions; if not, then this suggests there may be some common patterns of interaction. (If the number of clusters $k$ is less than the number of programs, then each cluster would be associated with a set of programs.) Clusters have mixed composition and that for each value of $k$ there is a large cluster. -To examine this, we determined the best cluster for each training sample (by taking the cluster with the highest $p(z | x)$) and grouped these mappings by program. If traces from a specific program are more likely to belong to a specific cluster, then we would expect the highest percentage of traces the program has in a cluster at least as high as the prior probability for the cluster. The intuition is that given no information about the trace, we will map traces to clusters based on the prior, so certain clusters are expected to have a higher fraction of traces. However, if we see that the number in the largest clusters for a program exceeds this fraction, then this suggests there is some underlying structure that is binding a program to a given cluster. +To examine this, we determined the best cluster for each training sample (by taking the cluster with the highest $p(z | x)$) and grouped these mappings by program. If traces from a specific program are more likely to belong to a specific cluster, then we would expect the highest percentage of traces the program has in a cluster at least as high as the prior probability for the cluster. The intuition is that given no information about the trace, we will map traces to clusters based on the prior, so certain clusters are expected to have a higher fraction of traces. If we see that the number in the largest clusters for a program exceeds this fraction, then there may be some underlying structure binding a program to a given cluster. The data (Table~\ref{table:bestclust}) shows that that there was a single dominant cluster composed of samples from every program. It appears that the identity of the program was not a strong determinant of the type of trace or interaction, and rather, there was a single dominant way of interacting with the API that was consistent across programs. This held true across different choices of $k$, hinting that adding more mixture components would simply refine our performance on the rarer edge cases. diff --git a/writeup/results_desc.tex b/writeup/results_desc.tex index 1aa7a16..40da7f3 100644 --- a/writeup/results_desc.tex +++ b/writeup/results_desc.tex @@ -19,7 +19,7 @@ For $k=6$ we get similar results, with \texttt{iterator} and \texttt{toArray} co In observing the initial and transition probabilities we noted that Markov mixture chains will not cluster sequences separately if they use disjoint states. Unlike GMMs which have a single mean, for Markov chains, a single chain can actually contain multiple (semantic) clusters if the clusters have disjoint states. For instance, for one chain we learned as part of a mixture, the initial probabilities are mostly spread between \texttt{get} (0.50 prior) and \texttt{add} (0.41 prior). Examining the transition probabilities, we see that (\texttt{get -> end}: 0.98), (\texttt{size -> toArray}: 1.0), (\texttt{toArray -> add}: 0.14, \texttt{toArray -> size}: 0.8524). This suggests that the generated sequence will look very different depending on the initial state: if it is \texttt{get} is chosen, then we are very likely to terminate immediately. If it is \texttt{size}, then we will likely bounce between \texttt{size} and \texttt{toArray}. Because the initial state probabilities are more or less a coin flip between get and size, this one cluster actually captures two distinct interactions. Because of this, it is possible that our chains cluster together things that should be apart, and we need some sort of post-processing to really extract the exact modes of interaction. (For instnace, we could break constituent chains into separate chains with deterministic start states if initial probabilities are split equally.) -Though we had hoped it would not matter, the length of the sequence seems to correspond to its classification. For $k = 2, 4, 5$, the largest cluster is associated with the shortest average sequence length and the smallest cluster is associated with the longest average length. (We could have tried to factor this out in our learning algorithm, but we had not thought it would have a significant effect and therefore wouldn't be worth the complexity.) +Though we had hoped it would not matter, the length of the sequence seems to correspond to its classification. For $k = 2, 4, 5$, the largest cluster is associated with the shortest average sequence length and the smallest cluster is associated with the longest average length. (We could have tried to account for this in our learning algorithm, but we had not thought it would have a significant effect.) \begin{figure} \centering -- 2.11.4.GIT