We present an unsupervised method of learning action\ symbols from video data, which self-tunes the number\ of symbols to effectively build hierarchical activity\ grammars. A video stream is given as a sequence of\ unlabeled segments. Similar segments are incrementally\ grouped to form a hierarchical tree structure. The tree\ is cut into clusters where each cluster is used to train\ an action symbol. Our goal is to find a good set of clusters\ i.e. symbols where regularities are best captured in\ the learned representation, i.e. induced grammar. Our\ method has two-folds: 1) Create a candidate set of symbols\ from initial clusters, 2) Build an activity grammar\ and measure model complexity and likelihood to assess\ the quality of the candidate set of symbols. We propose\ a balanced model comparison method which avoids the\ problem commonly found in model complexity computations\ where one measurement term dominates the other.\ Our experiments on the towers of Hanoi and human\ dancing videos show that our method can discover the\ optimal number of action symbols effectively.

}, author = {Lee, K. and Kim, TK and Demiris, Y.} }