We present an unsupervised method of learning action symbols from video data, which self-tunes the number of symbols to effectively build hierarchical activity grammars. A video stream is given as a sequence of unlabeled segments. Similar segments are incrementally grouped to form a hierarchical tree structure. The tree is cut into clusters where each cluster is used to train an action symbol. Our goal is to find a good set of clusters i.e. symbols where regularities are best captured in the learned representation, i.e. induced grammar. Our method has two-folds: 1) Create a candidate set of symbols from initial clusters, 2) Build an activity grammar and measure model complexity and likelihood to assess the quality of the candidate set of symbols. We propose a balanced model comparison method which avoids the problem commonly found in model complexity computations where one measurement term dominates the other. Our experiments on the towers of Hanoi and human dancing videos show that our method can discover the optimal number of action symbols effectively.