Gesture analysis in violin performances

Oriol Saña testing the experimental setup

Oriol Saña testing the experimental setup

The project started on spring 2010 and it is supported by the ESMUC and SIEMPRE project.


This research is performed in collaboration with Alfonso Pérez (Researcher at  SIEMPRE Project – MTG – UPF), Oriol Saña (PhD candidate at UAB), and Quim Llimona (Student at Sonology – ESMUC and Audiovisual engineering – UPF).


The aim of this research is to help determining whether jazz-violin students perform better (rhitmically speaking) than classic-violin students. The pedagogic view of this research is performed in parallel by a musicology PhD candidate at UAB (Oriol Saña). Our particular goal is to find rhythmic relationships for recordings performed by different students, in different sessions, playing different violins and different exercises. Data recorded comes from the audio (both microphone and pickup) and a set of position sensors attached to the bow and the violin (Polhemus®). Some state of the art descriptors are computed from this data, and statistical methods are applied to find the rhythmic relationship between students.

Experimental setup

The aim of the here presented setup to make our analysis independent of the violin used, the played piece, and particular playing conditions of a specific session. In fact, the only variable we are interested in is the jazz background of the students.


We had the collaboration of 10 violin students (participants A to H) from the Escola Superior de Música de Catalunya (ESMUC). Some of them are enrolled in classical music courses (A, G) while others are enrolled both in classical and jazz music courses (B, C, D, E, F, H). We also recorded two professional violinists as a reference, one from the classical tradition (participant I) and the other from the jazz tradition (participant J).


We recorded 3 pieces from the classical tradition (exercise 1 to 3). They are selected for different levels of rhytmic complexity, according to the criteria of a classical tradition professional violinist: (1) W. A. Mozart. Symphony n.38 in Eb Maj, 1st. movement, KV 543. Rhythmic patterns with sixteenth notes and some eighth notes in between. This excerpt presents high rhythmic regularity. (2) R. Strauss, Don Juan, op. 20, excerpt. Rhythmic excerpts that are developed through out the piece. There exists small variations on the melody but rhythm remains almost constant. (3) R. Schumann, Symphony n. 2 in C Maj, Scherzo, excerpt. Rhythmic complexity is higher than the two previous pieces. This excerpt does not present a specific rhythmic pattern. We also recorded 3 pieces from the jazz tradition: (4) Rhythmic exercise proposed by Andreas Schreiber at Anton Bruckner University (Linz, Austria), (5) exercise proposed by André Charler at Le centre des musiques Didier Lockwood (Dammarie-Lès-Lys, France), and (6) Exercise proposed by Michael Gustorff  at ArtEZ Conservatory Arnhem (The Netherlands).


We follow the students through 10 sessions in one trimester, from september to december 2011, in which they had to play all the exercises. The goal of this setup was, if possible, to extract conclusions on the learning process instead of the personal performing ability, and to make recordings independent of particular playing conditions in a specific date. Reference violinists were recorded only once.


Each exercise has been repeated twice. One with the own violin of each participant, an another one with the same violin for all of them in which all the gesture sensors were attached.

Data Aquisition

For all the exercises, students, sessions and played violins, we created a multimodal collection with video, audio and bow-body relative position information. Part of these recordings are available on Repovizz frontend for analysis[1].

Video recordings

We recorded video from four video cameras for recording performer gestures and audio. Each camera is focused to capture a specific component of performer’s gesture: (1) Front view, to provide information about the overall position of the violinist. (2) Top view, to detect horizontal movement from both body and bow. (3) Back view, to record the contact point between the bow and the string and supervise bow-force and bow tilt. (4) Foot view, to record the foot tapping in those cases in which the student does it. Classical tradition does not consider tapping as a good performance but it is widely used in jazz tradition.

Audio recordings

We recorded audio stream for the two types of violin for each exercise, student and session. We collected audio from (1) a microphone located at 2m far away from the violin to capture timbre properties of the violin, and (2) a pickup attached to the bridge to obtain more precise and room independent data from the violin. The exercises played with the students’ violin, that is, the ones gestural information is not captured from – were also recorded with (3) a clip-on microphone.

Position recordings

As detailed in previous research, the acquisition of gesture related data can be done using position sensors attached to the violin[2]. We use a six degrees of freedom electromagnetic tracker that provides information on localization and orientation of a sensor with respect to a source. We need two sensors, one attached to the bow and the other attached to the violin obtaining a complete representation of their relative movement. From all the information given by the two sensors, we focus on the following streams that can be directly computed: (1) Bow position, (2) Bow force (simulation), and (3) Bow velocity.

Data Processing

Our goal here is to determine whether there exists any relationship between rhythm properties of performed music and the musical background of the student. For that, we automatically extract rhythmic and amplitude descriptors from the collected streams and research for the dependence between them and the student.

Feature extraction

We compute descriptors from the audio recorded from the pickup (1 stream @ sr=22050Hz) and from the position data provided by the sensors attached to the bow and the violin (3 streams @ sr=240Hz). For each of the four streams, we compute two sets of descriptors using MIR toolbox for Matlab® [3]: (1) a set of global descriptors including length, beatedness, event density, tempo estimation (using both autocorrelation and spectral implementations), pulse clarity, and low energy, and (b) a frame based set of descriptors including onsets, attack time and attack slope.

Computing distances with respect to the references

According to a pedagogic criteria, our work focuses on the existing differences between the student performances (participants A..H) and the professional references (participants I..J). For the global descriptors, we compute the relative distance from the student to the reference in terms of euclidean distance. On the other hand, for the frame-based descriptors, we compute the relative distance using Dynamic time warping (DTW)[4] which also proved to be robust in gesture data[5]. Specifically, we use the total cost of warping path as a distance measure between two streams.

Statistical analysis

We are analyzing all the data right now. More information will be available soon.


[1] Mayor, O., Llop, J. and Maestre, E., “Repovizz: A multimodel on-line database and browsing tool for music peformance research”. Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR), Miami, USA, 2011.

[2] Maestre, E., Blaauw M., Bonada J., Guaus E., and Pérez A., “Statistical Modeling of Bowing Control applied to Violin Sound Synthesis”. IEEE Transactions on Audio, Speech, and Language Processing, Volume: 18 , Issue: 4 , Page(s): 855 – 871,  2010.

[3] Lartillot, O. and Toiviainen, P., “A Matlab Toolbox for Musical Feature Extraction From Audio”. International Conference on Digital Audio Effects, Bordeaux, 2007.

[4] Sakoe, H. and Chiba, S., “Dynamic programming algorithm optimization for spoken word recognition”. IEEE Transactions on Acoustics, Speech and Signal Processing, Volume: 26, Issue: 1; Page(s): 43- 49, 1978.

[5] Muller, M., “Efficient content-based retrieval of motion capture data”. ACM Transactions on Graphics, Volume: 24, Issue: 3, 2005.