This study may provide useful contributing research to the wider field of audio engineering by providing insight into the design and evaluation of alternative audio mixing interfaces and
Introduction
Research Questions and Objectives
This study compares two VR control schemes for a simple audio mixing task, examining subject preference and time-on-task, and tests whether experience level—dividing participants into more experienced and less experienced groups—affects ratings or time-to-task completion The aim is to identify potential differences in usability and efficiency between control schemes and experience levels The findings could inform the ergonomic design of VR mixing systems, guiding interface choices, interaction paradigms, and workflow optimization to enhance comfort and performance.
User preference was measured via a survey of perceived accuracy, efficiency, and satisfaction for two control systems: hand-gesture tracking and handheld controller-based control Both schemes affected single channels of a musical performance visualized in a VR audio mixing environment as three-dimensional objects The time taken to achieve an "optimal mix balance" was recorded, and users’ verbal feedback was collected to illuminate their experience with each control scheme Each subject’s preference for one control scheme or a lack of preference between schemes was documented The null hypothesis stated that, as measured by time-on-task and a satisfaction-oriented survey, there would be no differences in task completion time, subject-reported accuracy, satisfaction ratings, or overall preference between hand-detection control and physical controller systems.
Prior Art
Research Comparing the Stage-Metaphor and Channel-Strip Metaphor
In a study comparing stage-metaphor-based and channel-strip-based audio mixing systems, researchers found comparable performance between the two approaches Sound sources were visualized on a computer monitor for the channel-strip scheme and on a touchscreen tablet, which also allowed control of source selection within the stage-metaphor scheme An audio mixing interface was developed to compare the two control schemes for adjusting the volume and panning of a single channel in a stereo mix The study measured how accurately subjects could replicate the target volume and panning and how quickly they could complete the tasks.
Experiments were conducted over two days with 15 participants, including 7 experts and 8 novices Each test lasted 15–20 minutes, and participants first completed a questionnaire to record demographic information such as age before being briefed on the test procedure They then practiced the controls 1–3 times, using a combination of an interface (Launchpad or iPad) and a piece of music (Drums or Guitar) until they felt comfortable with the interface.
After the practice period, 15 participants completed five trials on each of two interfaces, resulting in 75 trials per interface In each trial, subjects listened to a prerecorded reference for eight seconds to identify the channel to replicate, then, upon returning to the interface, played the audio and adjusted the target signal’s gain (volume) and stereo position (panning) to match the reference; completion time and the resulting gain and panning values were recorded when playback stopped Scores were computed as the difference between the reference MIDI values and the selected values The analysis found no significant difference between the two interfaces for the mixing task, while a statistically significant difference emerged between novice and expert users in their general panning ability The authors cautioned that the absence of observed interface differences does not prove that none exist Participants were rarely able to express a clear preference, with only a few choosing one interface and no consistent trend The simplicity of the task may have contributed to these findings.
14 the stage metaphor in this case was preferred for its intuitiveness, enjoyability, and its ability to allow users to better visualize spatial elements of a mix.
Gestural Audio Mixing Controllers in Prior Research
In 2013, researchers introduced the WAVE audio mixing interface, a camera-based system that uses computer vision to track movements and gestures for controlling a digital audio workstation (DAW) while evaluating subjects’ ability to mix eight instrument tracks across different control setups The study notes that many sound engineers believed mixing exclusively on a DAW, rather than with an analog recording console, could worsen music quality The authors suggest that differences between the algorithms of mixing software and their analog desk counterparts may contribute to these perceived gaps, and a number of engineers cited ergonomics as a key factor underlying subjective mix quality differences between consoles and DAWs.
Researchers designed a simple audio mixing GUI displayed on a projected screen, with adjustments controlled by a gesture-control dictionary processed by a camera running computer vision software to detect hand movements, gestures, and position Ten professional mixing engineers took part in the evaluation, each tasked with mixing eight audio tracks that varied significantly in musical and signal features Each track contained recordings of a single instrument or a group of instruments, with genres spanning instrumental rock and film soundtracks Participants were trained on a familiar system gesture dictionary, and could apply five different sound-mixing methods, listed on the next page.
1 Mixing using the custom GUI and gestures without parametric visual information displayed,
2 Mixing using the custom GUI and gestures with parametric visual information displayed,
3 Mixing using the custom GUI and a mouse with a keyboard, without parametric visual information displayed,
4 Mixing using the custom GUI and a mouse with a keyboard, and visual information reflecting parametric changes,
5 Mixing using a keyboard, mouse, and MIDI controller for parameter editing
By using paired mixes generated by each method, the study provides insight into the precision and accuracy of gestural controls relative to other control methods, enabling a clear comparison of performance factors Engineers were asked to subjectively evaluate the recordings through a structured questionnaire, yielding perceptual assessments that complement the objective data.
Researchers concluded that mixing audio signals with hand gestures instead of traditional controllers like a keyboard and mouse is not only viable but intuitive The intuitiveness, convenience, and precision of the gestural control system were rated on a 1–5 scale by the engineers who evaluated it, with six engineers awarding a perfect score of 5 and the lowest score being 3 All engineers endorsed the ability to control multiple parameters simultaneously, a feature they primarily used for adjusting both the dynamic compression ratio and the threshold at the same time, while also exploring other uses such as modulating the amount of reverb in the mix.
Some subjects reported weariness resulting from insufficient ergonomics (one participant) and from the running order of the mixing methods (two participants) The researchers concluded that engineers are able to create mixes of equal aesthetic value across gesture-based mixing approaches.
16 controlled system as they would be able to create with a physical controller system was substantiated by the results of this study
Building on Lech and Kostek’s WAVE framework, Ratcliffe’s study leverages the channel-strip’s multi-parametric audio-control framework to advance the mixing interface design, addressing deficiencies in the current paradigm and uncovering nuanced design opportunities To achieve this, the researcher adopts Gibson’s stage metaphor to create an optical-tracking, computer-vision–assisted mixing system [16].
The author proposed that it can be useful to explore alternatives that do not rely on existing physical interfaces Although current user interface trends for adjusting audio parameters are grounded in physical controls, Ratcliffe argued that there is a responsibility to broaden design perspectives and to investigate non-traditional interaction paradigms.
Breaking away from skeuomorphistic design provides a more direct sense of control for the user and enhances the overall user experience The Interaction Design Foundation defines skeuomorphism as an approach where interface objects imitate their closest real-world counterparts in both appearance and interaction A classic example is the recycle bin icon used for deleting files on most operating systems, illustrating how familiar metaphors map real-world actions to digital controls.
To facilitate the study, Ratcliffe employed the Leap Motion controller—an optical tracking device shown in Figure 2—alongside a tablet with integrated control software to allow users to compare different mixing methods These methods were integrated into Ableton Live, a digital audio workstation, enabling researchers to evaluate how various mixing approaches perform in practice.
Figure 2 The sensor used in the study, adapted from [18]
Ratcliffe used virtual objects to model individual sound sources and their spatial relationship to the listener, guiding the adjustment of gain and stereophonic positioning Participants were required to have experience with a digital audio workstation (DAW) and some background as a musical performer, producer, or audio engineer Nine graduate students took part in the pilot study.
Participants in the MotionMix study focused on two core controls—volume level and panning The channels’ left–right panning was driven by the scaled x-position, while the perceived depth (volume) of each source in the mix depended on the objects’ z-position Users could adjust gain by moving a sound source back or forward, and reposition it left or right to place it within a virtual sound stage Three interface designs were used for all tasks, including the MotionMix interface with no visual feedback and two additional designs.
MotionMix interface with the stage metaphor visual representation, and a virtual mixer within the
DAW, viewed as a channel-strip metaphor, sets the scene: after being given the opportunity to familiarize themselves with the controls and expressing readiness for the task, they proceeded to mix an eight-channel session, but the duration of the practice period was not recorded.
Participants in the trials mixed multiple sound sources as if delivering a client-ready mix, with no time constraint and stopping only when they were satisfied with the position of each sound source The trials were timed from the moment participants first engaged with the system to the moment they informed the author that they were satisfied with their mix After completing the trials, subjects answered questions about their interface preferences and compared the two variations of the MotionMix system—both using DAW-style channel-strip metaphor controls—with the DAW channel-strip paradigm and with other external mixers they had used.
In the study, 18 participants stated they would adopt such a system into their workflows, while one participant said they would not be interested in integrating the system, and another left the question blank The results show that 67% of participants preferred MotionMix with visualization, 22% preferred motion without visualization, and 11% preferred Ableton Live for the task.
Comparisons showed no significant difference in accuracy between MotionMix and other mixers, and adding visualization did not substantially increase task duration versus using a digital audio workstation (DAW) However, participants spent more time on the subject evaluation task when using MotionMix without visualization The authors suggest that gestural control of a DAW offers potential benefits and recommend that future research continue evaluating gestural control systems while ensuring gesture interfaces remain simple and easy to use.
Further Research in Gestural Controls for Audio Mixing Interfaces
Another example of using gestural controls for audio mixing investigates integrating the Leap Motion sensor into an audio mixing interface while delivering a robust graphical user interface (GUI) for the user [20] The design also aims to extend Ratcliffe’s functionality, building on prior work to expand how gesture-based interaction can streamline music production.
MotionMix is the LAMI system developed by Wakefield, Dewey, and Gale, providing a richer stage-metaphor visual interface when paired with the Leap Motion controller The system uses a dictionary of hand-gesture controls and a 3D graphical interface that displays sound sources, the user’s hands, and a range of parameter adjustments such as EQ, effects, various modes, and playback triggering Participants performed a defined mixing task against a benchmark mix in a DAW, and three experts rated the resulting mixes on a 0 to 10 scale.
During testing, several errors in the LAMI system degraded the user experience For example, an artefact in the smoothing algorithm produced an unintended zeroing behavior, undermining the predictability of the feature This kind of issue highlights the need for targeted debugging, robust validation, and tighter quality controls in the smoothing stage By prioritizing stabilization of the smoothing algorithm and expanding test coverage, the LAMI system can deliver more reliable performance and a smoother user journey.
The study found that the auxiliary Send 2 channel design frustrated users and effectively rendered the feature unusable, while users preferred controlling parameters individually rather than in bulk In practice, the GUI frequently failed to follow gestural actions, detracting from the user experience The conclusion showed that multi-mapping parameter controls to hand movements was incongruent with test subjects’ preference for individual parameter control, and many participants described the LAMI system as time-consuming, stressful, or confusing.
While some participants found the LAMI system hard to use, most noted that it was fun to use The study found no evidence that the stage metaphor itself was cumbersome, but it did indicate that gestural controls for audio mixing with many parametric mappings may detract from the user experience.
VR as a Medium for Sound Source Visualization
Researchers have demonstrated the stage metaphor for audio engineering tasks within virtual reality, showing how stage-based workflows can be integrated with VR for both listening tests and creative applications The VESPERS system exemplifies this bridge between stage concepts and immersive VR, and has been used to facilitate listening tests as well as creative projects Developed by the SCENE Laboratory at Stevens Institute of Technology, VESPERS combines a 24.2 multichannel speaker array with a VR headset and controllers to render sound sources as interactive objects in a virtual environment Built on the Unity engine, VESPERS demonstrates how 3D control schemes and 3D visualization can be used together in a single system to provide an immersive platform for users to interact with sound and perform audio mixing tasks.
The design and evaluation new user interfaces which do not closely model traditional schema for audio mixing software is a complex task by nature Thankfully, some researchers have
For example, the authors argue that audio equipment and software user interfaces rooted in traditional paradigms should be reconsidered to develop more effective and simpler interfacing options, enabling designers to leverage the new tools in visual representation and interactive interfaces This shift can enhance usability and accessibility by aligning hardware and software controls with modern visual and interactive capabilities.
This study proposes a set of guidelines for designing audio-focused user interfaces, including determining the user's skill level with the audio product, performing a task analysis of the audio processes the system will execute, developing paper prototypes for exploratory task analysis before building functional prototypes, selecting a metaphor that conveys sufficient information while clearly representing the concept, and creating an interface that lets users directly manipulate the visualization.
Researchers proposed a framework for evaluating new user interfaces that includes informal evaluation with an expert user consulted throughout the design process, scenario-based simulations to refine prototypes into workable designs, and evaluation based on multiple metrics The recommended metrics cover efficiency, measured by normalized task completion time; effectiveness, reflected in the accuracy of simple tasks and the subjects’ satisfaction when not objectively verifiable; and satisfaction, captured through user preference rankings and observed interaction and feedback They argued that using many measures can yield contradictory results, and that users may resist change when confronted with interfaces that are radically different from the norm Additionally, radical redesigns of existing audio mixing paradigms may receive poor ratings due to their individuality.
Methods
Testing Software Design
The program used for the subject evaluation was developed with Unity, a software engine with a native physics engine commonly used for VR applications and video games, building upon prior
VR audio mixing demonstrations, including the VESPERS system, were presented to illustrate immersive sound rendering A monitor near the test administrator displayed the user’s perspective, with the administrator positioned outside the user’s motion field and aligned with the program controls implemented in Unity’s editor window An example layout of the system as configured for the evaluation is pictured below in Figure 3.
Figure 3 The testing system’s physical footprint
Two experts were consulted throughout the development process One graduate instructor
22 several iterations of pilot testing to ensure the software would be able to facilitate the subject evaluation consistently across multiple testing days
The program is designed to function interchangeably with two control systems The first system uses two handheld controllers that are tracked by infrared-based base stations mounted on height-adjustable stands out of reach of the subject, ensuring accurate tracking and unobtrusive operation A diagram of these controllers is provided below in Figure 4.
Figure 4 The HTC VIVE™ handheld controllers, adapted from [22]
The second system uses the Leap Motion optical hand-detection and gesture-recognition controller (shown earlier in section 2.4) Both control modalities are integrated with the Leap Motion Interaction Engine, which provides a consistent method for interacting with virtual objects via either physical controllers or hand-and-gesture input, yielding negligible latency differences and comparable controller resolution [23] This framework lets users interact with virtual sound objects as if they were physically present, drawing on the GUI approach established by Wakefield, Dewey, and Gale’s LAMI system Participants can deploy multiple controllers to push and pull individual sound sources, or grab and release objects using either the triggers on physical controllers or natural grab-like hand gestures detected by the Leap Motion controller.
Virtual spheres can be moved in three dimensions to control gain adjustments (depth, measured in decibels) and stereophonic positioning (left/right, expressed as a panning percentage) for individual sound sources assigned to each sphere This approach builds on the representational objects used in MotionMix, LAMI, and VESPERS, enabling precise 3D spatial audio manipulation and immersive spatialization.
Leaving the height unassigned enables vertical placement of the eight objects to prevent occlusion and crowding, so each object in a vertical column shares identical gain adjustment and stereophonic parameters A demonstration shows a participant using the hand-and-gesture control scheme to modify the gain and the stereophonic position of multiple sound sources, as illustrated in Photo 1 on the following page.
Photograph 1 A participant using the Leap Motion controller to interact with sound objects
During scene playback, the system creates invisible colliders to compute the variables that determine each sound object's stereophonic position and gain Two vertical planar colliders define the gain-adjustment bounds and run from roughly 0.1 meters in front of the user's headset detection radius to a plane 0.5 meters in front of the monitors Left and right colliders, centered roughly at the left and right monitor positions, determine the horizontal stereo placement of sound sources This setup prevents users from placing virtual objects outside their natural range of motion.
First, the software loads the monophonic audio files into the program and then generates virtual representations of the testing-room monitoring system based on measurements of the user-to-speaker distance, the distance between speakers, and their height from the floor A script procedurally creates as many representational sound objects as there are audio files in a resource folder, spaces them evenly across the span between the left and right monitors, and names each object after the corresponding audio file When users interact with any sound object using the controls, a text display appears within range, and the system recalculates the gain (vol) and stereophonic position (pan) every frame until interaction ends The vol range is computed by linear scaling from 0.0 dB at 0.1 meters in front of the headset’s initial position to about 0.5 meters from the front-edge boundary of the monitors The pan value ranges from 100% left through centered to 100% right, mimicking common digital audio workstation behavior, with gain values expressed as 0.0–1.0 and pan values as -1.0–1.0 following a constant-power panning law, as shown in Figure 5.
Figure 5 The boundary placement and limit values within the virtual reality environment for the control of audio sources, represented by virtual objects.
Research Environment
The testing was executed in a small recording studio commonly used for audio mixing and
An equilateral triangle was formed between the left and right studio monitors and the listening position, with a Genelec 7050B subwoofer placed midway between the monitors at the same distance from the listener as the monitors The studio monitors and the subwoofer were time- and level-calibrated (74 dB, A-weighting) to ensure the centered panning position appeared directly between the two monitors and the listening level at unity gain remained within a safe range per OSHA noise standard 29 CFR 1910.95 A Focusrite 2i2 USB audio interface fed the Universal Audio 8P with gain at unity and calibrated to +4 dBu, and the USB interface’s master gain could be lowered at the start of a test to 60–65 dB(A) as subjects requested, without exceeding safe listening levels Before each testing round, the VR headset and both controllers were calibrated according to each manufacturer’s recommendations, and the base stations were placed in the same positions across trials to ensure consistent virtual object positioning on different testing days.
Stimuli
Eight mono audio tracks at 48 kHz / 24-bit, with minimal editing and processing and similar average loudness, served as the stimuli for the test evaluation The tracks were derived from a nine-track recording of an acoustic rock song by removing the right mono channel of the overhead drums to produce a mono overhead track The total length of the song was approximately 3 minutes and 25 seconds, allowing it to loop just under three times during each 10-minute evaluation period The audio sample filenames were processed to yield simple object names that persist in front of each object for the duration of the test Table 1 shows the file name to object name conversion A visual representation of what subjects would see in their headset (with slight lens warping due to the difference between VR headset display and on-screen display) is presented on the next page in Photo 2, illustrating all eight sound sources and an example interaction with one sound source using a physical controller.
Table 1 The stimuli presented in the evaluation
Audio wav File Name Object Title
Subjects
In this study, 10 participants—two undergraduates, four graduate students, and four instructors—from the audio engineering technology department took part in the evaluation The participants ranged in age from 20 to 65 years, with a mean age of 39.4 years, and they possessed an average of 21.2 years of audio engineering and mixing experience, with individual experience spanning from 2 to 45 years.
Experimental Procedure
Participants completed two 10-minute trials using both randomly ordered control schemes to achieve their perceived optimal mix balance among the sound sources They were informed of the study focus and trial duration They evaluated each control scheme based on three subjective metrics: accuracy—how well virtual objects respond to controller input; efficiency—the ease of control and speed to achieve the intended result; and overall satisfaction with the scheme.
Before each ten-minute evaluation, the VR headset was calibrated for individualized comfort by adjusting placement and focus to suit each participant Participants were given up to five minutes before each trial to familiarize themselves with the assigned control scheme Once readiness was indicated, the testing period began, with all sound sources starting simultaneously and looping after each track ended Participants were instructed to end the test when they believed they had achieved a satisfactory audio mix.
Survey Questions
Following each ten-minute trial, participants completed a survey to rate their perceptions of the accuracy, efficiency, and satisfaction with the volume and panning of their assigned control scheme The questions in this first survey and their corresponding response options are described in Tables 2a and 2b.
Table 2a The survey given to participants after each control scheme’s trial period
Which control scheme did you use?
On a scale of 1-10, how ACCURATE were the controls for adjusting volume?
On a scale of 1-10, how EFFICIENT were the controls for adjusting volume?
On a scale of 1-10, how SATISFYING were the controls for adjusting volume?
On a scale of 1-10, how ACCURATE were the controls for adjusting panning?
On a scale of 1-10, how EFFICIENT were the controls for adjusting panning?
On a scale of 1-10, how SATISFYING were the controls for adjusting panning?
Table 2b The first survey’s response choices, corresponding to the questions in Table 2a
Hand-Detection Controls Physical Controllers
1 - Not accurate at all 2 3 4 5 6 7 8 9 10 - Very accurate
1 - Not efficient at all 2 3 4 5 6 7 8 9 10 - Very efficient
1 - Not satisfying at all 2 3 4 5 6 7 8 9 10 - Very satisfying
1 - Not accurate at all 2 3 4 5 6 7 8 9 10 - Very accurate
1 - Not efficient at all 2 3 4 5 6 7 8 9 10 - Very efficient
1 - Not satisfying at all 2 3 4 5 6 7 8 9 10 - Very satisfying
Participants were allowed to revise their responses to the initial survey after the second ten-minute trial After completing both ten-minute trials and their associated surveys, they were asked an additional question to express their overall preference between the two control schemes or to indicate no preference The end-of-study survey administered after both trials is detailed in Table 3.
Table 3 The exit survey of overall control scheme preference
Survey 2 Controller preference exit survey
Q1 Which control scheme did you prefer to use?
Hand-Detection Controls Physical Controllers I did not prefer any individual control
During the ten-minute evaluation periods, participants could provide verbal feedback on the control schemes, and the study recorded their responses as well as the time taken from the start to the end of each period Each subject completed one trial per control scheme, resulting in a total of 20 trials.
Results
Subject Response Differences Between Controllers, All Subjects
Tables 4a and 4b report the response means for all participants using hand controls and physical controllers in the evaluation; overall, physical controllers tended to yield higher mean evaluation ratings than hand controls, but only a subset of these ratings reached statistical significance within the 95% confidence interval (p < 05).
Table 4a Hand controller response ratings for all subjects
Table 4b Physical controller response ratings for all subjects
Independent samples t-tests comparing mean subject ratings and the time spent to complete the task between the two control schemes revealed statistically significant differences in several ratings Notably, volume accuracy differed significantly (t(19) = -2.60, p = 018).
Efficiency differed significantly between hand controls and physical controllers (t = -2.287, df = 19, p = 034), indicating a statistically significant difference in performance across the two control schemes In contrast, there was no significant difference in the time spent during each 10-minute evaluation period (t = 0.008, df = 19, p = 993) The results of the t-tests and the one-way ANOVA comparing the two control schemes are summarized in Table 5, which appears on the next page.
Table 5 Comparisons between controls (independent samples t-test, one-way ANOVA)
Figure 6 presents a comparison of the mean ratings across six categories provided by all subjects The physical controller scheme scored higher on average than the hand-controlled system in every category, though not all differences reached statistical significance.
Figure 6 Subject response means & 95% confidence interval between all groups.
Differences Between Inexperienced & Experienced Subjects
Differences between two respondent groups—those with under 10 years and those with over 10 years of experience in audio engineering and mixing—were further investigated The response means for the hand controls showed no statistically significant differences in ratings across any category, and no statistically significant difference in time spent during the evaluation However, the Volume Accuracy p-value of 073 approached significance, suggesting a difference in Volume Accuracy ratings between the under-10 and over-10 year experience groups for the hand-and-gesture controls The response means and the repeated analysis procedure for the hand controls are presented in Tables 6, 7, and 8.
Subject Response Means - All Groups
Table 6 Hand controls subject responses, subject experience under 10 years
Table 7 Hand controls subject responses, subject experience above 10 years
Table 8 Comparisons between experience groups (independent samples t-test, one-way
ANOVA) when using hand controls
GROUP (HAND CONTROLS, ALL SUBJECTS) t p
An additional analysis was conducted to determine whether there were statistically significant differences between the two experience groups in the physical controller scheme The results showed no statistically significant differences in any subject rating category or in the time spent during the evaluation period However, the p-value for Panning Accuracy did not indicate a significant difference.
A p-value of 0.073 approached the conventional significance threshold of p < 05, indicating a potential difference in Panning Accuracy ratings between experience groups for the physical controllers The means and the accompanying analyses are presented in Tables 9, 10, and 11.
Table 9 Physical controls subject responses, subject experience under 10 years
Table 10 Physical controls subject responses, subject experience above 10 years
Table 11 Comparisons between experience groups (independent samples t-test, one-way
The groups comparison of means was visualized in a bar chart in figures 7 and 8 on the next page, displaying both the means and 95% confidence intervals for the subject ratings split into the under 10 years’ experience group and the over 10 years’ experience group
Figure 7 Comparison of means between less and more experienced groups’ responses for the hand controls, with 95% CI
Figure 8 Comparison of means between less and more experienced groups’ responses for the
Groups Comparison - Hand Controls Response
Groups Comparison - Physical Controllers Response
Discussion
Quantitative Results
Mean ratings for the physical controllers were slightly higher than those of the hand-and- gesture controls in every category amongst the total participant group Additionally, these results were repeated when the sample group was split into more experienced (exp >= 10 yrs.) and less experienced (exp < 10 yrs.) users, with both groups reporting higher mean ratings for the physical controllers’ than for the hand-and-gesture controls Several differences garnered statistically significant results It was not possible to test for differences between control schemes for the experience-split groups due to the small sample size (n = 5)
In the data set for all subject responses, the mean subject ratings between the physical controllers and hand-and-gesture controls for the categories of Volume Accuracy (p = 018),
The study found Volume Efficiency (p = 018) and Panning Efficiency (p = 034) to be statistically significant (p < 05) Volume Satisfaction (p = 057) and Panning Satisfaction (p = 053) approached statistical significance, while Panning Accuracy (p = 214) did not reach statistical significance.
Compared with hand-and-gesture controls, the physical controllers achieved higher average ratings across all subjects and across experience-split groups, a pattern that is reflected in the final preference survey (Figure 9) Seven subjects preferred the physical control scheme, one subject preferred the hand-and-gesture controls, and two subjects did not express a preference for either scheme in the evaluation task.
Figure 9 The percentage of subjects’ preference between the control schemes
Across all comparisons of control schemes and experience groups, the analysis found no statistically significant differences in time-on-task, and there were no observable trends indicating any systematic difference between the groups.
Dividing the sample into two groups of five subjects—those with 10 or more years of experience and those with less than 10 years—the study found that, on average, both groups rated physical controllers higher than hand-and-gesture controls A further analysis explored whether ratings differed significantly between the experience groups However, none of the comparison categories reached statistical significance (p < 05).
= 073) and Panning Accuracy for the physical controls (p = 073) showed a difference between the two experience groups which neared the threshold of significance.
Subject Verbal Response
Many subjects mentioned the practicality of being able to directly interact with sound channels
Experience < 10 yearsExperience >= 10 yearsAll Respondents
Across 40 iterations of the testing program, participants most often preferred the physical controllers over hand-and-gesture schemes, citing improved responsiveness of the physical controls Several testers noted that using the triggers to drag and drop sound objects felt more effective than the grabbing gesture detection offered by hand-and-gesture controls Some participants reported responsiveness problems with the hand-and-gesture system, particularly when trying to grab objects without accidentally colliding with items already placed A few suggested adding a focus or locking mechanism for sound sources once they are positioned Others expressed interest in using the system to mix their own records.
Comparison to Prior Research
Historically, research on hand and gestural controls for audio applications has primarily compared these interfaces with keyboard-and-mouse setups or MIDI controllers and relied on screen-based visualization, with findings generally indicating a preference for hand- and gesture-based control The data presented here provide evidence that physical controllers were preferred by participants for mixing audio within a virtual reality soundstage over the hand- and-gesture control system.
Conclusions
Within a VR stage-metaphor representation used to mix multichannel audio, physical controls were preferred to optical hand-gesture detection controls, and subject ratings revealed statistically significant differences consistent with this preference The study found that participants largely favored physical controllers when interacting with objects in a basic VR audio mixing environment While both control schemes differed in self-reported efficiency, the difference in task completion time was not large enough to be deemed significant.
Across all measured categories, physical controllers outperformed hand-gesture controls, yielding higher mean accuracy and greater satisfaction in both volume and panning tasks Even with a small sample size, many differences between the interfaces reached statistical significance, and nearly all other comparisons approached that threshold Most subjects preferred physical controls, describing them as more perceivably accurate, efficient, and satisfying than the hand-gesture system, though a few reported isolated frustrations with the testing software No difference was found in the time required to complete the evaluation task The researcher attributes the similar completion times to the novelty of mixing virtual reality for many participants and to comparable tracking latency between the two control schemes when used in the program.
Even though more experienced subjects tended to rate individual metrics lower on average
Although the results were significant, the study found that, even when participants were divided into two experience groups, they consistently preferred physical controllers over hand-gesture controls and rated the physical-control categories higher than their hand-gesture counterparts.
Study findings indicate there are significant differences in subject-reported accuracy, satisfaction, and overall preference between hand-detection controls and physical controller systems The null hypothesis is partially rejected, with evidence of differences in preference, accuracy, efficiency, and satisfaction between the two control schemes evaluated Overall, most participants favored physical controller systems for the experimental task.
Future research into VR-based stage-metaphor audio mixers should investigate differences between more complex control mappings across two sets of physical controllers to determine how interface variations affect multichannel mixing performance Expanding participant numbers will enable additional analyses and more conclusive evidence, while developing a hybrid control system—such as a glove that provides hand tracking and gesture input together with the responsive drag‑and‑drop functionality of physical controllers—could address the functionality issues observed in this study If the study is repeated, detailed user interaction logging would offer deeper insights into how users engage with control schemes in VR audio mixing, and including a time-on-task measure may be less relevant given the open-ended nature of the task A more focused study could ask users to match individual channel values to a reference, yielding clearer metrics for evaluating control schemes and informing future design decisions for VR mixing interfaces.
Exploring virtual reality audio workstation design benefits from comparing various control methods to traditional multichannel audio mixing, including console-based mixing and adjusting software parameters with a keyboard and mouse, and assessing a VR soundstage that uses physical controllers and/or hand- and gesture-based controls As researchers have already integrated VR with gestural controllers to creatively augment physical instruments such as keyboards, there remains ample room to develop and test new and innovative audio control schemes for both creative and corrective adjustments.
Based on participant feedback, the researcher will continue developing features for the test program used in this study to improve usability and experimental accuracy The software has been released as open-source under the MIT License to enable reuse, modification, and collaboration in future projects This approach aims to facilitate the advancement of audio engineering practice and research by supporting broader adoption and ongoing innovation.
[1] M Walther-Hansen, “New and Old User Interface Metaphors In Music Production.” Journal on the Art of Record Production (JARP 2017) Issue 11, (2017) DOI: http://www.arpjournal.com/asarpwp/content/issue-11/
[2] D Daley, “The Engineers Who Changed Recording: Fathers Of Invention.” Sound On Sound (Oct 2004) DOI: https://www.soundonsound.com/people/engineers-who-changed- recording
[3] D Gibson, “The Art Of Mixing: A Visual Guide To Recording, Engineering, And
At NIME 2015, Gelineck, Korsgaard, and Büchert compare two mixing metaphors—the stage metaphor and the channel-strip metaphor—and examine performance when adjusting the volume and panning of a single channel in a stereo mix The study discusses implications for mixing interface design, highlighting how metaphor choice can influence user interaction and task performance in stereo signal manipulation.
[5] J Mycroft, T Stockman and J Reiss, “Visual Information Search in Digital Audio
Workstations.” Presented at the 140th AES Convention, Convention Paper 9510 (May
[6] R Selfridge and J Reiss, “Interactive Mixing Using Wii Controller.” Presented at the 130th AES Convention, Convention Paper 8396 (May 2011)
[7] M Lech and B Kostek, “Testing a Novel Gesture-Based Mixing Interface.” J Audio Eng Soc., Vol 61, No 5, pp 301-313 (May 2013)
[8] S Bryson, “Virtual Reality in Scientific Visualization.” Communications of the ACM, Vol 39,
[9] A Kuzminski, “These Fascinating New Tools Let You Do 3D Sound Mixing – Directly In VR.” A Sound Effect (Aug 2018) DOI: https://www.asoundeffect.com/vr-3d-sound- mixing/
[10] T Mọki-Patola, J Laitinen, A Kanerva and T Takala, “Experiments with virtual reality instruments.” Proceedings of the International Conference on New Interfaces for Musical Expression (NIME 2005) pp 11-16 (May 2005)
[11] “DearVR,” DearVR Retrieved from Web DOI: http://dearvr.com/
[12] J Kelly and D Quiroz, “The Mixing Glove and Leap Motion Controller: Exploratory Research and Development of Gesture Controllers for Audio Mixing.” Presented at the 142nd AES Convention, Convention e-Brief 314 (May 2017)
[13] F Rumsey, “Virtual reality: Mixing, rendering, believability.” J Audio Eng Soc., Vol 64,
[14] R Campbell, “Behind the Gear,” Tape Op – The Creative Music Recording Magazine, No
[15] B Owsinski, “The Mixing Engineer’s Handbook: Second Edition.” Boston: Thomson Course Technology PTR (2006)
[16] J Ratcliffe, “MotionMix: A Gestural Audio Mixing Controller.” Presented at the 137th AES Convention, Convention Paper 9215 (Oct 2014)
[17] K Gửttling, “What is Skeuomorphism?” The Interaction Design Foundation (2018)
[18] M Young-Lae and C Yong-Chul, “Virtual arthroscopic surgery system using Leap
Motion.” Korean Patent KR101872006B1 issued June 27, 2018 DOI: https://patentimages.storage.googleapis.com/4c/8d/85/55932cf18e50d9/11201700021311 0-pat00001.png
[19] J Wakefield, C Dewey and W Gale, “LAMI: A Gesturally Controlled Three-Dimensional Stage Leap (Motion-Based) Audio Mixing Interface.” Presented at the 142nd AES
Graham and Cluett frame the soundfield as a sound object within virtual reality environments, proposing that VR can function as a three-dimensional canvas for music composition They argue that immersive spaces enable spatialized sound objects to be placed, moved, and experienced in real time, opening new creative possibilities for composers and sound designers Presented at the Conference on Audio for Virtual and Augmented Reality in 2016, the work highlights how immersive audio situates music at the intersection of technology, perception, and artistic practice, offering fresh approaches to how sound is authored and experienced in VR.
[21] C Dewey and J Wakefield, “A Guide to the Design and Evaluation of New User Interfaces for the Audio Industry.” Presented at the 136th AES Convention, Convention Paper 9071 (Apr 2014)
[22] “About the VIVE™ Controllers.” HTC Corporation (2019) DOI: https://www.vive.com/media/filer_public/17/5d/175d4252-dde3-49a2-aa86- c0b05ab4d445/guid-2d5454b7-1225-449c-b5e5-50a5ea4184d6-web.png
[23] “Interaction Engine 1.2.0.” Leap Motion (Jun 2018) DOI: https://developer.leapmotion.com/releases/interaction-engine-120
[24] “TB1.” The Professional Monitor Company Ltd (2019) DOI: https://pmc- speakers.com/products/archive/archive/tb1
[25] “Genelec 7050B Studio Subwoofer.” GENELEC (2018) DOI: https://www.genelec.com/studio-monitors/7000-series-studio-subwoofers/7050b-studio- subwoofer
[26] “OSHA Noise Regulations (Standards-29 CFR): Occupational noise exposure.-1910.95.” Occupational Safety and Health Administration, Appendix E – Acoustic Calibration of Audiometers OSHA, Vol 1, No 9, p 9 (1996)
[27] “Leap Motion Orion.” Leap Motion (Jun 2018) DOI: https://developer.leapmotion.com/orion/
[28] J Desnoyers-Stewart, D Gerhard, and M.L Smith, “Augmenting a MIDI Keyboard Using Virtual Interfaces.” J Audio Eng Soc., Vol 66, No 6, pp 439-447 (Jun 2018)
“Channel (audio) – Glossary” Federal Agencies Digitization Guidelines Initiative (n.d.)
Retrieved from Web site: http://www.digitizationguidelines.gov:8081/term.php?term=channelaudio
“HTC Vive,” Wikipedia (n.d.) Retrieved from Web site: https://en.wikipedia.org/wiki/HTC_Vive
J F Hair, R E Anderson, R L Tatham, and W C Black, “Multivariate Data Analysis”, 5th ed
“Unity User Manual (2018.3).” Unity Technologies (2018)
“GitHub: Tactile Mix.” Justin Bennington (2018) Retrieved from Web Site: https://github.com/justin-bennington/tactile-mix/
Virtual Environment Programming
Full Subject Survey Response Data
Table 12 presents the subject response data collected from both surveys, including the exit survey, along with the task completion times measured in seconds This dataset underpins the analyses in Section 4.0 of this paper.
Table 12 The full subject response data set
Full Subject Verbal Response Data
Being able to solo or lock channels is there, but the field feels too narrow and parameter updates could be faster; a mute and solo feature, plus reverb zones, would improve the workflow The hands-on controls add a tactile element, yet they can feel sticky, and while the hands are more novel, the physical controllers still perform more reliably—if the hands could match the responsiveness of the hardware, I’d prefer it Overall, the hands don’t perform as well as the physical controls, and it would be great if you could look at the audio sources while using a joystick panner The interface quality recalls early Pro Tools.
My first experience with virtual reality was cool and performed far better than I expected VR proves especially beneficial for younger students, offering an easier, glance-friendly way to grasp concepts like the stereo field image Putting everything in its own immersive world makes a lot of sense, and I prefer using the controllers over hand tracking, because hands can have trouble with proximity in VR.
Using hand controls in virtual reality was an enjoyable experience, with the representation of hands in VR being particularly cool, even though there was no haptic feedback For better usability, visual icons should accompany or replace labels While some object collisions were expected, the hands sometimes collided with other objects during interaction, which interrupted immersion Controllers offered an easier interaction due to their lower profile compared with the hands A simple tap-to-mute or tap-to-solo function would enhance control Overall, it was a super enjoyable VR experience, and it’s exciting to see these features put into practice.
I can't wait to mix records like that, and adding a solo button plus expanded processing options would dramatically streamline the workflow in this DJ software, signaling a clear step toward the inevitable future of music mixing technology; the demo reinforced that impression more than I expected A laser-pointer style control could enhance interaction, even though the tactile feedback was satisfying, and the overall system sometimes felt crowded, prompting me to wonder whether the speakers' physical placements matched their virtual positions.
I envision this technology thriving in a modern studio environment The hand-gesture control is awesome and shows great potential, even though it took some getting used to The testing task felt somewhat limited, but I loved the experience and am genuinely impressed.
I want to use this music mixing system to craft my own tracks, and for the future I’d love to see spectral effects and reverb integrated, along with a trigger that signals when an effect is instantiated in the signal chain.
During the trial, physical controllers offered far superior control, with click-and-drag interactions proving easier than hand-detection options If hand detection is needed, physical sensors would be the way to go The main challenge with hand tracking was figuring out when you could touch an object, and I also had trouble knocking things around Because the physical controllers only take effect after you press a button, they felt more predictable and satisfying, and I could see myself using them It was odd how hot the face got.
“Weird to get used to, especially the hands Sort of distracting, didn't look at visual labels, used ears.”
Virtual reality (VR) mixing is intriguing, but I prefer physical controllers Hand tracking sometimes pushed objects away, creating a distracting experience and making precise placement difficult The physical controllers’ click-and-drag action felt simpler and more reliable for the tasks at hand They handle multitasking more effectively, and deliver more dynamic interactions than hand-tracking controls I quickly got used to the physical controllers.
“Impressive I could not see difference between the two control schemes other than the lack of drag-and-drop functionality in the hand-controlled method.”
This body of work is dedicated first to my parents, Bud and Donna, to my talented sister Olivia, and to Jake Among all the possibilities I could explore—exist, observe, create, and learn—I’m grateful to share this journey with the greatest family I could imagine.
To the experts at Warner Music Group, your wisdom, generosity, and industry expertise gave me a strong start and set me on a fulfilling lifelong career in music Your belief in my potential opened doors and created opportunities in the music industry that I will carry forward I’m grateful for your supportive, gregarious mentorship that helped shape my path and continues to guide my work today.
To my instructors and faculty – thank you for imparting your wisdom which facilitated my goal to pursue the highest standard of academic success
To my pioneering classmates and colleagues at Belmont University—Paul, Morgan, Austin, Owen, Tyler, Chris, and Jim—your willingness to welcome me as an outsider and the love you showed me are beyond words Your support transformed my experience and reminded me that belonging blooms in a community of true friendship.
Mentors Will Wright and Andrew Gower taught me early on that learning through experimentation with complex systems and navigating them intuitively matters more than merely winning or losing This mindset has shaped who I am today and turned my earliest opportunities to learn into defining moments of growth.
To all the subjects who took part in the evaluation, thank you for your participation, timeliness, and enthusiasm
To Clarke Schleicher and Paul Worley, who taught me the value in seeing the “forest for the trees”.