At Best Buy, I was the Lead UX Designer for the APEX product. APEX was an in house MVP ML testing engine that would act as an alternative to the third-party provider A/B testing structure. I won’t attempt to dive into the data science behind these ML tests as it was confounding and took quite a while to grasp.
Suffice it to say, APEX users, needed a way to not only start and stop a test, but also receive cues to aid in gauging the 'health' of a test in-progress, fixing any problems with the test. Here's the use case in a nutshell.
As an APEX user, I need to be able to run a ML test and all of its phases through to completion. During, before and after the test, I need to be able to monitor the health (accuracy) of the data being analyzed and the reliability of my data sources, based on a unique set of process, analytics and programming specific metrics (Trials, Actions, Strategies).
For the APEX team, often times tests would fail without dev’s or scientists knowing or knowing why. Sometimes we would hear about data source errors from participating Digital Product Owners, our clients. For context, APEX ML Tests boostrap the digital products that customers interact with on the BestBuy.com website in order to 'learn' from usage.
Coming off of the heels of a very insightful Service Blueprinting Exercise that I hosted, our Product Owner was very eager to dabble in more design thinking workshops and suggested that we go full tilt into a Google Design Sprint to solve this problem. You can imagine my delight! I’d never worked with a team that was so enthusiastic about the designers toolkit. I set up the design sprint with a brief presentation to the team which outlined our goals and agenda.
Click here to access the entire project write up in Figma!
Day 1 was a tremendous success and we were able to leverage everyone's diverse experience with managing, planning and monitoring the tests to map out a pretty clear user journey. It was clear to me that the team had not really gotten together to acknowledge each other’s areas of expertise and personal experiences within their realms.
Watching them all come together to learn from each other was thrilling. Also, we were able to align around the truth about the experience of running and monitoring a test by having everyone there to chime in with their perspective. We were able to see each test phase from many different points of view.
As a designer this day was exceedingly helpful. I’d been able to hold my own against the immense cognitive complexity of the product and ML test monitoring in general, but having this day to just bask in all of the raw user data was very enlightening even after months on the job.
Day 2 was exciting and challenging. While we understood that our task was to synthesize what we had learned from Day 1 to define or at least create strong hypotheses around our ideal user story that our sprint would follow, the metrics by which we would measure the success of our design investigation, and the Golden Path of our ideal user journey.
It took a lot of time and discussion for us to agree on our ideal user journey because what we began to realize the degrees of user segmentation between Days 1 and 2. Though they were all working towards the same goal of facilitating and monitoring a successful machine learning trial, our Developer users needed different signals for their tasks and analysis than the data scientist did and our product owner users had their respective needs also.
Technically each of them had varying definitions of data health and it took some time to prioritize which data needed to be surfaced first for immediate action, and prioritize the needs of each user segment.
Ultimately we were able to align on a needs hierarchy based on the current state domino effect monitoring and alerting protocol that the team followed in the current state. Who sees the smoke? Who rings the alarm? Who puts out the fire? This reasoning is reflected in our Golden Path.
Sketch day is always fun. As a designer it is almost anecdotal that the person who thinks they can’t ‘sketch’ has a brilliant idea that needs to be shared with the team and, most likely included in the prototype design iterations. In our case it was an idea to have a test overlay panel which could follow the user throughout the different screens of the test environment. This was in fact a pattern that I had been sketching and had intended to use in the prototype given what I’d already learned about the problem space.
The sketching phase corroborated some things we had agreed on about the important metrics to surface during Day 2, but we were still at a user segmentation crossroads and needed to figure out what metrics mattered the most.
Developers are looking for high level data health indicators about the decision (ML test) up front to know where to look for potential data problems either in Action (digital product’s end user web experience) or Trial (# of times product appears on site) metrics.
A data analyst on the other hand wants to know specific information about the Trials in relation to each Action (a test can have multiple actions).
And in the future, this pattern would need to support the APEX end user, a Digital Product Owner, who is about as new to this as you or I. This person would need to have the system point out critical information so that they can report it correctly to an APEX developer or Data Scientist.
Given what we had discussed during out Sketch Day and the interactions and ideas that we voted on, I set to work to design the monitoring widget pattern for the ML Test (Decision) in phase, the ML Test (Decision) Overview pages. I also designed a mock-up of what the Test Monitoring Details page would look like, trying to pay attention to the information hierarchy which we had discussed during our Sketch Day. I designed a flow for a healthy Decision and also, a Decision where there was a data failure.
After I got my prototype squared away, I hosted Usability sessions with a few of our Sr. Developers. Though the monitoring Widget and Color highlighting patterns were approved of, we still had to align on which information to show and how it fit into each user’s monitoring workflow. Also, we needed to know what information would appear on which parts of the interface as the user navigated between screens.
We created diagrammatic groupings of the metrics based on the interface type (full page, dashboard or at-a-glance widget) and their priority (general high priority, Strategy Specific, Decision Configuration Specific, Trial Specific.
We concluded our engagement by coming together around a Design Brief that would explain our intention for what we wanted to build. This was another day of very fruitful discussion but we were still lost in a whirlwind for trying to finalize our data metrics hierarchy and display requirements for assessing overall data health for a ML Test (Decision).
A lot of these questions were left unanswered and we ultimately determined that we would need to return to the Decision Overview page in another sprint and that the At-A-Glance widget could provide all the necessary information for solve the immediate problem of being able to monitor crucial Decision metrics before finding out that a test failed.
Click here to access the entire project write up in Figma!