1. What Used To Be: Night of Deployment Story
-What used to be: night of the Deployment Story:
*You get on a call, you deploy code, you try and run your test automation. Janice, asks how to run the test automation, and after digging through a 15 page document figures out the process to kick off. The test automation takes 35 mins to run. And then you get a result that produces an error. You start investigating the error, and in the mean time re-run the automation. Half way through the run, Barb, who is a BA, comes off of mute and says she verified manually. The automation finishes with a different error. The next night, Barb asks if we really need to wait for the automation, because who really wants to waste their time at 9pm? This is a real scenario, that we used to go through. How many of us in the audience have gone through this ourselves? How many of us had to run test automation that our team didn’t really trust? We sure did…but for us, that was the past.
What Is: Trusting Our Robots Story
*You get on a teams chat, the deployment coordinator needs to run our test automation. He or she presses a button on the quality dashboard, kicks off the tests and waits about 10 mins. He or she then examines a clear results dashboard and if failures exist, watches the video and looks at the screenshots. If the failures are a result of our code push, we rollback. The entire push process takes us a few minutes, and we wait to see if we have green checkmarks, since everything is automated. No green checkmarks, no push. Period.
What is the difference in the scenarios? In the first we incur risk, increase effort, and introduce the potential for manual failure . In the second, we rely on a known baseline, and understand our surface of functionality quickly.
All because of our trust in robots. That’s right friends. My team has an automated reliable safety net. A way to quickly ensure that we know our most important functionality works without doubt.
How did we get to this point? Does this sound like a story you want to hear? Then stick around because that is what this talk is going to be all about.
We will be discussing how to build trust in test automation and why it is one of the most important parts of your entire test automation journey. I hope to convince you that the best way to do this is focus on three key components of our test automation efforts.
We will look at each one of these core components and based on my team’s lessons learned, try and establish a way you can help your team solidify trust in your test automation. We will hear about why these concepts are important and what we did and what you can learn to improve each core piece.
A Simple Formula
Trusting The Robots Through Reliability, Visibility And Speed
-Achieving Trust in Test Automation = Reliability + Visibility + Speed
*Why is reliability super important to test automation? Because without it, you degrade the confidence in whether the machine works. Would you wait for a bus that only comes 1/5 times?
How Lack Of Reliability Degraded Trust
Our team faced this problem directly. Our test suite executions were very irregular. Sometimes they would pass, sometimes they would fail. When they would fail, sometimes a re-run would help. Sometimes it wouldn’t. This was a problem for us, because it degraded the belief of our team, that the test automation was useful.
Why Was Our Test Automation Not Reliable?
*After examining our reliability problem, we started to discover a pattern, which was a result of years of behaviour.
*At one point we had a large Selenium C# suite, a few protractor TypeScript suites, and Katalon based suites.
*There were many valiant efforts of our team to write test automation but they were all individual and disjointed.
*Initially the efforts produced great POCs, login scripts, hello world demonstrations, etc.
*But as time went on, engineers moved to business features and when problems came up, they pulled the plug, or put in duct tape and bubble gum
*This led to silos, lack of understanding, and eventually lack of reliability, since we were not sharing knowledge about a uniform solution.
Solution: Uniformity Based On Consensus
*As soon as we realized this lack of uniformity, we started searching for a technology that would work for most teams. We asked, inquired and listened to our devs
*We chose a technology that everyone who built front ends was excited about (because Devs told us so) and rewrote the existing requirements into a simple small test suite that tested the most important pieces of our business
*We used Cypress (TypeScript), because our devs were excited about it and it is the biggest player in the Angular world
*To create a uniform developer experience, we used the same CI/CD platform to initiate the test runs, which we use to deploy our code. The tests became just another pipeline.
*We realized that a reliable suite requires constant attention. In order to facilitate this, we had to maintain constant interest. We needed the dev teams to be interested in our test automation solutions, and by switching to something they were excited about, and recognized, we achieved that.
*Why is visibility super important for a test automation suite? Would you board a bus, if you couldn’t tell where the next stop was or if the driver could not see out the front window?
How Lack Of Visibility Degraded Trust
While observing how our team used our test automation, we saw another problem materialize. Visibility was never considered while building our original suites and it showed. It was not obvious where the results are displayed, and how to debug the failures. The lack of visibility, led to our team losing confidence in what the results were and finding a different way to verify whether the suite executed or not.
Why Were Our Test Suites Hard To Examine?
A. Result Visibility We found that one of the big reasons our team was hesitant to trust our test automation, was because it was not easy to tell what it did. We did not visualize the result, or the flow. We hid it away in CICD, and individual walking talking knowledge repositories. In order for a result to be interpreted, you had to navigate through a gauntlet of friction. From access issues to not being sure what actually ran, it was difficult to determine whether a passing test actually tested what you expected.
B. Visibility During Test Debugging Another big problem for the visibility of our test automation, was the process of debugging failed results. Not being able to quickly tell whether the test scripts were problematic or the system under test was not working, led us to usually default to the tests. We would examine a failure log, have to pull the tests to a local computer and re-run them locally. Our devs would spend hours figuring out issues related to running the tests, then would not be able to replicate the issues encountered on the CICD environment, and would be discouraged from going through the exercise again. The easier exercise, was to turn off the test when an inconvenient result was encountered, and validate it manually.
*Solutions: Result Visibility(Visualizations)
*In order to expose what was going on in our tests, we made a conscious choice to pick a test automation framework with easy to interpret visualizations, out of the box. When examining Cypress, we noticed that screenshots and videos were produced automatically for each test. It swayed us in the direction of using the framework.
*While designing the CICD flow, we also were swayed in our decisioning for an execution platform, by the ability to clearly see a tests’ flow and status. We went with an OSS solution that could show us whether a test passed, failed, and why easily and in an obvious spot. It was important for us to have a clear visualization of the test outcome.
*Solutions: Visibility During Debugging
*As part of choosing the test automation framework, we also kept in mind the fact that our dev debugging experience was currently horrible. Most of our devs working on the front end, wrote code in TypeScript. Our suites were not written in TypeScript. For a dev to examine a problem, he or she had to pull down a code base they were not familiar with, and could not run run easily. In some failure analysis, our test suite required our devs to run a virtual machine, to run the test suite on their local computer. Best case scenario, you ‘just’ had to align the version of chrome to the test suite’s expectations. Which in a centrally controlled IT environment, is a hurdle which is not easily overcome. Just to get the suite in question to run, devs had to spend a significant amount of time (sometimes hours) exercising their brain. Then there was reproducing the problem, which was usually not easy to do, as most failures were related to test automation timing on the execution environment.
*As a result, when we were searching for a way to raise trust around our efforts, we realized that reducing friction for debugging was a key component. We needed to be able to quickly understand where our problems were coming from. Whether the problem was our environment, our application, or our test code. We wanted more details in debugging. We needed to see as much as possible “out of the box”, without adding more code for visibility. This would reduce friction, and lead to a faster root cause analysis. The tool we ultimately landed on was Cypress, and one of reasons we chose it, was to focus on clearer visibility. Because of the way the testing framework is designed, it exposes not only the UI actions, but also the network traffic behind the scenes, therefore giving an insider view of what is going on in the application. This results in a much faster investigation pattern for teams, leading to a reduction of time wasted, and an overall increase in trust, due to clearer understanding of issues. When issues with tests are able to be attributed to clear failure reasons right away, the mantra of tests being flaky goes away.
*Why is speed super important for a test automation suite? Would you ride a bus that took longer to reach a destination than walking?
How Lack Of Speed Degraded Trust
*When our suites ran, each test took a long time and ran consecutively. Because of this, the suite was not useful when time was of the essence (ex. bug verification). Waiting 30-40 minutes for verification of high level functionality that could be manually verified in 10, was frustrating and did not build trust in the ability of the suite to return results when immediately needed.
Why Was Our Test Automation Slow?
A. Proxy Based Test Automation Tooling
The core problem with our traditional test automation approach was the way our individual tests were executed. We started with a test suite that was run by a framework which required a proxy to carry out browser actions. Additionally, all of our tests were ran one after another. These two factors coupled together (using a middle man for actions on the browser and stacking the tests), provided a run time which was longer than a manual execution of the same tests. As a result, our team was more motivated to execute the tests manually, as opposed to waiting for the test automation to finish.
B. Velocity While Scaling
The other problem we were encountering with respect to speed, was during the scaling of the suite. Because of the design of the execution framework, we were not able to scale without drastically increasing the length of our suite’s runtime. This velocity increase, was initially tolerated, but eventually led to a situation where the suite was simply not executed during the day in its entirety. Some team members close to the suite would choose and pick individual tests, or groups of tests, but execution as a whole was limited to prod push time and doubled the length of time our team members were meeting at night time. Additionally, the lack of scaling, prevented free discussion with respect to adding scenarios. Engineers were hesitant to increase the number of tests, and the suite growth stagnated.
The test suite taking a long time coupled with the lack of velocity scaling, put us in a position where our teams were making the choice of executing the test suite manually. People were literally following the test suite steps, but instead of letting robots do it, were just doing it themselves. They would walk to the bus stop, and choose to not get on the bus, but walk to the destination. Our teams took the time to build test automation, and it was being ignored in the most explicit way.
Solutions: Modern Testing Framework
During our investigation of possible replacement test automation frameworks, we started to notice a pattern with respect to how the frameworks handled browser interactions. Traditionally, the majority of testing frameworks used a proxy to perform actions on a browser, via an api provided by the browser. However, recently, a different approach had been pioneered by a few testing frameworks. The difference in approach originated from the implementation of an in process node service, which ran inside of the browser process, as opposed to external to it. The surface of control, and speed at which commands could be sent to the browser increased, and we decided that this approach was much more attractive as we were re-designing our tests. We went with this approach, and chose Cypress as our testing framework of choice. Using an in process testing framework increased the speed of our testing by about 30% per test.
Solutions: Maintaining Velocity While Scaling
While we gained an increase of 30% per test due to switching to an in process testing framework (Cypress), the fact that our tests ran consecutively still hampered our test suite’s overall execution time. We realized that in order to future proof our suite, we needed to ensure that our suite’s run time would be reflective of the longest single test…even if we added many tests. For this we decided that we had to find an execution framework which would allow for parallel test execution, at scale. After a brief search, we decided that due to the desire to control as much of our own domain as possible, we would go with an open source test execution environment called “Sorry-Cypress” and deploy it to Kubernetes. The choice to deploy to Kubernetes was intentional, because it gave us an easy test execution scaling strategy. We configured our deployments in such a way that each test would get it’s own execution container, which could be run in parallel. By adopting this strategy, we were able to ensure that our suite run time was approximately the same whether we were running 1 test or 6.
Summary & Call To Action
- So in summary we talked about Confidence = Reliability + Visibility + Speed
- Remember the story about our push nights? The lack of confidence, the inconsistency of execution, the lack of result understanding and how long this took?
- Remember how we solved this?
- To improve reliability, we empowered our team WRT test automation by using tooling the teams were excited about, which increased the focus on constant care (picture of test suite execution)
- To improve visibility, we built methods for our team members to easily see results, both positive and negative! (pictures of test suite failure)
- To improve speed, we moved to an in process testing framework and refactored our test suite to run at the speed of the slowest single test, so our team did not have to wait for the results, longer than it took them to make a fancy coffee (time of execution)
- What is the common thread in all these actions? There are a lot of changes and lessons, and technical improvements, but the common thread is not what we originally thought it was.
- Turns out the biggest problem of using robots to test, was really a people problem
Was our biggest lesson technical? Was it that we should’ve always be using Cypress & Kubernetes? Nope.
Our single biggest lesson while re-building trust in test automation…was to focus on our team while designing and building our test automation, and let the robots do the rest.