This question was passed around internally last week: “Does anyone on this distribution list have a list of good reasons why a test environment should approximate that of production to provide accurate test results in lieu of using a cheaper, smaller test environment and math to estimate production-like results?”
In my former career, we had access to the latest hardware and data center technology. We had a great team that supported all our testing and validation processes. We wrote our own tools for generating the load, and we also created the monitors at the user and kernel level. Even with this level of access, we would only size our customer deployments via interpolation because, computer behavior is not linear outside of the test regime.
My Response to the Question
This list is not hypothetical, but real-world experience.
- The horizontally scaled components can fail because of
- Expensive cache coherency and cross talk to maintain a horizontal state
- Cost of multiplied connections on the next tier
- Insufficient data in a database
- Missed costs of table scans
- Cache setting inaccuracies
- CPU scaling
- Writing an app that scales linearly with a CPU is very difficult; most solutions are logarithmic as you add CPUs
However, I would suggest testing in a smaller environment or a truncated environment quickly and efficiently to find MAJOR issues fast. These tests should be done the same day as changes. They should also have deep and investigative monitoring and analysis tools (MODERN APM) that find anomalous transactions efficiently, instead of looking for macro issues like CPU is high. The cost of debugging this is massive and orthogonal to agile principals.
Our customers should do hypothesis-driven optimization patterns that can be easily compared. The goal should NOT be to look for slowdowns but instead to drive optimizations. That is where the value is. Why plan to be mediocre?
Then, they can roll the solution into production where they also have good monitoring tools to debug quickly, or they can roll it back out if they missed something.
If they do want a clone system, then it should really be a clone. Five years ago, the performance community agreed that it was impossible to make an exact duplicate. Today, with infrastructure as code combined with the cloud, an exact duplicate is possible. The operations team, along with development and test teams, can construct a solution that is truly identical in network, CPU, disk, and memory. We can even exactly duplicate the data from production easily with data virtualization. When this is true, the value of the performance is truly accurate.
Simulations can even give way to duplicated traffic or partially deployed changes. In effect, you can employ a code change to one of the applications in a cluster to see if the performance does improve in production.
With another customer, we did the testing in production. We added load via HPE LoadRunner simulation to a live system. This carefully planned test simulation enabled us to plan for a spike event. Testing in production is no longer a bad thing!
NOTE: I do not recommend a snapshot into a different fake data center for performance or operational validation. This diminishes in value over the long run due to the fact that most systems become out of order without the attention of the operations staff. The collaboration and participation with the development team will yield operations-focused automation, which will drive improved reliability, shortened deployment cycles, and the ability to quickly iterate through the permutations required to deliver a very high-performance solution.
Orasi can help do all these things for them.
|Practice||Environments and/or Simulation|
|Level 4: Optimizing||
|Level 3: Managed||
|Level 2: Defined||
|Level 1: Repeatable||
|Level 0: Initial||