computer classThis question was passed around internally last week: “Does anyone on this distribution list have a list of good reasons why a test environment should approximate that of production to provide accurate test results in lieu of using a cheaper, smaller test environment and math to estimate production-like results?”

In my former career, we had access to the latest hardware and data center technology. We had a great team that supported all our testing and validation processes. We wrote our own tools for generating the load, and we also created the monitors at the user and kernel level. Even with this level of access, we would only size our customer deployments via interpolation because, computer behavior is not linear outside of the test regime.

My Response to the Question

This list is not hypothetical, but real-world experience.

  1. The horizontally scaled components can fail because of
    • Misconfiguration
    • Expensive cache coherency and cross talk to maintain a horizontal state
    • Cost of multiplied connections on the next tier
  2. Insufficient data in a database
    • Missed costs of table scans
    • Cache setting inaccuracies
  3. CPU scaling
    • Writing an app that scales linearly with a CPU is very difficult; most solutions are logarithmic as you add CPUs

However, I would suggest testing in a smaller environment or a truncated environment quickly and efficiently to find MAJOR issues fast. These tests should be done the same day as changes. They should also have deep and investigative monitoring and analysis tools (MODERN APM) that find anomalous transactions efficiently, instead of looking for macro issues like CPU is high. The cost of debugging this is massive and orthogonal to agile principals.

Our customers should do hypothesis-driven optimization patterns that can be easily compared. The goal should NOT be to look for slowdowns but instead to drive optimizations. That is where the value is. Why plan to be mediocre?

Then, they can roll the solution into production where they also have good monitoring tools to debug quickly, or they can roll it back out if they missed something.

If they do want a clone system, then it should really be a clone. Five years ago, the performance community agreed that it was impossible to make an exact duplicate. Today, with infrastructure as code combined with the cloud, an exact duplicate is possible. The operations team, along with development and test teams, can construct a solution that is truly identical in network, CPU, disk, and memory. We can even exactly duplicate the data from production easily with data virtualization. When this is true, the value of the performance is truly accurate.

Simulations can even give way to duplicated traffic or partially deployed changes. In effect, you can employ a code change to one of the applications in a cluster to see if the performance does improve in production.

With another customer, we did the testing in production. We added load via HPE LoadRunner simulation to a live system. This carefully planned test simulation enabled us to plan for a spike event. Testing in production is no longer a bad thing!

NOTE: I do not recommend a snapshot into a different fake data center for performance or operational validation. This diminishes in value over the long run due to the fact that most systems become out of order without the attention of the operations staff. The collaboration and participation with the development team will yield operations-focused automation, which will drive improved reliability, shortened deployment cycles, and the ability to quickly iterate through the permutations required to deliver a very high-performance solution.

Orasi can help do all these things for them.

Practice Environments and/or Simulation
Level 4: Optimizing
  • Scales to meet performance demand at the right cost
  • Ability to test out performance hypotheses effectively; a/b tests
  • Performance/load simulations predictively improve solutions
  • Sizing estimates and tools delivered with major changes
  • Proactive synthetic workload is integrated into production
Level 3: Managed
  • Test and production environments are orchestrated
  • Performance simulation executed in test and production for validation
  • Performance and scalability costs are visible to the business
  • Tests are integrated into CD pipeline
Level 2: Defined
  • Test and production change management processes are part of release management
  • Performance environments include robust data, network, and security requirements
  • Performance simulation includes data coverage, network effects, and device/browser level impacts
  • Tests include sizing and optimization plus Level 1
  • Fully automated, self-service deployment of software
Level 1: Repeatable
  • Test and production environments have documented runbooks and configuration
  • Performance tests are state driven
  • Test environments are close to production in scale and machine class
  • Testing includes Level 0 plus soak and spike
Level 0: Initial
  • Performance test environment not dedicated
  • Performance test run annually or once per cycle
  • Simulations based on record and replay
  • Tests are for stress and status quo only

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.