From: Thomas Walker Lynch Date: Mon, 4 Nov 2024 11:06:19 +0000 (+0000) Subject: adds the Introduction to Structured Testing doc X-Git-Url: https://git.reasoningtechnology.com/style/static/git-logo.png?a=commitdiff_plain;h=544a3f56f3f0b374899841bc2768ba0dd2dd7f9d;p=Mosaic adds the Introduction to Structured Testing doc --- diff --git a/document/.~lock.adder64.odg# b/document/.~lock.adder64.odg# deleted file mode 100644 index 35b1ad1..0000000 --- a/document/.~lock.adder64.odg# +++ /dev/null @@ -1 +0,0 @@ -,Thomas-developer,Blossac,03.11.2024 03:28,file:///home/Thomas-developer/.config/libreoffice/4; \ No newline at end of file diff --git a/document/An_Introduction_to_Structured_Testing.html b/document/An_Introduction_to_Structured_Testing.html index 6ec4c25..0b29ab3 100644 --- a/document/An_Introduction_to_Structured_Testing.html +++ b/document/An_Introduction_to_Structured_Testing.html @@ -87,7 +87,7 @@

Introduction

This guide provides a general overview of testing concepts. It is - not a reference manual for the Mosaic test bench itself. At the + not a reference manual for the Mosaic Testbench itself. At the time of writing, no such reference document exists, so developers and testers are advised to consult the source code directly for implementation details. A small example can be found in the Test_MockClass @@ -95,7 +95,7 @@ that make use of Mosaic.

A typical testing setup comprises three main components: - the test bench, the test + the Testbench, the test routines, and a collection of units under test (UUTs). Here, a UUT is any individual software or hardware component intended for testing. Because this guide focuses on software, we @@ -108,12 +108,12 @@ outputs, and determines whether the test passes or fails based on those values. A given test routine might repeat this procedure for any number of test cases. The final result from the test - routine is then relayed to the test bench. Testers and developers write - the test routines and place them into the test bench.

+ routine is then relayed to the Testbench. Testers and developers write + the test routines and place them into the Testbench.

-

Mosaic is a test bench. It serves as a structured environment for +

Mosaic is a Testbench. It serves as a structured environment for organizing and executing test routines, and it provides a library of utility - routines for assisting the test writer. When run, the test bench sequences + routines for assisting the test writer. When run, the Testbench sequences through the set of test routines, one by one, providing each test routine with an interface to control and examine standard input and output. Each test routine, depending on its design, might in turn sequence through @@ -234,6 +234,83 @@

The Mosaic tool assists testers in finding failures, but it does not directly help with identifying the underlying fault that led to the failure. Mosaic is a tool for testers. However, these two tasks—finding failures and locating faults—are not entirely separate. Knowing where a failure occurs can provide the developer with a good starting point for locating the fault and help narrow down possible causes. Additionally, once a developer claims to have fixed a fault, that claim can be verified through further testing.

+

Testing Objectives

+ + + +

The Moasic Testbench is useful for any type of testing that can be + formulated as test routines testing RUTs. This certainly includes + verification, regression, development, exploratory testing. It will + include the portions of performance, compliance, security, compatibility, + and acceptance testing that fit the model of test routines and RUTs. Only + recently has can it be imagined that the Mosaic TestBench can be used with + documentation testing. However, it is now possible to fit an AI API into a + test routine, and turn a document into a RUT. Usability testing often + depends in other types of tests, so to this extent the Mosaic Testbench + can play a role. However, usability is often also in part feedback from + users. So short of putting users in the Matrix, this portion of usability + testing remains outside the domain of the Mosaic Testbench, though come to + think of it, the Mosaic Testbench could be used to reduce surveys to pass + fails.

+ +

Each test objective will lead to writing tests of a different nature.

+

Unstructured Testing

@@ -301,8 +378,9 @@

Spot Checking

-

In spot checking, the function under test is checked against one or - two input vectors.

+

In spot checking, the function under test is checked against one or two + input vectors. When using a black box approach, these are chosen at + random.

Moving from zero to one is an finite relative change, i.e., running a program for the first time requires that many moving parts work together, @@ -311,22 +389,6 @@ test is called a smoke test, a term that has literal meaning in the field of electronics testing.

-

There are notorious edge cases in software. Zeros and index values just - off the end of arrays come to mind. Checking a middle value and edge cases - is often an effective approach for finding failures.

- -

It takes two points to determine a line. In Fourier analysis, it takes - two samples per period of the highest frequency component to determine an - entire waveform. Code also has patterns, patterns that are disjoint at - edge cases. Hence if a piece of code runs without failures for both edge - cases and spot check values in between, it will often run without - failures over an entire domain of values. This effect explains why ad hoc - testing has lead to so much relatively fail free code.

- -

Spot checking is especially valuable in early development, as it provides - useful insights with minimal investment. At this stage, investing more is - unwise while the code is still in flux.

-

Exhaustive Testing

A test routine will potentially run multiple test cases against a given @@ -401,6 +463,10 @@

Structured Testing

+

Structured testing is a form of white box testing, where the tester + examines the code being tested and applies various techniques to it + to increase the efficiency of the testing.

+

The Need for Structured Testing

All types of black-box testing have a serious problem in that the search @@ -548,17 +614,24 @@ -

- A typical response from people when they see this is that the knew it went up - fast, but did not know it went up this fast. -

+

A typical response from people when they see this is that the knew it went up + fast, but did not know it went up this fast. It is also important to note, there + is a one to one relationship between percentage of time to achieving exhaustive + coverage, and percentage of coverage. Half the time, 50 percent coverage. In + the last row of the table, to have reasonable test times, there would be coverage + 10-18 percentage coverage. At that level of coverage there is really + no reason to test. Hence, this table is not limited to speaking about exhaustive + testing, rather is speaks to black box testing in general.

+ +

Informed Spot Checking

-

White Box Testing

+

In white box testing, we take the opposite approach to black box + testing. The test writer does look at the code implementation and + must understand how to read the code. Take our 64-bit adder example of + the prior section. Here in this section we will apply a white box + technique known as Informed Spot Checking.

-

White box testing is the simplest type of structured test. In white box - testing, we take the opposite approach to black box testing. Here, the - test writer does look at the code implementation and must understand how to - read the code. Take our 64-bit adder example. This is it as a black box:

+

This is the prior example as a black box:


       int64 sum(int64 a, int64 b){
@@ -575,11 +648,13 @@
       }
     
-

The tester examines the code and sees there is a special case for a = 5717710 - and b = 27, which becomes the first test case. There’s also a special case - for when the sum exceeds the 64-bit integer range, both in the positive and negative - directions; these become two more test cases. Finally, the tester includes a few - additional cases that are not edge cases.

+

When following the approach of Informed Spot Checking, the tester examines + the code and sees there is a special case for a = 5717710 + and b = 27, which becomes the first test case. There’s also + a special case for when the sum exceeds the 64-bit integer range, both in + the positive and negative directions; these become two more test + cases. Finally, the tester includes a few additional cases that are not + edge cases.

Thus, by using white box testing instead of black box testing, the tester finds all the failures with just 4 or so test cases instead of

@@ -588,91 +663,372 @@

cases. Quite a savings, eh?

+

There are notorious edge cases in software, and these can often be seen + by looking at the RUT. Zeros and inputs that lead to index values just off + the end of arrays come to mind are common ones. Checking a middle value + and edge cases is often an effective approach for finding failures.

+ +

There is an underlying mechanism at play here. Note that it takes two + points to determine a line. In Fourier analysis, it takes two samples per + period of the highest frequency component to determine an entire + waveform. Code also has patterns, patterns that are disjoint at edge + cases. Hence if a piece of code runs without failures for both edge cases + and spot check values in between, it will often run without failures over + an entire domain of values. This effect explains why ad hoc testing has + lead to so much relatively fail free code.

+ +

Informed Spot Checking is especially valuable in early development, as it + provides useful insights with minimal investment. In the early development + stage, making more investment in test code is unwise due to the code being + in flux. Test work is likely to get ripped up and replaced.

+ +

The idea of test work being ripped up and replaced highlights a drawback + of white box testing. Analysis of code can become stale when implementations + are changed. However, due to the explosion in the size of the input space + with even a modest number of inputs, white box testing is necessary if there + is to be much commitment to producing reliable software or hardware.

+ +

Refactoring the RUT

+ +

Refactoring a RUT to make it more testable can be a powerful method for + turning testing problems that are exponentially hard due to state + variables, or very difficult to debug due to random variables, into + problems that are linearly hard. According to this method, the + tester is encouraged to examine the RUT to make the testing problem + easier.

+ +

By reconstructing the RUT I mean that we refactor the code to bring + any random variables or state variables to the interface where they + are then treated as inputs and outputs.

+ +

If placing state variables on the interface is adopted as a discipline by + the developers, reconstruction will not be needed in the test phase, or if + it is needed, white box testers will see this, and it will be a bug that + has been caught. Otherwise reconstruction leads to two versions of a + routine, one that has been reconstructed, and the other that has not. The + leverage gained on the testing problem by reconstructing a routine + typically more than outweighs the extra verification problem of comparing + the before and after routines.

+ +

As an example, consider our adder function with a random fault. As we + know from prior analysis, changing the fault to a random number makes + testing harder, but perhaps more importantly, it makes it nearly impossible + to debug, as the tester can not hand it to the developer and say, + 'it fails in this case'.

+

+      int64 sum(int64 a, int64 b){
+        if( a == (5717710 * rand()) && b == (27 * rand()) ) return 5;
+        else return a + b;
+      }
+    
+

The tester refactors this function as:

+

+      int64 sum( int64 a, int64 b, a0 = 5717710*rand() ,b0 = 27*rand() ){
+        if( a == a0 && b == b0 ) return 5;
+        else return a + b;
+      }
+    
+ +

Here a0 and b0 are added to the interface as + optional arguments. During testing their values will be supplied, during + production the defaults will be used. Thus, we have broken the one + test problem into two, the question if sum works, and the + question if the random number generation works.

+ +

Failures in sum found during testing are now reproducible. + If the tester employs the informed spot checking the failure will + be found with few tests, and the point in the input space where the + failure occurs can be reported to development and used for debugging.

+ +

Here is a function that keeps a state variable between calls.

+

+    int state = 0;
+    int call_count = 0; 
+    void state_machine(int input) {
+        int choice = (input >> call_count) & 1; 
+        switch (state) {
+            case 0:
+                printf("State 0: Initializing...\n");
+                state = choice ? 0 : 1;
+                break;
+            case 1:
+                printf("State 1: Processing Path A...\n");
+                state = choice ? 0 : 2; 
+                break;
+            case 2:
+                printf("State 2: Processing Path B...\n");
+                state = choice ? 0 : 3;
+                break;
+        }
+        call_count++;
+    }
+    
- - - - - + return {carry_in, sum}; + } + + +

According to the bottom up technique, we first test + the full_adder, which is not a difficult testing problem. It + employs well known trusted operations, and has a couple of interesting + special case conditions. Given the numeric nature of this code, these + special case conditions are probably better verified by proof than by + testing, but they can be tested.

+ +

Once the full_adder can be trusted, testing add_256 + reduces to checking that the various 64 bit parts are extracted and then + packed correctly, + and are not, say, offset by one, and that the carries are properly communicated + during the add.

+ +

Note this test also trusts the fact that ripple carry addition is a valid + algorithm for assembling the pieces. Thus there is a new verification + problem, that for the algorithm. In this case, ripple carry addition is + already a trusted algorithm.

+ +

Testing of full_adder could be further simplified with + refactoring, by moving the loop control variables to the interface and the + carry_in and carry_out to the interface. + As i is recycled, it would become two variables, + say i and j. Once the loop control variables + are on the interface it is straight forward to test the packing. Once the + carries are on the interface it is straight forward to test the + carries.

+ +

In general all programs and circuits can be conceptualized as functional + units, channels, and protocols. A test that shows that these work as specified, + shifts the test problem from the RUT to the specification.

+ +

Adding to the code

+ +

It is a common practice to add property checks to the code for gathering + data about failures or other potential problems. These will then write to + log files, or even send messages back to the code maintainers. By doing + this the testers benefit from the actual use of the product as though it + were a test run. When failures are found, such code might then trigger + remedial or recovery actions.

+ +

About Reference Outputs and Reference Properties

+ +

When testing during development, reference outputs often come from the + developers or testers themselves. They know what they expect from the + routines, but they do not know if the code will meet these expectations, + so they write tests. Typically, they try to imagine the hardest possible + cases. However, sometimes a young developer avoids testing challenging + cases to sidestep the risk of failures—this is, of course, a poor approach + that can lead to undetected issues.

+ +

Often, specification authors provide reference outputs or extensive test + suites that must be passed to achieve certification. Architects also + contribute by creating multi-level specifications—for the entire program, + for the largest components, and for communication protocols between + components. These specifications often serve as high-quality reference + outputs and property checks that can be applied to the model during testing. + The goal of developers and testers is to meet these specifications, making + failures directly relevant to the development process and program design.

+ +

Experts in a specific area sometimes provide test data, maintaining + a database of reference data as a resource for validating outputs. + For some types of code, experts also supply property checks, which + evaluate whether outputs satisfy essential properties rather than specific + values. Depending on the domain, these properties can be an important aspect + of the testing process.

+ +

Each time a bug is found, a test should be created to capture a failure + related to that bug. Ideally, such tests are written with minimal + implementation-specific details so they remain relevant even after code + changes. These tests are then added to a regression testing suite, ensuring + that future changes do not reintroduce the same issues.

+ +

For applications involving multi-precision arithmetic, such as the earlier + adder example, reference data is often sourced from another established + multi-precision library, whether an open-source or commercial product. The + assumption is that an existing product will be more reliable than a newly + developed one, and since it’s implemented differently, its errors are likely + to be uncorrelated. This competitive testing, which is aspect of + compatibility testing, here being used for other objectives. In the limit, as + the RUT matures, this approach will tend to identify bugs in the reference + data from the other company as often it does in the RUT, which might be an + interesting effect.

+ +

In some cases, reference data comes from historical sources or existing + systems. When upgrading or replacing a legacy system, historical data + serves as a benchmark for comparison. Similarly, industry standards + and compliance datasets, particularly from regulatory organizations + like IEEE, NIST, or ISO, provide reliable reference points for applications + requiring standardized outputs. Compliance-driven tests are often required + for certification or regulatory approval in fields such as finance, + healthcare, and aerospace.

+ +

For cases requiring many inputs without needing specific reference values, + random number generators can provide extensive test data. Examples include in + comparative testing and when property checking. Random number generators can + also be configured to concentrate cases in specific areas of the input domain + that for some reason concerns the testers.

+ +

Customer and user feedback sometimes uncovers additional test cases, + especially when dealing with complex or evolving software. Feedback + reveals edge cases or expected behaviors that developers and testers + may not have anticipated, allowing teams to create reference points + for new test cases that cover real-world use cases and address user needs.

+ +

Conclusion

+ +

If you are a typical tester or developer reading through the previous list, + you might feel a bit disappointed. Unless you work in a specialized area, + are attempting to create a compatible product, or need to exercise the hardware, much + of that list might seem inapplicable. For many developers, the most + applicable advice remains: "During development, reference outputs often + come from the developers or testers themselves." I apologize if this seems + limiting, but consider this: the reason we run programs is to generate the + very data we're looking for. If that data were easily available, we wouldn’t + need the program.

+ +

In many ways, testing is about making developers and testers the first + users of the product. All products will have bugs; it’s far better for + experts to encounter these issues first.

+ +

Testing also facilitates communication among project members. Are the + architects, developers, and testers all on the same page about how the + product should work? The only way to find out is to run what has been built + and observe it in action. For this, we need test cases.

+ +

This circular problem—finding data that our program should generate - to test + the program itself — illustrates a fundamental limitation in software testing. + We encountered this in the discussion on unstructured, black-box testing: as + soon as we open the box to inspect the code, we are no longer just testing it, + but reasoning about it and even verifying it formally.

+ +

This, perhaps, hints at a way forward. Our program is a restatement of the + specification in another language. Verification, then, is an equivalence + check. We can run examples to demonstrate equivalence, but black-box testing + alone will have limited impact. Alternatively, we can examine our code and + try to prove that it matches the specification. Though challenging, this + approach is far more feasible than waiting ten times the age of the universe + to confirm our solution through black box testing.

+ +

Think of testing as a reasoning problem. Explain why the routine works and + how it contributes to meeting the specification. Work from the top down: if + the high-level components behave correctly, the program will meet the + specification. That’s the first step. Then explain why the breakdown of + those top-level components ensures correct behavior. Continue this process, + and then use tests to validate each link in this chain of reasoning. In this + way, you can generate meaningful reference values.

-