Thursday, December 28, 2023

The unbearable lightness of C

We have a Simulink project from which I generate C code to use in a Visual Studio C++ project. The Simulink project works fine, I can build the C code without any errors, but when I ran the C executable, I got an access violation error due to trying to write to address 0x0. The C project was working fine for previous versions. 

I initially identified the revision where this error first appeared. I reviewed the changed code and couldn't find anything wrong.

After a week of debugging I found out that it was due to an off-by-one error; An array was defined with size 47 using Simulink function ssSetNumDiscState(S, 47) in mdlInitializeSizes(...), but later in function mdlInitializeConditions(...), a for loop with upper bound of 48 was executed which resulted in writing to the memory adjacent to the allocated section for that array. 
static void mdlInitializeConditions(SimStruct *S) {
    real_T *states = ssGetRealDiscStates(S);
    for (int i=0; i < 48; i++) {
    	*(states + i) = 0;
    }
}
It has nothing to do with the latest change in the sense that it was not related to the logic of the change. Instead, that change altered the memory mapping of the build, putting another array (TUBufferPtr) after the first and the overflow caused the value at TUBufferPtr[0] to become 0x0 (NULL). When the program tried to write to the address represented by TUBufferPtr[0], it naturally caused an access violation because writing to address 0x0 is not allowed.
When I looked at the repository history, I saw that this error was introduced 3 years ago and for all these years, it did not become visible! The access violation only occurred when other unrelated code updates caused the compiler to arrange memory slightly differently.

This is also one of the reasons why sometimes C programs behave differently between debug and release builds or on different versions (service packs) of the same operating system. That discrepancy is an indication of an error hiding somewhere in your program. Another way such an error can become visible is when you have it in your C++ DLL that you call from your Java program. One day you update your JDK and your program crashes because DLL and JVM share the same memory space. Naturally, your first inclination is to blame the JDK update but in reality there is a buffer overflow in your DLL code.

You can never say for sure that modifying a code in module A won't have an effect on an unrelated module B. As long as these modules are in the same process, i.e. use the same memory, the compiler might put them side by side and an error in A can overflow to B. Functionally distant modules can become "close relatives" in memory address space.

In a way, I was fortunate that the overwrite contained zero values. If it had used some value that was a valid address for the application to write to, it would cause seemingly random behavior and would have been a lot more fun (!)

Note that a typical static code analyzer would not able to catch this problem because we are defining the data structure with Simulink specific ssSetNumDiscState and getting it with ssGetRealDiscStates functions.

I solved the problem by adding #define NUMBER_OF_DISCRETE_STATES (47) and using that define in both ssSetNumDiscState() and mdlInitializeConditions(). This case study also serves as a cautionary tale illustrating why you should use defined constants rather than magic numbers.

In my nightly automated tests I was only checking if exe was generated which only proved that it compiled. I added running the exe and checking if it finished successfully because an access violation can only occur at runtime.

No comments:

Post a Comment