Friday, July 11, 2025

Double-to-Int Conversion with Bit Shifting

We often need to pack a large numeric range into 32 bits. For instance, timestamps in microseconds over a 36 minute period exceed Integer.MAX_VALUE. By discarding the least significant bits (via right shift), we can fit the value and later we can recover it by the same amount of left shift. However, increasing the number of shifts decreases accuracy. What we need to optimize is to find the right amount of bit shift that covers the range and minimizes error. The following Java code investigates this:


Monday, June 16, 2025

C/C++ header mismatch bug

I encountered a problem where a field in a global struct (myStruct) held a valid value before entering a function foo, but turned into garbage after entering it. When I consulted AI tools, they suggested that foo might be allocating very large local arrays, causing a stack overflow that could corrupt the global structure. Another possibility was an out-of-bounds write elsewhere in the code.

After a week of debugging and trying various solutions—such as increasing the thread's stack size—I discovered the root cause: The function foo was defined in a C library with multiple versions. Each version resided in a different folder but had the same file names. Which folder was used depended on a #define. I was including the header from one version of the library, but linking against the implementation from another. If the struct definitions had matched, this wouldn’t have caused an issue, but they differed—evident from the differing sizeof(myStruct). As a result, myStruct was interpreted using the wrong layout, leading to corrupted values from an incorrect memory region.

Sunday, June 15, 2025

C++ pointer bug

This C++ code has a significant bug that will cause undefined behavior:
#include <iostream>
class A {
  public:
    int val;
};
void reset(A *p_a) {
  if (p_a != NULL) {
      delete p_a;
  }
  p_a = new A();
}
int main() {
  A *p_a = new A();
  p_a->val = 5;
  std::cout << "Before reset, p_a->val:" << p_a->val << "\n";
  reset(p_a);
  std::cout << "After reset, p_a->val:" << p_a->val << "\n";
  return 0;
}

The reset function receives a copy of the pointer p_a, not a reference to it. When you modify p_a inside the function (with p_a = new A()), you're only changing the local copy - the original pointer in main() remains unchanged. What actually happens:

  1. p_a in main() points to an A object with val = 5 
  2. reset() receives a copy of this pointer 
  3. reset() deletes the original object (memory is freed) 
  4. reset() creates a new object, but assigns it only to the local copy 
  5. The original p_a in main() still points to the deleted memory 
  6. Accessing p_a->val after reset() is undefined behavior (accessing freed memory) 
The Fix: Pass the pointer by reference using a pointer-to-pointer or reference-to-pointer:
//Reference to pointer
void reset(A *&p_a) {
  if (p_a != nullptr) {
    delete p_a;
  }
  p_a = new A();
  // Call with: reset(p_a);

An even better fix is to use smart pointers, which removes the necessity for the reset function:

auto p_a = std::make_unique<A>();

You can detect such problems by enabling AddressSanitizer (ASAN) in Visual Studio:

  1. Right-click your project → Properties
  2. Go to Configuration Properties → C/C++ → General
  3. Set Enable Address Sanitizer to Yes (/fsanitize=address)
  4. Go to Configuration Properties → C/C++ → Optimization
  5. Set Optimization to Disabled (/Od) for better debugging
  6. Set Whole Program Optimization to No
  7. Go to Configuration Properties → C/C++ → Debug Information Format
  8. Set to Program Database (/Zi) or Program Database for Edit & Continue (/ZI)
In Eclipse CDT:

  1. Open your C/C++ project in Eclipse CDT
  2. Right-click project → Properties
  3. Navigate to C/C++ Build → Settings
  4. Under Tool Settings:
    1. GCC C++ Compiler → Miscellaneous
    2. GCC C Compiler → Miscellaneous
    3. Add to "Other flags": -fsanitize=address -g -O1
  5. Project Properties → C/C++ Build → Settings
  6. GCC C++ Linker → Miscellaneous
  7. Add to "Other objects": -fsanitize=address

Tuesday, May 27, 2025

Fuzzy Logic and Quake III Bots

Fuzzy logic is often used in decision-making systems where a detailed mathematical model of the system is unavailable or impractical. Instead of relying on equations, fuzzy logic encodes expert intuition into human-readable rules. These rules allow systems to make decisions based on approximate or linguistic input values, such as “low health” or “enemy nearby.”

For simple systems — say, with just one input and one output — fuzzy logic may be overkill. In those cases, a 1D interpolation (similar to proportional navigation) is often enough to generate smooth behavior transitions. But as systems grow more complex, fuzzy logic scales better than maintaining large interpolation grids or rigid condition trees.

While neural networks have become dominant in many domains, fuzzy logic still offers distinct advantages, especially in embedded or control-focused systems. Fuzzy logic requires structured human insight, while neural networks thrive on raw data and pattern discovery. For complex or poorly understood systems, writing fuzzy rules is impractical. Advantages of fuzzy logic over neural networks:

  1. Interpretability: Fuzzy rules are readable and understandable by developers and domain experts.
  2. Minimal training: Rules encode prior knowledge, reducing or eliminating the need for extensive data-driven training.
  3. Lightweight tuning: At most, fuzzy systems may require optimizing rule weights — a much simpler process than full network training.

One of the most interesting uses of fuzzy logic in gaming came from Quake III Arena. The bots in the game used fuzzy logic to evaluate possible behaviors — such as attack, search for health, search for a better weapon, retreat. Each action was assigned a desirability score based on fuzzy evaluations of current game state (e.g., health, distance to enemy, ammo). At each tick, the bot would choose the highest-scoring action.

To tune the bot parameters, the developers had bots play against each other and applied genetic algorithms to evolve the best-performing rule sets. Of course, they could not make the bots perfect because then a human player would never be able to win.

Monday, May 12, 2025

CPU, Analog, FPGA, or ASIC?

Algorithms can be implemented across a wide spectrum of hardware, each with its own trade-offs in speed, power, flexibility, cost, and scalability. Let’s compare the four main approaches:

1. Software on General-Purpose CPUs

Pros:

  • Easy to develop and debug: Rich toolchains, IDEs, and profiling tools.
  • Highly flexible: Reprogram anytime; modify algorithms at will.
  • Low development cost: No custom hardware needed; ready to run on PCs, servers, or microcontrollers.
  • Ecosystem and libraries: Access to optimized math libraries (e.g., FFTW, NumPy, BLAS).

Cons:

  • Latency and real-time constraints: OS overhead and unpredictable timing make hard real-time difficult. Soft real-time is achievable.
  • Performance limitations: Limited parallelism compared to hardware solutions.
  • High power consumption per operation: Especially inefficient for repetitive, simple tasks.

Ideal for general-purpose applications.

2. Analog Circuits

Pros:

  • Ultra-low latency: Signal is processed in real-time with no sampling delay.
  • Potentially high throughput: Continuous operation with no clock constraints.
  • Minimal power: No digital switching, especially useful in low-power sensors or RF front-ends.
  • No need for ADC/DAC: Processes raw analog signals directly.

Cons:

  • Limited precision: Susceptible to noise, drift, and component tolerances.
  • Hard to scale: Each additional function requires more physical components.
  • Difficult to tune or reconfigure: Redesign often requires physical changes.
  • No programmability: Once built, behavior is fixed or only marginally tunable.

Ideal for real-time sensing, analog filters, RF circuits, ultra-low power embedded front-ends.

3. Field-Programmable Gate Arrays (FPGAs)

Pros:

  • High parallelism: True concurrent execution of multiple operations.
  • Low deterministic latency: Ideal for real-time pipelines.
  • Reconfigurable hardware: Algorithms can be updated post-deployment.
  • Power-efficient: Much better performance-per-watt than CPUs for many tasks.

Cons:

  • Steep learning curve: Requires HDL knowledge (VHDL/Verilog) or high-level synthesis.
  • Toolchain complexity: Longer compile/synthesis times, debugging can be difficult.
  • Moderate development cost: More expensive than CPUs in small volumes.
  • Not optimal for floating-point math: Often better with fixed-point arithmetic.

Ideal for real-time video/audio processing, signal processing, robotics, hardware prototyping.

4. Custom Chips (ASICs)

Pros:

  • Maximum performance: Custom datapaths, memory layouts, and logic yield unmatched throughput.
  • Lowest power consumption: Fully optimized for the task at hand.
  • Smallest footprint: No unnecessary hardware or software overhead.
  • Production cost scales well: Extremely cheap per unit at high volumes.

Cons:

  • Astronomically high NRE (non-recurring engineering) cost: Millions of dollars just to reach first silicon.
  • Long time-to-market: Can take 6–24 months from design to tapeout.
  • Zero flexibility: Bugs in logic mean hardware re-spins.
  • High risk: A single design flaw can cost months of work and millions in losses.

Ideal for high-volume commercial products (e.g., smartphones, wireless chips), aerospace, medical devices, deep learning accelerators. Example: u-blox

Monday, April 14, 2025

First Step in Safety-Critical Software Development

The first action when developing safety-critical software is to add automatic commit checks for compiler warnings and reject the commit if any warnings are present. Enable the highest warning level and treat all warnings as errors.

Common Visual C++ warnings relevant to safety critical systems:

  1. C26451: Arithmetic overflow: Using operator 'op' on a value that may overflow the result (comes from Code Analysis with /analyze). Example: uint64_t c = a + b where a and b are of type uint32_t
  2. C4244: Conversion from ‘type1’ to ‘type2’, possible loss of data. Example: int → char or double → float
  3. C4018: Signed/unsigned mismatch. Can cause logic bugs and unsafe comparisons.
  4. C4701: Potentially uninitialized variable
  5. C4715: Not all control paths return a value
  6. C4013: 'function' undefined; assuming extern returning int
Of course, there is much more that needs to be done; in this blog post, I just wanted to focus on the first step from the perspective of a tech lead.

Saturday, April 12, 2025

UDP vs TCP

When working with real-time systems, it's important to understand how data is sent and received over UDP and TCP. The main reason to use UDP is that it can be 10x faster than TCP, but you have to be aware of its limitations.

UDP (User Datagram Protocol) sends data in discrete packets (called datagrams). Each call to sendto() on the sender side corresponds to exactly one recvfrom() on the receiver side.

  • No connection setup or teardown.
  • No built-in guarantees about delivery, order, or duplication (intermediate routers may retransmit packets if they think the first one was lost).
  • If you call sock.recvfrom(4) and the incoming packet is 9 bytes, you get the first 4 bytes—and the rest are discarded, i.e. you cannot get them with another receive call.

Rough Performance Advantage:

  • In short, bursty communications or real-time streaming, UDP can be 2x to 10x faster than TCP due to its lack of handshake, retransmission, and flow control mechanisms.
  • In my test case, UDP returned a DNS response in 42 ms, while TCP took 382 ms — nearly 9x faster.

TCP (Transmission Control Protocol) provides a continuous stream of bytes. It breaks your data into segments under the hood, but applications don’t see those packet boundaries.

  • Reliable: Guarantees delivery, order, and no duplication (if a duplicate packet is received, TCP automatically discards it).
  • Stream-oriented: You send bytes, not messages.
  • If you send b"Message 1", and call sock.recv(4), you might receive b"Mess", and then get the rest (b"age 1") in another call.
If the message might get corrupted during the creation phase rather than during transmission over the network, and you want to add a CRC to detect corruption at the application layer, then UDP might be better because it delivers the entire message, including the CRC, in a single recvfrom call.

Wednesday, February 12, 2025

Longitude convention and octal values

Recently, I encountered a bug in code that handled longitude values, where 33 degrees longitude is typically written as 033. Upon investigation, I discovered that Java was calculating 033 + 1 as 28 instead of 34. This happened because, when we write longitude values like "033" in geographic notation, we mean decimal 33 degrees. However, if we write it directly in Java, C++, or similar languages, it is interpreted as octal 33 (which equals decimal 27) due to the leading zero!

This is a classic example of why it's important to be careful when working with domain-specific number formats - what makes sense in geographic notation can have unexpected behavior when directly translated to programming language syntax!

Monday, February 10, 2025

Verifying Simulink generated C code

I generated embedded C code for the plant in the following MATLAB Simulink model:

To verify that the generated code behaves the same as the Simulink model, I could save the controller outputs to a file, create a C project with a main() function to read those outputs, and feed them to the plant C code at each time step—essentially converting a closed-loop simulation into an open-loop one. This approach would work if I use a single-step numerical integrator/solver like ode1 (Euler method). However, I usually prefer ode4 (Runge-Kutta 4) due to its better accuracy with less computation. Despite RK4 having 4 intermediate computations, it has a global error of O(h^4) and Euler O(h). If RK4 uses a step size of h, Euler would need h^4 to achieve similar accuracy. So, in total Euler would require more function evaluations than RK4—specifically, (1/h^3)/4 times more. For example, if hRK4 = 0.1, Euler would need 250x more computations.

Unfortunately, RK4 is a multi-step method with minor step sizes. Since the Simulink model saves controller outputs only at major steps, reading them from a file to drive the plant might result in divergent behavior—especially if the system has high-frequency dynamics—because controller outputs could differ at minor steps, while reading from the file only accounts for major step outputs.

You can only fully verify the system by integrating the C code into your embedded setup, which generates controller outputs based on real time plant-sensor data. Since testing on an embedded system takes time, to catch issues such as uninitialized variables in the plant code, you can create a C project, build and run your plant model with constant controller inputs. If it compiles and runs without NaN values, you can proceed to the embedded testing phase.

If you want to be more rigorous, you could save the plant outputs together with the controller outputs. In your C project, you read the controller output, feed it to the plant, run it for only one time step, read the Simulink plant output for that time instance from a file, and compare it with the C code plant output. Even this comparison is not perfect and only works for the first time step because Simulink integrator block outputs depend on the system state in the previous time step. To check the whole time span, you also have to save the system state and initialize the system with the correct state at each time step.

Wednesday, January 29, 2025

Linux: Dynamic vs Static Linking

Today, we had a chat with a colleague about whether a C/C++ binary (ELF) built on one Linux distribution would work on another. After some research, I found that default GCC builds are dynamically linked, and the ELF file contains:

  1. Your program's code
  2. A list of dynamic dependencies (shared libraries) it needs
  3. Symbols that need to be resolved at runtime
However, it does not contain the actual shared libraries - those need to be present on the system where you run the program. You can see these dependencies using ldd. For example, a simple C "hello world" program with only a printf() call can be built with gcc hello.c -o hello. ldd hello output:
linux-vdso.so.1 (0x00007fff6a0e1000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fdaf6400000)
/lib64/ld-linux-x86-64.so.2 (0x00007fdaf66b3000)

The files size is 15960 bytes. To make is truly independent of any shared library we would build it with gcc -static hello.c -o hello_static. ldd hello_static shows:
not a dynamic executable
The size of hello_static is a whopping 900344 bytes, 56X more than the dynamically linked build.

The good news is that these libraries are present by default on virtually every Linux distribution, so including them in every executable would waste a lot of space. However, you must ensure that C++ version (C++17, C++20, etc.) specific features used in your code are supported by the gcc/g++ version on the target Linux distribution. Of course, the CPU architecture has to be the same too — that goes without saying.

On Windows, C++ libraries come with Visual C++ Redistributable:

Note that  if a Windows PC can successfully load and run a DLL compiled with a specific C++ version, it can also run an EXE compiled with that C++ version because the C++ runtime requirements are the same whether the code is in a DLL or EXE. The only difference is how the code is packaged and loaded, not its runtime requirements.

Thursday, January 23, 2025

C++ std::endl vs "\n"

I have a thread safe print function that uses lock_guard and std::cout and is called from different threads. When I replaced std::endl expressions with "\n", my program stopped printing to console. The main difference is that std::endl does two things: it inserts a newline character AND flushes the output buffer, while "\n" only inserts a newline character. In a multithreaded environment, without flushing, the output might stay in the buffer and not be visible immediately.

Thursday, January 9, 2025

Memory Insights from a Segfault

Recently, a program crashed with segmentation fault (sigsegv). Since segmentation faults can happen for a variety of reasons, it took some time to find out that the root cause was insufficient stack (stack overflow) due to a function allocating extra 16kB for a local variable, float var[4096]. The quick fix was to increase the stack size of the thread calling that function from 32kB to 1MB. 

Another solution is to move var to static storage instead of using the stack by changing the definition to static float var[4096]. Upon closer inspection of the code, I saw that var was just used to copy a static array before sending it to another function. Since that function was not modifying the array, there was no reason for the copy. Removing var removed large stack allocation. 

Problems like this are stressful in the short term but provide an opportunity to review the concept of memory regions. Here's a concise breakdown:

Static Storage
- Memory allocated at program start, lives for entire program duration
- Size must be known at compile time 
- Good for: fixed-size buffers that exist for whole program 
- Example: static uint8_t buffer[1024];
- Zero runtime allocation overhead 
- Can't be resized 

Stack
- Memory allocated/deallocated automatically when entering/leaving scope 
- Very fast allocation/deallocation - Limited size (often few MB) 
- Good for: small-to-medium temporary buffers 
- Example: void foo() { uint8_t temp[1024]; }
- Risk of stack overflow with large allocations

Heap 
- Dynamic runtime allocation 
- More flexible but slower than stack 
- Larger size available 
- Good for: large buffers or unknown sizes 
- Example: auto* buffer = new uint8_t[1024];
- Risk of fragmentation 
- Must manually manage memory 

For embedded systems with fixed-size allocation needs: 
1. Use static storage for long-lived, known-size buffers 
2. Use stack for small temporary buffers 
3. Consider static memory pools instead of raw heap allocation if you need dynamic allocation

In Visual Studio 2022, you can use Build > Run Code Analysis to detect functions with large stack allocations, they will be marked with C6262 excessive stack usage warnings.