Foundation · Libraries · Tips & Tricks · Uncategorized

Build your own Error Reporter – Part 2: Android

In this second part on error reporting, we add support for Android. For that, we need to manually walk the call stack and configure the Delphi project to export symbols.

In the previous article in this series, we shared some building blocks for implementing error reporting functionality on iOS. To recap, these blocks are:

  1. Intercepting Exceptions, not only in the main thread, but also in secondary threads,
  2. Retrieving a Stack Trace to determine how we got to the error, and
  3. Symbolicate the StackTrace to make it readable and understandable by mere humans.

Building blocks #1 and #3 work on both iOS and Android, so I won’t be repeating these here. Please refer to the previous article if you need a refresher.

Retrieving the stack trace is where things get tricky, and thus interesting. But as a result, this article is a bit technical, since we have to dive into some ARM CPU details. Continue at your own risk…

Walk the Stack

On iOS, we have this convenient API called backtrace that generates a stack trace for us. Unfortunately, this API is not available for Android. There is a similar API called _Unwind_Backtrace, but there is no Delphi import for this API, nor is it easy to import it yourself. It is implemented in the static library libgcc.a, which you cannot linked to since it conflicts with Delphi’s librtlhelper.a that contains system level support routines.

Alternatively, you can also find this API in the shared library libc.so, so you can import it by loading the library using dlopen and retrieving the API address using dlsym.

But more importantly, the stack trace that _Unwind_Backtrace generates is unusable for Delphi applications. The API expects that the application uses the Procedure Call Standard for the ARM Architecture (AAPCS). However, it seems that Delphi (or LLVM?) applications use a model that looks more like the ARMv7 Function Calling Conventions for iOS (but not exactly). This ABI specifies that routines should start with a prolog that sets up a stack frame. That is very convenient for us since it allows us to walk the stack ourselves.

This ABI only applies to 32-bit applications. But since Delphi (currently) only supports 32-bit Android apps, this works out fine for us. Once Delphi starts adding support for 64-bit Android, we need to implement an additional stack walker, or use an API that does it for us.

Function Prolog

If you are coming from a Windows background, then you may know that whenever you call a routine, the application pushes the return address onto the stack. When the routine finishes, it pops the return address and jumps to that address.

ARM works differently. When you call a routine, it stores the return address in a special register, called the Link Return (LR) register. This is one of the 16 general purpose registers that ARM has available. More specifically, it is an alias for register 14 (R14). When a routine finishes, it sets the Program Counter (PC, alias for R15) to the value of LR so it jumps to the return address. This means that the return address doesn’t have to be on the stack. But how can we generate a call stack then?

That is where the function prolog comes to the rescue. On iOS, the specification says that the prolog:

  • Must push all registers that need saving to the stack.
  • It must always push the R7 and LR registers.
  • It must set R7 (aka Frame Pointer on iOS) to the location in the stack where the previous value of R7 was just pushed.

So a minimal prolog (as used a lot in small functions) looks like this:

push {R7, LR}
mov  R7, SP

You probably don’t know ARM assembly language, and you don’t need to. I will explain the minimum you need to know to follow along.

The first line pushes the values of R7 and LR to the stack, and the second line sets R7 to the value of the Stack Pointer (SP, aka R13).

Basic Stack Walking Algorithm

If every prolog looks like this, then we can walk the stack using these steps:

  1. Retrieve the current value of the R7 register. Lets call it FramePointer.
  2. At this location in the stick, you will find the previous value of the R7 register. Lets give it the imaginative name PreviousFramePointer.
  3. At the next location in the stack (after R7), we will find the pushed LR value. Add this value to the stack trace so we can use it later to look up the routine name by this address (using symbolication, as discussed in the previous post).
  4. Set FramePointer to PreviousFramePointer and go back to step 2. Rinse and repeat until FramePointer becomes 0 or it falls outside of the stack.

Unfortunately, Delphi doesn’t follow the iOS ABI exactly, and it may push other registers between R7 and LR. For example:

push {R4, R5, R6, R7, R8, R9, LR}
add  R7, SP, #12

Here, it pushed 3 registers (R4R6) before R7, so in the second line it sets R7 to point 12 bytes into the stack (so it still points to the previous R7, as required). This line can be read as R7 := SP + 12;. However, it also pushed registers R8 and R9, before it pushes the LR register. This means we cannot assume that the LR register will be directly located after the R7 register in the stack. There may be (up to 6) registers in between. We don’t know which one represents LR. So we just store all 7 values after R7, and later try to figure out which one represents LR.

In Delphi code, this can look something like this:

type
  TStackValues = array [0..6] of UIntPtr;

type
  TCallStack = record
    Count: Integer;
    Stack: array [0..19] of TStackValues;
  end;
  PCallStack = ^TCallStack;

function GetExceptionStackInfo(P: PExceptionRecord): Pointer;
const
  { On most Android systems, each thread has a stack of 1MB }
  MAX_STACK_SIZE = 1024 * 1024;
var
  Count: Integer;
  FramePointer, PrevFramePointer, MinStack, MaxStack: UIntPtr;
  Address: Pointer;
  CallStack: PCallStack;
begin
  { Allocate a PCallStack record large enough to hold 20 entries }
  GetMem(CallStack, SizeOf(TCallStack));

  { Get the current value of the R7 register (the frame pointer) }
  FramePointer := GetFramePointer;

  { The stack grows downwards, so all entries in the call stack leading to this
    call have addresses greater than FramePointer. We don't know what the start
    and end address of the stack is for this thread, but we do know that the
    stack is at most 1MB in size, so we only investigate entries from
    FramePointer to FramePointer + 1MB. }
  MinStack := FramePointer;
  MaxStack := MinStack + MAX_STACK_SIZE;

  { Now we can walk the stack using the algorithm described above. }
  Count := 0;
  while (Count < 20) and (FramePointer <> 0) 
    and (FramePointer >= MinStack) and (FramePointer < MaxStack) do
  begin
    { The first value at FramePointer contains the previous value of R7. }
    PrevFramePointer := PNativeUInt(FramePointer)^;

    { Store the 7 values after that. }
    Address := Pointer(FramePointer + SizeOf(UIntPtr));
    Move(Address^, CallStack.Stack[Count], SizeOf(TStackValues));
    Inc(Count);

    { Walk the stack to the previous frame pointer. }
    FramePointer := PrevFramePointer;
  end;

  CallStack.Count := Count;
  Result := CallStack;
end;

While walking the stack, we check if the frame pointer lies within stack. This should be the case for all routines that use a prolog. We don’t know the memory range for the stack, but we do know that every thread on most Android systems has a stack of 1 MB. We also know that the stack grows downwards, so that top of the stack is at most 1 MB more than the current frame pointer value.

Retrieve the Frame Pointer value

But how do we get the current value of the frame pointer (register R7)? This would be very easy to do if Delphi supported inline assembly code for ARM:

function GetFramePointer: UIntPtr; 
asm
  ldr R0, [R7]
  bx  LR
end;

This loads the contents of the address pointed to by R7 into the R0 register. The R0 register is used to store function results. The second line means “return to the address stored in the LR register”, which is how you return from a subroutine.

Unfortunately, we cannot use inline ARM assembly in Delphi. We could put these two lines of code in a separate assembly file, then assemble it and link it into a static library. You can than import that static library in Delphi, as in:

function GetFramePointer: UIntPtr; cdecl; external 'mystaticlibrary.a' name '_GetFramePointer';

That is the way you should go if you want to link external (3rd party) libraries. Maybe I’ll talk about creating static libraries for iOS and Android in a future post…

In this case however, the GetFramePointer function disassembles to just 8 bytes, so I put those 8 bytes into a constant array, create a function type and have a function pointer of that type point to the constant array:

type
  TGetFramePointer = function: NativeUInt; cdecl;

const
  GET_FRAME_POINTER_CODE: array [0..7] of Byte = (
    $00, $00, $97, $E5,  // ldr R0, [R7]
    $1E, $FF, $2F, $E1); // bx  LR

var
  GetFramePointer: TGetFramePointer = @GET_FRAME_POINTER_CODE;

A bit dirty, but it works.

Find Correct LR Value

OK, so now we have a stack trace with 7 potential LR values for each entry in the stack. Now we need to figure out which of those 7 represents the actual LR value. Most of the time, this will be the first of the 7 values. We cannot be 100% sure which of the 7 values (if any) represents the actual LR value. So instead, we just pass these values to the dladdr API (see previous article for details). If that API succeeds, we assume that we found the LR value. However, this is not fool proof because any value we pass to dladdr may be happen to be a valid code address, but not the LR value we are looking for.

Also, the LR value contains the address of the next instruction after the call instruction. Delphi usually uses the BL or BLX instruction to call another routine. These instructions takes 4 bytes, so LR will be set to 4 bytes after the BL(X) instruction (the return address). However, we want to know at what address the call was made, so we need to subtract 4 bytes.

There is one final complication here: the lowest bit of the LR register indicates the mode the CPU operates in (ARM or Thumb). We need to clear this bit to get to the actual address, by AND‘ing it with “not 1“.

So the final version (as found in the TgoExceptionReporter.GetCallStack in the sample code) looks like this:

class function TgoExceptionReporter.GetCallStack(
  const AStackInfo: Pointer): TgoCallStack;
var
  CallStack: PCallStack;
  I, J: Integer;
  FoundLR: Boolean;
begin
  { Convert TCallStack to TgoCallStack }
  CallStack := AStackInfo;
  SetLength(Result, CallStack.Count);
  for I := 0 to CallStack.Count - 1 do
  begin
    FoundLR := False;
    for J := 0 to Length(CallStack.Stack[I]) - 1 do
    begin
      Result[I].CodeAddress := (CallStack.Stack[I, J] and not 1) - 4;
      if GetCallStackEntry(Result[I]) then
      begin
        { Assume we found LR }
        FoundLR := True;
        Break;
      end;
    end;

    if (not FoundLR) then
      { None of the 7 values were valid.
        Set CodeAddress to 0 to signal we couldn't find LR. }
      Result[I].CodeAddress := 0;
  end;
end;

The dladdr API is called as part of the GetCallStackEntry method, which also demangles the symbol found at the given address.

Symbolication Revisited

We can now run our demo app from the last article to generate an error report, which looks something like this:

Access violation at address A5203DA2, accessing address FFFFFFFF
At address: $A5203DA2

Call stack:
                          $00000000
                          $00000000
                          $00000000
                          $00000000
                          $00000000
                          $00000000
                          $00000000
                          $00000000
                          $00000000
                          $00000000
                          $00000000
                          $00000000
                          $00000000
libandroid.so             $AD544CDE AInputQueue_preDispatchEvent + 1
                          $00000000
                          $00000000
                          $00000000
                          $00000000
                          $00000000
libErrorReportingSample.so $A520453C _NativeMain + 71 

Well that doesn’t look good. We were able to walk the stack, but symbolication failed for almost all symbols. What is going on here?

Well looking at the .so file that Delphi generates for your app, it turns out that almost all routine symbols are stored as local symbols instead of global symbols. The dladdr API only works with global symbols.

If you look at the files that Delphi generates in the Android build directory, then you will find a file with a .vsr extension that looks like this:

EXPORTED {
    global:
        _NativeMain;
        __rsrc_*;
        __rstr_*;
        dbkFCallWrapperAddr;
        __dbk_fcall_wrapper;
        _DbgExcNotify;
        _Unwind_VRS_Get;
        ...
        ExecJNI;
        ANativeActivity_onCreate;
    local: *;
}; 

This file is a so-called version script (hence the .vsr extension). Among other things, it tells the linker which symbols should be global and which ones should be local. Delphi generates a file that only makes a couple of symbols global, and all other symbols local (with the local: *; wildcard in the file).

Fortunately, we can create an additional .vsr file that contains some more rules. The repository contains a sample goExports.vsr file that makes all symbols we care about global:

{
  global: 
    _ZN*;
    _ZZ*;
};

This file says that all symbols that start with _ZN or _ZZ should be global. These are prefixes of mangled symbol names. Every mangled name starts with _Z and scoped names follow with an N after that. Since Delphi includes the name of the unit in every symbol (as in MyUnit.MyProc), almost every symbol starts with _ZN. An exception is nested routines, which start with _ZZ.

For those interested, this name mangling scheme is specified in the Itanium C++ ABI.

Now we need to tell the linker to use this .vsr file in addition to the one that Delphi generates. You do this by setting a linker command line option:

  1. Open the project options (using the menu “Project | Options…”).
  2. Select the target “All configurations – Android platform”.
  3. Go to the page “Delphi Compiler | Linking”.
  4. Set “Options passed to the LD linker” to --version-script=goExports.vsr

If the .vsr file is not in the application directory, you need to specify an absolute or relative directory (as in --version-script=..\goExports.vsr).

When you build the app now, the generated .so file with your app will be a bit bigger. This is because it stores most symbols in the export table now. But it does allow you to create meaningful error reports:

Access violation at address A51D8882, accessing address FFFFFFFF
At address: $A51D8882 (Fmain.TFormMain.ButtonAVClick(TObject*) + 21)

Call stack:
libErrorReportingSample.so $A4C3BA38 Sysutils.RaiseExceptObject(TExceptionRecord*) + 35
libErrorReportingSample.so $A4C05B28 _RaiseAtExcept(TObject*, Pointer) + 67
libErrorReportingSample.so $A4C1E8DC Internal.Excutils.SignalConverter(Cardinal, Cardinal, Cardinal) + 35
libErrorReportingSample.so $A51D887C Fmain.TFormMain.ButtonAVClick(TObject*) + 15
libErrorReportingSample.so $A501FBE8 Fmx.Controls.TControl.Click() + 583
libErrorReportingSample.so $A514955C Fmx.Stdctrls.TCustomButton.Click() + 11
libErrorReportingSample.so $A50201C0 Fmx.Controls.TControl.MouseClick(Uitypes.TMouseButton, set of Classes.System_Classes__1, Single, Single) + 67
libErrorReportingSample.so $A5185330 Fmx.Forms.TCommonCustomForm.MouseUp(Uitypes.TMouseButton, set of Classes.System_Classes__1, Single, Single, Boolean) + 207
libErrorReportingSample.so $A5185330 Fmx.Forms.TCommonCustomForm.MouseUp(Uitypes.TMouseButton, set of Classes.System_Classes__1, Single, Single, Boolean) + 207
libErrorReportingSample.so $A50D3886 Fmx.Platform.Android.TWindowManager.MouseUp(Uitypes.TMouseButton, set of Classes.System_Classes__1, Single, Single, Boolean) + 197
libErrorReportingSample.so $A50EEC6C Fmx.Platform.Android.TPlatformAndroid.ProcessAndroidMouseEvents() + 263
libErrorReportingSample.so $A50EEDE4 Fmx.Platform.Android.TPlatformAndroid.HandleAndroidMotionEvent(AInputEvent*) + 299
libErrorReportingSample.so $A50D89D2 Fmx.Platform.Android.TPlatformAndroid.HandleAndroidInputEvent(AInputEvent*) + 57
libErrorReportingSample.so $A50E7496 Fmx.Platform.Android.HandleAndroidInputEvent(var Androidapi.Appglue.TAndroid_app, AInputEvent*) + 21
libErrorReportingSample.so $A4E7FE08 Androidapi.Appglue.process_input(Androidapi.Appglue.TAndroid_app*, Androidapi.Appglue.android_poll_source*) + 103
libErrorReportingSample.so $A50EC19E Fmx.Platform.Android.TPlatformAndroid.InternalProcessMessages() + 149
libErrorReportingSample.so $A50D7CA6 Fmx.Platform.Android.TPlatformAndroid.Run() + 13
libErrorReportingSample.so $A517D79C Fmx.Forms.TApplication.Run() + 75
libErrorReportingSample.so $A517D79C Fmx.Forms.TApplication.Run() + 75
libErrorReportingSample.so $A51D901C _NativeMain + 71 

Source Code

As always, you can find the accompanying source code on GitHub as part of the JustAddCode repository.

As I said in the previous article, this code just presents some building blocks for creating your own error reporter. But those are important building blocks and should give you a good head start…

30 thoughts on “Build your own Error Reporter – Part 2: Android

  1. Yes. Super Wow. Probably the “blog of the year”! And can we expect that S. Dupont and others at Embarcadero will be a little pro-active to add tools and library like exposed on this site.
    Thanks a lot!
    Eddy

    Liked by 1 person

  2. So if I get “At address: $A51D8882 (Fmain.TFormMain.ButtonAVClick(TObject*) + 21)”, how do I find position 21 in my source code. Where do I have to start counting (begin statement, var block, first line of procedure?), do I have to skip spaces or returns while counting? Should optimalization be turned off as compiler directive?

    Like

    1. Hi Robert,
      The building blocks described in these blog posts don’t provide any line number information or ways to get to the line number.
      The offset you get (21 in this example) is the number of *bytes* into the routine, starting from the “begin” statement.
      A way you can use this offset is to set a breakpoint on the “begin” line and open the CPU view when the breakpoint is hit. You can then add the offset to the current instruction address and look at the line at that address. This CPU view should then also show you the source line at that address.
      Of course, this only works if the version that crashes is compiled with the same settings as the version you are debugging…

      Like

  3. Hi Erik
    Another fantastic blog, thank you! I’ve been using this in 10.1 Berlin and it works brilliantly.
    On Rio however, I can’t link my app anymore if I have the –version-script=goExports.vsr in the linker options. I get an error “anonymous version tag cannot be combined with other version tags”.
    Have your team discovered this same error and a workaround?

    Like

  4. I Found a little bug/problem in the unit grijjy.ErrorReporting.pas,

    in class function TgoExceptionReporter.GlobalGetExceptionStackInfo

    you do

    Move(Address^, CallStack.Stack[Count], SizeOf(TStackValues));

    With count = 1 and length(CallStack.Stack) = 1 so you have “out of bound exception”. To see it you must activate the range check error in the compiler

    Like

    1. The code by itself is correct, it just should be compiled with range checking turned off. I disabled range checking in this unit now. Could you check if that fixes your issue?

      Like

      1. Hi Erik,

        You also need to add {$OverFlowChecks Off} because else we have integer overflow at this row:

        Result[I].CodeAddress := (CallStack.Stack[I, J] and not 1) – 4;

        Like

  5. Hi Erik,

    I just try to migrate to the new Delphi 10.3.3 and I receive this error:

    [DCC Error] E2597 C:\SDKs\android-ndk-r17b\toolchains\aarch64-linux-android-4.9\prebuilt\windows\aarch64-linux-android\bin\ld.exe: cannot find -lgnustl_static

    cf: https://stackoverflow.com/questions/59240345/can-not-compile-my-android-app-in-64-bit-i-receive-dcc-error-cannot-find-lgnus

    Is Their any workaround to make your wonderfull class working under Android 64 bit ?

    Like

    1. 64 bit android does not support gnustl. Maybe it is sufficient to replace this in the source code with another C runtime. I cannot test this at the moment, but you try and experiment by replacing libgnustl_static.a with libc++.a, libc++_static.a or another C++ runtime…

      Like

      1. Thank Erik,

        with libc++_static.a I have this error:

        [DCC Error] E2597 Grijjy.ErrorReporting.o: In function `Grijjy::Errorreporting::cxa_demangle(char const*, char*, NativeInt, int&)’:
        Grijjy.ErrorReporting:(.text._ZN6Grijjy14Errorreporting12cxa_demangleEPKcPc9NativeIntRi[_ZN6Grijjy14Errorreporting12cxa_demangleEPKcPc9NativeIntRi]+0x0): undefined reference to `__cxa_demangle’

        But it’s seam to compile with libc++.a, Do you think it’s will work like this without anything else? I discover that my android device work in 32 bit so I can not test it right now 😦

        Like

      2. There may be a different API for 64bit, or maybe the name has one underscore less. I will try to look into this tomorrow and push a fix…

        Like

    1. A look at the code and it is more complicated than a missing import. The code for Android currently assumes a 32-bit architecture and instruction set to retrieve data like a call stack. It’s going to take more time to look into this…

      Like

      1. There very few people who can do what you did (I think only you even), so no choice, need to wait for you 🙂 And it’s very great that you share your knowledge with the community! I can’t wait to see your work available for Android 64 bit, I hope it’s will be soon…

        Like

  6. Hi Erik, When do you think you will have the time to take a look at the android 64 bit? Is it a complicated task, can we help you in any way?

    Like

    1. Maybe I can find some time this weekend. Will take some research since getting stack trace on ARM64 is very different then ARM32. I will post any updates here…

      Like

    2. Just pushed an update with Android64 support. I only have a single 64-bit Android device available for testing, but it did work on that one. Let me know there is a problem running it on your device(s).

      Like

Leave a comment