In this second part on error reporting, we add support for Android. For that, we need to manually walk the call stack and configure the Delphi project to export symbols.
In the previous article in this series, we shared some building blocks for implementing error reporting functionality on iOS. To recap, these blocks are:
- Intercepting Exceptions, not only in the main thread, but also in secondary threads,
- Retrieving a Stack Trace to determine how we got to the error, and
- Symbolicate the StackTrace to make it readable and understandable by mere humans.
Building blocks #1 and #3 work on both iOS and Android, so I won’t be repeating these here. Please refer to the previous article if you need a refresher.
Retrieving the stack trace is where things get tricky, and thus interesting. But as a result, this article is a bit technical, since we have to dive into some ARM CPU details. Continue at your own risk…
Walk the Stack
On iOS, we have this convenient API called backtrace
that generates a stack trace for us. Unfortunately, this API is not available for Android. There is a similar API called _Unwind_Backtrace
, but there is no Delphi import for this API, nor is it easy to import it yourself. It is implemented in the static library libgcc.a
, which you cannot linked to since it conflicts with Delphi’s librtlhelper.a
that contains system level support routines.
Alternatively, you can also find this API in the shared library
libc.so
, so you can import it by loading the library usingdlopen
and retrieving the API address usingdlsym
.
But more importantly, the stack trace that _Unwind_Backtrace
generates is unusable for Delphi applications. The API expects that the application uses the Procedure Call Standard for the ARM Architecture (AAPCS). However, it seems that Delphi (or LLVM?) applications use a model that looks more like the ARMv7 Function Calling Conventions for iOS (but not exactly). This ABI specifies that routines should start with a prolog that sets up a stack frame. That is very convenient for us since it allows us to walk the stack ourselves.
This ABI only applies to 32-bit applications. But since Delphi (currently) only supports 32-bit Android apps, this works out fine for us. Once Delphi starts adding support for 64-bit Android, we need to implement an additional stack walker, or use an API that does it for us.
Function Prolog
If you are coming from a Windows background, then you may know that whenever you call a routine, the application pushes the return address onto the stack. When the routine finishes, it pops the return address and jumps to that address.
ARM works differently. When you call a routine, it stores the return address in a special register, called the Link Return (LR
) register. This is one of the 16 general purpose registers that ARM has available. More specifically, it is an alias for register 14 (R14
). When a routine finishes, it sets the Program Counter (PC
, alias for R15
) to the value of LR
so it jumps to the return address. This means that the return address doesn’t have to be on the stack. But how can we generate a call stack then?
That is where the function prolog comes to the rescue. On iOS, the specification says that the prolog:
- Must push all registers that need saving to the stack.
- It must always push the
R7
andLR
registers. - It must set
R7
(aka Frame Pointer on iOS) to the location in the stack where the previous value ofR7
was just pushed.
So a minimal prolog (as used a lot in small functions) looks like this:
push {R7, LR} mov R7, SP
You probably don’t know ARM assembly language, and you don’t need to. I will explain the minimum you need to know to follow along.
The first line pushes the values of R7
and LR
to the stack, and the second line sets R7
to the value of the Stack Pointer (SP
, aka R13
).
Basic Stack Walking Algorithm
If every prolog looks like this, then we can walk the stack using these steps:
- Retrieve the current value of the
R7
register. Lets call itFramePointer
. - At this location in the stick, you will find the previous value of the
R7
register. Lets give it the imaginative namePreviousFramePointer
. - At the next location in the stack (after
R7
), we will find the pushedLR
value. Add this value to the stack trace so we can use it later to look up the routine name by this address (using symbolication, as discussed in the previous post). - Set
FramePointer
toPreviousFramePointer
and go back to step 2. Rinse and repeat untilFramePointer
becomes 0 or it falls outside of the stack.
Unfortunately, Delphi doesn’t follow the iOS ABI exactly, and it may push other registers between R7
and LR
. For example:
push {R4, R5, R6, R7, R8, R9, LR} add R7, SP, #12
Here, it pushed 3 registers (R4
–R6
) before R7
, so in the second line it sets R7
to point 12 bytes into the stack (so it still points to the previous R7
, as required). This line can be read as R7 := SP + 12;
. However, it also pushed registers R8
and R9
, before it pushes the LR
register. This means we cannot assume that the LR
register will be directly located after the R7
register in the stack. There may be (up to 6) registers in between. We don’t know which one represents LR. So we just store all 7 values after R7
, and later try to figure out which one represents LR
.
In Delphi code, this can look something like this:
type TStackValues = array [0..6] of UIntPtr; type TCallStack = record Count: Integer; Stack: array [0..19] of TStackValues; end; PCallStack = ^TCallStack; function GetExceptionStackInfo(P: PExceptionRecord): Pointer; const { On most Android systems, each thread has a stack of 1MB } MAX_STACK_SIZE = 1024 * 1024; var Count: Integer; FramePointer, PrevFramePointer, MinStack, MaxStack: UIntPtr; Address: Pointer; CallStack: PCallStack; begin { Allocate a PCallStack record large enough to hold 20 entries } GetMem(CallStack, SizeOf(TCallStack)); { Get the current value of the R7 register (the frame pointer) } FramePointer := GetFramePointer; { The stack grows downwards, so all entries in the call stack leading to this call have addresses greater than FramePointer. We don't know what the start and end address of the stack is for this thread, but we do know that the stack is at most 1MB in size, so we only investigate entries from FramePointer to FramePointer + 1MB. } MinStack := FramePointer; MaxStack := MinStack + MAX_STACK_SIZE; { Now we can walk the stack using the algorithm described above. } Count := 0; while (Count < 20) and (FramePointer <> 0) and (FramePointer >= MinStack) and (FramePointer < MaxStack) do begin { The first value at FramePointer contains the previous value of R7. } PrevFramePointer := PNativeUInt(FramePointer)^; { Store the 7 values after that. } Address := Pointer(FramePointer + SizeOf(UIntPtr)); Move(Address^, CallStack.Stack[Count], SizeOf(TStackValues)); Inc(Count); { Walk the stack to the previous frame pointer. } FramePointer := PrevFramePointer; end; CallStack.Count := Count; Result := CallStack; end;
While walking the stack, we check if the frame pointer lies within stack. This should be the case for all routines that use a prolog. We don’t know the memory range for the stack, but we do know that every thread on most Android systems has a stack of 1 MB. We also know that the stack grows downwards, so that top of the stack is at most 1 MB more than the current frame pointer value.
Retrieve the Frame Pointer value
But how do we get the current value of the frame pointer (register R7
)? This would be very easy to do if Delphi supported inline assembly code for ARM:
function GetFramePointer: UIntPtr; asm ldr R0, [R7] bx LR end;
This loads the contents of the address pointed to by R7
into the R0
register. The R0
register is used to store function results. The second line means “return to the address stored in the LR
register”, which is how you return from a subroutine.
Unfortunately, we cannot use inline ARM assembly in Delphi. We could put these two lines of code in a separate assembly file, then assemble it and link it into a static library. You can than import that static library in Delphi, as in:
function GetFramePointer: UIntPtr; cdecl; external 'mystaticlibrary.a' name '_GetFramePointer';
That is the way you should go if you want to link external (3rd party) libraries. Maybe I’ll talk about creating static libraries for iOS and Android in a future post…
In this case however, the GetFramePointer
function disassembles to just 8 bytes, so I put those 8 bytes into a constant array, create a function type and have a function pointer of that type point to the constant array:
type TGetFramePointer = function: NativeUInt; cdecl; const GET_FRAME_POINTER_CODE: array [0..7] of Byte = ( $00, $00, $97, $E5, // ldr R0, [R7] $1E, $FF, $2F, $E1); // bx LR var GetFramePointer: TGetFramePointer = @GET_FRAME_POINTER_CODE;
A bit dirty, but it works.
Find Correct LR Value
OK, so now we have a stack trace with 7 potential LR
values for each entry in the stack. Now we need to figure out which of those 7 represents the actual LR
value. Most of the time, this will be the first of the 7 values. We cannot be 100% sure which of the 7 values (if any) represents the actual LR
value. So instead, we just pass these values to the dladdr
API (see previous article for details). If that API succeeds, we assume that we found the LR
value. However, this is not fool proof because any value we pass to dladdr
may be happen to be a valid code address, but not the LR
value we are looking for.
Also, the LR
value contains the address of the next instruction after the call instruction. Delphi usually uses the BL
or BLX
instruction to call another routine. These instructions takes 4 bytes, so LR will be set to 4 bytes after the BL(X)
instruction (the return address). However, we want to know at what address the call was made, so we need to subtract 4 bytes.
There is one final complication here: the lowest bit of the LR
register indicates the mode the CPU operates in (ARM or Thumb). We need to clear this bit to get to the actual address, by AND
‘ing it with “not 1
“.
So the final version (as found in the TgoExceptionReporter.GetCallStack
in the sample code) looks like this:
class function TgoExceptionReporter.GetCallStack( const AStackInfo: Pointer): TgoCallStack; var CallStack: PCallStack; I, J: Integer; FoundLR: Boolean; begin { Convert TCallStack to TgoCallStack } CallStack := AStackInfo; SetLength(Result, CallStack.Count); for I := 0 to CallStack.Count - 1 do begin FoundLR := False; for J := 0 to Length(CallStack.Stack[I]) - 1 do begin Result[I].CodeAddress := (CallStack.Stack[I, J] and not 1) - 4; if GetCallStackEntry(Result[I]) then begin { Assume we found LR } FoundLR := True; Break; end; end; if (not FoundLR) then { None of the 7 values were valid. Set CodeAddress to 0 to signal we couldn't find LR. } Result[I].CodeAddress := 0; end; end;
The dladdr
API is called as part of the GetCallStackEntry
method, which also demangles the symbol found at the given address.
Symbolication Revisited
We can now run our demo app from the last article to generate an error report, which looks something like this:
Access violation at address A5203DA2, accessing address FFFFFFFF At address: $A5203DA2 Call stack: $00000000 $00000000 $00000000 $00000000 $00000000 $00000000 $00000000 $00000000 $00000000 $00000000 $00000000 $00000000 $00000000 libandroid.so $AD544CDE AInputQueue_preDispatchEvent + 1 $00000000 $00000000 $00000000 $00000000 $00000000 libErrorReportingSample.so $A520453C _NativeMain + 71
Well that doesn’t look good. We were able to walk the stack, but symbolication failed for almost all symbols. What is going on here?
Well looking at the .so
file that Delphi generates for your app, it turns out that almost all routine symbols are stored as local symbols instead of global symbols. The dladdr
API only works with global symbols.
If you look at the files that Delphi generates in the Android
build directory, then you will find a file with a .vsr
extension that looks like this:
EXPORTED { global: _NativeMain; __rsrc_*; __rstr_*; dbkFCallWrapperAddr; __dbk_fcall_wrapper; _DbgExcNotify; _Unwind_VRS_Get; ... ExecJNI; ANativeActivity_onCreate; local: *; };
This file is a so-called version script (hence the .vsr
extension). Among other things, it tells the linker which symbols should be global and which ones should be local. Delphi generates a file that only makes a couple of symbols global, and all other symbols local (with the local: *;
wildcard in the file).
Fortunately, we can create an additional .vsr
file that contains some more rules. The repository contains a sample goExports.vsr
file that makes all symbols we care about global:
{ global: _ZN*; _ZZ*; };
This file says that all symbols that start with _ZN
or _ZZ
should be global. These are prefixes of mangled symbol names. Every mangled name starts with _Z
and scoped names follow with an N
after that. Since Delphi includes the name of the unit in every symbol (as in MyUnit.MyProc
), almost every symbol starts with _ZN
. An exception is nested routines, which start with _ZZ
.
For those interested, this name mangling scheme is specified in the Itanium C++ ABI.
Now we need to tell the linker to use this .vsr
file in addition to the one that Delphi generates. You do this by setting a linker command line option:
- Open the project options (using the menu “Project | Options…”).
- Select the target “All configurations – Android platform”.
- Go to the page “Delphi Compiler | Linking”.
- Set “Options passed to the LD linker” to
--version-script=goExports.vsr
If the .vsr
file is not in the application directory, you need to specify an absolute or relative directory (as in --version-script=..\goExports.vsr
).
When you build the app now, the generated .so
file with your app will be a bit bigger. This is because it stores most symbols in the export table now. But it does allow you to create meaningful error reports:
Access violation at address A51D8882, accessing address FFFFFFFF At address: $A51D8882 (Fmain.TFormMain.ButtonAVClick(TObject*) + 21) Call stack: libErrorReportingSample.so $A4C3BA38 Sysutils.RaiseExceptObject(TExceptionRecord*) + 35 libErrorReportingSample.so $A4C05B28 _RaiseAtExcept(TObject*, Pointer) + 67 libErrorReportingSample.so $A4C1E8DC Internal.Excutils.SignalConverter(Cardinal, Cardinal, Cardinal) + 35 libErrorReportingSample.so $A51D887C Fmain.TFormMain.ButtonAVClick(TObject*) + 15 libErrorReportingSample.so $A501FBE8 Fmx.Controls.TControl.Click() + 583 libErrorReportingSample.so $A514955C Fmx.Stdctrls.TCustomButton.Click() + 11 libErrorReportingSample.so $A50201C0 Fmx.Controls.TControl.MouseClick(Uitypes.TMouseButton, set of Classes.System_Classes__1, Single, Single) + 67 libErrorReportingSample.so $A5185330 Fmx.Forms.TCommonCustomForm.MouseUp(Uitypes.TMouseButton, set of Classes.System_Classes__1, Single, Single, Boolean) + 207 libErrorReportingSample.so $A5185330 Fmx.Forms.TCommonCustomForm.MouseUp(Uitypes.TMouseButton, set of Classes.System_Classes__1, Single, Single, Boolean) + 207 libErrorReportingSample.so $A50D3886 Fmx.Platform.Android.TWindowManager.MouseUp(Uitypes.TMouseButton, set of Classes.System_Classes__1, Single, Single, Boolean) + 197 libErrorReportingSample.so $A50EEC6C Fmx.Platform.Android.TPlatformAndroid.ProcessAndroidMouseEvents() + 263 libErrorReportingSample.so $A50EEDE4 Fmx.Platform.Android.TPlatformAndroid.HandleAndroidMotionEvent(AInputEvent*) + 299 libErrorReportingSample.so $A50D89D2 Fmx.Platform.Android.TPlatformAndroid.HandleAndroidInputEvent(AInputEvent*) + 57 libErrorReportingSample.so $A50E7496 Fmx.Platform.Android.HandleAndroidInputEvent(var Androidapi.Appglue.TAndroid_app, AInputEvent*) + 21 libErrorReportingSample.so $A4E7FE08 Androidapi.Appglue.process_input(Androidapi.Appglue.TAndroid_app*, Androidapi.Appglue.android_poll_source*) + 103 libErrorReportingSample.so $A50EC19E Fmx.Platform.Android.TPlatformAndroid.InternalProcessMessages() + 149 libErrorReportingSample.so $A50D7CA6 Fmx.Platform.Android.TPlatformAndroid.Run() + 13 libErrorReportingSample.so $A517D79C Fmx.Forms.TApplication.Run() + 75 libErrorReportingSample.so $A517D79C Fmx.Forms.TApplication.Run() + 75 libErrorReportingSample.so $A51D901C _NativeMain + 71
Source Code
As always, you can find the accompanying source code on GitHub as part of the JustAddCode repository.
As I said in the previous article, this code just presents some building blocks for creating your own error reporter. But those are important building blocks and should give you a good head start…
Wow!
Probably the “blog post of the year”!
Thanks for the information and knowledge you’re sharing in your blog. Very much apprecited.
LikeLiked by 2 people
Thanks! I’m glad you liked it!
LikeLike
Yes. Super Wow. Probably the “blog of the year”! And can we expect that S. Dupont and others at Embarcadero will be a little pro-active to add tools and library like exposed on this site.
Thanks a lot!
Eddy
LikeLiked by 1 person
Erik van Bilsen : this is a wonderfull job ! months i was looking for a solution who work with delphi and didn’t find this blog before ..
LikeLike
You should recommend it to EMB.
LikeLike
Thank you!
LikeLike
So if I get “At address: $A51D8882 (Fmain.TFormMain.ButtonAVClick(TObject*) + 21)”, how do I find position 21 in my source code. Where do I have to start counting (begin statement, var block, first line of procedure?), do I have to skip spaces or returns while counting? Should optimalization be turned off as compiler directive?
LikeLike
Hi Robert,
The building blocks described in these blog posts don’t provide any line number information or ways to get to the line number.
The offset you get (21 in this example) is the number of *bytes* into the routine, starting from the “begin” statement.
A way you can use this offset is to set a breakpoint on the “begin” line and open the CPU view when the breakpoint is hit. You can then add the offset to the current instruction address and look at the line at that address. This CPU view should then also show you the source line at that address.
Of course, this only works if the version that crashes is compiled with the same settings as the version you are debugging…
LikeLike
Hi Erik
Another fantastic blog, thank you! I’ve been using this in 10.1 Berlin and it works brilliantly.
On Rio however, I can’t link my app anymore if I have the –version-script=goExports.vsr in the linker options. I get an error “anonymous version tag cannot be combined with other version tags”.
Have your team discovered this same error and a workaround?
LikeLike
Hi Chris,
Glad you like the post and that the code has been working out for you.
Did you pull the latest version of the JustAddCode repo (https://github.com/grijjy/JustAddCode)?
I did make a change to the VSR file a while ago to make it compatible with Rio. That has been working for me…
LikeLike
I Found a little bug/problem in the unit grijjy.ErrorReporting.pas,
in class function TgoExceptionReporter.GlobalGetExceptionStackInfo
you do
Move(Address^, CallStack.Stack[Count], SizeOf(TStackValues));
With count = 1 and length(CallStack.Stack) = 1 so you have “out of bound exception”. To see it you must activate the range check error in the compiler
LikeLike
The code by itself is correct, it just should be compiled with range checking turned off. I disabled range checking in this unit now. Could you check if that fixes your issue?
LikeLike
Hi Erik,
You also need to add {$OverFlowChecks Off} because else we have integer overflow at this row:
Result[I].CodeAddress := (CallStack.Stack[I, J] and not 1) – 4;
LikeLike
Hi Erik,
I just try to migrate to the new Delphi 10.3.3 and I receive this error:
[DCC Error] E2597 C:\SDKs\android-ndk-r17b\toolchains\aarch64-linux-android-4.9\prebuilt\windows\aarch64-linux-android\bin\ld.exe: cannot find -lgnustl_static
cf: https://stackoverflow.com/questions/59240345/can-not-compile-my-android-app-in-64-bit-i-receive-dcc-error-cannot-find-lgnus
Is Their any workaround to make your wonderfull class working under Android 64 bit ?
LikeLike
64 bit android does not support gnustl. Maybe it is sufficient to replace this in the source code with another C runtime. I cannot test this at the moment, but you try and experiment by replacing libgnustl_static.a with libc++.a, libc++_static.a or another C++ runtime…
LikeLike
Thank Erik,
with libc++_static.a I have this error:
[DCC Error] E2597 Grijjy.ErrorReporting.o: In function `Grijjy::Errorreporting::cxa_demangle(char const*, char*, NativeInt, int&)’:
Grijjy.ErrorReporting:(.text._ZN6Grijjy14Errorreporting12cxa_demangleEPKcPc9NativeIntRi[_ZN6Grijjy14Errorreporting12cxa_demangleEPKcPc9NativeIntRi]+0x0): undefined reference to `__cxa_demangle’
But it’s seam to compile with libc++.a, Do you think it’s will work like this without anything else? I discover that my android device work in 32 bit so I can not test it right now 😦
LikeLike
There may be a different API for 64bit, or maybe the name has one underscore less. I will try to look into this tomorrow and push a fix…
LikeLike
Thanks Erik!
Again congratulation for your excellent work!
LikeLike
A look at the code and it is more complicated than a missing import. The code for Android currently assumes a 32-bit architecture and instruction set to retrieve data like a call stack. It’s going to take more time to look into this…
LikeLike
There very few people who can do what you did (I think only you even), so no choice, need to wait for you 🙂 And it’s very great that you share your knowledge with the community! I can’t wait to see your work available for Android 64 bit, I hope it’s will be soon…
LikeLike
Hi Erik, When do you think you will have the time to take a look at the android 64 bit? Is it a complicated task, can we help you in any way?
LikeLike
Maybe I can find some time this weekend. Will take some research since getting stack trace on ARM64 is very different then ARM32. I will post any updates here…
LikeLike
thanks 🙂
LikeLike
Just pushed an update with Android64 support. I only have a single 64-bit Android device available for testing, but it did work on that one. Let me know there is a problem running it on your device(s).
LikeLike
That very great, as soon as I m at home, I will test it! Thanks again Erik!
LikeLike