KEMBAR78
[native/mono] Use mono_jit_thread_attach() by jonpryor · Pull Request #9937 · dotnet/android · GitHub
Skip to content

Conversation

@jonpryor
Copy link
Contributor

TODO: what's the explanation?

TODO: what's the explanation?
…o_jit_thread_attach

Get all those CI "ignore networking error" fixes!

Make CI Green Again™
(…wait.)
@lateralusX
Copy link
Member

lateralusX commented Mar 21, 2025

Analysis switching to mono_jit_thread_attach in dotnet/android:

TL;DR

dotnet/android already runs majority of its threads as cooperate suspend aware threads (GC safe when return from thread attach) and hybrid suspend will pre-emptive suspend threads that are currently in GC safe mode, indicates that changing the remaining calls from mono_thread_attach into mono_jit_thread_attach should be a low-risk change.

The full story:

Mono runtime can run in different suspend models, pre-emptive, hybrid or coop suspend. In the past, Xamarin Android (mono/mono), used pre-emptive suspend model, but starting with .net6, dotnet Android switched to hybrid suspend model.

The major difference between these two is how threads are suspend/resumed when triggering a GC, pre-emptive suspend relies on signals, meaning that any thread attached to the runtime, can be suspended at any location in code, including in bad areas that could cause side effects (like holding low level locks). The Mono embedding API’s was original designed with this suspend model in mind.

The hybrid suspend model on the other hand is a combination between pre-emptive and cooperate suspend model, but for this discussion, the interesting part is the fact that threads running GC unsafe under hybrid suspend need to hit safe points to be suspended, if not hitting a safe point, the runtime will (by default), wait for that thread forever, hanging the GC in its stw "stop the world" phase. A safe point is a location in code where thread will yield execution and wait for GC to complete and resume it. A safe point is just a location where a thread tells the runtime that it's in a GC safe region promising not to touch any managed memory or call any runtime functions as described here, https://www.mono-project.com/docs/advanced/runtime/docs/coop-suspend/#gc-safe-mode.

The switching back and forth between GC unsafe and safe is mainly taken care of by the runtime, for example, calling a p/invoke will mark the thread as being in GC safe mode while running the p/invoke, internal runtime waits, hitting safe points in C# code etc. A thread running in GC unsafe mode means that its executing managed or runtime code and needs to hit a safe point before it can be suspended by GC.

A thread could start out either as in GC unsafe or safe mode. The following lists a couple of scenarios:

  • mono_thread_attach due to backward compability with Mono embedding API and embedders, thread attached to the runtime in GC unsafe mode, meaning that it needs to reach a safe point to be suspended.
  • mono_jit_thread_attach thread gets attached to the runtime in GC safe mode.
  • mono_jit_init/mono_jit_init_version, thread calling these functions to initialize the runtime will be put in GC safe mode.
  • Native to managed wrappers, like unmanaged callers only methods, reverse p/invoke function pointers, GetFunctionPointerForDelegate etc, will attach unattached threads so they will be in GC safe mode on return.

A thread that is running in GC safe mode must be switched to GC unsafe mode when re-entering managed or runtime code. When calling through the native to managed wrappers, this will be taken care of by the wrapper. When calling through the Mono embedding API’s, each individual API needs to take a decision (based on what it does) to switch to GC unsafe and then back to the state thread had when entering the API (could actually be GC unsafe if it was called in GC unsafe mode).

Threads GC mode is critical when running a GC, since GC would need to do a stw in order to proceed with GC work. The hybrid suspend models stw is a little more complex in how it operates compared to both pre-emptive and coop suspend, but it mainly boils down to two steps. The first, all threads attached to the runtime will be checked. If thread is currently in GC safe mode, it will be ignored in first step, all threads in GC unsafe mode will be waited upon until they reach a safe point. This is normally where we see deadlocks in ANR’s due to threads not reaching safe points in timely manners. Once the first phase is done (all threads in GC unsafe mode reached safe points), second phase will consider all threads still in GC safe mode and pre-emptive suspended them (using signals).

We have identified several ANR’s (Application Not Responding) on Android where we seen threads attached to runtime with callstacks like this:

"queue-1-2" tid=8105 Native
  #00  pc 0x000000000006a0c0  /system/lib64/libc.so (__rt_sigsuspend+4)
  #01  pc 0x0000000000029684  /system/lib64/libc.so (sigsuspend+44)
  #02  pc 0x00000000001ff994  /data/app/<app.bundle.id>-_I4CSOAWam382fA8t14IEg==/lib/arm64/libmonosgen-2.0.so (suspend_signal_handler+200)
  #03  pc 0x00000000000005dc  [vdso:000000737837f000]
  #04  pc 0x000000000001dae8  /system/lib64/libc.so (syscall+24)
  #05  pc 0x00000000000e1ee4  /system/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+152)
  #06  pc 0x0000000000392794  /system/lib64/libart.so (art::Monitor::Wait(art::Thread*, long, int, bool, art::ThreadState)+632)
  #07  pc 0x0000000000394288  /system/lib64/libart.so (art::Monitor::Wait(art::Thread*, art::mirror::Object*, long, int, bool, art::ThreadState)+252)
  #08  pc 0x00000000001dcadc  /system/framework/arm64/boot.oat (java.lang.Object.wait [DEDUPED]+140)
  #09  pc 0x00000000001fcebc  /system/framework/arm64/boot.oat (java.lang.Thread.parkFor$+428)
  #10  pc 0x0000000000608cd8  /system/framework/arm64/boot.oat (java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await+808)
  #11  pc 0x00000000005de8ec  /system/framework/arm64/boot.oat (java.util.concurrent.LinkedBlockingQueue.take+156)
  #12  pc 0x00000000005e9e1c  /system/framework/arm64/boot.oat (java.util.concurrent.ThreadPoolExecutor.getTask+492)
  #13  pc 0x00000000005ec2b0  /system/framework/arm64/boot.oat (java.util.concurrent.ThreadPoolExecutor.runWorker+240)
  #14  pc 0x00000000005fb114  /system/framework/arm64/boot.oat (java.util.concurrent.ThreadPoolExecutor$Worker.run+68)
  #15  pc 0x00000000001fd13c  /system/framework/arm64/boot.oat (java.lang.Thread.run+76)
  #16  pc 0x0000000000509384  /system/lib64/libart.so (art_quick_invoke_stub+580)
  #17  pc 0x00000000000d8078  /system/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+200)
  #18  pc 0x0000000000431120  /system/lib64/libart.so (art::InvokeWithArgArray(art::ScopedObjectAccessAlreadyRunnable const&, art::ArtMethod*, art::ArgArray*, art::JValue*, char const*)+104)
  #19  pc 0x00000000004322ac  /system/lib64/libart.so (art::InvokeVirtualOrInterfaceWithJValues(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, _jmethodID*, jvalue*)+432)
  #20  pc 0x0000000000458e8c  /system/lib64/libart.so (art::Thread::CreateCallback(void*)+1140)
  #21  pc 0x00000000000678b4  /system/lib64/libc.so (__pthread_start(void*)+36)
  #22  pc 0x000000000001ef24  /system/lib64/libc.so (__start_thread+68)

This thread is attached to runtime since its running our suspend signal handler, but it also seems to be waiting inside some Java thread pool. This works under pre-emptive suspend, but if the same thread has been attached to runtime under hybrid and ends up waiting like this outside of managed and runtime code, then the thread must be in GC safe mode or it will violate the runtime hybrid suspend model, since if a thread is in GC unsafe mode, then it needs to reach a safe point in timely manner, something above callstack will probably never do, blocking the completion of stw.

It turns out that dotnet/android codebase still have two locations that could attach threads using mono_thread_attach, while the majority of threads are attached either as runtime init thread, marshalled methods using mono_jit_thread_attach or native to managed wrappers. Thread attached using one of the above will be in GC safe mode, meaning they should either reach safe points or being pre-emptive suspended under hybrid suspend model. After analyzing the code paths in dotnet/android ending up in calls to mono_thread_attach it is however still not clear if they are reachable in real world scenarios, but since we have ANR’s that points to issues suspending threads, we seen threads with callstack waiting outside runtime and majority of threads attached to runtime running dotnet Android seems to attach as cooperate suspend aware (GC safe mode on return), it would make sense to standardize and attach all threads as cooperate suspend aware in dotnet/android repro.

As part of this analysis, I also looked over all Mono API’s used by dotnet/runtime, analyzed if they are correctly switching to GC unsafe when called, if the API’s are cooperate aware and if they are safe to only be called during init or before running managed code. The fact that hybrid suspend will do a pre-emptive suspend on threads that are running in GC safe mode reduce issues using API’s that currently won’t enter GC unsafe (but probably should) or are not cooperate suspend aware, passing raw GC objects as parameter or return values.

Since we already run majority of threads as cooperate suspend aware threads (GC safe when return from thread attach) in dotnet/android and that hybrid suspend will pre-emptive suspend threads that are currently in GC safe mode, indicates that changing the remaining calls from mono_thread_attach into mono_jit_thread_attach should be a low risk change.

For completion, this is the list of Mono embedding API’s used by dotnet/android and their state regarding switching to GC unsafe, being cooperate suspend aware and potential implications. API’s without comments should be safe to call under any suspend model. API’s marked as “init-only” should be called before runtime or before running managed code. They are either not thread safe changing runtime state or used during runtime initialization or needs to be in place before running managed code. API’s marked with “Can’t be called under coop suspend model.” normally means that the API uses raw GC objects as parameters or return values. These API’s can’t be called under cooperate suspend model, but since hybrid suspend model will pre-emptive suspend threads in GC safe mode, it can still scan threads active stack and registers, so should be able to handle direct GC references on stack or in register for all attached runtime threads. The last category “Should transition to GC unsafe.”, is mainly API’s that should do a GC unsafe transition internally but currently don’t. This is something that should probably be fixed in runtime and until done, these API’s can’t be safely called under coop suspend model. They should however still be safe under hybrid suspend, since threads in GC safe mode will be pre-empted.

Mono API GC Unsafe Cooperate Comment
mono_add_internal_call No Yes init-only
mono_alc_get_default_gchandle No Yes
mono_array_new Yes No Can’t be called under coop suspend model.
mono_assembly_get_image Yes Yes
mono_assembly_load_from_full Yes Yes
mono_assembly_load_full Yes Yes
mono_assembly_load_full_alc Yes Yes
mono_assembly_loaded Yes Yes
mono_assembly_name_free Yes Yes
mono_assembly_name_get_culture Yes Yes
mono_assembly_name_get_name Yes Yes
mono_assembly_name_new Yes Yes
mono_assembly_open_full Yes Yes
mono_check_corlib_version Yes Yes
mono_class_from_mono_type Yes Yes
mono_class_from_name Yes Yes
mono_class_get Yes Yes
mono_class_get_field_from_name Yes Yes
mono_class_get_image No Yes
mono_class_get_method_from_name Yes Yes
mono_class_get_name Yes Yes
mono_class_get_namespace Yes Yes
mono_class_get_type No Yes
mono_class_get_type_token No Yes
mono_class_is_subclass_of Yes Yes
mono_class_vtable Yes Yes
mono_config_is_server_mode No Yes
mono_debug_init No Yes init-only
mono_debug_open_image_from_memory Yes Yes
mono_debugger_agent_unhandled_exception Yes No Can’t be called under coop suspend model.
mono_dl_fallback_register No Yes init-only
mono_domain_foreach Yes Yes
mono_domain_get No Yes
mono_domain_get_id No Yes
mono_domain_set Yes Yes
mono_error_get_message No Yes Should transition to GC unsafe.
mono_field_get_value Yes No Can’t be called under coop suspend model.
mono_field_set_value Yes No Can’t be called under coop suspend model.
mono_field_static_set_value Yes Yes
mono_gc_register_bridge_callbacks No Yes init-only
mono_gc_wait_for_bridge_processing Yes Yes
mono_get_byte_class No Yes
mono_get_method No Yes Should transition to GC unsafe.
mono_get_root_domain No Yes
mono_get_runtime_build_info No Yes
mono_guid_to_string No Yes
mono_image_get_name No Yes
mono_image_loaded Yes Yes
mono_image_open_from_data_alc Yes Yes
mono_image_open_from_data_with_name Yes Yes
mono_image_strerror No Yes
mono_install_assembly_preload_hook No Yes init-only
mono_install_assembly_preload_hook_v3 No Yes init-only
mono_jit_init_version No Yes
mono_jit_parse_options No Yes init-only
mono_jit_set_aot_mode No Yes init-only
mono_jit_set_trace_options No Yes init-only
mono_jit_thread_attach No Yes
mono_method_full_name Yes Yes
mono_method_get_unmanaged_callers_only_ftnptr Yes Yes
mono_object_get_class Yes No Can’t be called under coop suspend model.
mono_profiler_create No Yes init-only
mono_reflection_assembly_get_assembly No No Can’t be called under coop suspend model.
mono_reflection_type_from_name Yes Yes
mono_reflection_type_get_type Yes No Can’t be called under coop suspend model.
mono_runtime_init No Yes
mono_runtime_invoke Yes No Can’t be called under coop suspend model.
mono_runtime_set_main_args No Yes init-only
mono_set_crash_chaining No Yes init-only
mono_set_signal_chaining No Yes init-only
mono_set_use_llvm No Yes init-only
mono_string_chars No No Can’t be called under coop suspend model.
mono_string_length No No Can’t be called under coop suspend model.
mono_string_new Yes No Can’t be called under coop suspend model.
mono_string_to_utf8 Yes No Can’t be called under coop suspend model.
mono_thread_attach No No Can’t be called under coop suspend model.
mono_thread_create Yes Yes
mono_trace_set_level_string No Yes init-only
mono_trace_set_log_handler No Yes init-only
mono_trace_set_mask_string No Yes init-only
mono_trace_set_print_handler No Yes init-only
mono_trace_set_printerr_handler No Yes init-only
mono_type_get_name_full No Yes Should transition to GC unsafe.
mono_type_get_object Yes No Can’t be called under coop suspend model.
mono_unhandled_exception Yes No Can’t be called under coop suspend model.
mono_value_copy_array No No Can’t be called under coop suspend model.

@lateralusX
Copy link
Member

@jonpryor should be proceed with this PR?

@srxqds
Copy link

srxqds commented Apr 6, 2025

other platform like ios/windows should also use mono_jit_thread_attach instead mono_thread_attach,

@lateralusX
Copy link
Member

lateralusX commented Apr 7, 2025

other platform like ios/windows should also use mono_jit_thread_attach instead mono_thread_attach,

dotnet OSX/iOS SDK's already use mono_jit_thread_attach for its integration as well as explicit calls to enter/exit GC safe/unsafe manually transition threads between RUNNNING and BLOCKING modes.

For embedders, what API's you use depends on how you use the embedding API's from the attached thread. mono_thread_attach is the way to do it for backward compability and safety of all embedding API's. A thread attached using this API will remain in RUNNING mode (GC unsafe) and is viewed as running managed/runtime code. Such a thread will only enter BLOCKING mode (GC safe) when hitting safepoints inside managed or runtime code. In that case, if you have some native code like this:

mono_thread_attach();
call embedding API
call embedding API
call embedding API
wait on event in external thread pool

That violates the threading rules when attached using mono_thread_attach, since thread will be in RUNNING mode when waiting on an external event, blocking STW from completion. In cases like above, thread will need to be detached from runtime before it can wait on native thread pool.

If the same thread was attached using mono_jit_thread_attach then thread will be in BLOCKING mode when returning from mono_jit_thread_attach call and the wait on event in external thread pool won't prevent STW from completing. This is however only safe in hybrid suspend mode (since it will pre-empt thread in BLOCKING mode), but in coop suspend model, it is only safe if the called embedding API's do proper GC transitions + won't pass raw managed references in/out of the embedding API. Using mono_jit_thread_attach puts more responsibility on the embedder to only use/call embedding API's that are fully cooperate aware (unless running in hybrid suspend mode). If you end up calling an embedding API that is not fully cooperate aware you will end up corrupting the managed heap in one way or the other. iOS SDK manually transitions the thread into RUNNING mode if its calling embedding API's that are not full cooperate aware and when done using them it can only keep GC handles around when switching thread black to BLOCKING mode.

It is worth noting that wrappers used for reverse delegates as well as unmanaged callers only internally attach thread using mono_threads_attach_coop that works similar to mono_jit_thread_attach, leaving thread in BLOCKING mode when returning, if thread was not attached to runtime before call. It is also worth noting that thread init runtime will also return in state BLOCKING.

We had a PR a long time ago to change behavior of mono_thread_attach to leave thread in BLOCKING mode on return, but it was backed out due to risk of breaking embedders since incorrectly using the embedding API's from a thread in BLOCKING mode is riskier (possible heap corruption) than incorrectly hand over thread to an external scheduler in RUNNING mode (hanging STW).

The safest way of using the embedding API's under all modes is still to use mono_thread_attach and then detach thread if thread ends up handled by an external scheduler (like waiting outside of runtime/managed code). If you run in hybrid mode you can get away with mono_jit_thread_attach/mono_threads_attach_coop, but if you run in cooperate suspend mode you either use mono_thread_attach or have deep understanding around what embedding API's that can be called and do like iOS, manually switch between BLOCKING and RUNNING when needed.

@jonpryor jonpryor marked this pull request as ready for review May 1, 2025 15:19
@jonathanpeppers jonathanpeppers merged commit 8f800b5 into main Jun 13, 2025
58 of 60 checks passed
@jonathanpeppers jonathanpeppers deleted the dev/jonp/jonp-use-mono_jit_thread_attach branch June 13, 2025 19:53
@yeahg-dev
Copy link

When can we expect this change to be included in the Android SDK? We're experiencing frequent ANRs due to GC-related issues, so it would be great if this update could be released soon.

@jonathanpeppers
Copy link
Member

@yeahg-dev do you have an example of a crash that includes mono_thread_attach() in the backtrace?

This is going out in .NET 10 previews first, before we decide how safe it is for servicing.

@yeahg-dev
Copy link

yeahg-dev commented Jun 18, 2025

The reason I need the work from this PR is not because of crashes caused by mono_thread_attach(), but due to ANRs. Through this issue, I’ve learned that mono_thread_attach() can lead to ANRs because it starts threads in a GC unsafe state. Therefore, I’m hoping that this PR will help reduce the number of ANRs we’re currently experiencing in our app.

Just in case, I'm sharing the stack trace of the ANR we're experiencing. I was wondering if this PR could potentially resolve the issue, or if it would be more appropriate to open a separate issue for it. I would really appreciate it if you could take a look at the logs and let me know your thoughts.

anr_epoll_pwait.txt
anr_sigtimedwait.txt

@jonathanpeppers
Copy link
Member

@yeahg-dev would you be able to test the next .NET 10 preview and see if your problem goes away?

Right now, it the fix kind of "theoretical", no one has verified it solves anything.

@yeahg-dev
Copy link

We are only observing the ANRs in production, so deploying .NET 10 Preview to live users would be too risky for us at this point. Therefore, I’m afraid we won’t be able to test it in a live environment.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 19, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants