ANR全称为Application Not Response,是由于App主线程长时间卡滞导致的。ANR并不代表我们的应用代码一定存在编码问题,通常不会导致FATAL EXECPTION。直观上来讲,ANR发生时应用会无法及时响应用户操作,这会极大影响用户体验。
Android APP开发中的四大基础组件中Service、BroadcastReceiver与ContentProvider以及Input都会去检测ANR超时,这里以Service为例,简述ANR产生的原因:
当我们的应用组件通过startService启动一个Service时,会通过binder通知AMS启动Service,AMS在经过一系列逻辑处理后判断可以启动,就会通过binder通知应用所在进程创建一个Service并执行Service的onCreate生命周期,与此同时AMS会通过handler发送一个延时任务,此延时任务就是ANR任务。在应用进程Service生命周期执行完后就会通过binder告知AMS解除ANR任务;如果应用进程Service生命周期超时,AMS内的ANR任务就会执行,从而记录ANR报错并dump当前系统状态。
这里的机制解析,我们分别从应用APP以及AMS的角度来进行,并且以Android14的代码进行分析。首先是应用App调用startService启动服务:
Intent serviceIntent = new Intent(context, MyService.class);
context.startService(serviceIntent);
在通过startService启动服务后,程序调用会进入到ContextImpl.java内:
//frameworks/base/core/java/android/app/ContextImpl.java
@Override
public ComponentName startService(Intent service) {
warnIfCallingFromSystemProcess();
return startServiceCommon(service, false, mUser);
}
private ComponentName startServiceCommon(Intent service, boolean requireForeground,
UserHandle user) {
//startService核心
validateServiceIntent(service);
service.prepareToLeaveProcess(this);
ComponentName cn = ActivityManager.getService().startService(
mMainThread.getApplicationThread(), service,
service.resolveTypeIfNeeded(getContentResolver()),requireForeground,
getOpPackageName(), getAttributionTag(), user.getIdentifier());
}
接下来我们进入ActivityManagerService,也就是AMS内部:
//frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java
public ComponentName startService(IApplicationThread caller, Intent service,
String resolvedType, boolean requireForeground, String callingPackage,
String callingFeatureId, int userId)
throws TransactionTooLargeException {
return startService(caller, service, resolvedType,
requireForeground,callingPackage,callingFeatureId, userId, false /*
isSdkSandboxService */, INVALID_UID, null, null);
}
private ComponentName startService(IApplicationThread caller, Intent service,
String resolvedType, boolean requireForeground, String callingPackage,
String callingFeatureId, int userId, boolean isSdkSandboxService,
int sdkSandboxClientAppUid, String sdkSandboxClientAppPackage, String instanceName)
{
//mServices属于ActiveServices对象
res = mServices.startServiceLocked(caller, service,
resolvedType, callingPid, callingUid,
requireForeground, callingPackage, callingFeatureId, userId,
isSdkSandboxService,
sdkSandboxClientAppUid,sdkSandboxClientAppPackage,
instanceName);
}
紧接着我们进入ActiveServices类:
//frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
ComponentName startServiceLocked(IApplicationThread caller, Intent service, String resolvedType,int callingPid, int callingUid, boolean fgRequired,String callingPackage, @Nullable String callingFeatureId, final int userId,BackgroundStartPrivileges backgroundStartPrivileges, boolean isSdkSandboxService,
int sdkSandboxClientAppUid, String sdkSandboxClientAppPackage, String instanceName)
{
final ComponentName realResult =
startServiceInnerLocked(r, service, callingUid, callingPid,
callingProcessName, callingProcessState,
fgRequired, callerFg,
backgroundStartPrivileges, callingPackage);
……
}
//frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
ComponentName startServiceInnerLocked(ServiceMap smap, Intent service, ServiceRecord r,boolean callerFg, boolean addToStarting, int callingUid, String callingProcessName,int callingProcessState, boolean wasStartRequested, String callingPackage)
{
final int uid = r.appInfo.uid;
final String packageName = r.name.getPackageName();
final String serviceName = r.name.getClassName();
FrameworkStatsLog.write(FrameworkStatsLog.SERVICE_STATE_CHANGED, uid,
packageName,serviceName,FrameworkStatsLog.SERVICE_STATE_CHANGED__STATE__START);
mAm.mBatteryStatsService.noteServiceStartRunning(uid, packageName, serviceName);
//启动Service的核心
String error = bringUpServiceLocked(r, service.getFlags(), callerFg,
false /* whileRestarting */,
false /* permissionsReviewRequired */,
false /* packageFrozen */,
true /* enqueueOomAdj */);
}
在startServiceInnerLocked中会进入到bringUpServiceInnerLocked,这里面会分两种情况进行下一步处理。
一种情况是如果Service所在的App进程已经启动,则会调用realStartServiceLocked方法:
//frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
if (app != null) {
app.addPackage(r.appInfo.packageName, r.appInfo.longVersionCode, mAm.mProcessStats);
//当APP已经启动时,启动service
realStartServiceLocked(r, app, thread, pid, uidRecord, execInFg,enqueueOomAdj);
}
//进入realStartServiceLocked内通过scheduleCreateService创建service
private void realStartServiceLocked(ServiceRecord r, ProcessRecord app,
IApplicationThread thread, int pid, UidRecord uidRecord, boolean execInFg,
boolean enqueueOomAdj) throws RemoteException {
……
boolean created = false;
try {
//这里的thread对象代表应用进程的ApplicationThread
thread.scheduleCreateService(r, r.serviceInfo,
null /* compatInfo (unused but need to keep method signature) */,
app.mState.getReportedProcState());
r.postNotification(false);
created = true;
}
}
另一种情况是如果App进程未启动,又可细分为两类:
//frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
if (app == null && !permissionsReviewRequired && !packageFrozen) {
if (r.isSdkSandbox) {
final int uid = Process.toSdkSandboxUid(r.sdkSandboxClientAppUid);
app = mAm.startSdkSandboxProcessLocked(procName, r.appInfo, true,
intentFlags,hostingRecord, ZYGOTE_POLICY_FLAG_EMPTY, uid,
r.sdkSandboxClientAppPackage);
r.isolationHostProc = app;
} else {
app = mAm.startProcessLocked(procName, r.appInfo, true, intentFlags,
hostingRecord, ZYGOTE_POLICY_FLAG_EMPTY, false, isolated);
}
}
这里我们以App已启动为例,通过realStartServiceLocked启动Service,最终会进入到scheduleServiceTimeoutLocked方法:
//frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
void scheduleServiceTimeoutLocked(ProcessRecord proc) {
if (proc.mServices.numberOfExecutingServices() == 0 || proc.getThread() == null) {
return;
}
Message msg = mAm.mHandler.obtainMessage(
ActivityManagerService.SERVICE_TIMEOUT_MSG);
msg.obj = proc;
mAm.mHandler.sendMessageDelayed(msg, proc.mServices.shouldExecServicesFg()
? mAm.mConstants.SERVICE_TIMEOUT : mAm.mConstants.SERVICE_BACKGROUND_TIMEOUT);
}
这里我们可以看到,在scheduleServiceTimeoutLocked中通过handler为Service启动设定了超时任务,这里的超时时间有两个值:针对前台服务是20s,针对后台服务是200s.
DEFAULT_SERVICE_TIMEOUT = 20 * 1000 * Build.HW_TIMEOUT_MULTIPLIER;
DEFAULT_SERVICE_BACKGROUND_TIMEOUT = DEFAULT_SERVICE_TIMEOUT * 10;
至此在AMS中的任务已经告一段落,接着我们回到应用App自身,前面讲到在realStartServiceLocked方法中通过App的ApplicationThread的scheduleCreateService创建service,这里最终进入的是ActivityThread.java中scheduleCreateService:
public final void scheduleCreateService(IBinder token,
ServiceInfo info, CompatibilityInfo compatInfo, int processState) {
updateProcessState(processState, false);
CreateServiceData s = new CreateServiceData();
s.token = token;
s.info = info;
sendMessage(H.CREATE_SERVICE, s);
}
沿着调用链继续追踪,最终回到了ActiveServices.java内的serviceDoneExecutingLocked方法:
//frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
private void serviceDoneExecutingLocked(ServiceRecord r, boolean inDestroying,
boolean finishing, boolean enqueueOomAdj, @OomAdjReason int oomAdjReason) {
final ProcessServiceRecord psr = r.app.mServices;
if (DEBUG_SERVICE) Slog.v(TAG_SERVICE,"Nesting at 0 of " + r.shortInstanceName);
psr.setExecServicesFg(false);
psr.stopExecutingService(r);
if (psr.numberOfExecutingServices() == 0) {
if (DEBUG_SERVICE || DEBUG_SERVICE_EXECUTING) Slog.v(TAG_SERVICE_EXECUTING, "No more executingServices of " + r.shortInstanceName);
mAm.mHandler.removeMessages(ActivityManagerService.SERVICE_TIMEOUT_MSG, r.app);
} else if (r.executeFg) {
……
}
}
这里我们可以看到,在Service创建完成之后,回通过handler将之前的ANR任务解除。
Android原生记录的ANR日志都位于/data/anr文件夹内,当应用发生ANR时,系统可能会存在问题从而导致众多应用都会发生ANR,针对ANR日志我们重点是需要学会如何分析日志,找到ANR的root cause。
当我们通过adb打开/data/anr文件夹时,我们能够看到如下所示内容:
anr_2024-03-31-02-16-39-479
anr_2024-03-31-02-17-21-731
anr_2024-03-31-02-17-58-405
…
所有的anr日志都以anr_xxxx开头,日志文件会携带时间信息,用于表明该anr捕获的时间;
我们打开anr日志文件,有一些关键字段的信息是极其重要的。
subject字段开头的日志表明anr发生的问题原因,如下所示:
Subject: Input dispatching timed out (141c2c9 com.blacksesame.video/com.example.video.ui.activity.VideoPlayActivity (server) is not responding. Waited 5000ms for MotionEvent(deviceId=4, eventTime=225178371585000, source=TOUCHSCREEN, displayId=2, action=DOWN, actionButton=0x00000000, flags=0x00000000, metaState=0x00000000, buttonState=0x00000000, classification=NONE, edgeFlags=0x00000000, xPrecision=8.5, yPrecision=13.3, xCursorPosition=nan, yCursorPosition=nan, pointers=[0: (756.5, 645.2)]), policyFlags=0x62000001).
上述错误表明在com.example.video应用中的VideoPlayActivity发生了anr,anr的原因是由于Input dispatch超时,表明Input分发存在卡滞导致没有响应用户输入。
events字段用于记录该ANR发生之前的ANR记录,如下所示:
events <
timestamp_ms: 1711851399478
anr <
pid: 2564
process: "com.example.video"
process_class: 1
subject: "Input dispatching timed out (141c2c9 com.blacksesame.video/com.example.video.ui.activity.VideoPlayActivity (server) is not responding. Waited 5000ms for MotionEvent(deviceId=4, eventTime=225178371[...]"
uid: 10114
>
>
events <
timestamp_ms: 1711851441487
half_watchdog <
subject: "Blocked in monitor com.android.server.am.ActivityManagerService on monitor thread (watchdog.monitor) for 30s, Blocked in handler on on foreground thread (android.fg) for 30s, Blocked in handler on o[...]"
>
>
这里的events记录了两次ANR,这些ANR都发生在本次ANR之前,可以作为ANR问题发生时的辅助信息。
上述内容是一些信息总览,用于我们快速获取信息,具体的报错日志需要通过dumping pid字段进行分析,如下所示:
----- dumping pid: 710 at 225225702
----- pid 710 at 2024-03-31 02:17:21.735883895+0000 -----
Cmd line: system_server
Build fingerprint: 'c1200_aaos/c1200:14/UQ1A.231205.015.A1/V1.1.3.0:userdebug/test-keys'
ABI: 'arm64'
Build type: optimized
suspend all histogram: Sum: 11.916ms 99% C.I. 1us-1476.159us Avg: 85.726us Max: 2919us
DALVIK THREADS (181):
//“main”与tid=1表明属于主线程,Blocked表明主线程属于Block状态
"main" prio=5 tid=1 Blocked
| group="main" sCount=1 ucsCount=0 flags=1 obj=0x71e344c8 self=0xb400f270432d4be0
| sysTid=710 nice=-2 cgrp=foreground sched=0/0 handle=0xf271325994f8
| state=S schedstat=( 13723566710 13467097570 113753 ) utm=1082 stm=289 core=3 HZ=100
| stack=0xffffc7177000-0xffffc7179000 stackSize=8188KB
//表明发生死锁
| held mutexes=
at com.android.server.am.ActivityManagerService.broadcastIntentWithFeature(ActivityManagerService.java:15389)
//关键信息,对应lock被thread 14占有
- waiting to lock <0x0d8bc079> (a com.android.server.am.ActivityManagerService) held by thread 14
at android.app.ContextImpl.sendBroadcastAsUser(ContextImpl.java:1478)
at com.android.server.alarm.AlarmManagerService$3.lambda$doAlarm$0(AlarmManagerService.java:1995)
at com.android.server.alarm.AlarmManagerService$3.$r8$lambda$o_OrcmiwSegk0Cl6DW0Phad6Mm8(AlarmManagerService.java:0)
at com.android.server.alarm.AlarmManagerService$3$$ExternalSyntheticLambda0.run(R8$$SyntheticClass:0)
at android.os.Handler.handleCallback(Handler.java:958)
at android.os.Handler.dispatchMessage(Handler.java:99)
at android.os.Looper.loopOnce(Looper.java:205)
at android.os.Looper.loop(Looper.java:294)
at com.android.server.SystemServer.run(SystemServer.java:973)
at com.android.server.SystemServer.main(SystemServer.java:667)
at java.lang.reflect.Method.invoke(Native method)
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:552)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:949)
这里我们可以看到发生的问题在于dead lock,且lock被同一进程内的其他线程(Thread 14)持有,我们看看此时Thread 14此时的状态:
"android.ui" prio=5 tid=14 Native
| group="main" sCount=1 ucsCount=0 flags=1 obj=0x14c01c28 self=0xb400f2704330c5e0
| sysTid=729 nice=-10 cgrp=top-app sched=0/0 handle=0xf26ddf04bcb0
| state=S schedstat=( 754225290 619216305 2951 ) utm=55 stm=20 core=0 HZ=100
| stack=0xf26ddef48000-0xf26ddef4a000 stackSize=1039KB
| held mutexes=
native: #00 pc 000aa698 /apex/com.android.runtime/lib64/bionic/libc.so (__ioctl+8) (BuildId: c74277f481a383c87215b672f6465e24)
native: #01 pc 00063668 /apex/com.android.runtime/lib64/bionic/libc.so (ioctl+156) (BuildId: c74277f481a383c87215b672f6465e24)
native: #02 pc 0005e294 /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver+280) (BuildId: 1570a59fb47f18bdbd3b7aa21d8ceedc)
native: #03 pc 0005f5f0 /system/lib64/libbinder.so (android::IPCThreadState::waitForResponse+60) (BuildId: 1570a59fb47f18bdbd3b7aa21d8ceedc)
native: #04 pc 0005f330 /system/lib64/libbinder.so (android::IPCThreadState::transact+216) (BuildId: 1570a59fb47f18bdbd3b7aa21d8ceedc)
native: #05 pc 00076640 /system/lib64/libbinder.so (android::BpBinder::transact+188) (BuildId: 1570a59fb47f18bdbd3b7aa21d8ceedc)
native: #06 pc 00114d1c /system/lib64/libgui.so (android::gui::BpSurfaceComposer::captureLayers+436) (BuildId: d6c3eaedda5518d14dd35993bc419095)
native: #07 pc 000ff578 /system/lib64/libgui.so (android::ScreenshotClient::captureLayers+88) (BuildId: d6c3eaedda5518d14dd35993bc419095)
native: #08 pc 00205fd8 /system/lib64/libandroid_runtime.so (android::nativeCaptureLayers+340) (BuildId: e18d4023419fc50e2dfbce708dce5ae6)
at android.window.ScreenCapture.;(Native method)
at android.window.ScreenCapture.captureLayers(ScreenCapture.java:173)
at android.window.ScreenCapture.captureLayers(ScreenCapture.java:136)
at android.window.ScreenCapture.captureLayersExcluding(ScreenCapture.java:163)
这里我们可以看到,在名为android.ui的线程内,通过jni调用进入到了libgui所提供的captureLayers方法,其作为bp端,对应的实现在SurfaceFlinger进程。此时我们应当去观察SurfaceFlinger及其相关组件(Gralloc Mapper,Grolloc Allocator,HWC等),从而进行下一步追查。
以上就是一个ANR日志分析的大致步骤。
大多数ANR发生都是由于应用编码不规范导致的,少部分ANR是由于Native FWK的阻塞引起(binder或者jni调用),也有部分ANR是由于系统资源紧张而导致的。作为应用开发者,我们应当遵循一定的准则从而尽可能规避ANR的产生:
1.避免在主线程上执行涉及 复杂I/O 的操作。
2.避免在主线程上进行长时间的CPU计算。
3.避免在主线程在对另一个进程进行同步 binder 调用,而后者需要很长时间才能返回。
4.避免主线程因长时间等待其他线程状态同步而处于阻塞状态。
5.避免主线程在进程中或通过 binder 调用与另一个线程之间发生死锁。
The post Android系统ANR浅析 first appeared on FranzKafka Blog.