IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Android系统ANR浅析

    FranzKafka95发表于 2024-04-08 06:33:00
    love 0
    Read Time:7 Minute, 7 Second

    ANR全称为Application Not Response,是由于App主线程长时间卡滞导致的。ANR并不代表我们的应用代码一定存在编码问题,通常不会导致FATAL EXECPTION。直观上来讲,ANR发生时应用会无法及时响应用户操作,这会极大影响用户体验。

    Android APP开发中的四大基础组件中Service、BroadcastReceiver与ContentProvider以及Input都会去检测ANR超时,这里以Service为例,简述ANR产生的原因:

    当我们的应用组件通过startService启动一个Service时,会通过binder通知AMS启动Service,AMS在经过一系列逻辑处理后判断可以启动,就会通过binder通知应用所在进程创建一个Service并执行Service的onCreate生命周期,与此同时AMS会通过handler发送一个延时任务,此延时任务就是ANR任务。在应用进程Service生命周期执行完后就会通过binder告知AMS解除ANR任务;如果应用进程Service生命周期超时,AMS内的ANR任务就会执行,从而记录ANR报错并dump当前系统状态。

    机制解析

    这里的机制解析,我们分别从应用APP以及AMS的角度来进行,并且以Android14的代码进行分析。首先是应用App调用startService启动服务:

    Intent serviceIntent = new Intent(context, MyService.class);
    context.startService(serviceIntent);

    在通过startService启动服务后,程序调用会进入到ContextImpl.java内:

    //frameworks/base/core/java/android/app/ContextImpl.java
    @Override
    public ComponentName startService(Intent service) {
            warnIfCallingFromSystemProcess();
            return startServiceCommon(service, false, mUser);
    }
    
    private ComponentName startServiceCommon(Intent service, boolean requireForeground,
             UserHandle user) {
    	   //startService核心
    	    validateServiceIntent(service);
                service.prepareToLeaveProcess(this);
                ComponentName cn = ActivityManager.getService().startService(
                   mMainThread.getApplicationThread(), service,
                   service.resolveTypeIfNeeded(getContentResolver()),requireForeground,
                   getOpPackageName(), getAttributionTag(), user.getIdentifier());
    }

    接下来我们进入ActivityManagerService,也就是AMS内部:

    //frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java
    public ComponentName startService(IApplicationThread caller, Intent service,
                String resolvedType, boolean requireForeground, String callingPackage,
                String callingFeatureId, int userId)
                throws TransactionTooLargeException {
                return startService(caller, service, resolvedType, 
                requireForeground,callingPackage,callingFeatureId, userId, false /* 
                isSdkSandboxService */, INVALID_UID, null, null);
    }
    
    private ComponentName startService(IApplicationThread caller, Intent service,
        String resolvedType, boolean requireForeground, String callingPackage,
        String callingFeatureId, int userId, boolean isSdkSandboxService,
        int sdkSandboxClientAppUid, String sdkSandboxClientAppPackage, String instanceName)
    {
    	//mServices属于ActiveServices对象
            res = mServices.startServiceLocked(caller, service,
                     resolvedType, callingPid, callingUid,
                     requireForeground, callingPackage, callingFeatureId, userId,
                     isSdkSandboxService,
                     sdkSandboxClientAppUid,sdkSandboxClientAppPackage,
                     instanceName);
             
    }

    紧接着我们进入ActiveServices类:

    //frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
    ComponentName startServiceLocked(IApplicationThread caller, Intent service, String resolvedType,int callingPid, int callingUid, boolean fgRequired,String callingPackage, @Nullable String callingFeatureId, final int userId,BackgroundStartPrivileges backgroundStartPrivileges, boolean isSdkSandboxService,
    int sdkSandboxClientAppUid, String sdkSandboxClientAppPackage, String instanceName)
    {
    
    final ComponentName realResult =
                    startServiceInnerLocked(r, service, callingUid, callingPid,
                    callingProcessName, callingProcessState,
                    fgRequired, callerFg,
                    backgroundStartPrivileges, callingPackage);
    	……
    }
    
    //frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
    ComponentName startServiceInnerLocked(ServiceMap smap, Intent service, ServiceRecord r,boolean callerFg, boolean addToStarting, int callingUid, String callingProcessName,int callingProcessState, boolean wasStartRequested, String callingPackage)
    {
         final int uid = r.appInfo.uid;
         final String packageName = r.name.getPackageName();
         final String serviceName = r.name.getClassName();
         FrameworkStatsLog.write(FrameworkStatsLog.SERVICE_STATE_CHANGED, uid, 
         packageName,serviceName,FrameworkStatsLog.SERVICE_STATE_CHANGED__STATE__START);
         mAm.mBatteryStatsService.noteServiceStartRunning(uid, packageName, serviceName);
         //启动Service的核心
         String error = bringUpServiceLocked(r, service.getFlags(), callerFg,
                    false /* whileRestarting */,
                    false /* permissionsReviewRequired */,
                    false /* packageFrozen */,
                    true /* enqueueOomAdj */);
    }
    

    在startServiceInnerLocked中会进入到bringUpServiceInnerLocked,这里面会分两种情况进行下一步处理。

    一种情况是如果Service所在的App进程已经启动,则会调用realStartServiceLocked方法:

    //frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
    if (app != null) {
        app.addPackage(r.appInfo.packageName, r.appInfo.longVersionCode,	mAm.mProcessStats);
        //当APP已经启动时,启动service
        realStartServiceLocked(r, app, thread, pid, uidRecord, execInFg,enqueueOomAdj);
    }
    
    //进入realStartServiceLocked内通过scheduleCreateService创建service
    private void realStartServiceLocked(ServiceRecord r, ProcessRecord app,
          IApplicationThread thread, int pid, UidRecord uidRecord, boolean execInFg,
          boolean enqueueOomAdj) throws RemoteException {
           ……
    	boolean created = false;
            try {
    	//这里的thread对象代表应用进程的ApplicationThread
            thread.scheduleCreateService(r, r.serviceInfo,
            null /* compatInfo (unused but need to keep method signature) */,
            app.mState.getReportedProcState());
            r.postNotification(false);
            created = true;
         }
    }

    另一种情况是如果App进程未启动,又可细分为两类:

    //frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
    if (app == null && !permissionsReviewRequired && !packageFrozen) {	  
         if (r.isSdkSandbox) {
                    final int uid = Process.toSdkSandboxUid(r.sdkSandboxClientAppUid);
                    app = mAm.startSdkSandboxProcessLocked(procName, r.appInfo, true, 
                    intentFlags,hostingRecord, ZYGOTE_POLICY_FLAG_EMPTY, uid, 
                    r.sdkSandboxClientAppPackage);
                    r.isolationHostProc = app;
                } else {
                    app = mAm.startProcessLocked(procName, r.appInfo, true, intentFlags,
                            hostingRecord, ZYGOTE_POLICY_FLAG_EMPTY, false, isolated);
                }
    
    }

    这里我们以App已启动为例,通过realStartServiceLocked启动Service,最终会进入到scheduleServiceTimeoutLocked方法:

    //frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
    void scheduleServiceTimeoutLocked(ProcessRecord proc) {
       if (proc.mServices.numberOfExecutingServices() == 0 || proc.getThread() == null) {
                return;
         }
        Message msg = mAm.mHandler.obtainMessage(
    ActivityManagerService.SERVICE_TIMEOUT_MSG);
            msg.obj = proc;
            mAm.mHandler.sendMessageDelayed(msg, proc.mServices.shouldExecServicesFg()
                    ? mAm.mConstants.SERVICE_TIMEOUT : mAm.mConstants.SERVICE_BACKGROUND_TIMEOUT);
    }

    这里我们可以看到,在scheduleServiceTimeoutLocked中通过handler为Service启动设定了超时任务,这里的超时时间有两个值:针对前台服务是20s,针对后台服务是200s.

    DEFAULT_SERVICE_TIMEOUT = 20 * 1000 * Build.HW_TIMEOUT_MULTIPLIER;
    DEFAULT_SERVICE_BACKGROUND_TIMEOUT = DEFAULT_SERVICE_TIMEOUT * 10;

    至此在AMS中的任务已经告一段落,接着我们回到应用App自身,前面讲到在realStartServiceLocked方法中通过App的ApplicationThread的scheduleCreateService创建service,这里最终进入的是ActivityThread.java中scheduleCreateService:

    public final void scheduleCreateService(IBinder token,
          ServiceInfo info, CompatibilityInfo compatInfo, int processState) {
          updateProcessState(processState, false);
          CreateServiceData s = new CreateServiceData();
           s.token = token;
           s.info = info;
    
            sendMessage(H.CREATE_SERVICE, s);
     }

    沿着调用链继续追踪,最终回到了ActiveServices.java内的serviceDoneExecutingLocked方法:

    //frameworks/base/services/core/java/com/android/server/am/ActiveServices.java
    private void serviceDoneExecutingLocked(ServiceRecord r, boolean inDestroying,
         boolean finishing, boolean enqueueOomAdj, @OomAdjReason int oomAdjReason) {
    final ProcessServiceRecord psr = r.app.mServices;
              if (DEBUG_SERVICE) Slog.v(TAG_SERVICE,"Nesting at 0 of " + r.shortInstanceName);
                    psr.setExecServicesFg(false);
                    psr.stopExecutingService(r);
                    if (psr.numberOfExecutingServices() == 0) {
                        if (DEBUG_SERVICE || DEBUG_SERVICE_EXECUTING) Slog.v(TAG_SERVICE_EXECUTING, "No more executingServices of " + r.shortInstanceName);
    mAm.mHandler.removeMessages(ActivityManagerService.SERVICE_TIMEOUT_MSG, r.app);
    } else if (r.executeFg) {
        ……
      }
    }

    这里我们可以看到,在Service创建完成之后,回通过handler将之前的ANR任务解除。

    日志解析

    Android原生记录的ANR日志都位于/data/anr文件夹内,当应用发生ANR时,系统可能会存在问题从而导致众多应用都会发生ANR,针对ANR日志我们重点是需要学会如何分析日志,找到ANR的root cause。

    当我们通过adb打开/data/anr文件夹时,我们能够看到如下所示内容:

    anr_2024-03-31-02-16-39-479
    anr_2024-03-31-02-17-21-731
    anr_2024-03-31-02-17-58-405
    …

    所有的anr日志都以anr_xxxx开头,日志文件会携带时间信息,用于表明该anr捕获的时间;

    我们打开anr日志文件,有一些关键字段的信息是极其重要的。

    subject字段开头的日志表明anr发生的问题原因,如下所示:

    Subject: Input dispatching timed out (141c2c9 com.blacksesame.video/com.example.video.ui.activity.VideoPlayActivity (server) is not responding. Waited 5000ms for MotionEvent(deviceId=4, eventTime=225178371585000, source=TOUCHSCREEN, displayId=2, action=DOWN, actionButton=0x00000000, flags=0x00000000, metaState=0x00000000, buttonState=0x00000000, classification=NONE, edgeFlags=0x00000000, xPrecision=8.5, yPrecision=13.3, xCursorPosition=nan, yCursorPosition=nan, pointers=[0: (756.5, 645.2)]), policyFlags=0x62000001).

    上述错误表明在com.example.video应用中的VideoPlayActivity发生了anr,anr的原因是由于Input dispatch超时,表明Input分发存在卡滞导致没有响应用户输入。

    events字段用于记录该ANR发生之前的ANR记录,如下所示:

    events <
      timestamp_ms: 1711851399478
      anr <
        pid: 2564
        process: "com.example.video"
        process_class: 1
        subject: "Input dispatching timed out (141c2c9 com.blacksesame.video/com.example.video.ui.activity.VideoPlayActivity (server) is not responding. Waited 5000ms for MotionEvent(deviceId=4, eventTime=225178371[...]"
        uid: 10114
      >
    >
    events <
      timestamp_ms: 1711851441487
      half_watchdog <
        subject: "Blocked in monitor com.android.server.am.ActivityManagerService on monitor thread (watchdog.monitor) for 30s, Blocked in handler on  on foreground thread (android.fg) for 30s, Blocked in handler on  o[...]"
      >
    >
    

    这里的events记录了两次ANR,这些ANR都发生在本次ANR之前,可以作为ANR问题发生时的辅助信息。

    上述内容是一些信息总览,用于我们快速获取信息,具体的报错日志需要通过dumping pid字段进行分析,如下所示:

    ----- dumping pid: 710 at 225225702
    
    ----- pid 710 at 2024-03-31 02:17:21.735883895+0000 -----
    Cmd line: system_server
    Build fingerprint: 'c1200_aaos/c1200:14/UQ1A.231205.015.A1/V1.1.3.0:userdebug/test-keys'
    ABI: 'arm64'
    Build type: optimized
    suspend all histogram:	Sum: 11.916ms 99% C.I. 1us-1476.159us Avg: 85.726us Max: 2919us
    DALVIK THREADS (181):
    //“main”与tid=1表明属于主线程,Blocked表明主线程属于Block状态
    "main" prio=5 tid=1 Blocked
      | group="main" sCount=1 ucsCount=0 flags=1 obj=0x71e344c8 self=0xb400f270432d4be0
      | sysTid=710 nice=-2 cgrp=foreground sched=0/0 handle=0xf271325994f8
      | state=S schedstat=( 13723566710 13467097570 113753 ) utm=1082 stm=289 core=3 HZ=100
      | stack=0xffffc7177000-0xffffc7179000 stackSize=8188KB
    //表明发生死锁
      | held mutexes=
      at com.android.server.am.ActivityManagerService.broadcastIntentWithFeature(ActivityManagerService.java:15389)
    //关键信息,对应lock被thread 14占有
      - waiting to lock <0x0d8bc079> (a com.android.server.am.ActivityManagerService) held by thread 14
      at android.app.ContextImpl.sendBroadcastAsUser(ContextImpl.java:1478)
      at com.android.server.alarm.AlarmManagerService$3.lambda$doAlarm$0(AlarmManagerService.java:1995)
      at com.android.server.alarm.AlarmManagerService$3.$r8$lambda$o_OrcmiwSegk0Cl6DW0Phad6Mm8(AlarmManagerService.java:0)
      at com.android.server.alarm.AlarmManagerService$3$$ExternalSyntheticLambda0.run(R8$$SyntheticClass:0)
      at android.os.Handler.handleCallback(Handler.java:958)
      at android.os.Handler.dispatchMessage(Handler.java:99)
      at android.os.Looper.loopOnce(Looper.java:205)
      at android.os.Looper.loop(Looper.java:294)
      at com.android.server.SystemServer.run(SystemServer.java:973)
      at com.android.server.SystemServer.main(SystemServer.java:667)
      at java.lang.reflect.Method.invoke(Native method)
      at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:552)
      at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:949)
    

    这里我们可以看到发生的问题在于dead lock,且lock被同一进程内的其他线程(Thread 14)持有,我们看看此时Thread 14此时的状态:

    "android.ui" prio=5 tid=14 Native
      | group="main" sCount=1 ucsCount=0 flags=1 obj=0x14c01c28 self=0xb400f2704330c5e0
      | sysTid=729 nice=-10 cgrp=top-app sched=0/0 handle=0xf26ddf04bcb0
      | state=S schedstat=( 754225290 619216305 2951 ) utm=55 stm=20 core=0 HZ=100
      | stack=0xf26ddef48000-0xf26ddef4a000 stackSize=1039KB
      | held mutexes=
      native: #00 pc 000aa698  /apex/com.android.runtime/lib64/bionic/libc.so (__ioctl+8) (BuildId: c74277f481a383c87215b672f6465e24)
      native: #01 pc 00063668  /apex/com.android.runtime/lib64/bionic/libc.so (ioctl+156) (BuildId: c74277f481a383c87215b672f6465e24)
      native: #02 pc 0005e294  /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver+280) (BuildId: 1570a59fb47f18bdbd3b7aa21d8ceedc)
      native: #03 pc 0005f5f0  /system/lib64/libbinder.so (android::IPCThreadState::waitForResponse+60) (BuildId: 1570a59fb47f18bdbd3b7aa21d8ceedc)
      native: #04 pc 0005f330  /system/lib64/libbinder.so (android::IPCThreadState::transact+216) (BuildId: 1570a59fb47f18bdbd3b7aa21d8ceedc)
      native: #05 pc 00076640  /system/lib64/libbinder.so (android::BpBinder::transact+188) (BuildId: 1570a59fb47f18bdbd3b7aa21d8ceedc)
      native: #06 pc 00114d1c  /system/lib64/libgui.so (android::gui::BpSurfaceComposer::captureLayers+436) (BuildId: d6c3eaedda5518d14dd35993bc419095)
      native: #07 pc 000ff578  /system/lib64/libgui.so (android::ScreenshotClient::captureLayers+88) (BuildId: d6c3eaedda5518d14dd35993bc419095)
      native: #08 pc 00205fd8  /system/lib64/libandroid_runtime.so (android::nativeCaptureLayers+340) (BuildId: e18d4023419fc50e2dfbce708dce5ae6)
      at android.window.ScreenCapture.;(Native method)
      at android.window.ScreenCapture.captureLayers(ScreenCapture.java:173)
      at android.window.ScreenCapture.captureLayers(ScreenCapture.java:136)
      at android.window.ScreenCapture.captureLayersExcluding(ScreenCapture.java:163)

    这里我们可以看到,在名为android.ui的线程内,通过jni调用进入到了libgui所提供的captureLayers方法,其作为bp端,对应的实现在SurfaceFlinger进程。此时我们应当去观察SurfaceFlinger及其相关组件(Gralloc Mapper,Grolloc Allocator,HWC等),从而进行下一步追查。

    以上就是一个ANR日志分析的大致步骤。

    规避措施

    大多数ANR发生都是由于应用编码不规范导致的,少部分ANR是由于Native FWK的阻塞引起(binder或者jni调用),也有部分ANR是由于系统资源紧张而导致的。作为应用开发者,我们应当遵循一定的准则从而尽可能规避ANR的产生:

    1.避免在主线程上执行涉及 复杂I/O 的操作。

    2.避免在主线程上进行长时间的CPU计算。

    3.避免在主线程在对另一个进程进行同步 binder 调用,而后者需要很长时间才能返回。

    4.避免主线程因长时间等待其他线程状态同步而处于阻塞状态。

    5.避免主线程在进程中或通过 binder 调用与另一个线程之间发生死锁。

    Happy
    Happy
    0 0 %
    Sad
    Sad
    0 0 %
    Excited
    Excited
    0 0 %
    Sleepy
    Sleepy
    0 0 %
    Angry
    Angry
    0 0 %
    Surprise
    Surprise
    0 0 %

    The post Android系统ANR浅析 first appeared on FranzKafka Blog.



沪ICP备19023445号-2号
友情链接