Android N中SurfaceView泄露的问题分析

来源:互联网 发布:淘宝被投诉假冒品牌 编辑:程序博客网 时间:2024/06/05 05:20

最近遇到一个bug,现象为SurfaceView的Layer没有销毁,导致屏幕上一直显示该Layer。觉得该案例有点意思,故在此记录下分析过程及解决方法,供有一定framework基础的Rom开发人员参考。


现象

开心消消乐的界面一直在屏幕上显示,无论如何都不能销毁。


分析过程

首先最直接相关的模块是SurfaceFlinger,既然能看到,应该存在该Layer并且进行了合成,否则这里就有问题,用如下命令dump状态信息:

adb shell dumpsys SurfaceFlinger

这里只摘取该Layer相关的部分:

+ Layer 0x71b57b0400 (SurfaceView - com.happyelements.AndroidAnimal/com.happyelements.hellolua.MainActivity)  Region transparentRegion (this=0x71b57b0708, count=1)    [  0,   0,   0,   0]  Region visibleRegion (this=0x71b57b0410, count=1)    [  0,   0, 1080, 1920]  Region surfaceDamageRegion (this=0x71b57b0488, count=1)    [  0,   0,   0,   0]      layerStack=   0, z=    21015, pos=(0,0), size=(1080,1920), crop=(   0,   0,1080,1920), finalCrop=(   0,   0,  -1,  -1), isOpaque=1, invalidate=0, alpha=0xff, flags=0x00000002, tr=[1.00, 0.00][0.00, 1.00]      FilterRender Layer= 0, FilterMode= 0 availableRect =(   0,   0,   0,   0)      client=0x71b86f0f40      format= 4, activeBuffer=[1080x1920:1088,  1], queued-frames=0, mRefreshPending=0      mSecure=0, mProtectedByApp=0, mFiltering=0, mNeedsFiltering=0            mTexName=54 mCurrentTexture=-1            mCurrentCrop=[0,0,0,0] mCurrentTransform=0            mAbandoned=0            -BufferQueue mMaxAcquiredBufferCount=1, mMaxDequeuedBufferCount=3, mDequeueBufferCannotBlock=0 mAsyncMode=0, default-size=[1080x1920], default-format=4, transform-hint=00, FIFO(0)={}             this=0x71b55e3000 (mConsumerName=SurfaceView - com.happyelements.AndroidAnimal/com.happyelements.hellolua.MainActivity, mConnectedApi=0, mConsumerUsageBits=0x900, mId=39, mPid=15358, producer=[-1:com.happyelements.AndroidAnimal], consumer=[15358:/system/bin/surfaceflinger])             [00:0x0] state=FREE                 [01:0x0] state=FREE                 [02:0x0] state=FREE                 [03:0x0] state=FREE                    *BufferQueueDump mIsBackupBufInited=0, mAcquiredBufs(size=0), mMode=TRACK_CONSUMER                 [-1] mLastAcquiredBuf->mGraphicBuffer->handle=0x71b7636900

得到如下信息:

  1. flags=0x00000002,即该Layer是show和opaque状态
  2. alpha=0xff,即alpha值为完全不透明
  3. visibleRegion为[ 0, 0, 1080, 1920],说明有可见区域,而且是全屏

综合以上以及dump出来的合成信息,说明SurfaceFlinger这边的状态没有问题,符合我们看到的现象。

同时注意到有些奇怪的信息,之所以说奇怪是因为跟正常参与合成的Layer不一样:

  1. GraphicBuffer全部是FREE状态,正常应该至少有一个是ACQUIRED
  2. mCurrentTexture=-1,正常应该是>=0
  3. mConnectedApi=0,正常应该是>0

当然能进入到现在这种bug状态本身就不能太按常理来看待,SurfaceFlinger这边暂且先到这里。


目光转向WMS这边,用如下命令dump状态信息:

adb shell dumpsys window

唯一跟该SurfaceView相关的信息如下:

WINDOW MANAGER SURFACES (dumpsys window surfaces)  Surface #0: #75499c8 SurfaceView - com.happyelements.AndroidAnimal/com.happyelements.hellolua.MainActivity    mLayerStack=0 mLayer=21015    mShown=true mAlpha=1.0 mIsOpaque=false    mPosition=0.0,0.0 mSize=1080x1920    mCrop=[0,0][1080,1920]    mFinalCrop=[0,0][0,0]    Transform: (1.0, 0.0, 0.0, 1.0)

这并不是窗口堆栈打印出的内容,为了不让此文写的太过冗长,直接给出结论:

  1. 该信息打印的是一个静态SurfaceTrace集合中的内容
  2. SurfaceTrace是SurfaceControl的子类,而每个SurfaceControl对应的是SF端的一个Layer
  3. 构造新的SurfaceTrace实例会往该静态数组添加元素,销毁时移除该元素

现在有个SurfaceTrace存在于该静态集合中,说明其创建后没有被销毁,这就是该bug的最直接原因,也是我们最开始的切入点。 现在WMS仅有这条信息,并没有窗口堆栈及token的对应状态,这着实让人有点惆怅,否则或许能发现点蛛丝马迹,直接扒代码找原因无异于大海捞针。

现在没有log,只有现场,还能知道如下信息:

  • 通过ps命令知道目标进程已死(好奇怪,进程都死了怎么Layer还在)
  • 还记得上面提到该Layer的一些奇怪的信息,扒了扒代码后得知这是因为调用了SurfaceControl.disconnect(),这是android N中新增的API,并且只在暂存Surface相关的逻辑中调用,所谓暂存Surface是android N新增的用来加速界面响应的一种优化,这可以说明代码曾经走到过某个位置,多少对分析问题有点帮助。

如果没有其它线索,分析到这里已经结束,剩下的事情就是”愉快地“钻进代码的海洋里去寻找bug,并向老天许愿。所幸的是能抓到system_server的hprof,瞬间感觉人生充满了希望。


接下来看hprof文件,为简化分析过程,不会去粘贴大量的数据。

首先从WMS中dump出来的那个SurfaceControl入手,根据代码这个实例只能是SurfaceTrace或者是它的子类SurfaceControlWithBackground,最后发现是SurfaceControlWithBackground,N种mSubLayer小于0的子窗口(即位于父窗口下方)在创建SurfaceControl时默认实例化SurfaceControlWithBackground,而SurfaceView刚好就是这样的窗口。查看它的GcRoot,确实是保存在一个静态的数组中。

顺藤摸瓜找到了对应的WindowState,GcRoot在WMS.mWindowMap中,另外它的父WindowState也一样存在。到这里我们要先下一个重要的结论:
泄露的不止是SurfaceView窗口,还有它的父窗口。
以及我们后面再来回答的一个疑问:
为什么SurfaceFlinger端看不到父窗口的Layer?

接下来马上要回答一个问题:上面不是说WMS已经dump不出来这些窗口了吗?

要回答这个问题要先讲下WindowState的组织方式,它保存在系统中的多个位置,包括如下:

  • WMS.mWindowMap:以IBinder为键值查找WindowState
  • DisplayContent.mWindows:列表方式保存单个屏幕上的WindowState
  • WindowToken.windows或AppWindowToken.allAppWindows:列表方式保存从属的WIndowState
  • WindowState.mChildWindows:列表方式保存子窗口

注:上述的列表方式均以ArrayList的方式保存窗口,索引值越大层级越高

问题的答案是:dump出来的信息是通过DisplayContent.mWindows来取,既然没有对应的信息,说明泄露的WindowState已经从这里面移除,考察上述的其它地方是否存在:

  • WMS.mWindowMap:存在
  • DisplayContent.mWindows:不存在
  • AppWindowToken.allAppWindows:不存在
  • WindowState.mChildWindows:存在

按照正常的逻辑,移除一个WindowState后,所有组织它的地方都应该移除对应的引用。现在这种状况,需要在这几个中找一个最好排查的因素,从代码来看,WMS.mWindowMap是最简单的,因为只有一处代码从这里移除WindowState,即WMS.removeWindowInnerLocked():

void removeWindowInnerLocked(WindowState win) {    if (win.mRemoved) {        // Nothing to do.        if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM,                "removeWindowInnerLocked: " + win + " Already removed...");        return;    }    for (int i = win.mChildWindows.size() - 1; i >= 0; i--) {        WindowState cwin = win.mChildWindows.get(i);        Slog.w(TAG_WM, "Force-removing child win " + cwin + " from container " + win);        removeWindowInnerLocked(cwin);    }    win.mRemoved = true;    ...    mPolicy.removeWindowLw(win);    win.removeLocked(); // WindowState.mChildWindows中移除    if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM, "removeWindowInnerLocked: " + win);    mWindowMap.remove(win.mClient.asBinder()); // WMS.mWindowMap中移除    ...    final WindowToken token = win.mToken;    final AppWindowToken atoken = win.mAppToken;    if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM, "Removing " + win + " from " + token);    token.windows.remove(win); // WindowToken.windows中移除    if (atoken != null) {        atoken.allAppWindows.remove(win); // AppWindowToken.allAppWindows中移除    }    ...    final WindowList windows = win.getWindowList();    if (windows != null) {        windows.remove(win); // DisplayContent.mWindows中移除    }}

也就是说对于这个泄露的WindowState,肯定没有执行到这里,这从WindowState.mRemoved值为false也可以印证,从WindowState.mChildWindows中移除的唯一位置在WIndowState.removeLocked():

void removeLocked() {    disposeInputChannel();    if (isChildWindow()) {        if (DEBUG_ADD_REMOVE) Slog.v(TAG, "Removing " + this + " from " + mAttachedWindow);        mAttachedWindow.mChildWindows.remove(this); // WindowState.mChildWindows中移除    }    mWinAnimator.destroyDeferredSurfaceLocked();    mWinAnimator.destroySurfaceLocked();    mSession.windowRemovedLocked();    try {        mClient.asBinder().unlinkToDeath(mDeathRecipient, 0);    } catch (RuntimeException e) {        // Ignore if it has already been removed (usually because        // we are doing this as part of processing a death note.)    }}

这么看来WMS.removeWindowInnerLocked()像是做最后移除工作的地方,因为上述的所有保存WindowState的地方都会在这里进行移除,现在出现不一致的情况,说明有其它地方会对某些引用进行移除,问题集中在DisplayContent.mWindows和AppWindowToken.allAppWindows。

先看下AppWindowToken.allAppWindows,查了一番代码,找到AppWindowToken.removeAllWindows():

void removeAllWindows() {    ...    allAppWindows.clear(); // AppWindowToken.allAppWindows清空    windows.clear(); // WindowToken.windows清空}

调用的部分路径为:

WMS.removeAppToken()->AppWindowToken.removeAppFromTaskLocked()->AppWindowToken.removeAllWindows()

简单地说,我们知道目标进程已经挂掉,至少在死亡讣告中会调用到WMS.removeAppToken。我们说根据结果进行推导,这部分就解释的通。

那DisplayContent.mWindows这边怎么解释,问题出在WMS.rebuildAppWindowListLocked():

private void rebuildAppWindowListLocked(final DisplayContent displayContent) {    final WindowList windows = displayContent.getWindowList();    int NW = windows.size();    int i;    int lastBelow = -1;    int numRemoved = 0;    if (mRebuildTmp.length < NW) {        mRebuildTmp = new WindowState[NW+10];    }    // First remove all existing app windows.    i=0;    while (i < NW) {        WindowState w = windows.get(i);        if (w.mAppToken != null) {            WindowState win = windows.remove(i); // 先从DisplayContent.mWindows移除,并可能在后面重新添加            win.mRebuilding = true;            mRebuildTmp[numRemoved] = win;            mWindowsChanged = true;            if (DEBUG_WINDOW_MOVEMENT) Slog.v(TAG_WM, "Rebuild removing window: " + win);            NW--;            numRemoved++;            continue;        } else if (lastBelow == i-1) {            if (w.mAttrs.type == TYPE_WALLPAPER) {                lastBelow = i;            }        }        i++;    }    // Keep whatever windows were below the app windows still below,    // by skipping them.    lastBelow++;    i = lastBelow;    // First add all of the exiting app tokens...  these are no longer    // in the main app list, but still have windows shown.  We put them    // in the back because now that the animation is over we no longer    // will care about them.    final ArrayList<TaskStack> stacks = displayContent.getStacks();    final int numStacks = stacks.size();    for (int stackNdx = 0; stackNdx < numStacks; ++stackNdx) {        AppTokenList exitingAppTokens = stacks.get(stackNdx).mExitingAppTokens;        int NT = exitingAppTokens.size();        for (int j = 0; j < NT; j++) {            i = reAddAppWindowsLocked(displayContent, i, exitingAppTokens.get(j));        }    }    // And add in the still active app tokens in Z order.    for (int stackNdx = 0; stackNdx < numStacks; ++stackNdx) {        final ArrayList<Task> tasks = stacks.get(stackNdx).getTasks();        final int numTasks = tasks.size();        for (int taskNdx = 0; taskNdx < numTasks; ++taskNdx) {            final AppTokenList tokens = tasks.get(taskNdx).mAppTokens;            final int numTokens = tokens.size();            for (int tokenNdx = 0; tokenNdx < numTokens; ++tokenNdx) {                final AppWindowToken wtoken = tokens.get(tokenNdx);                if (wtoken.mIsExiting && !wtoken.waitingForReplacement()) {                    continue;                }                i = reAddAppWindowsLocked(displayContent, i, wtoken);            }        }    }    i -= lastBelow;    if (i != numRemoved) {        displayContent.layoutNeeded = true;        Slog.w(TAG_WM, "On display=" + displayContent.getDisplayId() + " Rebuild removed "                + numRemoved + " windows but added " + i + " rebuildAppWindowListLocked() "                + " callers=" + Debug.getCallers(10));        for (i = 0; i < numRemoved; i++) {            WindowState ws = mRebuildTmp[i];            if (ws.mRebuilding) {                StringWriter sw = new StringWriter();                PrintWriter pw = new FastPrintWriter(sw, false, 1024);                ws.dump(pw, "", true);                pw.flush();                Slog.w(TAG_WM, "This window was lost: " + ws);                Slog.w(TAG_WM, sw.toString());                ws.mWinAnimator.destroySurfaceLocked();            }        }        Slog.w(TAG_WM, "Current app token list:");        dumpAppTokensLocked();        Slog.w(TAG_WM, "Final window list:");        dumpWindowsLocked();    }    Arrays.fill(mRebuildTmp, null);}

简单说下逻辑,就是先移除所有的应用窗口,并根据最新的AppWindowToken排列顺序来重新添加,而要重新添加的上,WindowToken.windows必须不为空,而根据上面的分析这里已经为空,那么对不起,移除完后已经加不上了,这从WindowState.mRebuilding为true可以证明。那么这里又解释通了,而且跟WindowToken.windows被清空有关。

那到底为什么没走到清理现场的WMS.removeWindowInnerLocked()?再回到死亡讣告,每个WindowState都会注册死亡讣告,并在窗口所在进程挂掉后调用WMS.removeWindowLocked(),这点是没有疑问的,并且会在后续调用WMS.removeWindowInnerLocked(),但是在这之前有可能提前返回,代码太多,只列出可能提前返回的部分,看注释我们来一一排除:

if (win.mHasSurface && okToDisplay()) {    final AppWindowToken appToken = win.mAppToken;    if (win.mWillReplaceWindow) { // mWillReplaceWindow为false        // This window is going to be replaced. We need to keep it around until the new one        // gets added, then we will get rid of this one.        if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM, "Preserving " + win + " until the new one is "                + "added");        // TODO: We are overloading mAnimatingExit flag to prevent the window state from        // been removed. We probably need another flag to indicate that window removal        // should be deffered vs. overloading the flag that says we are playing an exit        // animation.        win.mAnimatingExit = true;        win.mReplacingRemoveRequested = true;        Binder.restoreCallingIdentity(origId);        return;    }    // 唯一的可能就是进入到这个条件并return    if (win.isAnimatingWithSavedSurface() && !appToken.allDrawnExcludingSaved) {        // We started enter animation early with a saved surface, now the app asks to remove        // this window. If we remove it now and the app is not yet drawn, we'll show a        // flicker. Delay the removal now until it's really drawn.        if (DEBUG_ADD_REMOVE) {            Slog.d(TAG_WM, "removeWindowLocked: delay removal of " + win                    + " due to early animation");        }        // Do not set mAnimatingExit to true here, it will cause the surface to be hidden        // immediately after the enter animation is done. If the app is not yet drawn then        // it will show up as a flicker.        setupWindowForRemoveOnExit(win);        Binder.restoreCallingIdentity(origId);        return;    }    // If we are not currently running the exit animation, we need to see about starting one    wasVisible = win.isWinVisibleLw();    if (keepVisibleDeadWindow) { // 这里肯定进不来        if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM,                "Not removing " + win + " because app died while it's visible");        win.mAppDied = true;        win.setDisplayLayoutNeeded();        mWindowPlacerLocked.performSurfacePlacement();        // Set up a replacement input channel since the app is now dead.        // We need to catch tapping on the dead window to restart the app.        win.openInputChannel(null);        mInputMonitor.updateInputWindowsLw(true /*force*/);        Binder.restoreCallingIdentity(origId);        return;    }    final WindowStateAnimator winAnimator = win.mWinAnimator;    if (wasVisible) {        final int transit = (!startingWindow) ? TRANSIT_EXIT : TRANSIT_PREVIEW_DONE;        // Try starting an animation.        if (winAnimator.applyAnimationLocked(transit, false)) {            win.mAnimatingExit = true;        }        //TODO (multidisplay): Magnification is supported only for the default display.        if (mAccessibilityController != null                && win.getDisplayId() == Display.DEFAULT_DISPLAY) {            mAccessibilityController.onWindowTransitionLocked(win, transit);        }    }    final boolean isAnimating =            winAnimator.isAnimationSet() && !winAnimator.isDummyAnimation();    final boolean lastWindowIsStartingWindow = startingWindow && appToken != null            && appToken.allAppWindows.size() == 1;    // We delay the removal of a window if it has a showing surface that can be used to run    // exit animation and it is marked as exiting.    // Also, If isn't the an animating starting window that is the last window in the app.    // We allow the removal of the non-animating starting window now as there is no    // additional window or animation that will trigger its removal.    if (winAnimator.getShown() && win.mAnimatingExit            && (!lastWindowIsStartingWindow || isAnimating)) { // mAnimatingExit为false        // The exit animation is running or should run... wait for it!        if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM,                "Not removing " + win + " due to exit animation ");        setupWindowForRemoveOnExit(win);        if (appToken != null) {            appToken.updateReportedVisibilityLocked();        }        Binder.restoreCallingIdentity(origId);        return;    }}

最后发现只有一处可能,看代码跟Surface的暂存有关,是不是想到了什么?对,上面讲过泄漏的layer走过这部分相关的代码,到这里会将WindowState.mRemoveOnExit置为true,若WindowState.mAnimatingExit同时为true,那么会在WindowStateAnimator.finishExit()中执行最后的移除操作,但是看到的信息是前者为true,后者为false,所以不会被移除。特别的,因为这时候目标进程挂掉了,没有后续的其它调用,状态就一直停留在这里,问题就此发生。给出结论:
目标应用曾经启动过并且退到后台,重新启动的过程中目标进程突然挂掉,并且此时父窗口和子窗口都没有重新完成绘制,即调用WMS.finishDrawingWindow,问题发生。

可以想象,实际上这种情况在日常使用中是非常难出现的,所以出现的概率极低。根据给出的结论进行代码调整使之能够达到复现条件,得到的结果是必现!!


现在回到之前埋下的一个疑问:
为什么SurfaceFlinger端看不到父窗口的Layer?
答案是Layer跟SurfaceControl对应,WindowState泄漏不代表SurfaceControl也泄漏,也就是说子窗口的SurfaceControl没有销毁而父窗口的销毁了。看下hprof中这两个窗口SurfaceControl相关的引用,情况如下:

  • 父窗口的mSurfaceController和mPendingDestroySurface都已经为null,说明已经销毁
  • 子窗口的mSurfaceController为null,mPendingDestroySurface不为null,说明被延迟销毁

实际上两者都调用了WindowStateAnimator.destroySurfaceLocked():

void destroySurfaceLocked() {    ...    if (mSurfaceDestroyDeferred) { // 子窗口mSurfaceDestroyDeferred为true        if (mSurfaceController != null && mPendingDestroySurface != mSurfaceController) {            if (mPendingDestroySurface != null) {                if (SHOW_TRANSACTIONS || SHOW_SURFACE_ALLOC) {                    WindowManagerService.logSurface(mWin, "DESTROY PENDING", true);                }                mPendingDestroySurface.destroyInTransaction();            }            mPendingDestroySurface = mSurfaceController;        }    } else {        if (SHOW_TRANSACTIONS || SHOW_SURFACE_ALLOC) {            WindowManagerService.logSurface(mWin, "DESTROY", true);        }        destroySurface();    }    ...}

这下清楚了,子窗口的SurfaceControl因为WindowState.mSurfaceDestroyDeferred为true被延迟销毁;为true是因为SurfaceView进行relayout时带有RELAYOUT_DEFER_SURFACE_DESTROY的flag,在正常情况下稍后SurfaceView会调用WMS.performDeferredDestroyWindow()销毁mPendingDestroySurface,但是在这之前进程挂了,那么就没有了这个调用。

有兴趣的可以看下SurfaceView.updateWindow()函数,正常情况下会有如下调用序列:

WMS.relayoutWindow()->WMS.finishDrawingWindow()->WMS.performDeferredDestroyWindow()

如果在WMS.finishDrawingWindow()之前进程挂了,就跟我们的结论完全吻合,mPendingDestroySurface就会一直得不到销毁。父窗口没有泄漏SurfaceControl就是因为它是被立即销毁。

原因已查明,那要怎么修复?实际上,如果WMS.removeWindowInnerLocked()有被调用到,就不会有任何泄漏,做为框架要做到任何时候都能保持状态正常,而不管应用是不是在某个特殊场景挂掉了!

那么问题就回到陷入上述场景时怎么办,代码如下:

if (win.isAnimatingWithSavedSurface() && !appToken.allDrawnExcludingSaved) {    // We started enter animation early with a saved surface, now the app asks to remove    // this window. If we remove it now and the app is not yet drawn, we'll show a    // flicker. Delay the removal now until it's really drawn.    if (DEBUG_ADD_REMOVE) {        Slog.d(TAG_WM, "removeWindowLocked: delay removal of " + win                + " due to early animation");    }    // Do not set mAnimatingExit to true here, it will cause the surface to be hidden    // immediately after the enter animation is done. If the app is not yet drawn then    // it will show up as a flicker.    setupWindowForRemoveOnExit(win);    Binder.restoreCallingIdentity(origId);}

意思是如果正在使用一个暂存的Surface执行动画,并且应用还没完成绘制,就延迟移除窗口,设置mRemoveOnExit为true,还特意交代不能设置mAnimatingExit为true,因为那会使得动画结束后Surface被马上隐藏,美其名曰:这一切都是为了不闪屏!! mAnimatingExit是可以一直不为true的好吧。


解决方案

一种改法是同时将mAnimatingExit置为true,但是很有可能WindowStateAnimator.finishExit()根本没机会调用到,还是于事无补。

个人最后的改法是注释掉这部分代码,既然现在要销毁窗口,为何还等到绘制完成并且动画结束?后续一定还有机会再进行移除吗?这个所谓的优化真的有意义吗?改完后,不再复现。

原创粉丝点击