Archive for August, 2015

android: binder: arm: SIGSEGV at IPCThreadState::executeCommand in libbinder.so

August 30, 2015

This post is to discuss a case in which a segmentation native crash happens in binder transactions. The source code is qualcomm release in which frameworks is android-4.4.3_KTU84L KitKat and kernel is inux 3.10.28. The architecture is Quad-core ARM Cortex A53.

symptom: segmentation at IPCThreadState::executeCommand in libbinder.so

pid: 308, tid: 464, name: Binder_1  >>> /system/bin/surfaceflinger <<<
thread: Binder_1
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 00000000
    r0 00000000  r1 b70724f4  r2 00000001  r3 00000003
    r4 b6fad0f0  r5 00000000  r6 b6f4538c  r7 000003e8
    r8 00000134  r9 b64d1000  sl bea4d85c  fp b6f432f4
    ip 000003e8  sp b65cec78  lr b6d6acb3  pc b6d676ec  cpsr 000f0030
    d0  6f207463656a6266  d1  64696f76202c6366
    d2  61503a3a64696f73  d3  65723a3a6c656365
    d4  746e69752074736e  d5  6973202c2a745f38
    d6  6f63202c745f657a  d7  657a69732074736e
    d8  0000000000000000  d9  0000000000000000
    d10 0000000000000000  d11 0000000000000000
    d12 0000000000000000  d13 0000000000000000
    d14 0000000000000000  d15 0000000000000000
    d16 41cdcd6500000000  d17 416fca0560000000
    d18 412e848000000000  d19 0000000000000000
    d20 3ff9d07c84b5dcc6  d21 40cf830000000000
    d22 3efa019fabb79d95  d23 3f2a019f8723aeaa
    d24 3f56c16c16c76a94  d25 3f81111111185da8
    d26 40cf830000000000  d27 3fc55555555554db
    d28 3fe0000000000000  d29 0000000000000001
    d30 fff0000000000000  d31 4000000000000000
    scr 68000010

backtrace:
    #00  pc 000206ec  /system/lib/libbinder.so (android::IPCThreadState::executeCommand(int)+587)
    #01  pc 00020a97  /system/lib/libbinder.so (android::IPCThreadState::getAndExecuteCommand()+38)
    #02  pc 00020b0f  /system/lib/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+50)
    #03  pc 00025935  /system/lib/libbinder.so
    #04  pc 00012671  /system/lib/libutils.so (android::Thread::_threadLoop(void*)+216)
    #05  pc 000120f3  /system/lib/libutils.so
    #06  pc 0000d360  /system/lib/libc.so (__thread_entry+72)
    #07  pc 0000d4f8  /system/lib/libc.so (pthread_create+240)

preliminary: binder thread and IPCThreadState
After opening /dev/binder, a process becomes a binder process and has up to 16 binder threads. Each binder thread has its own thread local IPCThreadState object which could be reference with IPCThreadState::self(). The IPCThreadState:self() provides all the infrastructures for the binder thread to do whatever he or she wants to do.

After spawned, a binder thread calls joinThreadPool to enter loops. transact initiate a transaction in which executeCommand is carried out. The talkWithDriver is the low level functions to ioctl /dev/binder. Finally, shutdown is to exit a binder thread.

class IPCThreadState
{
public:
    static  IPCThreadState*     self();
    static  IPCThreadState*     selfOrNull();  // self(), but won't instantiate
            sp<ProcessState>    process();
            void                joinThreadPool(bool isMain = true);
            status_t            transact(int32_t handle,
                                         uint32_t code, const Parcel& data,
                                         Parcel* reply, uint32_t flags);
    static  void                shutdown();
private:
            status_t            talkWithDriver(bool doReceive=true);
            status_t            executeCommand(int32_t command);
            Parcel              mIn;
            Parcel              mOut;

preliminary: how parcel and binder transaction works
Binder transaction is an IPC transaction. On the client side, one thread of the client process write transaction data into the parcel IPCThreadState()->self()->mOut and requests a binder transaction via ioctl of /dev/binder. This thread will enter kernel space and allocate a binder transaction data whose data is copied from the parcel mOut. Finally, the binder transaction data is en-queued into the todo list of the target process.

On the server side, a binder thread is blocked at ioctl of /dev/binder. If the todo list of the process is en-queued, then the binder thread will enter kernel space and copy the binder transaction data into the IPCThreadState()->self()->mIn.

ipcthreadstate_parcel

(gdb) p mIn      
$33 = {mError = 0, mData = 0xb6fad198 "\fr", mDataSize = 48, mDataCapacity = 256, mDataPos = 48, mObjects = 0x0, mObjectsSize = 0, mObjectsCapacity = 0, mNextObjectHint = 0, mFdsKnown = true, 
  mHasFds = false, mAllowFds = true, mOwner = 0, mOwnerCookie = 0x2e}

analyze: what binder thread is doing while native crash
In this case, the binder thread Binder_1(tid=464) of the process /system/bin/surfaceflinger(pid=308) encounters segmentation while executing the command after reading binder transaction from mIn.

(gdb) bt
#0  android::IPCThreadState::executeCommand (this=0xb6fad0f0, cmd=<optimized out>) at frameworks/native/libs/binder/IPCThreadState.cpp:1088
#1  0xb6d67a9a in android::IPCThreadState::getAndExecuteCommand (this=0xb6fad0f0) at frameworks/native/libs/binder/IPCThreadState.cpp:436
#2  0xb6d67b12 in android::IPCThreadState::joinThreadPool (this=0xb6fad0f0, isMain=<optimized out>) at frameworks/native/libs/binder/IPCThreadState.cpp:490
#3  0xb6d6c938 in android::PoolThread::threadLoop (this=0xb6fab648) at frameworks/native/libs/binder/ProcessState.cpp:69
#4  0xb6e85672 in android::Thread::_threadLoop (user=0xb6fab648) at frameworks/native/libs/utils/Threads.cpp:812
#5  0xb6e850f4 in thread_data_t::trampoline (t=<optimized out>) at frameworks/native/libs/utils/Threads.cpp:95
#6  0xb6f00364 in __thread_entry (func=0xb6e850bd <thread_data_t::trampoline(thread_data_t const*)>, arg=0xb6fab698, tls=0xb65cedd0) at bionic/libc/bionic/pthread_create.cpp:105
#7  0xb6f004fc in pthread_create (thread_out=0xbea4d85c, attr=<optimized out>, start_routine=0xb6e850bd <thread_data_t::trampoline(thread_data_t const*)>, arg=0x78)
    at bionic/libc/bionic/pthread_create.cpp:224
#8  0x00000000 in ?? ()
status_t IPCThreadState::executeCommand(int32_t cmd)
{
    BBinder* obj;
    RefBase::weakref_type* refs;
    status_t result = NO_ERROR;

    switch (cmd) {
        ......
        ......
    case BR_TRANSACTION:
        {
            binder_transaction_data tr;
            result = mIn.read(&tr, sizeof(tr));
            ALOG_ASSERT(result == NO_ERROR,
                "Not enough command data for brTRANSACTION");
            if (result != NO_ERROR) break;

            Parcel buffer;
            buffer.ipcSetDataReference(
                reinterpret_cast<const uint8_t*>(tr.data.ptr.buffer),
                tr.data_size,
                reinterpret_cast<const size_t*>(tr.data.ptr.offsets),
                tr.offsets_size/sizeof(size_t), freeBuffer, this);

            const pid_t origPid = mCallingPid;
            const uid_t origUid = mCallingUid;

            mCallingPid = tr.sender_pid;
            mCallingUid = tr.sender_euid;
            ......
            if (tr.target.ptr) {
                sp<BBinder> b((BBinder*)tr.cookie);
                const status_t error = b->transact(tr.code, buffer, &reply, tr.flags);
                if (error < NO_ERROR) reply.setError(error);

            } else {
                const status_t error = the_context_object->transact(tr.code, buffer, &reply, tr.flags);
                if (error < NO_ERROR) reply.setError(error);
            }
    }
}
status_t Parcel::read(void* outData, size_t len) const
{
    if ((mDataPos+PAD_SIZE(len)) >= mDataPos && (mDataPos+PAD_SIZE(len)) <= mDataSize
            && len <= PAD_SIZE(len)) {
        memcpy(outData, mData+mDataPos, len);
        mDataPos += PAD_SIZE(len);
        ALOGV("read Setting data pos of %p to %d\n", this, mDataPos);
        return NO_ERROR;
    }    
    return NOT_ENOUGH_DATA;
}

analyze: check if mIn transmitted from binder driver is reasonable
The parcel mIn of the binder thread Binder_1(tid=464) looks reasonable. The binder request is from tid=10071 of pid=1304. This binder transaction is a two-way transaction and allow replies with file descriptors. The binder thread will execute FIRST_CALL_TRANSACTION + 3 functions of the requested service 0xb7036d20.

(gdb) x/12x mIn.mData
0xb6fad198:     0x0000720c      0x80287202      0xb7036d20      0xb70724f4
0xb6fad1a8:     0x00000003      0x00000010      0x00000518      0x00002757
0xb6fad1b8:     0x00000064      0x00000000      0xb65cf028      0xb65cf08c
(gdb) p *((struct binder_transaction_data*)(mIn.mData+0x08))             
$40 = {target = {handle = 3070455072, ptr = 0xb7036d20}, cookie = 0xb70724f4, code = 3, flags = 16, sender_pid = 1304, sender_euid = 10071, data_size = 100, offsets_size = 0, data = {ptr = {
      buffer = 0xb65cf028, offsets = 0xb65cf08c}, buf = "(\360\\\266\214\360\\\266"}}

binder_transaction_data

analyze: check if tr copied from mIn.mData+8 is corrupted
The binder thread Binder_1(tid=464) copies from mIn.mData+8 to local variable tr for 40 bytes. Then it will execute binder transaction and uses tr as the argument. The expected content of is below.
binder_executecommand_tr_01

However, coredump shows that tr content is 4 bytes shift of mIn.mData+8. Now the problem reduce to why memcpy copies only 36 bytes with 4 bytes shift while it is supposed to copy 40 bytes data. Also while this case happens, it always appears that dst[3:0] = 4, such as 0xb65cecf4.
binder_executecommand_tr_02

(gdb) x/12x mIn.mData
0xb6fad198:     0x0000720c      0x80287202      0xb7036d20      0xb70724f4
0xb6fad1a8:     0x00000003      0x00000010      0x00000518      0x00002757
0xb6fad1b8:     0x00000064      0x00000000      0xb65cf028      0xb65cf08c
(gdb) x/12x &tr 
0xb65cecf4:     0x00000000      0xb7036d20      0xb70724f4      0x00000003
0xb65ced04:     0x00000010      0x00000518      0x00002757      0x00000064
0xb65ced14:     0x00000000      0xb65cf028      0xac59912d      0xb6fad0f0

preliminary: how bionic memcpy works in coretex-a53

According to cortex-a53.mk, the implementation of memcpy in cortex-a53 is in bionic/libc/arch-arm/cortex-a15/bionic/memcpy_base.S.

include bionic/libc/arch-arm/cortex-a15/cortex-a15.mk
  • initial condition
    • R0 is dst of memcpy
      R1 is src of memcpy
      Assume that while this issue happens, dst[3:0] = 4

    memcpy_40bytes_01

  • copy 4 bytes to make r0(dest) 8 byte alignment
  • // Copies 4 bytes, dst 32 bits aligned before, at least 64 bits after.
    vld4.8      {d0[0], d1[0], d2[0], d3[0]}, [r1]!
    vst4.8      {d0[0], d1[0], d2[0], d3[0]}, [r0, :32]!
    

    memcpy_40bytes_02

  • copy 8 bytes to make r0 16 byte alignment
  • // Copies 8 bytes, dst 64 bits aligned before, at least 128 bits after.
    vld1.8      {d0}, [r1]!
    vst1.8      {d0}, [r0, :64]!
    

    memcpy_40bytes_03

  • copy 16 bytes
  • // Copies 16 bytes, destination 128 bits aligned.
    vld1.8      {d0, d1}, [r1]!
    vst1.8      {d0, d1}, [r0, :128]!
    

    memcpy_40bytes_04

  • src remaining is less than 16 bytes, copy 8 bytes
  • vld1.8      {d0}, [r1]!
    vst1.8      {d0}, [r0]!
    

    memcpy_40bytes_05

  • src remaining is less than 8 bytes, copy 4 bytes
  • 1:      bge         2f
    vld4.8      {d0[0], d1[0], d2[0], d3[0]}, [r1]!
    vst4.8      {d0[0], d1[0], d2[0], d3[0]}, [r0]!
    

    memcpy_40bytes_06

    Let (int8_t*)a = &mIn.mData[8] and dst[3:0]=0x4, the content of SIMD registers D0, D1, D2, and D3 should be below after memcpy 40 bytes.

    d0 = {a[36], a[29], a[30], a[31], a[32], a[33], a[34], a[35]}
    d1 = {a[37], a[21], a[22], a[23], a[24], a[25], a[26], a[27]} 
    d2 = {a[38], ……}
    d3 = {a[39], ……}
    

    analysis: possible reason to make memcpy copy only 36 bytes with 4 bytes shift
    One possible reason to make memcpy copy only 36 bytes with 4 bytes shift is that the line 108 of bionic/libc/arch-arm/cortex-a15/bionic/memcpy_base.S. is skipped by CPU while the other instructions are executed correctly. Let’s check what happen if this line is skipped.

     // Copies 4 bytes, dst 32 bits aligned before, at least 64 bits after.
    vld4.8      {d0[0], d1[0], d2[0], d3[0]}, [r1]!
    vst4.8      {d0[0], d1[0], d2[0], d3[0]}, [r0, :32]!
    
  • initial condition
    • R0 is dst of memcpy
      R1 is src of memcpy
      Assume that while this issue happens, dst[3:0] = 4

    memcpy_40bytes_01

  • copy 4 bytes to make r0(dest) 8 byte alignment
  • // Copies 4 bytes, dst 32 bits aligned before, at least 64 bits after.
    vld4.8      {d0[0], d1[0], d2[0], d3[0]}, [r1]!
    vst4.8      {d0[0], d1[0], d2[0], d3[0]}, [r0, :32]! -> skipped
    

    memcpy_40bytes_02

  • copy 8 bytes to make r0 16 byte alignment
  • // Copies 8 bytes, dst 64 bits aligned before, at least 128 bits after.
    vld1.8      {d0}, [r1]!
    vst1.8      {d0}, [r0, :64]!
    

    memcpy_40bytes_03

  • copy 16 bytes
  • // Copies 16 bytes, destination 128 bits aligned.
    vld1.8      {d0, d1}, [r1]!
    vst1.8      {d0, d1}, [r0, :128]!
    

    memcpy_40bytes_04

  • src remaining is less than 16 bytes, copy 8 bytes
  • vld1.8      {d0}, [r1]!
    vst1.8      {d0}, [r0]!
    

    memcpy_40bytes_05

  • src remaining is less than 8 bytes, copy 4 bytes
  • 1:      bge         2f
    vld4.8      {d0[0], d1[0], d2[0], d3[0]}, [r1]!
    vst4.8      {d0[0], d1[0], d2[0], d3[0]}, [r0]!
    

    memcpy_40bytes_06

    If the line 108 is true, let (int8_t*)a = &mIn.mData[8], the content of SIMD registers D0, D1, D2, and D3 after memcpy should be as below.

    d0 = {a[32], a[25], a[26], a[27], a[28], a[29], a[30], a[31]}
    d1 = {a[33], a[17], a[18], a[19], a[20], a[21], a[22], a[23]} 
    d2 = {a[34], ……}
    d3 = {a[35], ……}
    

    analysis: how to prove line 108 of bionic/libc/arch-arm/cortex-a15/bionic/memcpy_base.S is skipped
    The content of d0, d1, d2, and d3 registers could help prove if line 108 is skipped. However, while the native crash happens, these registers have already been used by others. We couldn’t see the content of d0, d1, d2 and d3 right after memcpy. Thus, I use a debug patch to enforce native crash if memcpy doesn’t work correctly.

    The result coredump and tombstone shows that the value of d0, d1, d2, and d3 are the same as when line 108 is skipped. So I believe that this assumption is true.
    d0 = {a[32], a[25], a[26], a[27], a[28], a[29], a[30], a[31]}
    d1 = {a[33], a[17], a[18], a[19], a[20], a[21], a[22], a[23]}
    d2 = {a[34], ……}
    d3 = {a[35], ……}

    pid: 975, tid: 1315, name: Binder_5  >>> system_server <<<
    thread: Binder_5
    signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr deadbeef
        r0 00012345  r1 5ed5ed39  r2 00000004  r3 deadbeef
        r4 614e6958  r5 00000028  r6 00000028  r7 0000000a
        r8 5f978cc4  r9 0000000a  sl 401f7206  fp 4015e2f4
        ip 000000c0  sp 5f978c28  lr 401f2a5f  pc 401f2a64  cpsr 800f0030
        d0  0000000000000108  d1  0000272b000031c3
        d2  0061002e00650068  d3  006f007200640059
        d4  006c007500610066  d5  ffffffff00000074
        d6  0000000000000025  d7  ffffffff00000001
        d8  0000000000000000  d9  0000000000000000
        d10 0000000000000000  d11 0000000000000000
        d12 0000000000000000  d13 0000000000000000
        d14 0000000000000000  d15 0000000000000000
        d16 0004000400040004  d17 0004000400040004
        d18 0002000200020002  d19 0002000200020002
        d20 0000000000000000  d21 0000000000000000
        d22 0000000000000000  d23 0000000000000000
        d24 0080008000800080  d25 0080008000800080
        d26 0080008000800080  d27 0080008000800080
        d28 0080008000800080  d29 0080008000800080
        d30 0000000000000000  d31 8080808080808080
        scr 68000011
    
    backtrace:
        #00  pc 00023a64  /system/lib/libbinder.so (android::Parcel::read(void*, unsigned int) const+151)
        #01  pc 000206d9  /system/lib/libbinder.so (android::IPCThreadState::executeCommand(int)+424)
        #02  pc 00020b4f  /system/lib/libbinder.so (android::IPCThreadState::getAndExecuteCommand()+38)
        #03  pc 00020bd7  /system/lib/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+66)
        #04  pc 00025ac1  /system/lib/libbinder.so
        #05  pc 00012671  /system/lib/libutils.so (android::Thread::_threadLoop(void*)+216)
        #06  pc 0004fb4f  /system/lib/libandroid_runtime.so (android::AndroidRuntime::javaThreadShell(void*)+98)
        #07  pc 000120f3  /system/lib/libutils.so
        #08  pc 0000d360  /system/lib/libc.so (__thread_entry+72)
        #09  pc 0000d4f8  /system/lib/libc.so (pthread_create+240)
    
    (gdb) bt
    #0  0x401f2a64 in android::Parcel::read (this=0x614e6958, outData=0x5f978cc4, len=40) at vendor/htc/native/libs/binder/Parcel.cpp:932
    #1  0x401ef6dc in android::IPCThreadState::executeCommand (this=0x614e6928, cmd=-2144833022) at vendor/htc/native/libs/binder/IPCThreadState.cpp:1059
    #2  0x401efb52 in android::IPCThreadState::getAndExecuteCommand (this=0x614e6928) at vendor/htc/native/libs/binder/IPCThreadState.cpp:448
    #3  0x401efbda in android::IPCThreadState::joinThreadPool (this=0x0, isMain=<optimized out>) at vendor/htc/native/libs/binder/IPCThreadState.cpp:504
    #4  0x5f87b000 in ?? ()
    #5  0x5f87b000 in ?? ()
    Backtrace stopped: previous frame identical to this frame (corrupt stack?)
    (gdb) x/12 mData
    0x5ed5ed30:     0x0000720c      0x80287202      0x54ad22c0      0x547d2cf0
    0x5ed5ed40:     0x0000001d      0x00000010      0x000031b3      0x0000272b
    0x5ed5ed50:     0x00000110      0x00000000      0x5968c308      0x5968c418
    (gdb) x/12 outData
    0x5f978cc4:     0x00000000      0x54ad22c0      0x547d2cf0      0x0000001d
    0x5f978cd4:     0x00000010      0x000031b3      0x0000272b      0x00000110
    0x5f978ce4:     0x00000000      0x5968c308      0x62efbc00      0x614e692
    

    analysis: why an assembly code is skipped by CPU
    At first, I though it was a hardware issue. After co-work with vendor developers, the root cause is that something is wrong in exception handler.

    While an neon instruction is executed in user space, there will be an exception and CPU will enter an exception handler. Usually, an exception handler will deal with exceptions, decrease PC by 4, and return to the context where exception happens. In this case, something is wrong in exception handler, the CPU returns to user space context without decreasing PC by 4. So an neon instruction is skipped.

    Conclusion
    Usually, an exception handler will deal with exceptions, decrease PC by 4, and return to the context where exception happens. If something is wrong in exception handler and CPU returns to the original context without decrease PC by 4, then an instruction would be skipped. In this case, an neon instruction in memcpy is skipped and it makes a process hit native crash in binder transactions.

    android: binder: kernel panic: binder_get_ref

    August 23, 2015

    This post is to discuss debugging kernel panic at binder_get_ref. The reference source code here is qualcomm msm kernel release 3.4.0.

    [  638.949732] c1    936 Unable to handle kernel paging request at virtual address e714c805
    [  638.949732] c1    936 pgd = dd99c000
    [  638.949763] c1    936 [e714c805] *pgd=9f075811, *pte=00000000, *ppte=00000000
    [  638.949763] c1    936 DFSR=00000007, TTBCR=00000000, TTBR0=9e39c06a, TTBR1=8060406a
    [  638.949763] c1    936 MAIR0=ff0881a8, MAIR1=40e040e0, SCTLR=10c5787d
    [  638.949793] c1    936 Internal error: Oops: 7 [#1] PREEMPT SMP ARM
    [  638.949793] c1    936 Modules linked in: bcmdhd dm_crypt(O) moc_crypto(PO) moc_platform_mod(O)
    [  638.949824] c1    936 CPU: 1    Tainted: P        W  O  (3.4.10-gc50a6b2 #1)
    [  638.949854] c1    936 PC is at binder_get_ref+0x8/0x34
    [  638.949854] c1    936 LR is at binder_transaction_buffer_release+0x190/0x340
    [  638.949854] c1    936 pc : [<c050c304>]    lr : [<c050e48c>]    psr: a0000013
    [  638.949854] c1    936 sp : dae7ddd8  ip : 766045f4  fp : 000001cc
    [  638.949885] c1    936 r10: 00000000  r9 : dc3e6400  r8 : 00000068
    [  638.949885] c1    936 r7 : eb000288  r6 : c0cf22f8  r5 : 0001734e  r4 : eb000220
    [  638.949885] c1    936 r3 : e714c7dd  r2 : 000001cb  r1 : 000001cc  r0 : da231580
    …
    [  638.952601] c1    936 Process Binder_4 (pid: 936, stack limit = 0xdae7c2f8)
    …
    [  638.952815] c1    936 [<c050c304>] (binder_get_ref+0x8/0x34) from [<c050e48c>] (binder_transaction_buffer_release+0x190/0x340)
    [  638.952845] c1    936 [<c050e48c>] (binder_transaction_buffer_release+0x190/0x340) from [<c050fb1c>] (binder_thread_write+0x620/0x1f7c)
    [  638.952845] c1    936 [<c050fb1c>] (binder_thread_write+0x620/0x1f7c) from [<c0512238>] (binder_ioctl+0x1d4/0x588)
    [  638.952876] c1    936 [<c0512238>] (binder_ioctl+0x1d4/0x588) from [<c014d2cc>] (do_vfs_ioctl+0x4d8/0x54c)
    [  638.952876] c1    936 [<c014d2cc>] (do_vfs_ioctl+0x4d8/0x54c) from [<c014d374>] (sys_ioctl+0x34/0x54)
    [  638.952906] c1    936 [<c014d374>] (sys_ioctl+0x34/0x54) from [<c000e480>] (ret_fast_syscall+0x0/0x30)
    [  638.952906] c1    936 Code: e1a00003 e12fff1e e5903010 ea000006 (e5932028) 
    [  638.952937] c1    936 ---[ end trace 3d62df4cc805c0c7 ]---
    [  638.952968] c1    936 Kernel panic - not syncing: Binder_4 PC:binder_get_ref+0x8/0x34 LR:binder_transaction_buffer_release+0x190/0x340
    

    analysis
    In kernel binder, a ref of a process points to a node in some process. A ref is a client and a node is a service. Each process use a rb tree to maintains all refs of this process. Each process also use another rb tree to maintains all nodes of this process.

    In this case, thread Binder_4(tid=936), requests a binder transaction from a binder proxy whose description is 0x1cc. Binder driver tries to find ref whose description is 0x1cc from the ref rb tree.

    This panic happens in binder_get_ref(binder_proc=0xdc3e6400, desc=0x1cc). log shows, at timestamp, ref = r3 = 0xe714c7dd and desc = r1 = 0x1cc. Kernel panic happens at ref->desc while deference ref->desc = [0xe714c7dd + 0x28] = 0x[e714c805]

    static struct binder_ref *binder_get_ref(struct binder_proc *proc,
                         uint32_t desc)
    {
        struct rb_node *n = proc->refs_by_desc.rb_node;
        struct binder_ref *ref;
    
        while (n) {
            ref = rb_entry(n, struct binder_ref, rb_node_desc);
    
            if (desc < ref->desc)
                n = n->rb_left;
            else if (desc > ref->desc)
                n = n->rb_right;
            else
                return ref; 
        }
        return NULL;
    }
    
    c050c2fc <binder_get_ref>:
    c050c2fc:   e5903010    ldr r3, [r0, #16]
    c050c300:   ea000006    b   c050c320 <binder_get_ref+0x24>
    c050c304:   e5932028    ldr r2, [r3, #40]   ; 0x28
    c050c308:   e2430004    sub r0, r3, #4
    c050c30c:   e1510002    cmp r1, r2
    c050c310:   35933008    ldrcc   r3, [r3, #8]
    c050c314:   3a000001    bcc c050c320 <binder_get_ref+0x24>
    c050c318:   912fff1e    bxls    lr
    c050c31c:   e5933004    ldr r3, [r3, #4]
    c050c320:   e3530000    cmp r3, #0
    c050c324:   1afffff6    bne c050c304 <binder_get_ref+0x8>
    c050c328:   e1a00003    mov r0, r3
    c050c32c:   e12fff1e    bx  lr
    

    analysis: check if ref rb tree is memory corrupted
    Traverse the ref rb tree to find rb node whose desc is 01xcc. If the root cause is memory corruption, then the traverse would fail in the same place as the kernel panic. However, the traverse is successful and a rb node with desc 0x1cc is found. Now the root cause seems not related to memory corruption. Let’s check the assembly code for further analysis.
    binder_panic_01

    analysis: check registers and assembly code
    Since r2 = 0x1cb, this panic happens while traversing from the rb node with desc 0x1cb to rb node with desc 0x1cd.

    Given, desc = 0x1cc, ref->desc = 0x1cb, and n = 0xda231584, if
    desc > ref->desc is true, then n = n->rb_right. So n = [0xda231584+0x4] = [0xda231588] = 0xda818a84.

    One possible reason to cause this issue is a bit flip in CPU caches happens and [0xda231588] becomes [0xdaa31588] = 0xe714c7dd which is equal to r3.
    0xda231588 = 0b11011010001000110001010110001000
    0xdaa31588 = 0b11011010101000110001010110001000

    binder_panic_02

    c050c304:   e5932028    ldr r2, [r3, #40]   ; 0x28 
    => r2 = [0xda2315ac] = 0x1cb
    c050c308:   e2430004    sub r0, r3, #4
    => r0 = 0xda231580
    => since 0x1cb < 0x1cc, traverse the right child.
    c050c31c:   e5933004    ldr r3, [r3, #4]
    => r3 = [0xda231588] = 0xda818a84 
    => However, a bit flip happens => r3 = [0xdaa31588] = e714c7dd
    c050c320:   e3530000    cmp r3, #0
    c050c324:   1afffff6    bne c050c304 <binder_get_ref+0x8>
    c050c304:   e5932028    ldr r2, [r3, #40]   ; 0x28
    => Unable to handle kernel paging request at virtual address e714c805
    

    conclusion
    In this case, a kernel panic happens at binder_get_ref. At first, traverse the ref rb tree to check if the ref rb tree is memory corrupted. The result shows that the ref rb tree is not memory corrupted. With further analysis of registers and assembly code in binder_get_ref, one possible root cause is that a bit flip happens in CPU caches while memory content is correct. So it’s possible a hardware issue rather than a software issue.

    arm: mm: memory usage distribution

    August 22, 2015

    This post is to discuss memory usage distribution, and explains each item in the distribution. The reference source code here is qualcomm msm kernel release 3.4.0 with arm 32-bit architecture. In the case discussed here, CONFIG_ARM_LPAE is disabled.

    memory usage distribution
    /proc/meminfo provides information about spectrum of the memory usage distribution. MemTotal item indicates the total physical ram used by kernel. The value of MemTotal is approximately the sum of the other highlighted items below.

    formula of MemTotal distribution
    MemTotal = MemFree + Buffers + Cached + SwapCached + AnonPages + Slab + KernelStack + PageTables + VmallocAlloc + KgslAlloc + IonTotal + Kmalloc + DmaAlloc + IommuPGD + HashTables + MemPoolAlloc + Ftrace
    Among these, the first 8 items are original Linux ones.

    MemTotal:        1844636 kB
    MemFree:          560832 kB
    Buffers:           38108 kB
    Cached:           720412 kB
    SwapCached:            0 kB
    Active:           371144 kB
    Inactive:         652828 kB
    Active(anon):     265508 kB
    Inactive(anon):      312 kB
    Active(file):     105636 kB
    Inactive(file):   652516 kB
    Unevictable:           0 kB
    Mlocked:               0 kB
    HighTotal:       1283072 kB
    HighFree:         212004 kB
    LowTotal:         561564 kB
    LowFree:          348828 kB
    SwapTotal:        419424 kB
    SwapFree:         419424 kB
    Dirty:                92 kB
    Writeback:             0 kB
    AnonPages:        265460 kB
    Mapped:           176208 kB
    Shmem:               396 kB
    Slab:              44936 kB
    SReclaimable:      21336 kB
    SUnreclaim:        23600 kB
    KernelStack:        8384 kB
    PageTables:        16988 kB
    NFS_Unstable:          0 kB
    Bounce:                0 kB
    WritebackTmp:          0 kB
    CommitLimit:     1341740 kB
    Committed_AS:   33542204 kB
    VmallocTotal:     406528 kB
    VmallocUsed:       10932 kB
    VmallocIoRemap:     2088 kB
    VmallocAlloc:       8440 kB
    VmallocMap:            0 kB
    VmallocUserMap:        0 kB
    VmallocVpage:       4888 kB
    VmallocChunk:     121936 kB
    KgslAlloc:         29784 kB
    IonTotal:         129224 kB
    IonInUse:         101748 kB
    FreeCma:          201544 kB
    Kmalloc:            5120 kB
    DmaAlloc:           7288 kB
    IommuPgd:            756 kB
    DriverAlloc:        1952 kB
    HashTables:         1696 kB
    MemPoolAlloc:        256 kB
    CachedUnmapped:   549844 kB
    KgslUnmapped:      24372 kB
    Ftrace:              800 kB
    

    distribution item: MemFree
    MemFree is equal to all pages in freelist. In this case, the buddy system has two zones. Each zone has 11 freeareas(from order 0 to 10) and each freearea has 6 freelists. Among the 6 freelists, Isolate migration type freelist is not accounted into MemFree. The other migration type freelists are accounted into MemFree.

    pagetypeinfo is a very important tool to check if MemFree accounting is correct and if buddy system is suffering from external fragmentation.

    Page block order: 10
    Pages per block:  1024
    
    Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10 
    Node    0, zone   Normal, type    Unmovable    263    176    101     22      5      3      3      2      2      1      1 
    Node    0, zone   Normal, type  Reclaimable      5      8      7      5      1      0      1      1      0      1      0 
    Node    0, zone   Normal, type      Movable      1      1      1      1      1      0      1      1      1      0     77 
    Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      1 
    Node    0, zone   Normal, type          CMA      0      0      0      0      0      1      0      0      1      0      2 
    Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
    Node    0, zone  HighMem, type    Unmovable     26     33     34     15      8      2      2      1      2      1      0 
    Node    0, zone  HighMem, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0 
    Node    0, zone  HighMem, type      Movable    509    390    147     34      9      0      0      0      0      0      0 
    Node    0, zone  HighMem, type      Reserve      0      0      0      0      0      0      0      0      0      0      1 
    Node    0, zone  HighMem, type          CMA   1321    912   1726   1572    965    277     16      1      0      0      0 
    Node    0, zone  HighMem, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
    
    Number of blocks type     Unmovable  Reclaimable      Movable      Reserve          CMA      Isolate 
    Node 0, zone   Normal           43            6           99            1            4            0 
    Node 0, zone  HighMem           39            0          190            1           84            0 
    

    distribution item: Buffers
    Buffers indicates the memory usage of the caches of block device. Below shows how Buffers changes.

  • Initial condition
  • $ adb shell cat /proc/meminfo
    MemTotal:        1844636 kB
    MemFree:          553536 kB
    Buffers:          120160 kB
    Cached:           617456 kB
    Mapped:           187244 kB
    
  • After drop_caches, Buffers cache are almost dropped.
  • $ adb shell "echo 3 > /proc/sys/vm/drop_caches"
    MemTotal:        1844636 kB
    MemFree:         1126696 kB
    Buffers:             516 kB
    Cached:           181868 kB
    Mapped:           186664 kB
    
  • Read 262144 KB from a block device
  • $ adb shell dd if=/dev/block/mmcblk0p45 of=/dev/null bs=512 count=$((2*1024*256))
    524288+0 records in
    524288+0 records out
    268435456 bytes transferred in 3.102 secs (86536252 bytes/sec)
    
  • Buffers size becomes 264508 KB
  • $ adb shell cat /proc/meminfo
    MemTotal:        1844636 kB
    MemFree:          858232 kB
    Buffers:          264508 kB
    Cached:           182480 kB
    Mapped:           186664 kB
    

    distribution item: Cached
    Caches indicates the memory usage of the caches of files. Below shows how Caches changes.

  • Initial condition
  • $ adb shell cat /proc/meminfo
    MemTotal:        1844636 kB
    MemFree:          610848 kB
    Buffers:           22176 kB
    Cached:           713112 kB
    Mapped:           182868 kB
    
  • After drop_caches, lots of Caches are dropped. But mapped file caches cannot been dropped by drop_caches. Mapped files could be dropped by reclaims, including direct reclaim and background reclaim.
  • $ adb shell "echo 3 > /proc/sys/vm/drop_caches"
    MemTotal:        1844636 kB
    MemFree:         1166720 kB
    Buffers:             524 kB
    Cached:           188048 kB
    Mapped:           183064 kB
    
  • Write 262144 KB into a file
  • $ adb shell dd if=/dev/zero of=/data/test_file  bs=512 count=$((2*1024*256))
    524288+0 records in
    524288+0 records out
    268435456 bytes transferred in 10.672 secs (25153247 bytes/sec)
    
  • Cached size becomes 451108 KB
  • $ adb shell cat /proc/meminfo
    MemTotal:        1844636 kB
    MemFree:          898296 kB
    Buffers:            1228 kB
    Cached:           451108 kB
    Mapped:           183068 kB
    

    distribution item: AnonPages
    Anonymous page are physical page mapped by user space virtual address space allocated by malloc(). Below test program shows how AnonPages changes.

    void add_anony_page()
    {
         int *p, i;
    
         for (i = 0; i < 16384; i++) {
             p = (int *)malloc(4096);
             *p = 0;
         }
    }
    
  • Initial condition
  • MemTotal:        1844636 kB
    MemFree:          601260 kB
    AnonPages:        259928 kB
    
  • add_anony_pages starts
  • MemTotal:        1844636 kB
    MemFree:          535592 kB
    AnonPages:        325664 kB
    
      PID       Vss      Rss      Pss      Uss  cmdline
     4211    66112K   65828K   65824K   65824K  /data/add_anon_pages
    
  • add_anony_pages ends
  • MemTotal:        1844636 kB
    MemFree:          600752 kB
    AnonPages:        259988 kB
    

    distribution item: Slab
    Memory usage used by Slab allocator which help relieve internal fragmentation. It’s usage spectrum could be referred from /proc/slabinfo.

    slabinfo - version: 2.1
    # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
    ext4_groupinfo_4k    133    156    104   39    1 : tunables    0    0    0 : slabdata      4      4      0
    UDPLITEv6              0      0    768   21    4 : tunables    0    0    0 : slabdata      0      0      0
    UDPv6                 43     84    768   21    4 : tunables    0    0    0 : slabdata      4      4      0
    tw_sock_TCPv6         42    105    192   21    1 : tunables    0    0    0 : slabdata      5      5      0
    TCPv6                 44     88   1472   22    8 : tunables    0    0    0 : slabdata      4      4      0
    

    distribution item: KernelStack
    In arm 32-bit architecture, each task has a KernelStack of two pages.

    #define THREAD_SIZE_ORDER   1
    #define THREAD_SIZE     8192
    
    static void account_kernel_stack(struct thread_info *ti, int account)
    {
        struct zone *zone = page_zone(virt_to_page(ti));
    
        mod_zone_page_state(zone, NR_KERNEL_STACK, account);
    }
    void free_task(struct task_struct *tsk)
    {
        account_kernel_stack(tsk->stack, -1); 
        free_thread_info(tsk->stack);
        rt_mutex_debug_task_free(tsk);
        ftrace_graph_exit_task(tsk);
        free_task_struct(tsk);
    }
    
    static struct task_struct *dup_task_struct(struct task_struct *orig)
    {
    ......
        account_kernel_stack(ti, 1);
    ......
    }
    

    Below example demonstrates that KernelStack = number of threads * 8192.

    $ adb shell ps -t | wc
       1002    9237   69934
    $ adb shell cat /proc/meminfo | grep KernelStack
    KernelStack:        8016 kB
    

    distribution item: PageTables
    This item is exported by original Linux to shows the memory usage of lowest level page table. In this case, the architecture is arm-32bit and LAPE is disabled. So each process has a two level page table. In order to approximate correct memory usage, this item is tweaked to include the first level page table whose size is 16 KB.

    #ifdef CONFIG_ARM_LPAE
    #define __pgd_alloc()   kmalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL)
    #define __pgd_free(pgd) kfree(pgd)
    #else
    static inline pgd_t *__pgd_alloc(void)
    {
        unsigned long addr = __get_free_pages(GFP_KERNEL, 2); 
    
        pgd_alloc_count(addr);
    
        return (pgd_t *) addr;
    }
    
    static inline void __pgd_free(pgd_t * pgd)
    {
        unsigned long addr = (unsigned long)pgd;
    
        pgd_free_count(addr);
    
        free_pages(addr, 2); 
    }
    #endif
    

    distribution item: VmallocAlloc
    VmallocTotal is the size of vmalloc address space in kernel virtual address space. VmallocUsed is the used size of vmalloc address space.
    Kernel drivers could use vmap and vunmap to allocate and free memory from vmalloc address space.

    VmallocAlloc is a customised item to indicate the size of physical pages allocated by vmalloc allocator. vmalloc allocator requests both physical and virtual memory address space.

    vmalloc increases both VmallocUsed and VmallocAlloc meanwhile vmap only increases VmallocUsed.

    VmallocTotal:     406528 kB
    VmallocUsed:       10892 kB
    VmallocIoRemap:     2088 kB
    VmallocAlloc:       8404 kB
    VmallocMap:            0 kB
    VmallocUserMap:        0 kB
    VmallocVpage:       4888 kB
    VmallocChunk:     123608 kB
    

    distribution item: KgslAlloc
    This customised item is to account memory allocated by KGSL(qualcomm graphics driver) allocator.

    static int
    _kgsl_sharedmem_page_alloc(struct kgsl_memdesc *memdesc,
                struct kgsl_pagetable *pagetable,
                size_t size)
    {
    ......
        sg_init_table(memdesc->sg, memdesc->sglen_alloc);
    
        len = size;
        while (len > 0) {
            ......
            for (j = 0; j < page_size >> PAGE_SHIFT; j++) {
                pages[pcount++] = nth_page(page, j);
                SetPageKgsl(nth_page(page, j));
            }
    
            sg_set_page(&memdesc->sg[sglen++], page, page_size, 0);
            len -= page_size;
        }
    ......
    }
    
    static void kgsl_page_alloc_free(struct kgsl_memdesc *memdesc)
    {
    ......
       if (memdesc->sg)
            for_each_sg(memdesc->sg, sg, sglen, i) {
                if (sg->length == 0)
                    break;
                ......                
                size = 1 << get_order(sg->length);
                for (j = 0; j < size; j++)
                    ClearPageKgsl(nth_page(sg_page(sg), j));
                __free_pages(sg_page(sg), get_order(sg->length));
            }
    ......
    }
    

    distribution item: IonTotal
    This customised item is to account memory allocated by The Android ION memory allocator. Ion allocator help relieve external fragmentation.

    static void *ion_page_pool_alloc_pages(struct ion_page_pool *pool)
    {
        struct page *page;
    
        page = alloc_pages(pool->gfp_mask & ~__GFP_ZERO, pool->order);
        ......
    }
    static void ion_page_pool_free_pages(struct ion_page_pool *pool,
                         struct page *page)
    {
        __free_pages(page, pool->order);
    }
    

    distribution item: Kmalloc
    This customised item is to account kmalloc(size, gfp) while size > 8192. In slub implementation of kmalloc, if requested size is less than or equal to 8192, than kmalloc allocator will request a slab from the kmalloc-xxxx kmemcache slab allocator. Otherwise, kmalloc allocator will request a compound page from buddy allocator directly.

    CONFIG_SLUB=y
    
    #define SLUB_MAX_SIZE (2 * PAGE_SIZE)
    static __always_inline void *kmalloc(size_t size, gfp_t flags)
    {
        if (__builtin_constant_p(size)) {
            if (size > SLUB_MAX_SIZE)
                return kmalloc_large(size, flags);
    
            if (!(flags & SLUB_DMA)) {
                struct kmem_cache *s = kmalloc_slab(size);
    
                if (!s)
                    return ZERO_SIZE_PTR;
    
                return kmem_cache_alloc_trace(s, flags, size);
            }   
        }   
        return __kmalloc(size, flags);
    }
    
    slabinfo - version: 2.1
    # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
    kmalloc-8192          80     96   8192    4    8 : tunables    0    0    0 : slabdata     24     24      0   
    kmalloc-4096         380    408   4096    8    8 : tunables    0    0    0 : slabdata     51     51      0   
    kmalloc-2048         216    240   2048   16    8 : tunables    0    0    0 : slabdata     15     15      0   
    kmalloc-1024        1562   1872   1024   16    4 : tunables    0    0    0 : slabdata    117    117      0   
    kmalloc-512         2007   2112    512   16    2 : tunables    0    0    0 : slabdata    132    132      0   
    kmalloc-256          888   1008    256   16    1 : tunables    0    0    0 : slabdata     63     63      0   
    kmalloc-128         8978  11232    128   32    1 : tunables    0    0    0 : slabdata    351    351      0   
    kmalloc-64         75844  79744     64   64    1 : tunables    0    0    0 : slabdata   1246   1246      0   
    kmalloc-192         7599   8694    192   21    1 : tunables    0    0    0 : slabdata    414    414      0   
    kmem_cache            32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0   
    
    

    distribution item: DmaAlloc
    This customised item indicate the memory usage of DMA allocator. In this case, CONFIG_CMA is enabled, so DMA allocator request memory from CMA allocator.

    static void *__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
                 gfp_t gfp, pgprot_t prot, const void *caller,
                 struct dma_attrs *attrs)
    {
    ......
        size = PAGE_ALIGN(size);
    
        if (arch_is_coherent() || nommu())
            addr = __alloc_simple_buffer(dev, size, gfp, &page);
        else if (gfp & GFP_ATOMIC)
            addr = __alloc_from_pool(size, &page);
        else if (!IS_ENABLED(CONFIG_CMA))
            addr = __alloc_remap_buffer(dev, size, gfp, prot, &page, caller);
        else
            addr = __alloc_from_contiguous(dev, size, prot, &page,
                            caller, attrs);
    
        if (addr)
            *handle = pfn_to_dma(dev, page_to_pfn(page));
    
        return addr;
    ......
    }
    

    If gfp & GFP_ATOMIC is true, then DMA allocator will request memory from dma_pool whose default size is 256 KB. In this case, CONFIG_CMA is enabled, and dma_pool request memory from CMA allocator at driver initialisation.

    struct dma_pool {
        size_t size;
        spinlock_t lock;
        unsigned long *bitmap;
        unsigned long nr_pages;
        void *vaddr;
        struct page *page;
    };
    
    static struct dma_pool atomic_pool = {
        .size = SZ_256K,
    };
    
    static int __init atomic_pool_init(void)
    {
        struct dma_pool *pool = &atomic_pool;
        void *ptr;
        ......
        if (IS_ENABLED(CONFIG_CMA))
            ptr = __alloc_from_contiguous(NULL, pool->size, prot, &page,
                            atomic_pool_init, NULL);
        ......
        if (ptr) {
            spin_lock_init(&pool->lock);
            pool->vaddr = ptr;
            ......
        }
    ......
    }
    

    distribution item: IommuPGD
    In this case, the architecture is arm 32-bit and LAPE is disabled, so the page table has two levels. This customised item indicates the memory usage of the first level and the second level page table of all iommu domains.

    distribution item: HashTables
    The customised item indicates the memory usage allocated by the API alloc_large_system_hash.

    /*
     * allocate a large system hash table from bootmem
     * - it is assumed that the hash table must contain an exact power-of-2
     *   quantity of entries
     * - limit is the number of hash buckets, not the total allocation size
     */
    void *__init alloc_large_system_hash(const char *tablename,
                         unsigned long bucketsize,
                         unsigned long numentries,
                         int scale,
                         int flags,
                         unsigned int *_hash_shift,
                         unsigned int *_hash_mask,
                         unsigned long limit);
    

    distribution item: MemPoolAlloc
    This naming of this distribution item is a little misleading. It indicate the memory usage allocated by the API mempool_alloc_pages.

    /*
     * A simple mempool-backed page allocator that allocates pages
     * of the order specified by pool_data.
     */
    void *mempool_alloc_pages(gfp_t gfp_mask, void *pool_data)
    {
        int order = (int)(long)pool_data;
        return alloc_pages(gfp_mask, order);
    }
    EXPORT_SYMBOL(mempool_alloc_pages);
    
    void mempool_free_pages(void *element, void *pool_data)
    {
        int order = (int)(long)pool_data;
        __free_pages(element, order);
    }
    EXPORT_SYMBOL(mempool_free_pages);
    

    distribution item: Ftrace
    This distribution item indicate the memory usage used by Ftrace driver. The memory usage might be 32 MB in a system with 4 CPUs.

    conclusion
    The memory usage distribution of total memory in qualcomm msm kernel release 3.4.0 arm 32-bit architecture is MemTotal = MemFree + Buffers + Cached + SwapCached + AnonPages + Slab + KernelStack + PageTables + VmallocAlloc + KgslAlloc + IonTotal + Kmalloc + DmaAlloc + IommuPGD + HashTables + MemPoolAlloc + Ftrace. All these distribution items are disjoint to each other except that AnonPages and Swapcached has few pages in common. According to experiments, the formula could approximate the MemTotal with difference less than 10 MB in a 2 GB device.

    android: device freeze: some thread are blocked at shrink_inactive_list

    August 16, 2015

    This post is to discuss a case in which many threads blocked at shrink_inactive_list in kernel space. The reference source code here is qualcomm msm release with android 4.4.2 and kernel 3.4.0. This post is similar to android: device freeze: many threads blocked at proc_pid_cmdline because some threads blocked at shrink_inactive_list in kernel space .

    Symptom: many threads blocked at shrink_inactive_list in kernel space

    [231481.184237] c0  22335 [KEY] [PWR] Show Blocked State -- long press power key
    [231481.184245] c0  22335   task                PC stack   pid father
    [231481.184252] c0  22335 kthreadd        D c0991bf0     0     2      0 0x00000000 c0 0
    [231481.184266] c0  22335   tgid: 2, group leader: kthreadd
    [231481.184289] c0  22335 [<c0991bf0>] (__schedule+0x430/0x5e4) from [<c0990278>] (schedule_timeout+0x294/0x34c)
    [231481.184302] c0  22335 [<c0990278>] (schedule_timeout+0x294/0x34c) from [<c0991530>] (io_schedule_timeout+0x78/0xa8)
    [231481.184317] c0  22335 [<c0991530>] (io_schedule_timeout+0x78/0xa8) from [<c0245778>] (congestion_wait+0x78/0x140)
    [231481.184334] c0  22335 [<c0245778>] (congestion_wait+0x78/0x140) from [<c023c220>] (shrink_inactive_list+0x7c/0x3e0)
    [231481.184346] c0  22335 [<c023c220>] (shrink_inactive_list+0x7c/0x3e0) from [<c023c8c0>] (shrink_mem_cgroup_zone+0x33c/0x4c0)
    [231481.184359] c0  22335 [<c023c8c0>] (shrink_mem_cgroup_zone+0x33c/0x4c0) from [<c023ca9c>] (shrink_zone+0x58/0x8c)
    [231481.184371] c0  22335 [<c023ca9c>] (shrink_zone+0x58/0x8c) from [<c023d6ec>] (do_try_to_free_pages+0x208/0x49c)
    ..
    [231481.185837] c0  22335 vold            D c0991bf0     0   370      1 0x00000000 c0 0
    [231481.185850] c0  22335   tgid: 370, group leader: vold
    [231481.185862] c0  22335 [<c0991bf0>] (__schedule+0x430/0x5e4) from [<c0990278>] (schedule_timeout+0x294/0x34c)
    [231481.185873] c0  22335 [<c0990278>] (schedule_timeout+0x294/0x34c) from [<c0991530>] (io_schedule_timeout+0x78/0xa8)
    [231481.185886] c0  22335 [<c0991530>] (io_schedule_timeout+0x78/0xa8) from [<c0245778>] (congestion_wait+0x78/0x140)
    [231481.185898] c0  22335 [<c0245778>] (congestion_wait+0x78/0x140) from [<c023c220>] (shrink_inactive_list+0x7c/0x3e0)
    [231481.185948] c0  22335 [<c023c220>] (shrink_inactive_list+0x7c/0x3e0) from [<c023c8c0>] (shrink_mem_cgroup_zone+0x33c/0x4c0)
    [231481.185961] c0  22335 [<c023c8c0>] (shrink_mem_cgroup_zone+0x33c/0x4c0) from [<c023ca9c>] (shrink_zone+0x58/0x8c)
    

    analysis: check if threads really blocked in shrink_inactive_list
    The snapshot of the call stacks of many threads are at shrink_inactive_list()->congestion_wait() because too_many_isolated returns true. too_many_isolated returns true when isolate_file > inactive_file.

    static noinline_for_stack unsigned long
    shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
                         struct scan_control *sc, enum lru_list lru)
    {
    ......
            while (unlikely(too_many_isolated(zone, file, sc))) {
                    congestion_wait(BLK_RW_ASYNC, HZ/10);
    
                    /* We are about to die and free our memory. Return now. */
                    if (fatal_signal_pending(current))
                            return SWAP_CLUSTER_MAX;
            }
    ......
    }
    
    /*
     * Are there way too many processes in the direct reclaim path already?
     */
    static int too_many_isolated(struct zone *zone, int file,
                    struct scan_control *sc)
    {
            unsigned long inactive, isolated;
    
            if (current_is_kswapd())
                    return 0;
    
            if (!global_reclaim(sc))
                    return 0;
    
            if (file) {
                    inactive = zone_page_state(zone, NR_INACTIVE_FILE);
                    isolated = zone_page_state(zone, NR_ISOLATED_FILE);
            } else {
                    inactive = zone_page_state(zone, NR_INACTIVE_ANON);
                    isolated = zone_page_state(zone, NR_ISOLATED_ANON);
            }
    
            return isolated > inactive;
    }
    

    The memory accounting of isolate_file and inactive_file shows that too_many_isolated doesn’t always return true. It seems these threads are not blocked at shrink_inactive_list()->congestion_wait(). Besides, there are 160 MB free pages in highmem and 40 MB free pages in lowmem, it’s not reasonable that many threads enter direct reclaim at the same time while free memory is high.

    isolate_file2_01

    analysis: why many thread enter direct reclaim at the same time while free memory is high

    Below is zone status of lowmem. The free memory is only 40 MB which is equal to low watermark + lowmem_reserve from highmem. Under this condition, if free memory in highmem is exhausted, any movable allocations couldn’t fallback to get pages from lowmem.

    isolate_file2_02

    Below is zone status of highmem. The freearea of each order are empty. The free memory in highmem is actually 0 MB, but vm_stat[0] incorrectly account free memory in highmem as 160 MB. The incorrect accounting of free memory disable lowmemory killer and kswapd because both depend on free memory accounting. The only way to restore free memory is to reclaim. Thus, many user space process are busy reclaiming within direct reclaim.

    The post android: mm: dynamic memory system explains why busy reclaiming while Cached is extremely low will make device freeze. In this case, Cached is only about 1 MB.

    isolate_file2_03

    analysis: why free memory accounting in highmem is incorrect
    In kernel: mm: cma: allocation failure due to buddy system, it’s shown that CMA allocation might lead to free memory accounting incorrect.

    Conclusion
    Due to incorrect accounting of free memory, lowmemory killer and kswapd are disabled because both depend on free memory accounting. Many threads are busy at reclaiming memory in direct reclaim. System performance is highly degraded under this condition. This problems is formally fxied at Linux 3.18.

  • mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype
  • mm/page_alloc: add freepage on isolate pageblock to correct buddy list
  • mm/page_alloc: move freepage counting logic to __free_one_page()
  • android: device freeze: many threads blocked at proc_pid_cmdline because some threads blocked at shrink_inactive_list in kernel space

    August 14, 2015

    This post is to discuss a case in which many threads blocked at proc_pid_cmdline because some threads blocked at shrink_inactive_list in kernel space. The reference source code here is qualcomm msm release with android 4.4.2 and kernel 3.4.0. This post is similar to android: device freeze: some thread are blocked at shrink_inactive_list .

    symptom: many threads blocked at proc_pid_cmdline in kernel space

    <6>[75041.954866] c2    994 system_server   D c0991970     0  1005    384 0x00000000 c0 0
    <6>[75041.954878] c2    994   tgid: 1005, group leader: system_server
    <6>[75041.954889] c2    994 [<c0991970>] (__schedule+0x430/0x5e4) from [<c0991ef0>] (schedule_preempt_disabled+0x24/0x34)
    <6>[75041.954899] c2    994 [<c0991ef0>] (schedule_preempt_disabled+0x24/0x34) from [<c0990a40>] (__mutex_lock_slowpath+0x160/0x1c0)
    <6>[75041.954910] c2    994 [<c0990a40>] (__mutex_lock_slowpath+0x160/0x1c0) from [<c0990ad4>] (mutex_lock+0x34/0x58)
    <6>[75041.954920] c2    994 [<c0990ad4>] (mutex_lock+0x34/0x58) from [<c070a778>] (binder_ioctl+0x34/0x5d0)
    <6>[75041.954930] c2    994 [<c070a778>] (binder_ioctl+0x34/0x5d0) from [<c027ce70>] (vfs_ioctl+0x28/0x3c)
    <6>[75041.954940] c2    994 [<c027ce70>] (vfs_ioctl+0x28/0x3c) from [<c027d9c4>] (do_vfs_ioctl+0x534/0x588)
    <6>[75041.954949] c2    994 [<c027d9c4>] (do_vfs_ioctl+0x534/0x588) from [<c027da60>] (sys_ioctl+0x48/0x70)
    <6>[75041.954959] c2    994 [<c027da60>] (sys_ioctl+0x48/0x70) from [<c01061c0>] (ret_fast_syscall+0x0/0x30)
    <6>[75041.955066] c2    994 ActivityManager D c0991970     0  1025    384 0x00000001 c0 0
    <6>[75041.955078] c2    994   tgid: 1005, group leader: system_server
    <6>[75041.955088] c2    994 [<c0991970>] (__schedule+0x430/0x5e4) from [<c09923c8>] (__down_read+0xc0/0xd8)
    <6>[75041.955099] c2    994 [<c09923c8>] (__down_read+0xc0/0xd8) from [<c024e980>] (__access_remote_vm+0x2c/0x10c)
    <6>[75041.955109] c2    994 [<c024e980>] (__access_remote_vm+0x2c/0x10c) from [<c024eb98>] (access_process_vm+0x44/0x5c)
    <6>[75041.955121] c2    994 [<c024eb98>] (access_process_vm+0x44/0x5c) from [<c02b4fc0>] (proc_pid_cmdline+0x50/0xd4)
    <6>[75041.955132] c2    994 [<c02b4fc0>] (proc_pid_cmdline+0x50/0xd4) from [<c02b5c4c>] (proc_info_read+0x54/0xd4)
    <6>[75041.955142] c2    994 [<c02b5c4c>] (proc_info_read+0x54/0xd4) from [<c026e218>] (vfs_read+0xb0/0x19c)
    <6>[75041.955151] c2    994 [<c026e218>] (vfs_read+0xb0/0x19c) from [<c026e338>] (sys_read+0x34/0x68)
    <6>[75041.955160] c2    994 [<c026e338>] (sys_read+0x34/0x68) from [<c01061c0>] (ret_fast_syscall+0x0/0x30)
    <6>[75041.955749] c2    994 watchdog        D c0991970     0  1279    384 0x00000000 c0 0
    <6>[75041.955761] c2    994   tgid: 1005, group leader: system_server
    <6>[75041.955771] c2    994 [<c0991970>] (__schedule+0x430/0x5e4) from [<c09923c8>] (__down_read+0xc0/0xd8)
    <6>[75041.955782] c2    994 [<c09923c8>] (__down_read+0xc0/0xd8) from [<c024e980>] (__access_remote_vm+0x2c/0x10c)
    <6>[75041.955790] c2    994 [<c024e980>] (__access_remote_vm+0x2c/0x10c) from [<c024eb98>] (access_process_vm+0x44/0x5c)
    <6>[75041.955801] c2    994 [<c024eb98>] (access_process_vm+0x44/0x5c) from [<c02b4fc0>] (proc_pid_cmdline+0x50/0xd4)
    <6>[75041.955810] c2    994 [<c02b4fc0>] (proc_pid_cmdline+0x50/0xd4) from [<c02b5c4c>] (proc_info_read+0x54/0xd4)
    <6>[75041.955820] c2    994 [<c02b5c4c>] (proc_info_read+0x54/0xd4) from [<c026e218>] (vfs_read+0xb0/0x19c)
    <6>[75041.955829] c2    994 [<c026e218>] (vfs_read+0xb0/0x19c) from [<c026e338>] (sys_read+0x34/0x68)
    <6>[75041.955838] c2    994 [<c026e338>] (sys_read+0x34/0x68) from [<c01061c0>] (ret_fast_syscall+0x0/0x30)
    

    analysis: threads blocked at proc_pid_cmdline
    While threads of system_server are blocked, frameworks watchdog timeout is supposed to restart frameworks. But in this case, ActivityManager thread and watchdog thread of system_server process are both blocked in kernel space with D state. Frameworks watchdog couldn’t be triggered and device stay in freeze state. The two threads are blocked at __access_remote_vm while requesting down_read some thread’s mm->mmap_sem rwsem.

    At the same time, TexturesGenerat thread is blocked at shrink_inactive_list while page fault, allocate pages, and then enter direct relaim. Within, do_page_fault, the mm->mmap_sem is down_read. If a down_write is blocked at the same mm->mmap_sem, then all subsequent down_read of this rwsem are also blocked. Now, the problem reduces to why the below two threads are always blocked at shrink_inactive_list while entering direct reclaim.

    =======================================================
        Task name: TexturesGenerat pid: 23173 cpu: 0
        state: 0x2 exit_state: 0x0 stack base: 0xe09ac000
        Stack:    [<c0991970>] __schedule+0x430
        [<c098fff8>] schedule_timeout+0x294
        [<c09912b0>] io_schedule_timeout+0x78
        [<c0245774>] congestion_wait+0x78
        [<c023c230>] shrink_inactive_list+0x7c
        [<c023c8bc>] shrink_mem_cgroup_zone+0x33c
        [<c023ca98>] shrink_zone+0x58
        [<c023d6e8>] do_try_to_free_pages+0x208
        [<c023dc2c>] try_to_free_pages+0xe0
        [<c0232e64>] __alloc_pages_nodemask+0x538
        [<c024dd08>] handle_pte_fault+0x13c
        [<c024e3c4>] handle_mm_fault+0xcc
        [<c09946c4>] do_page_fault+0x1a0
        [<c0100364>] do_DataAbort+0x180
        [<c09930b4>] __dabt_usr+0x34
    
    =======================================================
    Process: logcat, cpu: 0 pid: 5169 start: 0xe4d8c400
    =====================================================
        Task name: logcat pid: 5169 cpu: 0
        state: 0x2 exit_state: 0x0 stack base: 0xe601a000
        Stack:    [<c0991970>] __schedule+0x430
        [<c098fff8>] schedule_timeout+0x294
        [<c09912b0>] io_schedule_timeout+0x78
        [<c0245774>] congestion_wait+0x78
        [<c023c230>] shrink_inactive_list+0x7c
        [<c023c8bc>] shrink_mem_cgroup_zone+0x33c
        [<c023ca98>] shrink_zone+0x58
        [<c023d6e8>] do_try_to_free_pages+0x208
        [<c023dc2c>] try_to_free_pages+0xe0
        [<c0232e64>] __alloc_pages_nodemask+0x538
        [<c0275e20>] pipe_write+0x24c
        [<c026d5a0>] do_sync_write+0x98
        [<c026e07c>] vfs_write+0xb4
        [<c026e42c>] sys_write+0xc0
        [<c01061c0>] ret_fast_syscall+0x0
    
    static int __kprobes
    do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
    {
    ......
            if (!down_read_trylock(&mm->mmap_sem)) {
                    if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc))
                            goto no_context;
    retry:
                    down_read(&mm->mmap_sem);
            } else {
                    /*  
                     * The above down_read_trylock() might have succeeded in
                     * which case, we'll have missed the might_sleep() from
                     * down_read()
                     */
                    might_sleep();
    ......
            }   
    
            fault = __do_page_fault(mm, addr, fsr, flags, tsk);
    }
    

    analysis: threads blocked at shrink_inactive_list -> congestion_wait
    If free memory is below watermarks or allocations failure happens, then kswapd reclaims in the background while allocating threads might enter direct reclaim to reclaim memory.

    In the case, some threads are blocked at shrink_inactive_list => congestion_wait. The reason why these threads wait at congestion_wait is because isolate_file is greater than inactive_file in highmem and makes too_many_isolated return true.

    static noinline_for_stack unsigned long
    shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
                         struct scan_control *sc, enum lru_list lru)
    {
    ......
            while (unlikely(too_many_isolated(zone, file, sc))) {
                    congestion_wait(BLK_RW_ASYNC, HZ/10);
    
                    /* We are about to die and free our memory. Return now. */
                    if (fatal_signal_pending(current))
                            return SWAP_CLUSTER_MAX;
            }
    ......
    }
    
    /*
     * Are there way too many processes in the direct reclaim path already?
     */
    static int too_many_isolated(struct zone *zone, int file,
                    struct scan_control *sc)
    {
            unsigned long inactive, isolated;
    
            if (current_is_kswapd())
                    return 0;
    
            if (!global_reclaim(sc))
                    return 0;
    
            if (file) {
                    inactive = zone_page_state(zone, NR_INACTIVE_FILE);
                    isolated = zone_page_state(zone, NR_ISOLATED_FILE);
            } else {
                    inactive = zone_page_state(zone, NR_INACTIVE_ANON);
                    isolated = zone_page_state(zone, NR_ISOLATED_ANON);
            }
    
            return isolated > inactive;
    }
    

    Now, the problem reduces to why isolate_file is always greater than inactive_file in highmem and makes too_many_isolated returns true. From t32, the isolate_file is as high as 36 MB. I suspect the accounting is incorrect.

    isolate_file_01

    After reviewing source code which has impact on zone_page_state(zone, NR_ISOLATED_FILE), it appears that within reclaim_clean_pages_from_list exists a bug. __mod_zone_page_state might update a per CPU variable with preemption enabled. Replacing __mod_zone_page_state with mod_zone_page_state could make sure preemption is disabled while update the per CPU variable.

    $ grep -rnsIE "NR_ISOLATED_FILE" .
    ./mm/vmscan.c:996:	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
    ./mm/vmscan.c:1204:		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
    ./mm/compaction.c:400:		__mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
    ./mm/compaction.c:403:		mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
    ./mm/compaction.c:416:	isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
    ./mm/page_alloc.c:2938:		global_page_state(NR_ISOLATED_FILE),
    ./mm/page_alloc.c:2998:			K(zone_page_state(zone, NR_ISOLATED_FILE)),
    
    unsigned long reclaim_clean_pages_from_list(struct zone *zone,
                                                struct list_head *page_list)
    {
            ......
            ret = shrink_page_list(&clean_pages, zone, &sc,
                                    TTU_UNMAP|TTU_IGNORE_ACCESS,
                                    &dummy1, &dummy2, true);
            list_splice(&clean_pages, page_list);
            __mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
            return ret;
    }
    
    /*
     * Use interrupt disable to serialize counter updates
     */
    void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
                                            int delta)
    {
            unsigned long flags;
    
            local_irq_save(flags);
            __mod_zone_page_state(zone, item, delta);
            local_irq_restore(flags);
    }
    

    potential solution
    A patch merged in Linux 3.15-rc2 fixes this bug.
    vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state()

    conclusion
    While this symptom happens, isoate_file accounting is abnormally high. In Linux 3.4.0, there exists a bug which might lead to incorrect isoate_file accounting. This bug is fixed with the patch vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state() merged in Linux 3.15-rc2.

    kernel: mm: cma: allocation failure due to buddy system

    August 12, 2015

    This post is to discuss CMA(Contiguous Memory Allocator) allocation failure due to Buddy system. The post is similar to kernel: mm: cma: allocation failure due to ksm . The reference source code here is qualcomm msm kernel release 3.4.0.

    CMA allocation code flow
    dma_alloc_from_contiguous is the API of CMA allocation. It manages CMA heap with bitmap and leverages alloc_contig_range to allocate contiguous pages.

    Two APIs could get pages from buddy system: alloc_pages and alloc_contig_range. The former is to allocate a page of requested order. The latter is to allocate a requested region.

    start_isolate_page_range sets all pageblocks the requested region occupies to be isolate. It then moves all free pages in these pageblocks to isolate freelist. After this step, all pages freed from users return to isolate freelist and couldn’t be allocated by users.

    __alloc_contig_migrate_range migrates all used page to other pageblocks. Since CMA pageblocks could only be fall backed from Movable pages allocation, it is assumed that all used pages in CMA pageblocks could be migrated. After these pages migrated and freed, they returns to isolate freelist because these pageblocks are already isolate.

    test_pages_isolated checks if all pages in the requested region are free or isolated.

    isolate_freepages_range splits all free pages into order-0 free pages. Then, it uses a for loop to alloc each order-0 page and kernel_map_pages each page.

    undo_isolate_page_range restores all pageblocks’ migration type to CMA and therefore moves all free pages to CMA freelist.

    cma_alloc_01

    symptom: alloc_contig_range test_pages_isolated failed

    <4>[80467.803487] c3  26510 alloc_contig_range test_pages_isolated(33a00, 33b00) failed 
    <4>[80467.821804] c3  26510 alloc_contig_range test_pages_isolated(33b00, 33c00) failed 
    

    The error log implies that test_pages_isolated returns false. Within alloc_contig_range, after the first two steps works successfully, all pages in the region are supposed to be free and in isolate freelist.

    int alloc_contig_range(unsigned long start, unsigned long end,
                           unsigned migratetype)
    {
    ......
            /* Make sure the range is really isolated. */
            if (test_pages_isolated(outer_start, end)) {
                    pr_warn("alloc_contig_range test_pages_isolated(%lx, %lx) failed\n",
                           outer_start, end);
                    ret = -EBUSY;
                    goto done;
            }
    ......
    } 
    

    analysis
    In kernel: mm: cma: allocation failure due to ksm , the second step, __alloc_contig_migrate_range, fails to migrate KSM pages. In this case, however, test_pages_isolated return false even the first two steps execute correctly.

    The root cause of this symptom is that after the first two steps completed, PCP list and users return pages into freelist. Somehow, these pages doesn’t return to isolate freelist. The free pages not in isolate freelist could be allocated. In the third step, test_pages_isolated return false because some returned free pages are allocated again.

    analysis: the pages in PCP list returns to freelist
    PCP list provides order-0 free page caches. If the size of PCP list is below threshold, then it will supplement its free pages from freelist. While an order-0 free page is moved from freelist to PCP list, it will set page_private(page) as the migration type of this page’s freelist. If PCP list size is above threshold, then it will move superfluous pages back into freelist whose migration type is equal to page_private(page).

    After the first two steps, if a page is moved from PCP list into freelist, it will move into CMA freelist. This makes test_pages_isolated return false and CMA allocation failure.

    /*
     * Frees a number of pages from the PCP lists
     * Assumes all pages on list are in same zone, and of same order.
    to *
     * If the zone was previously in an "all pages pinned" state then look to
     * see if this freeing clears that state.
     *
     * And clear the zone's pages_scanned counter, to hold off the "all pages are
     * pinned" detection logic.
     */
    static void free_pcppages_bulk(struct zone *zone, int count,
                                            struct per_cpu_pages *pcp)
    {
    ......
                    do {
                            page = list_entry(list->prev, struct page, lru);
                            mt = get_pageblock_migratetype(page);
                            if (likely(mt != MIGRATE_ISOLATE))
                                    mt = page_private(page);
    
                            /* must delete as __free_one_page list manipulates */
                            list_del(&page->lru);
                            /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
                            __free_one_page(page, zone, 0, mt);
                            trace_mm_page_pcpu_drain(page, 0, mt);
                            if (likely(mt != MIGRATE_ISOLATE)) {
                                    free++;
                                    if (is_migrate_cma(mt))
                                            cma_free++;
                            }
                    } while (--to_free && --batch_free && !list_empty(list));
    
    ......
    }
    

    analysis: the allocated pages return to freelist
    While a user is ready to free a page and return it back to CMA freelist, if CMA driver finishes the first two steps at the same time, then the freed page will return to CMA freelist rather than isolate freelist. It’s possible that the freed page in CMA freelist could be allocated, and the third step, test_pages_isolated, will return false and the CMA allocation fails.

    static void free_one_page(struct zone *zone, struct page *page, int order,
                                    int migratetype)
    {
            spin_lock(&zone->lock);
            zone->pages_scanned = 0;
    
            __free_one_page(page, zone, order, migratetype);
            if (unlikely(migratetype != MIGRATE_ISOLATE))
                    __mod_zone_freepage_state(zone, 1 << order, migratetype);
            spin_unlock(&zone->lock);
    }
    

    analysis: snowball effect: increase allocation failure rate
    After the first two steps, almost all pages are free and in isolate freelist. If a page is returned to incorrect freelist, it is likely to merge with lots of buddy pages in isolate freelist and become a larger one which is in incorrect freelist. Within these free pages, if only one is allocated, then the third step will return false. CMA allocation fail rate increases when memory pressure is high because these free pages not in freelist will be allocated easily.

    analysis: snowball effect: incorrect MemFree accounting
    Within free pages, the ones in isolate freelist is not accounted into MemFree. If a page is returned into CMA freelist and merge with the other free pages in the isolate freelist, then at most 4MB of isolate free pages becomes free pages in CMA freelist. If these pages are allocated, the MemFree account might incorrectly increase 4MB after these pages are freed again. The MemFree accounting might be 100 MB more than correct value. In Android, it will make lowmemory killer stop killing applications and memory pressure high.

    conclusion
    I encountered this problem in qualcomm msm kernel release 3.4.0. The same root cause also causes MemFree accounting incorrect. I locally fixed this problem by making sure free pages are returned to correct freelist. The problems is formally solved at Linux 3.18.

  • mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype
  • mm/page_alloc: add freepage on isolate pageblock to correct buddy list
  • mm/page_alloc: move freepage counting logic to __free_one_page()
  • kernel: mm: cma: allocation failure due to ksm

    August 11, 2015

    This post is to discuss CMA(Contiguous Memory Allocator) allocation failure due to KSM. The reference source code here is qualcomm msm kernel release 3.4.0.

    CMA allocation code flow
    dma_alloc_from_contiguous is the API of CMA allocation. It manages CMA heap with bitmap and leverages alloc_contig_range to allocate contiguous pages.

    Two APIs could get pages from buddy system: alloc_pages and alloc_contig_range. The former is to allocate a page of requested order. The latter is to allocate a requested region.

    start_isolate_page_range sets all pageblocks the requested region occupies to be isolate. It then moves all free pages in these pageblocks to isolate freelist. After this step, all pages freed from users return to isolate freelist and couldn’t be allocated by users.

    __alloc_contig_migrate_range migrates all used page to other pageblocks. Since CMA pageblocks could only be fall backed from Movable pages allocation, it is assumed that all used pages in CMA pageblocks could be migrated. After these pages migrated and freed, they returns to isolate freelist because these pageblocks are already isolate.

    test_pages_isolated checks if all pages in the requested region are free and in isolate freelist.

    isolate_freepages_range splits all free pages into order-0 free pages. Then, it uses a for loop to alloc each order-0 page and kernel_map_pages each page.

    undo_isolate_page_range restores all pageblocks’ migration type to CMA and therefore moves all free pages to CMA freelist.

    cma_alloc_01

    symptom: alloc_contig_range test_pages_isolated failed

    <4>[80467.803487] c3  26510 alloc_contig_range test_pages_isolated(33a00, 33b00) failed 
    <4>[80467.821804] c3  26510 alloc_contig_range test_pages_isolated(33b00, 33c00) failed 
    

    The error log implies that test_pages_isolated fails. Within alloc_contig_range, after the first two steps works successfully, all pages in the region are supposed to be free and in isolate freelist.

    int alloc_contig_range(unsigned long start, unsigned long end,
                           unsigned migratetype)
    {
    ......
            /* Make sure the range is really isolated. */
            if (test_pages_isolated(outer_start, end)) {
                    pr_warn("alloc_contig_range test_pages_isolated(%lx, %lx) failed\n",
                           outer_start, end);
                    ret = -EBUSY;
                    goto done;
            }
    ......
    } 
    

    analysis
    The second step, __alloc_contig_migrate_range, calls migrate_pages to migrate anonymous pages. If the anonymous page is KSM page, then __unmap_and_move will return -EBUSY.

    alloc_contig_range
    -> __alloc_contig_migrate_range
       -> migrate_pages
          -> unmap_and_move
             -> __unmap_and_move
    
    static int __unmap_and_move(struct page *page, struct page *newpage,
                            int force, bool offlining, enum migrate_mode mode)
    {
    ......
            /*   
             * Only memory hotplug's offline_pages() caller has locked out KSM,
             * and can safely migrate a KSM page.  The other cases have skipped
             * PageKsm along with PageReserved - but it is only now when we have
             * the page lock that we can be certain it will not go KSM beneath us
             * (KSM will not upgrade a page from PageAnon to PageKsm when it sees
             * its pagecount raised, but only here do we take the page lock which
             * serializes that).
             */
            if (PageKsm(page) && !offlining) {
                    rc = -EBUSY;
                    goto unlock;
            }
    ......
    } 
    

    However, migrate_pages will ignore this error and still return success. Maybe this design is due to the fact that KSM page doesn’t support migration in kernel 3.4.0. If migration failure is permanent for KSM page, then there is no need to return error code.

    int migrate_pages(struct list_head *from,
                    new_page_t get_new_page, unsigned long private, bool offlining,
                    enum migrate_mode mode)
    {
    ......
            for(pass = 0; pass < 10 && retry; pass++) {
                    retry = 0;
    
                    list_for_each_entry_safe(page, page2, from, lru) {
                            cond_resched();
    
                            rc = unmap_and_move(get_new_page, private,
                                                    page, pass > 2, offlining,
                                                    mode);
    
                            switch(rc) {
                            case -ENOMEM:
                                    goto out;
                            case -EAGAIN:
                                    retry++;
                                    trace_migrate_retry(retry);
                                    break;
                            case 0:
                                    nr_succeeded++;
                                    break;
                            default:
                                    /* Permanent failure */
                                    nr_failed++;
                                    break;
                            }
                    }
            }
            rc = 0;
    ......
    }
    

    conclusion
    In qualcomm msm kernel release 3.4.0, KSM migration is not supported. So CMA allocation fails easily if KSM is enabled. KSM migration is supported in Linux 3.10 .

    kernel: patch: memory leakage: single_open

    August 10, 2015

    In linux-kernel, Al Viro contributes several patches with title single_open() leaks. This post is to discuss memory leakage related to single_open.
    kernel_memory_leakage_single_open_01

    single_open itself doesn’t cause memory leakage. The root cause of memory leakage is asymmetric open/release of a file structure. If a file’s open is single_open and its release is seq_release, then seq_file->op is not freed and leaked.

    seq_open:              alloc seq_file, seq->buf
    single_open:           alloc seq_file, seq->buf, seq_file->op
    seq_open_private:      alloc seq_file, seq->buf, seq_file->private
    seq_release:         release seq_file, seq->buf
    single_release:      release seq_file, seq->buf, seq_file->op
    seq_release_private: release seq_file, seq->buf, seq_file->private
    

    kernel_memory_leakage_single_open_02

    In conclusion, to avoid memory leakage, open/release a sequential file with symmetric functions.

  • seq_open v.s seq_release
  • single_open v.s. single_release
  • seq_open_private v.s. seq_open_release
  • arm: mm: kernel panic: BUG: Bad page map in process

    August 10, 2015

    This post is to discuss debugging kernel panic: BUG: Bad page map in process. The reference source code here is qualcomm msm kernel release 3.4.0. In the case discussed here, CONFIG_ARM_LPAE is disabled, and virtual address space is 4 GB.

    symptom: BUG: Bad page map in process
    This issue happens while user space requests to unmap an interval of virtual memory. kernel removes the virtual memory interval and walks page table to find the physical page to which the virtual memory is mapping. If the physical page exists, then kernel will clear the page table entry and decrease map count of this mapped page. In this issue, the unmppaed virtual memory maps to a physical free page. It contradicts to unmap a free page and a BUG is triggered.

    [ 518.237101] c1 6508 BUG: Bad page map in process c.sense.browser pte:22880747 pmd:22892831
    [ 518.237115] c1 6508 page:d3847a00 count:0 mapcount:-128 mapping: (null) index:0x0
    [ 518.237122] c1 6508 page flags: 0x14(referenced|dirty)
    [ 518.237138] c1 6508 addr:90130000 vm_flags:000c00fb anon_vma: (null) mapping:efa02118 index:bb
    [ 518.237152] c1 6508 vma->vm_ops->fault: kgsl_gpumem_vm_fault+0x0/0x48
    [ 518.237161] c1 6508 vma->vm_file->f_op->mmap: kgsl_mmap+0x0/0x2b4
    [ 518.237204] c1 6508 kernel BUG at … kernel/mm/memory.c:710!
    [ 518.237212] c1 6508 Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
    [ 518.237265] c1 6508 PC is at print_bad_pte+0x154/0x178
    [ 518.237276] c1 6508 LR is at __print_symbol+0x2c/0x4c
    [ 518.239850] c1 6508 Kernel panic - not syncing: c.sense.browser PC:print_bad_pte+0x154/0x178 LR:__print_symbol+0x2c/0x4c
    [ 518.239760] c1 6508 [<c025df78>] (print_bad_pte+0x154/0x178) from [<c025f2d8>] (unmap_single_vma+0x330/0x5a8)
    [ 518.239772] c1 6508 [<c025f2d8>] (unmap_single_vma+0x330/0x5a8) from [<c025fd9c>] (unmap_vmas+0x38/0x54)
    [ 518.239784] c1 6508 [<c025fd9c>] (unmap_vmas+0x38/0x54) from [<c026296c>] (unmap_region+0xa4/0x194)
    [ 518.239795] c1 6508 [<c026296c>] (unmap_region+0xa4/0x194) from [<c026394c>] (do_munmap+0x220/0x288)
    [ 518.239806] c1 6508 [<c026394c>] (do_munmap+0x220/0x288) from [<c02639f0>] (vm_munmap+0x3c/0x50)
    [ 518.239820] c1 6508 [<c02639f0>] (vm_munmap+0x3c/0x50) from [<c01061e0>] (ret_fast_syscall+0x0/0x30)
    [ 518.239829] c1 6508 Code: 0a000002 e5931028 e59f0020 ebfe8752 (e7f001f2)
    

    user space virtual memory management
    Kernel manages the user space virtual address space with a double linked list, current->mm->mmap. While user space tries to unmap a virtual memory interval, kernel removes the interval from the list current->mm->mmap. Kernel also uses a rb tree to manage the same virtual address space to provide optimal time complexity to find the required interval.

    BUG_Bad_page_map_01

    page table
    Since CONFIG_ARM_LPAE is disabled, the page table is 2-level rather than 3-level. In arm v7, within a virtual address, [31..20] bits specify the first level page table offset, [19..12] bits specify the second level page table offset, and [12..0] bits specify the address within the physical page.

    In arm, kernel puts the two consecutive second level page tables together. And it tweaks to use [31..21] bits to specify the first level page table offset, [20..12] bits to specify the second level page table offset.

    BUG_Bad_page_map_02

    debug
    do_munmap might do the following three jobs:

  • Remove vm_area_struct from current->mm->mmap
  • Clear page table entry at current->mm->pg[pgd_offste]->pte[pte_offset]
  • Modify page structure pointed by page table entry
  • In the case, kernel unmaps an interval from 0x90130000 in virtual address space. After walking the page table, kernel concludes that it maps to a physical page whose corresponding page structure is at 0xD3847A00. After decreased by one, page->_mapcount is equal to -129. The page is a Buddy one with order 4. With page->lru, it’s confirmed the page is in a freelist and its predecessor and successor are both order 4 Buddy page, too. So far, it seems that virtual address is not supposed to map to this physical page, let’s walk the page table to see what happens.

    BUG_Bad_page_map_03

    #define PAGE_BUDDY_MAPCOUNT_VALUE (-128)
     
    static inline int PageBuddy(struct page *page)
    {
             return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE;
    }
     
    inline void __SetPageBuddy(struct page *page)
    {
            VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
            atomic_set(&page->mapcount, PAGE_BUDDY_MAPCOUNT_VALUE);
    }
     
    static inline void __ClearPageBuddy(struct page *page)
    {
            VM_BUG_ON_PAGE(!PageBuddy(page), page);
            atomic_set(&page->mapcount, -1);
    

    debug: find the first level page table entry
    The first level page table is task->mm->pgd. Given virtual address is 0x90130000, the first level page table entry is task->mm->pgd[0x480] = 0xE923A400. From ramdump, it appears that in print_bad_pte function the pgd variable have the same value. So it’s probably correct.

    (gdb) p/x 0x90130000 >> 21
    $1 = 0x480
    (gdb) p/x ((pgd_t *)0xE9238000) + 0x480
    $2 = 0xe923a400
    

    BUG_Bad_page_map_04

    debug: find the second level page table
    The first level page table entry points to physical address of the second level page table. The second level page table is 1KB in size and alignment. The physical address of the second level page table is 0x22892800 and 0x31 is flag.

    [0xE923A400] = 0x22892831
    

    As far as second level page table is concerned, arm 32-bit kernel puts two linux page tables and two hardware page tables within a page.

    Within in page 0x22892000:

  • The linux page table 0x22892000 is related to task->mm->pgd[0x480]
  • The linux page table 0x22892400 is related to task->mm->pgd[0x481]
  • The hw page table 0x22892800 is related to task->mm->pgd[0x480]
  • The hw page table 0x22892C00 is related to task->mm->pgd[0x481]
  • task->mm->pgd[0x480] points to the hw page table 0x22892800
  • task->mm->pgd[0x481] points to the hw page table 0x22892C00
  • BUG_Bad_page_map_05

    debug: find the second level page table entry

    (gdb) p/x (0x90130000 >> 12) & ((1<<9)-1)
    $1 = 0x130
    (gdb) p/x ((pte_t *)0x22892000) + 0x130
    $2 = 228924C0
    

    BUG_Bad_page_map_06

    debug: find the physical page
    The content of the second level page table entry was already erased by unmap. From adjacent second level page table entry, the physical address of the page might be 0x22980000. last_kmsg shows that pte is 0x22880000.

    [ 518.237101] c1 6508 BUG: Bad page map in process c.sense.browser pte:22880747 pmd:22892831
    
    mem_map[22880] = 0xD3847A00
    

    BUG_Bad_page_map_07

    conclusion
    The vma should map to the physical page at 0x22980000. But log shows that the physical page is at 0x22880000. It’s possible that a bit-flip happens in cpu cache and the incorrect pte value related to a free page trigger a BUG.


    %d bloggers like this: