Linux 内核驱动模块强制卸载

2018年8月7日 11.68k 次阅读 0 条评论 1 人点赞

Linux 内核尽管是一个大而全的宏内核，包括驱动、文件系统以及内存管理等都打包带走，但是其引以为豪的模块化设计也让他吸收到了微内核所带来的模块化设计思想。这里我们不讨论到底宏内核所带来的高性能和微内核带来的高稳定性等问题，仅仅从技术的角度探讨如何强制卸载一个已经不能通过正常手段卸载的 Linux 内核模块，因为在某些时候，他变得很重要¹。

什么情况下会无法卸载

万众期待的 OOPS

大多数内核开发者遇到的情况可能存在两种，第一种就是模块发生崩溃，比如出现了 OOPS 这样的错误信息。

#include <linux/module.h>
#include <linux/init.h>

static int __init hello_init(void)
{
        int *p = NULL;
        *p = 1;

        return 0;
}

static void __exit hello_exit(void)
{
        printk(KERN_ALERT "Bye man, I have been unload.\n");
}

module_init(hello_init);
module_exit(hello_exit);
MODULE_AUTHOR("Jackie Liu <liuyun01@kylinos.cn>");
MODULE_LICENSE("GPL");

那么编译他然后加载他，你可以看到内核驱动报空指针异常。

jackieliu@jackieliu-virtual-machine:~/hello$ dmesg
[ 3520.142923] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 3520.142989] IP: hello_init+0x3/0x1000 [hello]
[ 3520.142992] PGD 0
[ 3520.142995] Oops: 0002 [#1] SMP
[ 3520.143044] CPU: 0 PID: 4073 Comm: insmod Tainted: G
[ 3520.143045] Hardware name: VMware, Inc.
[ 3520.143100] Call Trace:
[ 3520.143330]  ? do_one_initcall+0x53/0x1c0
[ 3520.143375]  ? kmem_cache_alloc_trace+0x152/0x1c0
[ 3520.143381]  do_init_module+0x5f/0x1ff
[ 3520.143385]  load_module+0x18ef/0x1cd0
[ 3520.143474]  ? ima_post_read_file+0x7d/0xa0
[ 3520.143479]  SYSC_finit_module+0xdf/0x110
[ 3520.143481]  ? SYSC_finit_module+0xdf/0x110
[ 3520.143484]  SyS_finit_module+0xe/0x10
[ 3520.143487]  do_syscall_64+0x5b/0xc0
[ 3520.143617]  entry_SYSCALL64_slow_path+0x25/0x25
[ 3520.143620] RIP: 0033:0x7fb090e7f959
[ 3520.143621] RSP: 002b:00007ffd524f7108 EFLAGS: 00000202 ORIG_RAX: 0000000000000139
[ 3520.143623] RAX: ffffffffffffffda RBX: 000000eeb3dd21f0 RCX: 00007fb090e7f959
[ 3520.143625] RDX: 0000000000000000 RSI: 000000eeb298f246 RDI: 0000000000000003
[ 3520.143626] RBP: 000000eeb298f246 R08: 0000000000000000 R09: 00007fb091144ea0
[ 3520.143627] R10: 0000000000000003 R11: 0000000000000202 R12: 0000000000000000
[ 3520.143629] R13: 000000eeb3dd2b70 R14: 0000000000000000 R15: 0000000000000000
[ 3520.143631] Code: <c7> 04 25 00 00 00 00 01 00 00 00 48 89 e5 5d c3 00 00 00 00 00 00 
[ 3520.143648] RIP: hello_init+0x3/0x1000 [hello] RSP: ffff987704d77c78
[ 3520.143649] CR2: 0000000000000000
[ 3520.143652] ---[ end trace 226554bc8680d245 ]---

自然，也无法卸载这个驱动模块：

jackieliu@jackieliu-virtual-machine:~/hello$ sudo rmmod hello 
rmmod: ERROR: Module hello is in use

消失的 exit 函数

还有一种情况就是驱动开发者根本就没有声明 module_exit 函数，不管是刻意为之还是编码失误，但是问题是存在的。

#include <linux/module.h>
#include <linux/init.h>

static int __init hello_init(void)
{
        int i;

        for(i = 0; i < 10; i++ ) {
            printk(KERN_ALERT "Hey, How are you. digit %d\n", i);
        }

        return 0;
}

module_init(hello_init);
MODULE_AUTHOR("Jackie Liu <liuyun01@kylinos.cn>");
MODULE_LICENSE("GPL");

编译安装之后，卸载该模块。

jackieliu@jackieliu-virtual-machine:~/hello$ sudo rmmod hello
rmmod: ERROR: ../libkmod/libkmod-module.c:793 kmod_module_remove_module() could not remove 'hello': Device or resource busy
rmmod: ERROR: could not remove module hello: Device or resource busy

为什么无法卸载

当然，了解到了无法卸载的现象，我们首先需要了解为什么内核模块通过正常的 rmmod 无法卸载。查看 rmmod 的源码，可以得知他是使用的 sys_delete_module 这个接口进行模块删除，这个函数定义在 arch/arm64/include/asm/unistd32.h。

#define __NR_delete_module 129
__SYSCALL(__NR_delete_module, sys_delete_module)

在 linux 的内核模块实现函数中，有这个系统调用的具体实现，代码位于 kernel/module.c。

SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
                unsigned int, flags)
{
        struct module *mod;
        char name[MODULE_NAME_LEN];
        int ret, forced = 0;

        if (!capable(CAP_SYS_MODULE) || modules_disabled)
                return -EPERM;

        if (strncpy_from_user(name, name_user, MODULE_NAME_LEN-1) < 0)
                return -EFAULT;
        name[MODULE_NAME_LEN-1] = '\0';

        if (mutex_lock_interruptible(&module_mutex) != 0)
                return -EINTR;

        // 通过名称来查找需要卸载的模块
        mod = find_module(name);
        if (!mod) {
                ret = -ENOENT;
                goto out;
        }

        // 如果有模块仍然依赖本模块，则不允许卸载
        if (!list_empty(&mod->source_list)) {
                /* Other modules depend on us: get rid of them first. */
                ret = -EWOULDBLOCK;
                goto out;
        }

        // 模块活跃才允许卸载
        if (mod->state != MODULE_STATE_LIVE) {
                /* FIXME: if (force), slam module count damn the torpedoes */
                pr_debug("%s already dying\n", mod->name);
                ret = -EBUSY;
                goto out;
        }

        // 如果有初始化函数且没有 exit 函数，除非是定义了 CONFIG_MODULE_FORCE_UNLOAD
        if (mod->init && !mod->exit) {
                forced = try_force_unload(flags);
                if (!forced) {
                        /* This module can't be removed */
                        ret = -EBUSY;
                        goto out;
                }
        }

        // 停止当前的模块，内核无法卸载一个正在被使用的模块
        ret = try_stop_module(mod, flags, &forced);
        if (ret != 0)
                goto out;

        mutex_unlock(&module_mutex);
        /* Final destruction now no one is using it. */
        if (mod->exit != NULL)
                mod->exit();
        blocking_notifier_call_chain(&module_notify_list,
                                ¦    MODULE_STATE_GOING, mod);
        async_synchronize_full();

        /* Store the name of the last unloaded module for diagnostic purposes */
        strlcpy(last_unloaded_module, mod->name, sizeof(last_unloaded_module));

        free_module(mod);
        return 0;
out:
        mutex_unlock(&module_mutex);
        return ret;
}

上面的函数很简单也容易理解，这就是一个正常的模块的卸载过程，他会在有人依赖或者无 exit 函数时，不允许卸载当前的模块，这样看起来他的限制很多，不允许随便卸载一个模块，这其实也是出于安全和稳定性的考虑，有些状态不对的模块的确不允许随便卸载，因为一旦卸载之后，是可以重新继续加载一次模块，这样的模块可能会与残留的信息发生冲突或者导致数据错误。那么是不是强制卸载模块就没有意义呢？当然不是，至少对于调试内核模块来讲，就很有意义。

强制卸载无 exit 的驱动

既然没有 exit 的函数无法卸载，那么就给他一个 exit 不就可以卸载了吗？那么如何给他设置一个 exit 函数呢？当然是从外部来设置。首先编写一个简单的模块，可以参考实现简单的 Linux 内核模块文章。

#include <linux/module.h>
#include <linux/init.h>

static char *modname = NULL;
module_param(modname, charp, 0644);
MODULE_PARM_DESC(modname, "The name of module you wanna clean.\n");

void force_exit(void)
{
        printk(KERN_ALERT "Hey, Thanks for force unload %s.\n", modname);
}

static int __init rmmod_force_init(void)
{
        struct module *mod = NULL;
        // 查找需要设置 exit 函数的模块
        if ((mod = find_module(modname)) == NULL)
                printk(KERN_ALERT "This [%s] not found!\n", modname);
        if (mod->exit == NULL)
                // 设置该模块的 exit 函数为 force_exit 函数
                mod->exit = force_exit;
        return 0;
}

static void __exit rmmod_force_exit(void)
{
        printk(KERN ALERT "Bye man, I have been unload.\n");
}

module_init(rmmod_force_init);
module_exit(rmmod_force_exit);
MODULE_AUTHOR("Jackie Liu <liuyun01@kylinos.cn>");
MODULE_LICENSE("GPL");

编译并安装该模块，需要注意一定要添加模块参数，不然 rmmod_force 模块不知道该给那个模块设置 exit 函数²。

jackieliu@machine:~/rmmod_force$ sudo insmod rmmod_force.ko modname=hello
jackieliu@machine:~/rmmod_force$ sudo rmmod hello

之后可以通过 lsmod 观察时候已经卸载了 hello 模块，也可以通过 dmesg 观察是否打印了 force_exit 函数的信息。

强制卸载 oops 错误的模块

要卸载发生 OOPS 的模块也很简单，通过对 delete_module 系统调用的分析，当发生 OOPS 时，引用计数不为 1，导致无法通过正常逻辑卸载模块，既然引用计数不为 1，通过外部模块将该变量设置为 1 即可。

#include <linux/module.h>
#include <linux/init.h>
#include <asm-generic/local.h>

static char *modname = NULL;
module_param(modname, charp, 0644);
MODULE_PARM_DESC(modname, "The name of module you wanna clean.\n");

static int __init rmmod_force_init(void)
{
        struct module *mod = NULL;
        int cpu;
        if ((mod = find_module(modname)) == NULL)
                printk(KERN_ALERT "This [%s] not found!\n", modname);
        mod->state = MODULE_STATE_LIVE;
        // 设置每一个 CPU 上缓存的 mod->refcnt 为 0
        for_each_possible_cpu(cpu)
                local_set((local_t*)per_cpu_ptr(&(mod->refcnt), cpu), 0);
       // 设置 mod->refcnt 变量值 1
        atomic_set(&mod->refcnt, 1);
        return 0;
}

static void __exit rmmod_force_exit(void)
{
        printk(KERN_ALERT "Bye man, I have been unload.\n");
}

module_init(rmmod_force_init);
module_exit(rmmod_force_exit);
MODULE_AUTHOR("Jackie Liu <liuyun01@kylinos.cn>");
MODULE_LICENSE("GPL");

通过上面的方法虽然最终可以成功的卸载掉 hello 模块，但是还是会带来一些问题，在我的系统上表现就是 vim 无法打开，报段错误。

jackieliu@jackieliu-virtual-machine:~/rmmod_force$ vim
段错误 (核心已转储)
jackieliu@jackieliu-virtual-machine:~/rmmod_force$ dmesg
[ 4705.269554] traps: vim[6326] general protection ip:e6b045b311 sp:7ffd4bc70140 error:0 in vim.basic[e6b030e000+21e000]

但不管怎样，模块最终是可以被卸载掉，也可以重新再加载一次原来的模块，这样对于调试一个简单的驱动来说是很方便的。Enjoy it.

参考链接

Linux强制卸载内核模块(由于驱动异常导致rmmod不能卸载)

有时候可能不能关机，但是其中的模块又出错了，最好的办法就是卸载后重新编写一个新的驱动加载进入内核。 ↩︎
要注意的是，一定要在 hello 模块卸载之后才能卸载 rmmod_force 模块，因为 hello 的 exit 函数是由 rmmod_force 模块提供的，如若不然那么就算是通过 rmmod_force 模块给 hello 模块设置了 exit 函数也由于 rmmod_force 模块的卸载而导致 exit 函数不存在。 ↩︎

作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可