系统启动开关核概率性死机问题分析

free-jdx 2020-12-30 16:36:48 5760
1. 前言

使用全志平台系统开发时,出现概率性死机问题;
这里主要描述下死机分析过程

2. 栈信息
[   27.892505] init: open path: /dev/bus/usb/005/002
[   29.580872] Unable to handle kernel NULL pointer dereference at virtual address 00000004
[   29.589952] pgd = c0004000
[   29.593117] [00000004] *pgd=00000000
[   29.597211] sunxi oops: enable sdcard JTAG interface
[   29.602744] sunxi oops: cpu frequency: 1008 MHz
[   29.602963] sunxi oops: ddr frequency: 576 MHz
[   29.602963] sunxi oops: gpu frequency: 576 MHz
[   29.602963] sunxi oops: cpu temperature: 66
[   29.602963] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[   29.602963] Modules linked in: 8188eu dummy_acc snd_usb_audio snd_usbmidi_lib snd_hwdep gpio_sunxi sunxi_ir_rx sunxi_sndspdif sndspdif sunxi_spdma sunxi_spdif uvcvideo videobuf_dma_contig videobuf_core mali(O) nand(O) [last unloaded: 8188eu]
[   29.602963] CPU: 0    Tainted: G        W  O  (3.4.39 #1)
[   29.602963] PC is at cpufreq_governor_interactive+0x2cc/0x5c8
[   29.602963] LR is at cpufreq_governor_interactive+0x2cc/0x5c8
[   29.602963] pc : [<c0406568>]    lr : [<c0406568>]    psr: 600f0013
[   29.602963] sp : e6221d90  ip : e6221d90  fp : e6221dd4
[   29.602963] r10: 00000000  r9 : 00000000  r8 : 00000002
[   29.602963] r7 : ffffffff  r6 : 00000000  r5 : 00000000  r4 : e61c6680
[   29.602963] r3 : c08d9378  r2 : 012bb000  r1 : 00000000  r0 : c09389b8
[   29.602963] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
[   29.602963] Control: 10c5387d  Table: 648a006a  DAC: 00000015
.....
.....
[   29.602963] [<c0406568>] (cpufreq_governor_interactive+0x2cc/0x5c8) from [<c04007e4>] (__cpufreq_governor+0xd0/0x17c)
[   29.602963] [<c04007e4>] (__cpufreq_governor+0xd0/0x17c) from [<c0400d9c>] (__cpufreq_remove_dev.isra.13+0x2f0/0x354)
[   29.602963] [<c0400d9c>] (__cpufreq_remove_dev.isra.13+0x2f0/0x354) from [<c05f7b6c>] (cpufreq_cpu_callback+0x6c/0x88)
[   29.602963] [<c05f7b6c>] (cpufreq_cpu_callback+0x6c/0x88) from [<c004e05c>] (notifier_call_chain+0x48/0x78)
[   29.602963] [<c004e05c>] (notifier_call_chain+0x48/0x78) from [<c004e0e8>] (__raw_notifier_call_chain+0x24/0x2c)
[   29.602963] [<c004e0e8>] (__raw_notifier_call_chain+0x24/0x2c) from [<c0029ca8>] (__cpu_notify+0x3c/0x58)
[   30.600053] fence timeout on [e1b1ddc0] after 1000ms
3. 问题分析

初步结论:
. 系统跑飞是由于非法地址引起的,经过排查,有两个怀疑点:
-- 从内存中加载出来的数据存在异常
-- cpufreq驱动存在漏洞

4. 详细分析流程
第一步:
1074057 c040596c <cpufreq_governor_interactive>:
1074058 c040596c:   e1a0c00d    mov ip, sp
1074059 c0405970:   e92ddff0    push    {r4, r5, r6, r7, r8, r9, sl, fp, ip, lr, pc}
1074060 c0405974:   e24cb004    sub fp, ip, #4
1074061 c0405978:   e24dd01c    sub sp, sp, #28
1074062 c040597c:   e92d4000    push    {lr}
.....
.....
1074236 c0405c34:   eb001352    bl  c040a984 <cpufreq_frequency_get_table>
1074237 c0405c38:   e5953004    ldr r3, [r5, #4]             //+0x2cc 死在这里,所以R5的内容是非法地址

第二步:排查cpufreq_frequency_get_table
1079349 c040a984 <cpufreq_frequency_get_table>:                     //没发现操作R5
1079350 c040a984:   e1a0c00d    mov ip, sp
1079351 c040a988:   e92dd800    push    {fp, ip, lr, pc}
1079352 c040a98c:   e24cb004    sub fp, ip, #4
1079353 c040a990:   e92d4000    push    {lr}
1079354 c040a994:   ebf00e1b    bl  c000e208 <__gnu_mcount_nc>
1079355 c040a998:   e59f200c    ldr r2, [pc, #12]   ; c040a9ac <cpufreq_frequency_get_table+0x28>
1079356 c040a99c:   e59f300c    ldr r3, [pc, #12]   ; c040a9b0 <cpufreq_frequency_get_table+0x2c>
1079357 c040a9a0:   e7922100    ldr r2, [r2, r0, lsl #2]
1079358 c040a9a4:   e7930002    ldr r0, [r3, r2]
1079359 c040a9a8:   e89da800    ldm sp, {fp, sp, pc}
1079360 c040a9ac:   c08fcc70    .word   0xc08fcc70
1079361 c040a9b0:   c08d8378    .word   0xc08d8378

第三步:往上看汇编,R5很早就被赋值了
利用gdb定位
(gdb) b*0xc0405c38
Note: breakpoint 1 also set at pc 0xc0405c38.
Breakpoint 2 at 0xc0405c38: file drivers/cpufreq/cpufreq_interactive.c, line 1561.

c语言:
1560         freq_table = cpufreq_frequency_get_table(policy->cpu);
1561         if (!tunables->hispeed_freq) {                             //出错在这里
1562 #if defined(CONFIG_ARCH_SUN9IW1P1)

再一次确认代码是否符合
1074236 c0405c34:   eb001352    bl  c040a984 <cpufreq_frequency_get_table>
1074237 c0405c38:   e5953004    ldr r3, [r5, #4]
1074238 c0405c3c:   e59f82dc    ldr r8, [pc, #732]  ; c0405f20 <cpufreq_governor_interactive+0x5b4>
1074239 c0405c40:   e50b0040    str r0, [fp, #-64]  ; 0x40
1074240 c0405c44:   e3530000    cmp r3, #0 //确实对上号,   if (!tunables->hispeed_freq) {

第四步:从栈信息,__cpufreq_remove_dev() -> __cpufreq_governor(START) -> cpufreq_governor_interactive(START)
  出问题代码:
static int cpufreq_governor_interactive(struct cpufreq_policy *policy,
unsigned int event)
{
if (have_governor_per_policy()) //tunables赋值,然后经过switch调转到事件CPUFREQ_GOV_START
tunables = policy->governor_data;
else
tunables = common_tunables; //cpu只有一个class就走这里

WARN_ON(!tunables && (event != CPUFREQ_GOV_POLICY_INIT));

switch (event) {
case CPUFREQ_GOV_START:
    mutex_lock(&gov_lock);

    freq_table = cpufreq_frequency_get_table(policy->cpu);
if (!tunables->hispeed_freq) {                         //跑飞
}

怀疑: 1. 变量被修改了 2. Cpufreq Interactive策略被退出了

5. 排查

(1)Interactive回调函数里面加入打印事件发生类型,例如策略开始、停止、退出等,监听系统起来后进入产测,事件的发生经过。

[   34.181471] **[interactive] event = 2
[   34.185587] **common_tunables addr: e1a50840
[   34.191101] **tunables addr: e1a50840
[   34.195176] **[interactive] event = 5
[   34.199333] **common_tunables addr: e1a50840
[   34.204289] **tunables addr: e1a50840

结果:发现interactive策略有退出的动作

(2)在Interactive退出时候,加入stack_dump()

 [   34.181471] **[interactive] event = 2
[   34.185587] **common_tunables addr: e1a50840
[   34.191101] **tunables addr: e1a50840
[   34.195176] **[interactive] event = 5
[   34.199333] **common_tunables addr: e1a50840
[   34.204289] **tunables addr: e1a50840
[   34.208502] [<c00169fc>] (unwind_backtrace+0x0/0xec) from [<c05f8190>] (dump_stack+0x20/0x24)
[   34.218108] [<c05f8190>] (dump_stack+0x20/0x24) from [<c0406524>] (cpufreq_governor_interactive+0x270/0x658)
[   34.229727] [<c0406524>] (cpufreq_governor_interactive+0x270/0x658) from [<c04007e4>] (__cpufreq_governor+0xd0/0x17c)
[   34.241874] [<c04007e4>] (__cpufreq_governor+0xd0/0x17c) from [<c0400f9c>] (__cpufreq_set_policy+0x130/0x1d0)
[   34.253429] [<c0400f9c>] (__cpufreq_set_policy+0x130/0x1d0) from [<c0401808>] (store_scaling_governor+0x13c/0x17c)
[   34.265173] [<c0401808>] (store_scaling_governor+0x13c/0x17c) from [<c04005bc>] (store+0x6c/0x94)
[   34.275511] [<c04005bc>] (store+0x6c/0x94) from [<c0166a58>] (sysfs_write_file+0x118/0x14c)
[   34.285090] [<c0166a58>] (sysfs_write_file+0x118/0x14c) from [<c010f2a4>] (vfs_write+0xc4/0x140)

结果:发现上层应用有切换cpu freq策略,这种操作引起异常死机的问题所在。

6. 总结

完整的问题分析:
(1)从栈信息分析来看,上层应用擅自改动调频策略引起该问题,简单说是系统起来后,调频策略是从performance切到interactive,然后打开cpu hotplug enable,让系统按需调频开核。
(2)有种特殊情况:如果上层软件通过节点从interactive切到其他模式,首先把interactive策略暂停和退出,这时候会把相应资源被释放,例如策略参数common_tunables = NULL和policy->governor_data = NULL,这里没有释放policy(策略集合,每个CPU都有的),但是如果在这时候策略还没切换好 (policy->governor == interactive),开核或者关核时候会直接用到interactive里面资源,这时会造成因为空指针跑飞情况。

声明:本文内容由易百纳平台入驻作者撰写,文章观点仅代表作者本人,不代表易百纳立场。如有内容侵权或者其他问题,请联系本站进行删除。
free-jdx
红包 95 7 评论 打赏
评论
0个
内容存在敏感词
手气红包
    易百纳技术社区暂无数据
相关专栏
置顶时间设置
结束时间
删除原因
  • 广告/SPAM
  • 恶意灌水
  • 违规内容
  • 文不对题
  • 重复发帖
打赏作者
易百纳技术社区
free-jdx
您的支持将鼓励我继续创作!
打赏金额:
¥1易百纳技术社区
¥5易百纳技术社区
¥10易百纳技术社区
¥50易百纳技术社区
¥100易百纳技术社区
支付方式:
微信支付
支付宝支付
易百纳技术社区微信支付
易百纳技术社区
打赏成功!

感谢您的打赏,如若您也想被打赏,可前往 发表专栏 哦~

举报反馈

举报类型

  • 内容涉黄/赌/毒
  • 内容侵权/抄袭
  • 政治相关
  • 涉嫌广告
  • 侮辱谩骂
  • 其他

详细说明

审核成功

发布时间设置
发布时间:
是否关联周任务-专栏模块

审核失败

失败原因
备注
拼手气红包 红包规则
祝福语
恭喜发财,大吉大利!
红包金额
红包最小金额不能低于5元
红包数量
红包数量范围10~50个
余额支付
当前余额:
可前往问答、专栏板块获取收益 去获取
取 消 确 定

小包子的红包

恭喜发财,大吉大利

已领取20/40,共1.6元 红包规则

    易百纳技术社区