技术专栏
系统启动开关核概率性死机问题分析
1. 前言
使用全志平台系统开发时,出现概率性死机问题;
这里主要描述下死机分析过程
2. 栈信息
[ 27.892505] init: open path: /dev/bus/usb/005/002
[ 29.580872] Unable to handle kernel NULL pointer dereference at virtual address 00000004
[ 29.589952] pgd = c0004000
[ 29.593117] [00000004] *pgd=00000000
[ 29.597211] sunxi oops: enable sdcard JTAG interface
[ 29.602744] sunxi oops: cpu frequency: 1008 MHz
[ 29.602963] sunxi oops: ddr frequency: 576 MHz
[ 29.602963] sunxi oops: gpu frequency: 576 MHz
[ 29.602963] sunxi oops: cpu temperature: 66
[ 29.602963] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 29.602963] Modules linked in: 8188eu dummy_acc snd_usb_audio snd_usbmidi_lib snd_hwdep gpio_sunxi sunxi_ir_rx sunxi_sndspdif sndspdif sunxi_spdma sunxi_spdif uvcvideo videobuf_dma_contig videobuf_core mali(O) nand(O) [last unloaded: 8188eu]
[ 29.602963] CPU: 0 Tainted: G W O (3.4.39 #1)
[ 29.602963] PC is at cpufreq_governor_interactive+0x2cc/0x5c8
[ 29.602963] LR is at cpufreq_governor_interactive+0x2cc/0x5c8
[ 29.602963] pc : [<c0406568>] lr : [<c0406568>] psr: 600f0013
[ 29.602963] sp : e6221d90 ip : e6221d90 fp : e6221dd4
[ 29.602963] r10: 00000000 r9 : 00000000 r8 : 00000002
[ 29.602963] r7 : ffffffff r6 : 00000000 r5 : 00000000 r4 : e61c6680
[ 29.602963] r3 : c08d9378 r2 : 012bb000 r1 : 00000000 r0 : c09389b8
[ 29.602963] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel
[ 29.602963] Control: 10c5387d Table: 648a006a DAC: 00000015
.....
.....
[ 29.602963] [<c0406568>] (cpufreq_governor_interactive+0x2cc/0x5c8) from [<c04007e4>] (__cpufreq_governor+0xd0/0x17c)
[ 29.602963] [<c04007e4>] (__cpufreq_governor+0xd0/0x17c) from [<c0400d9c>] (__cpufreq_remove_dev.isra.13+0x2f0/0x354)
[ 29.602963] [<c0400d9c>] (__cpufreq_remove_dev.isra.13+0x2f0/0x354) from [<c05f7b6c>] (cpufreq_cpu_callback+0x6c/0x88)
[ 29.602963] [<c05f7b6c>] (cpufreq_cpu_callback+0x6c/0x88) from [<c004e05c>] (notifier_call_chain+0x48/0x78)
[ 29.602963] [<c004e05c>] (notifier_call_chain+0x48/0x78) from [<c004e0e8>] (__raw_notifier_call_chain+0x24/0x2c)
[ 29.602963] [<c004e0e8>] (__raw_notifier_call_chain+0x24/0x2c) from [<c0029ca8>] (__cpu_notify+0x3c/0x58)
[ 30.600053] fence timeout on [e1b1ddc0] after 1000ms
3. 问题分析
初步结论:
. 系统跑飞是由于非法地址引起的,经过排查,有两个怀疑点:
-- 从内存中加载出来的数据存在异常
-- cpufreq驱动存在漏洞
4. 详细分析流程
第一步:
1074057 c040596c <cpufreq_governor_interactive>:
1074058 c040596c: e1a0c00d mov ip, sp
1074059 c0405970: e92ddff0 push {r4, r5, r6, r7, r8, r9, sl, fp, ip, lr, pc}
1074060 c0405974: e24cb004 sub fp, ip, #4
1074061 c0405978: e24dd01c sub sp, sp, #28
1074062 c040597c: e92d4000 push {lr}
.....
.....
1074236 c0405c34: eb001352 bl c040a984 <cpufreq_frequency_get_table>
1074237 c0405c38: e5953004 ldr r3, [r5, #4] //+0x2cc 死在这里,所以R5的内容是非法地址
第二步:排查cpufreq_frequency_get_table
1079349 c040a984 <cpufreq_frequency_get_table>: //没发现操作R5
1079350 c040a984: e1a0c00d mov ip, sp
1079351 c040a988: e92dd800 push {fp, ip, lr, pc}
1079352 c040a98c: e24cb004 sub fp, ip, #4
1079353 c040a990: e92d4000 push {lr}
1079354 c040a994: ebf00e1b bl c000e208 <__gnu_mcount_nc>
1079355 c040a998: e59f200c ldr r2, [pc, #12] ; c040a9ac <cpufreq_frequency_get_table+0x28>
1079356 c040a99c: e59f300c ldr r3, [pc, #12] ; c040a9b0 <cpufreq_frequency_get_table+0x2c>
1079357 c040a9a0: e7922100 ldr r2, [r2, r0, lsl #2]
1079358 c040a9a4: e7930002 ldr r0, [r3, r2]
1079359 c040a9a8: e89da800 ldm sp, {fp, sp, pc}
1079360 c040a9ac: c08fcc70 .word 0xc08fcc70
1079361 c040a9b0: c08d8378 .word 0xc08d8378
第三步:往上看汇编,R5很早就被赋值了
利用gdb定位
(gdb) b*0xc0405c38
Note: breakpoint 1 also set at pc 0xc0405c38.
Breakpoint 2 at 0xc0405c38: file drivers/cpufreq/cpufreq_interactive.c, line 1561.
c语言:
1560 freq_table = cpufreq_frequency_get_table(policy->cpu);
1561 if (!tunables->hispeed_freq) { //出错在这里
1562 #if defined(CONFIG_ARCH_SUN9IW1P1)
再一次确认代码是否符合
1074236 c0405c34: eb001352 bl c040a984 <cpufreq_frequency_get_table>
1074237 c0405c38: e5953004 ldr r3, [r5, #4]
1074238 c0405c3c: e59f82dc ldr r8, [pc, #732] ; c0405f20 <cpufreq_governor_interactive+0x5b4>
1074239 c0405c40: e50b0040 str r0, [fp, #-64] ; 0x40
1074240 c0405c44: e3530000 cmp r3, #0 //确实对上号, if (!tunables->hispeed_freq) {
第四步:从栈信息,__cpufreq_remove_dev() -> __cpufreq_governor(START) -> cpufreq_governor_interactive(START)
出问题代码:
static int cpufreq_governor_interactive(struct cpufreq_policy *policy,
unsigned int event)
{
if (have_governor_per_policy()) //tunables赋值,然后经过switch调转到事件CPUFREQ_GOV_START
tunables = policy->governor_data;
else
tunables = common_tunables; //cpu只有一个class就走这里
WARN_ON(!tunables && (event != CPUFREQ_GOV_POLICY_INIT));
switch (event) {
case CPUFREQ_GOV_START:
mutex_lock(&gov_lock);
freq_table = cpufreq_frequency_get_table(policy->cpu);
if (!tunables->hispeed_freq) { //跑飞
}
怀疑: 1. 变量被修改了 2. Cpufreq Interactive策略被退出了
5. 排查
(1)Interactive回调函数里面加入打印事件发生类型,例如策略开始、停止、退出等,监听系统起来后进入产测,事件的发生经过。
[ 34.181471] **[interactive] event = 2
[ 34.185587] **common_tunables addr: e1a50840
[ 34.191101] **tunables addr: e1a50840
[ 34.195176] **[interactive] event = 5
[ 34.199333] **common_tunables addr: e1a50840
[ 34.204289] **tunables addr: e1a50840
结果:发现interactive策略有退出的动作
(2)在Interactive退出时候,加入stack_dump()
[ 34.181471] **[interactive] event = 2
[ 34.185587] **common_tunables addr: e1a50840
[ 34.191101] **tunables addr: e1a50840
[ 34.195176] **[interactive] event = 5
[ 34.199333] **common_tunables addr: e1a50840
[ 34.204289] **tunables addr: e1a50840
[ 34.208502] [<c00169fc>] (unwind_backtrace+0x0/0xec) from [<c05f8190>] (dump_stack+0x20/0x24)
[ 34.218108] [<c05f8190>] (dump_stack+0x20/0x24) from [<c0406524>] (cpufreq_governor_interactive+0x270/0x658)
[ 34.229727] [<c0406524>] (cpufreq_governor_interactive+0x270/0x658) from [<c04007e4>] (__cpufreq_governor+0xd0/0x17c)
[ 34.241874] [<c04007e4>] (__cpufreq_governor+0xd0/0x17c) from [<c0400f9c>] (__cpufreq_set_policy+0x130/0x1d0)
[ 34.253429] [<c0400f9c>] (__cpufreq_set_policy+0x130/0x1d0) from [<c0401808>] (store_scaling_governor+0x13c/0x17c)
[ 34.265173] [<c0401808>] (store_scaling_governor+0x13c/0x17c) from [<c04005bc>] (store+0x6c/0x94)
[ 34.275511] [<c04005bc>] (store+0x6c/0x94) from [<c0166a58>] (sysfs_write_file+0x118/0x14c)
[ 34.285090] [<c0166a58>] (sysfs_write_file+0x118/0x14c) from [<c010f2a4>] (vfs_write+0xc4/0x140)
结果:发现上层应用有切换cpu freq策略,这种操作引起异常死机的问题所在。
6. 总结
完整的问题分析:
(1)从栈信息分析来看,上层应用擅自改动调频策略引起该问题,简单说是系统起来后,调频策略是从performance切到interactive,然后打开cpu hotplug enable,让系统按需调频开核。
(2)有种特殊情况:如果上层软件通过节点从interactive切到其他模式,首先把interactive策略暂停和退出,这时候会把相应资源被释放,例如策略参数common_tunables = NULL和policy->governor_data = NULL,这里没有释放policy(策略集合,每个CPU都有的),但是如果在这时候策略还没切换好 (policy->governor == interactive),开核或者关核时候会直接用到interactive里面资源,这时会造成因为空指针跑飞情况。
声明:本文内容由易百纳平台入驻作者撰写,文章观点仅代表作者本人,不代表易百纳立场。如有内容侵权或者其他问题,请联系本站进行删除。
红包
95
7
评论
打赏
- 分享
- 举报
评论
0个
手气红包
暂无数据
相关专栏
-
浏览量:539次2024-01-25 13:00:44
-
浏览量:3837次2021-04-02 15:53:41
-
浏览量:4586次2021-04-15 14:56:16
-
浏览量:1286次2019-05-16 16:41:47
-
浏览量:4442次2021-11-20 15:17:49
-
浏览量:2214次2019-08-27 15:57:36
-
浏览量:1907次2019-04-24 16:39:20
-
浏览量:4701次2021-04-01 15:39:46
-
浏览量:811次2023-08-30 10:00:35
-
浏览量:1236次2024-01-16 09:54:41
-
浏览量:6132次2021-04-20 16:37:57
-
浏览量:1248次2023-11-06 15:17:14
-
浏览量:5039次2021-04-15 15:05:49
-
浏览量:1541次2020-10-29 09:14:41
-
浏览量:6902次2021-01-22 15:28:47
-
浏览量:1062次2018-04-27 10:37:25
-
浏览量:10911次2022-03-12 09:00:12
-
浏览量:1577次2020-01-14 09:11:57
-
浏览量:6868次2021-03-22 11:45:10
置顶时间设置
结束时间
删除原因
-
广告/SPAM
-
恶意灌水
-
违规内容
-
文不对题
-
重复发帖
打赏作者
free-jdx
您的支持将鼓励我继续创作!
打赏金额:
¥1
¥5
¥10
¥50
¥100
支付方式:
微信支付
打赏成功!
感谢您的打赏,如若您也想被打赏,可前往 发表专栏 哦~
举报反馈
举报类型
- 内容涉黄/赌/毒
- 内容侵权/抄袭
- 政治相关
- 涉嫌广告
- 侮辱谩骂
- 其他
详细说明
审核成功
发布时间设置
发布时间:
请选择发布时间设置
是否关联周任务-专栏模块
审核失败
失败原因
请选择失败原因
备注
请输入备注