记一次MongoDB处理rollback失败 replSet too much data to roll back
目录
环境说明
- 操作系统:CentOS Linux release 8.2.2004 (Core)
- MongoDB版本:3.6.21
问题描述
线上MongoDB 节点挂了,自动拉起之后,没过多久起不来了…. 查看日志发现是rollback失败了
$ grep 'rsBackgroundSync' replication.log
2022-04-25T15:56:22.777+0800 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last op time fetched: { ts: Timestamp(1650870067, 9), t: 149 }. source's GTE: { ts: Timestamp(1650870144, 1), t: 150 } hashes: (-4603273711463716908/-773514121576334543)
2022-04-25T15:56:22.777+0800 I REPL [rsBackgroundSync] Replication commit point: { ts: Timestamp(0, 0), t: -1 }
2022-04-25T15:56:22.777+0800 I REPL [rsBackgroundSync] Rollback using the 'rollbackViaRefetch' method because UUID support is feature compatible with featureCompatibilityVersion 3.6.
2022-04-25T15:56:22.777+0800 I REPL [rsBackgroundSync] transition to ROLLBACK from SECONDARY
2022-04-25T15:56:22.777+0800 I NETWORK [rsBackgroundSync] Skip closing connection for connection # 1
2022-04-25T15:56:22.777+0800 I ROLLBACK [rsBackgroundSync] Starting rollback. Sync source: 172.16.31.47:27018
2022-04-25T15:56:22.779+0800 I ROLLBACK [rsBackgroundSync] Finding the Common Point
2022-04-25T15:56:22.782+0800 I ROLLBACK [rsBackgroundSync] our last optime: Timestamp(1650870067, 9)
2022-04-25T15:56:22.782+0800 I ROLLBACK [rsBackgroundSync] their last optime: Timestamp(1650873382, 167)
2022-04-25T15:56:22.782+0800 I ROLLBACK [rsBackgroundSync] diff in end of log times: -3315 seconds
2022-04-25T15:56:45.012+0800 I ROLLBACK [rsBackgroundSync] Rollback common point is { ts: Timestamp(1650869833, 2586), t: 149 }
2022-04-25T15:56:45.012+0800 I REPL [rsBackgroundSync] Incremented the rollback ID to 22
2022-04-25T15:56:45.012+0800 I ROLLBACK [rsBackgroundSync] Starting refetching documents
2022-04-25T15:57:58.580+0800 I ROLLBACK [rsBackgroundSync] Rollback finished. The final minValid is: { ts: Timestamp(1650776551, 102), t: 148 }
2022-04-25T15:57:58.580+0800 F ROLLBACK [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: replSet too much data to roll back.
2022-04-25T15:57:58.580+0800 F - [rsBackgroundSync] Fatal Assertion 40507 at src/mongo/db/repl/rs_rollback.cpp 1516
2022-04-25T15:57:58.580+0800 F - [rsBackgroundSync] \n\n***aborting after fassert() failure\n\n
报错:Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: replSet too much data to roll back
去网上找了一圈,发现并没有处理方案
rollback失败的原理可以去看其他人写的文章,这里就不赘述了:
问题处理
https://jira.mongodb.org/browse/SERVER-47918
Under condition #1 the 300MB rollback limit is no longer enforced post-4.0
这里说了,4.0之后就没有这个限制了…
那会不会是硬编码限制,既然是硬编码限制那是不是可以通过改代码来解决?
拉代码看一下
$ git checkout r3.6.21
打开vim src/mongo/db/repl/rs_rollback.cpp
的1028行
// Checks that the total amount of data that needs to be refetched is at most
// 300 MB. We do not roll back more than 300 MB of documents in order to
// prevent out of memory errors from too much data being stored. See SERVER-23392.
if (totalSize >= 300 * 1024 * 1024) {
throw RSFatalException("replSet too much data to roll back.");
}
发现只是一个硬编码限制,那就把这段if
注释了,重新编译下mongod。
编译过程不在赘述了,后面问题就解决了