5.5.24,一个怪异的死锁问题导致DB无法连接

来源:互联网 发布:维氏刀具在淘宝上买 编辑:程序博客网 时间:2024/06/14 04:58

故障现象:

mysql 5.5.24,突然所有新连接卡住,db无法连接,无任何报错



故障原因分析:

找到对应的mysqld进程号,pstack $pid > /tmp/jj.sql
Thread 28 (Thread 0x7fc08bb32700 (LWP 3402)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 27 (Thread 0x7fc048e6b700 (LWP 3403)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 26 (Thread 0x7fc04846a700 (LWP 3404)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 25 (Thread 0x7fc047a69700 (LWP 3405)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 24 (Thread 0x7fc047068700 (LWP 3406)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 23 (Thread 0x7fc046667700 (LWP 3407)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 22 (Thread 0x7fc045c66700 (LWP 3408)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 21 (Thread 0x7fc045265700 (LWP 3409)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 20 (Thread 0x7fc044864700 (LWP 3410)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 19 (Thread 0x7fc043e63700 (LWP 3411)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 18 (Thread 0x7fc043462700 (LWP 3412)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 17 (Thread 0x7fc042a61700 (LWP 3413)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 16 (Thread 0x7fc042060700 (LWP 3414)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 15 (Thread 0x7fc04165f700 (LWP 3415)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 14 (Thread 0x7fc040c5e700 (LWP 3416)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 13 (Thread 0x7fc04025d700 (LWP 3417)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 12 (Thread 0x7fc03f85c700 (LWP 3418)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x7fc03ee5b700 (LWP 3419)):
#0  0x000000364f800614 in ?? () from /lib64/libaio.so.1
#1  0x0000000000870ad7 in os_aio_linux_handle ()
#2  0x0000000000835086 in fil_aio_wait ()
#3  0x00000000007d11a8 in io_handler_thread ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x7fc03d450700 (LWP 3421)):
#0  0x00007fc08c9af98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000873de3 in os_event_wait_time_low ()
#2  0x00000000007cd900 in srv_lock_timeout_thread ()
#3  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x7fc03ca4f700 (LWP 3422)):
#0  0x00007fc08c9af98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000873de3 in os_event_wait_time_low ()
#2  0x00000000007cd51b in srv_error_monitor_thread ()
#3  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7fc03c04e700 (LWP 3423)):
#0  0x00007fc08c9af98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000873de3 in os_event_wait_time_low ()
#2  0x00000000007d053b in srv_monitor_thread ()
#3  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7fc03e45a700 (LWP 3424)):
#0  0x00007fc08c9af5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000873f26 in os_event_wait_low ()
#2  0x00000000007ce43e in srv_master_thread ()
#3  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7fc088630700 (LWP 3444)):
#0  0x00007fc08c9b34b5 in sigwait () from /lib64/libpthread.so.0
#1  0x00000000005009eb in signal_hand ()
#2  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7fc0885ef700 (LWP 3445)):
#0  0x00007fc08c9af5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000737573 in Event_queue::cond_wait(THD*, timespec*, char const*, char const*, unsigned int) ()
#2  0x00000000007375fb in Event_queue::get_top_for_execution_if_time(THD*, Event_queue_element_for_exec**) ()
#3  0x00000000007366d7 in Event_scheduler::run(THD*) ()
#4  0x000000000073681b in event_scheduler_thread ()
#5  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7fc0885ae700 (LWP 3467)):
#0  0x00007fc08c9b2264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fc08c9ad53e in _L_lock_995 () from /lib64/libpthread.so.0
#2  0x00007fc08c9ad4a6 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000057bb6a in log_in_use(char const*) ()
#4  0x00000000006ce273 in MYSQL_BIN_LOG::purge_logs_before_date(long) ()
#5  0x00000000006d26f8 in MYSQL_BIN_LOG::rotate_and_purge(bool) ()
#6  0x000000000061243e in reload_acl_and_cache(THD*, unsigned long, TABLE_LIST*, int*) ()
#7  0x0000000000567359 in mysql_execute_command(THD*) ()
#8  0x000000000056b209 in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()
#9  0x000000000056cb36 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#10 0x00000000005f6386 in do_handle_one_connection(THD*) ()
#11 0x00000000005f63ea in handle_one_connection ()
#12 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#13 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7fc08856d700 (LWP 5461)):
#0  0x00007fc08c9b2264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fc08c9ad508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00007fc08c9ad3d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00000000006d2eef in MYSQL_BIN_LOG::reset_logs(THD*) ()
#4  0x000000000057a0ac in reset_master(THD*) ()
#5  0x00000000006127c8 in reload_acl_and_cache(THD*, unsigned long, TABLE_LIST*, int*) ()
#6  0x0000000000567359 in mysql_execute_command(THD*) ()
#7  0x000000000056b209 in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()
#8  0x000000000056cb36 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#9  0x00000000005f6386 in do_handle_one_connection(THD*) ()
#10 0x00000000005f63ea in handle_one_connection ()
#11 0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fc08852c700 (LWP 5528)):
#0  0x00007fc08c9af5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005016cf in one_thread_per_connection_end(THD*, bool) ()
#2  0x00000000005f62dd in do_handle_one_connection(THD*) ()
#3  0x00000000005f63ea in handle_one_connection ()
#4  0x00007fc08c9ab9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc08bc1b8fd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fc08cdce720 (LWP 3400)):
#0  0x00007fc08c9b2264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fc08c9ad53e in _L_lock_995 () from /lib64/libpthread.so.0
#2  0x00007fc08c9ad4a6 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000547ae2 in THD::THD() ()
#4  0x0000000000501985 in handle_connections_sockets() ()
#5  0x000000000050612f in mysqld_main(int, char**) ()
#6  0x00007fc08bb51d5d in __libc_start_main () from /lib64/libc.so.6
#7  0x00000000004fd9dd in _start ()


从堆栈信息看出,1号主线程处于锁等待,导致新连接无法连上,#0  0x00007fc08c9b2264 in __lll_lock_wait () from /lib64/libpthread.so.0
。3号线程执行了reset  master的操作,4号线程看起来像是一个flush logs的动作,通过观察binlog看到故障时间点正好是一个binlog达到上限后切换的时间点,初步推测是flush logs和reset master操作之间存在死锁问题,通过查看源代码来验证这个观点。




信息从pstack看Thread 4先调用MYSQL_BIN_LOG::purge_logs_before_date,加锁语句为mysql_mutex_lock(&LOCK_index),完成后并无释放;后执行log_in_use,加锁语句为mysql_mutex_lock(&LOCK_thread_count);
也就是说Thread 4的加锁顺序是先LOCK_index,后LOCK_thread_count
以下是相关的函数代码


int MYSQL_BIN_LOG::purge_logs_before_date(time_t purge_time)
{
  int error;
  char to_log[FN_REFLEN];
  LOG_INFO log_info;
  MY_STAT stat_area;
  THD *thd= current_thd;


  DBUG_ENTER("purge_logs_before_date");


  mysql_mutex_lock(&LOCK_index);
  to_log[0]= 0;


  if ((error=find_log_pos(&log_info, NullS, 0 /*no mutex*/)))
    goto err;


  while (strcmp(log_file_name, log_info.log_file_name) &&
         !is_active(log_info.log_file_name) &&
         !log_in_use(log_info.log_file_name))
  {
    if (!mysql_file_stat(m_key_file_log,
                         log_info.log_file_name, &stat_area, MYF(0)))
    {
      if (my_errno == ENOENT)
      {
        /*
          It's not fatal if we can't stat a log file that does not exist.
        */
        my_errno= 0;
      }
      else
      {
        /*
          Other than ENOENT are fatal
        */
        if (thd)
        {
          push_warning_printf(thd, MYSQL_ERROR::WARN_LEVEL_WARN,
                              ER_BINLOG_PURGE_FATAL_ERR,
                              "a problem with getting info on being purged %s; "
                              "consider examining correspondence "


                              "of your binlog index file "
                              "to the actual binlog files",
                              log_info.log_file_name);
        }
        else
        {
          sql_print_information("Failed to delete log file '%s'",
                                log_info.log_file_name);
        }
        error= LOG_INFO_FATAL;
        goto err;
      }
    }
    else
    {
      if (stat_area.st_mtime < purge_time)
        strmake(to_log,
                log_info.log_file_name,
                sizeof(log_info.log_file_name) - 1);
      else
        break;
    }
    if (find_next_log(&log_info, 0))
      break;
  }


  error= (to_log[0] ? purge_logs(to_log, 1, 0, 1, (ulonglong *) 0) : 0);


err:
  mysql_mutex_unlock(&LOCK_index);
  DBUG_RETURN(error);
}
#endif /* HAVE_REPLICATION */


bool log_in_use(const char* log_name)
{
  size_t log_name_len = strlen(log_name) + 1;
  THD *tmp;
  bool result = 0;


  mysql_mutex_lock(&LOCK_thread_count);
  I_List_iterator<THD> it(threads);


  while ((tmp=it++))
  {
    LOG_INFO* linfo;
    if ((linfo = tmp->current_linfo))
    {
      mysql_mutex_lock(&linfo->lock);
      result = !memcmp(log_name, linfo->log_file_name, log_name_len);
      mysql_mutex_unlock(&linfo->lock);
      if (result)
        break;
    }
  }


  mysql_mutex_unlock(&LOCK_thread_count);
  return result;
}


信息从pstack看Thread 3调用了MYSQL_BIN_LOG::reset_logs,加锁语句为mysql_mutex_lock(&LOCK_thread_count);
 mysql_mutex_lock(&LOCK_log);
mysql_mutex_lock(&LOCK_index);
也就是说先给LOCK_thread_count加锁,再给LOCK_index加锁,Thread 4和Thread 3的加锁顺序正好相反,出现了死锁现象,这个死锁mysql无法检测出来。
以下是相关函数的代码
bool MYSQL_BIN_LOG::reset_logs(THD* thd)
{
  LOG_INFO linfo;
  bool error=0;
  int err;
  const char* save_name;
  DBUG_ENTER("reset_logs");


  ha_reset_logs(thd);


  /*
    The following mutex is needed to ensure that no threads call
    'delete thd' as we would then risk missing a 'rollback' from this
    thread. If the transaction involved MyISAM tables, it should go
    into binlog even on rollback.
  */
  mysql_mutex_lock(&LOCK_thread_count);


  /*
    We need to get both locks to be sure that no one is trying to
    write to the index log file.
  */
  mysql_mutex_lock(&LOCK_log);
  mysql_mutex_lock(&LOCK_index);


  /* Save variables so that we can reopen the log */
  save_name=name;
  name=0;                                       // Protect against free
  close(LOG_CLOSE_TO_BE_OPENED);








而1号主线程显示也是等待编号为_L_lock_995的锁,从代码推测也就是等待LOCK_thread_count的锁释放,但因为3和4线程陷入死锁,永远也无法释放,从而一直卡着。


复现方法

后来了解到用户为了节约磁盘空间,定期会执行reset master清理所有binlog。
查看了下mysql5.6没法复现,已经修复了这个bug。5.5.24的复现方法如下:
1 gdb -p $pidofmysqld
2 在log_in_use函数处设置断点,c继续运行
3 打开1个连接,执行flush logs
4 再打开一个连接,执行reset  master
5 gdb执行到断点处,按c继续
6 这时再打开新连接,发现无法连接
7 pstack查看信息跟之前的pstack信息吻合,故障复现
0 0
原创粉丝点击