关于mysql innodb 如何保存大对象(BLOB等),最强解析

来源:互联网 发布:java数组 nullpointer 编辑:程序博客网 时间:2024/06/05 20:48

                       Externally Stored Fields in InnoDB


This article discusses the storage (inline and external) of field data in the InnoDB storage engine. All fields of variable length types like VARCHAR, VARBINARY, BLOB and TEXT can be stored inline within the clustered index record, or stored externally in separate BLOB pages outside of the index record (but within the same tablespace). All of these fields can be classified as large objects. These large objects are either binary large objects or character large objects. The binary large objects do not have an associated character set, whereas the character large objects do.    >>本文主要讨论mysql innodb存储引擎下如何存储大对象。所有的变长类型,如VARCHAR,VARBINARY,BLOG,TEXT即可能被保存在索引页中(innodb 表是索引组织表),也可能被保存在external pages(溢出页,溢出页同索引页在同一个表空间)。至于这些列是保存在B-tree pages还是被保存在external pages,主要取决于下面三个因素:
1).大对象的size;2).整个行的size;3).innodb 的 row format。大对象可能是二进制大对象也可能是字符串大对象,二进制大对象没有相关字符集,而二进制大对象有字符集。

Within the InnoDB storage engine there is no difference between the way character large objects and binary large objects are handled. Throughout this article we will use the term “BLOB field” to refer to any of the aforementioned field types that can be chosen for external storage.    >>对于innodb存储引擎来说,对于二进制和字符串大对象的处理都是一样的。本文下面的描述中 “BLOB field”只是一个统称,它代表mysql中支持的所有大对象比如BLOB,TEXT,LONG VARCHAR(应该还包括LONG VARBINARY)

This article will provide information about the following: >>下面的文章包含如下几部分内容

  • Explain when a BLOB field will be stored inline and when it will be stored externally, with respect to the clustered index record.    >>本文说明了主键索引中的大对象什么情况下会被保存在B-tree pages,什么情况下会被保存在external pages中(注意只有主键索引中可能存在大对象)
  • The structure of the BLOB reference.    >>BLOB reference的结构
  • The BLOB prefix that is stored in the clustered index record, when the BLOB is stored externally.    >>如果表的row_format为COMPACT或者REDUNDANT时如果大对象被保存在external pages,BLOG字段会在索引页保存 768 bytes数据加上一个20 bytes的指针(BLOB reference),其他的保存在external pages。如果表的row_format为DYNAMIC或者COMPRESSED(innodb_file_format=Barracuda在支持之前的REDUNDANT和COMPACT row_format基础上新增了Dynamic和Compressed两种 row_format),时,如果大对象被保存在external pages,索引页中不会再保存BLOB prefix(BLOB 前768 bytes 数据)只保存BLOB reference。
  • Utility gdb functions to examine the BLOB reference and the record offsets array.    >>使用gdb 程序分析BLOB reference和记录的位置偏移量

The BLOB fields are associated with the clustered index records (the primary key) of a table. Only the clustered index can store a BLOB field externally. A secondary index cannot have externally stored fields. For the purposes of this article, we won’t deal with any secondary indexes.    >>只有聚集索引才能存储大对象在external pages,所以我们下面说的都是聚集索引中大对象如何存储。

The Schema

The following example table will be used to present the information:    >>下面是用来说明问题的示例表

Note: ௱ – Tamil number one hundred (Unicode 0BF1), ௲ – Tamil number one thousand (Unicode 0BF2)

A single clustered index record can have 1 or more externally stored BLOBs. So for the given table definition of t1, there are 4 possible ways that the BLOB fields of f2 and f3 can be stored:    >>测试表t1中有f2和f3两个列存储的大对象,所以理论上innodb 有如下4中方式存储它们

  1. f2 and f3 are both stored inline within the clustered index page    >>f2和f3都存储在索引页中
  2. f2 is stored inline, while f3 is stored externally    >>f2保存在索引页中,f3保存在 external pages中
  3. f3 is stored inline, while f2 is stored externally    >>f3保存在索引页中,f2保存在 external pages中
  4. both f2 and f3 are stored externally    >>f2和f3都保存在 external pages中

In the following sections, let us see which of the BLOB columns are externally stored and which of them are stored inline, for each sample row we created above. Note that the row format of table t1 is not explicitly specified. In MySQL 5.6, it will default to the COMPACT row format.  Please keep this in mind as we discuss the example.    >>下面我们来看一下,上面表中的每一行记录中大对象是如何保存的。注意我们建表时没有指定row_format ,所以t1表使用 mysql 5.6中的默认值 COMPACT。

Overview of BLOB Storage

The BLOB data can be stored inline in the clustered index record, or it can be stored externally in separate BLOB pages. These external BLOB pages are allocated from the same tablespace in which the clustered index resides. The BLOB data will always be stored inline whenever possible though. If and only if this is not possible because of the record size, then the BLOB field will be stored externally.  This is true for all of the current row formats: REDUNDANT, COMPACT, DYNAMIC, and COMPRESSED. Let’s now take a look at the storage details for the BLOB columns in our example table.    >>innodb引擎下,大对象即可能被保存在索引页中也可能被保存在external pages中(external pages同聚集索引页在同一个表空间)。在可能的情况下大对象会被尽可能的保存在索引页中。当记录的size 过大,无法保证一个索引页中存储最少两条记录时,大对象就会被保存在external pages中。这个原则对当前的row format,REDUNDANT,COMPACT,DYNAMIC,COMPRESSED都是适用的。

In MySQL 5.6, the default row format of a table is COMPACT, so that’s what our t1 table is using. The default page size is 16K, so that’s also what we’re using. The maximum record size that can be stored in a 16K page using the COMPACT row format is 8126 bytes. The function page_get_free_space_of_empty() will tell us the total free space available in a page. The value returned by this function, which is then divided by 2, will give us the maximum allowed record size. The division by 2 is required because an index page must contain a minimum of 2 records. Let’s look at an example (the argument value of “1” tells the function that the row format for the page is COMPACT):    >>Mysql 5.6 下 row format 的默认值是COMPACT,默认的page size是 16k。这时表中单条记录最大尺寸为 8126 bytes. page_get_free_space_of_empty()能够计算出page 总得剩余空间,用该值除以2 即得出记录的最大尺寸(除以2是因为一个页中最少要保存两个记录)。在如上条件下,记录的最大尺寸为8126 bytes

The following table shows the storage details of the BLOB columns for each row in table t1. Keeping in mind the maximum allowed record size, it is clear as to why the 60000 bytes of BLOB data is always stored externally. It simply will not fit within a single clustered index record.    >>下面的表格显示了测试表中,各行中 f2和f3列如何存储。正如上面所分析的,单条记录最大 size 为 8216 bytes,所以size 为60000 bytes的 大对象都被保存在 external pages中

PRIMARY KEYLENGTH OF F2STORAGE OF F2LENGTH OF F3STORAGE OF F313000 bytesInline3000 bytesInline260000 bytesExternal60000 bytesExternal360000 bytesExternal4500 bytesInline44500 bytesInline60000 bytesExternal

Inline Storage of BLOBs

As mentioned previously, no BLOB fields will be externally stored if the size of the record is less than the maximum record size allowed in a page. In our example table, for the row with a primary key value of 1, no BLOB fields need to be stored externally because the full record size is less than 8126 bytes. The following tables give the sizing details for each of the rows in our example table:    >>当行的 size小于 8126 bytes时,所有大对象都保存在索引页。所以测试表中的第一条记录中的大对象不需要存储在 external pages中。下面的表格描述了测试表中各行的具体大小,以及被保存在external pages中的列

PRIMARY KEYCLUSTERED INDEX RECORD SIZE (IN BYTES)FIELDS MOVED OUTRECORD SIZE AFTER MOVING BLOB OUTSIDE (IN BYTES)MAXIMUM ALLOWED RECORD SIZE (IN BYTES)16027-602781262120027f2, f316038126364527f253158126464527f353158126

As we can see, the BLOB fields are stored externally until the record size falls below the limit. In the table above, column 2 gives the initial clustered index record size. If this size is greater than the allowed maximum size of the record (shown in column 5), then the function dtuple_convert_big_rec() is invoked to choose the fields destined for external storage. Column 3 lists the fields that have been chosen by this function for external storage. Column 4 shows the clustered index record size after moving the chosen fields to external storage. Again, this value must be less than the maximum record size, shown in column 5 (which is 8126 bytes in our example).    >>上面的表中第二列是每行记录的 size. 如果行的size 大于 所允许的最大size(本例中是8126 bytes),那么会调用 dtuple_convert_big_rec()函数来选择哪些列将被保存在external pages。上面表格的第三列是每行中被保存在external pages中的列信息。上面表格的第四列表示除了external pages保存数据外,主键索引的size(这个值可定是小于 maximum record size的)。上面表格的第五列表示当前条件下maximum record size值

All of the size details provided above are obtained through the debugger from the callers of this function.

For more clarity, let me explain the clustered index record length for the first row with a primary key value of 1  (shown as 6027 bytes). The length of the user fields f1, f2, and f3 are 4 bytes, 3000 bytes, and 3000 bytes respectively. The length of the system fields DB_ROLL_PTR and DB_TRX_ID are 7 bytes and 6 bytes respectively. The record header stores the length of the f2 and f3 fields, taking 4 bytes (2 bytes for each field). The record header also contains a null bit array, which for this record takes up 1 byte. Lastly, the record header contains REC_N_NEW_EXTRA_BYTES—which is 5 bytes for the COMPACT row format—of additional information. The complete storage details for the final record are presented in the following table:    >>下面我们来分析一下测试表中第一条记录的6027 bytes是如何计算出来的。第一行中,f1,f2,f3的长度分别为4 bytes,3000 bytes,3000 bytes。标识回滚段和事务id的列分别占7 bytes和6 bytes。record header中还包含 null标识占1 byte。最后REC_N_NEW_EXTRA_BYTES占用5 bytes(row format 为 COMPACT时)。这些值加起来 正好为6027 bytes

FIELDLENGTH (IN BYTES)TOTAL6027Length of f32Length of f22Null bit array1REC_N_NEW_EXTRA_BYTES5f14DB_TRX_ID6DB_ROLL_PTR7f23000f33000

You can refer to the documentation in storage/innobase/rem/rem0rec.cc for more details about the COMPACT and REDUNDANT row formats. The REDUNDANT row format is also explained in the blog article InnoDB REDUNDANT Row Format.

Choosing Fields for External Storage

As discussed above, the function dtuple_convert_big_rec() is invoked to decide which parts of the oversized clustered index record will be chosen for external storage. This function makes use of the following rules to decide this:    >>当行的size 大于maximum record size值时,会调用dtuple_convert_big_rec()决定哪些大对象被保存在 external pages。该函数通过如下规则确定哪些大对象会被保存在external中

  • No fixed length fields can be chosen for external storage.    >>固定长度类型的列不会被保存在 external pages中,如char
  • No variable length fields whose size is less than or equal to 40 (2 * BTR_EXTERN_FIELD_REF_SIZE) bytes will be chosen for external storage.  >>可变长度列类型 如果size小于等于40 bytes(2 * BTR_EXTERN_FIELD_REF_SIZE),不会被保存在external pages中
  • No variable length fields whose size is less than the BLOB prefix size will be chosen for external storage.  This means that in the case of REDUNDANT and COMPACT row formats, if the field data length is less than or equal to 768 bytes (DICT_ANTELOPE_MAX_INDEX_COL_LEN), then it will not be chosen for external storage. This rule is not applicable for DYNAMIC and COMPRESSED row formats, because their BLOB prefix is 0 bytes.    >>可变长度列类型,如果size 小于 BLOB prefix size,不会被保存在external pages中(这条规则使用于 row_format 为COMPACT或者REDUNDANT 的表,不适于row_format为DYNAMIC 和 COMPRESSED 的表)。

In the function dtuple_convert_big_rec(), we examine one BLOB field at a time for potential external storage, and then move it to external storage if it passes the criteria noted above, until the clustered index record size falls within the maximum allowed. Larger fields will be selected for external storage before smaller fields, to ensure that maximum space savings happens in the clustered index page. This ensures that more records can be stored in each clustered index page.    >>dtuple_convert_big_rec()函数一次检查一个BLOB field列,如果满足上面所说的条件,该BLOB field会被保存在external pages中,然后再检查该行中的下一个BLOB field,直到列的size 小于当前条件下的 maximum size。在同一行中,size越大的field 越先被dtuple_convert_big_rec()函数检查,以确保index pages中能保留根多的数据。

BLOB Reference

When a BLOB field is stored externally, a BLOB reference is stored in the clustered index record. The BLOB reference will be stored after the BLOB prefix, if any. This BLOB reference is 20 bytes, and it contains the following information:    >>>>当一个BLOB field被保存在external pages时,会在index pages中保存该BLOB field的BLOB reference(指示该BLOB的size,external page位置等元数据信息)。如果表的row format为COMPACT或者REDUNDANT,那么BLOB reference保存在BLOB prefix之后。BLOB reference 占用20 bytes空间,包含如下信息:

  1. The space identifier (4 bytes)    >>>>表空间信息(下图中的Space ID部分)
  2. The page number where the first page of the BLOB is stored (4 bytes)    >>>>储存BLOB field的起始 page号(下图中的Page Number部分)
  3. The offset of the BLOB header within that page (4 bytes)    >>>>起始page中 BLOB header 的位移量(下图中的Offset)
  4. The total size of the BLOB data (8 bytes)    >>>>该BLOB 的 total size(除了上面三个以外的剩余部分)

Even though 8 bytes are available to store the total size of the BLOB data, only the last 4 bytes are actually used. This means that within InnoDB, the maximum size of a single BLOB field is currently 4GB.    >>>>尽管上面我们提到有8 bytes的空间可以用来存储BLOB size信息,但是实际上仅有 4 bytes空间可以真正用来保存BLOB size 信息。这就意味着,innodb中可以保存的最大 BLOB field size 是4 GB(4 bytes=32 bit ,2^32=4GB)

Structure of InnoDB BLOB Reference

In the length field, two bits are used to store the ownership and inheritance information, which are not discussed in this article. We will cover that in a subsequent blog article. The most significant bit of the length field is used to store ownership information and the second most significant bit is used to store the inheritance information.    >>>>上图中最后保存BLOB size信息部分,用 2 bits来保存 ownership和inheritance 信息,至于ownership和inheritance本文我们不做讨论,在下篇的博客中再做说明。

Here is a gdb function to print the contents of a BLOB reference. This function takes a pointer to the external BLOB reference as an argument. The calls to ntohl() are required because all data on disk is stored in network byte order by InnoDB.   >>>>这里有一个 gdb 函数,可以用来打印 BLOB reference。

BLOB Prefix

When a BLOB field is stored externally, we may also store a prefix of the value in the clustered index record, depending on the row format used. For the REDUNDANT and COMPACT row formats, a BLOB prefix of 768 bytes is stored in the clustered index record. For the DYNAMIC and COMPRESSED row formats, a BLOB prefix is never stored in the clustered index record. The BLOB prefix would be followed by the BLOB reference.    >>>>如果我们表的row format是COMPACT 或者REDUNDANT,那么external pages保存的BLOB field,会在index pages中保存一个 BLOB prefix(BLOB field最开始的 768 bytes数据),在BLOB prefix 之后保存着该BLOB 的BLOB reference。如果表的row format是 DYNAMIC,或者compressed,那么不会在index pages中保存BLOB prefix,所有的BLOB数据都被保存在 external pages中,index page中只保留该BLOB相关的BLOB reference。

The BLOB prefix, when available, helps to calculate the secondary index key without needing to fetch the externally stored BLOB data (which involves at least one extra page load/fetch). This is possible because the maximum length of a secondary index key is 767 bytes. If we attempt to create a secondary index with a bigger length it will be automatically truncated with a warning. For example, consider the following statement:    >>>>

This will generate the following warning to the user: “Specified key was too long; max key length is 767 bytes”. The resulting index records of s1 will only be on a prefix of f2, covering only the first 767 bytes.    >>>>索引的最大长度不能超过 767 bytes,超过该长度创建会报错

External Storage of BLOBs

A BLOB field is considered to be stored externally if it is stored outside of the clustered index B-tree, and just a reference to the BLOB (along with any BLOB prefix) is stored in the clustered index record. This means that the size and structure of the clustered index B-tree will not be affected by the externally stored BLOB data.    >>>>

For the row in our sample table with a primary key value of 2, the BLOB fields f2 and f3 will be stored externally from the clustered index record. This is because their size is too big to fit within the clustered index record. Instead, a reference to the externally stored BLOB data will be stored in the clustered index record. A pictorial view of this external storage is shown below:

Externally Stored Fields in InnoDB

Note: The diagram above shows each clustered index record having exactly one externally stored BLOB. This is just a simplication. Each clustered index record can have any number of externally stored BLOBs.

Let’s verify that the BLOB fields for the row with a primary key value of 2 are indeed externally stored with the help of the debugger. Run the server through a debugger (we used gdb) and issue the query SELECT * FROM t1 WHERE f1 = 2. Put a breakpoint in the function row_search_mvcc(), which searches for rows using a cursor. Within this function, once the offsets are calculated for the selected record, let’s print the offsets to verify which of the columns are stored externally. I’ll use the gdb function ib_print_offsets (provided in the appendix) to examine the offsets array. The (*) in the field output shows that it is in fact stored externally.

The above output shows that the record has 5 fields (3 fields are specified by the user, and 2 are internal system fields, namely DB_TRX_ID and DB_ROLL_PTR). The field number 0 begins at offset 0 and ends at offset 4. The first field begins at offset 4 and ends at offset 10. The second field begins at offset 10 and ends at offset 17. The third field, whose data is stored externally, begins at offset 17 and ends at offset 805. The fourth field, whose data is stored externally, begins at offset 805 and ends at 1593. This information is presented in a tabular format below.

FIELD NUMBERDESCRIPTIONEXTERNALBEGIN OFFSETEND OFFSETLENGTH0The field f1 (Primary Key)No0441DB_TRX_IDNo41062DB_ROLL_PTRNo101773The field f2Yes17805788 (768 bytes blob prefix + 20 bytes of blob reference)4The field f3Yes8051593788 (768 bytes blob prefix + 20 bytes of blob reference)

The length of the fields f2 and f3 are each 788 bytes. This includes the 768 bytes of the BLOB prefix, and the 20 bytes for the external BLOB reference. The gdb function that I used to examine the offsets array is provided in the appendix.

The BLOB reference can be examined by using the gdb function ib_print_blob_ref (which we provided earlier). It takes the BLOB reference as an argument. The BLOB reference of field f2 is available at rec+805-20, where 805 is the end offset of f2 and 20 is the size of the BLOB reference. The BLOB reference of field f3 is available at rec+1593-20, where 1593 is the end offset of f2 and 20 is the size of the BLOB reference. The output is provided below:

The length of the externally stored part of the BLOB is 59232 bytes. The BLOB prefix is 768 bytes. So the total length of the field is 59232 + 768 = 60000 bytes, which is the size of both the f2 and f3 fields.

Conclusion

In this article we saw when the BLOB data will be stored inline, and when it will be stored externally. We also saw the rules applied in choosing particular field data for external storage. In a subsequent blog article, we will cover further details regarding BLOB handling within InnoDB.


Appendix: A gdb Function to Examine the Offsets Array

Within InnoDB, to manipulate and access the various fields of a record, an offsets array is calculated by the function rec_get_offsets(). This function returns an offsets array of type ulint*. This array contains the following pieces of information:

  1. The number of fields in the index.
  2. Whether the record has any externally stored fields.
  3. The offset at which each of the fields begin and end.
  4. Whether the row format is COMPACT or REDUNDANT. Note that DYNAMIC and COMPRESSED row formats are a special type of the COMPACT row format. In other words, a DYNAMIC row format is a COMPACT row format plus additional rules. And the COMPRESSED row format is a DYNAMIC row format with additional rules.
  5. In a debug build, it also contains the rec pointer and the index pointer.

Here is the gdb function referenced earlier that can be used to examine an offsets array in a debug build of MySQL:

Authors

This article is jointly authored by Annamalai Gurusami (கு அண்ணாமலை), Aditya and Thirunarayanan Balathandayuthapani of the MySQL Server Sustaining Team, Bengaluru.

I would also like to thank Marko Makela and Matt Lord for their reviews and support. They helped to improve this article to a great extent.












原创粉丝点击