随机主键对InnoDB插入性能的影响
孔子说得好,“学而不思则罔,思而不学则殆”。看书看多了不去想不去做对比,很快脑子就会一片混乱。这篇blog稍微介绍下随机主键对InnoDB插入性能的影响。
InnoDB引擎的行结构相当复杂(compact、redundant、Barracuda)。为了弄明白这一个细节,我翻了很多遍官方手册。后来又对比着看了很多本书,发现他们自相矛盾,无奈之下,只能从简单的开始测起。下面先摘抄两段矛盾的描述:
《Pro MySQL》,169页:
Remember that InnoDB tables follow a clustered data organization where the data page is clustered, or ordered, based on the primary key value. Would it then surprise you to know that InnoDB does not actually store records in the order of the primary key?
“But wait!” you say. “How is it possible that a clustered data organization can be built on index pages without those records being laid out in primary key order?” The answer lies in the storage engine’s use of next-key pointers in the data records.
The designers of InnoDB knew that maintaining clustered index data pages in sort order of the primary key would be a performance problem. When records were inserted, the storage engine would need to find where the record “fit” into the appropriate data page, then move records around within the file in order to sort correctly. Updating a record would likewise cause problems. Additionally, the designers knew that inserting records on a heap structure (with no regard to the order of the records) would be faster, since multiple insertions could be serialized to go into contiguous blocks on the data page. Therefore, the developers came up with a mechanism whereby records can be inserted into the data page in no particular order (a heap), but be affixed with a pointer to the record that had the next primary key value.
The InnoDB storage engine inserts a record wherever the first available free space is located. It gets this free record space address from the page header section. To determine the next-key pointer, it uses the small, ondensed page directory trailing section of the data page to locate the appropriate place to insert the primary key value for the inserted record. In this way, only the small page directory set of key values and pointers must be rearranged. Note also that the next-key pointers are a one-way (forward-only) list.
《High Performance MySQL 2nd》,117页:
If you’re using InnoDB and don’t need any particular clustering, it can be a good idea to define a surrogate key, which is a primary key whose value is not derived from your application’s data. The easiest way to do this is usually with an AUTO_INCREMENT column. This will ensure that rows are inserted in sequential order and will offer better performance for joins using primary keys.
It is best to avoid random (nonsequential) clustered keys. For example, using UUID values is a poor choice from a performance standpoint: it makes clustered index insertion random, which is a worst-case scenario, and does not give you any helpful data clustering.
是不是很崩溃?《Pro MySQL》介绍了原理;而《High Performance MySQL 2nd》反对了《Pro MySQL》的观点,而且还在之后的篇幅里面用一个并不太能说明问题的实验来证明自己的观点。算了,还是得靠自己……
1、建表:
CREATE TABLE `test` (
`a` char(32) NOT NULL default ”,
`b` varchar(5000) default NULL,
PRIMARY KEY (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
加上一个varchar的大字段的原因是,如果InnoDB主键随机插入存在行数据的搬迁,那么性能下降将会非常明显。这里的MySQL采用了O_DIRECT的模式,并且将buffer_pool减少到500M,尽量减少内存中的数据以体现这一差别。
2、写两个单线程程序。一个产生递增主键并插入到test表内;另外一个产生随机主键并插入到test表内
测试结果是以下两个图:
事实证明,《Pro MySQL》对InnoDB的描述是不对的。不过《Pro MySQL》已经出版了很多年了,或许以前的InnoDB是按照他说的方式实现的也未可知。从测试结果来看,InnoDB(MySQL5.0.84自带)是真正采用了聚集索引,数据存放的物理位置与聚集索引相关。
我觉得pro MySQL中指定应该是在一个page中记录是按堆存储的,没有顺序,这个直接看ibd的文件就能知道。但page与page之间是有顺序的,所以大量的随机插入还是使得性能下降。
呵呵,有疑惑,做实验验证。
感觉两个看起来都没有错的,pro mysql并没有说随机插入不影响性能的。
innodb数据是按照cluster index进行聚合的。
但是这个聚合是以page为单位的,page在磁盘上并不是连续的,如果整个磁盘文件数据是完全有序的话就会导致说,如果我在一张表的前面插入一行,就会导致所有的数据向后移动了。
通过page进行粒度上的控制可以避免这点,当然会带来一定的性能下降。
@liecs:
是的,很有可能是我误解了Pro MySQL的意思。
结论是,在InnoDB的聚集索引上做随机插入是会大大影响性能的。
我想请教那个性能分析图是如何生成的啊,
用的是第三方工具?
你可以把索引看做是deque那样的数组。如果要向中间插入数据,那么就会引起数据移动,而数据移动是昂贵的。向deque后面追加数据是很便宜的,只要allocate新的block追加在尾巴上就可以,所以递增的插入要便宜得多。
这么说来,many to many关系中间表是组合主键,它的组合值也不会是有顺序的,这样的表也要加一个auto increment的列?可是基本上所有的资料上都是推荐用组合索引的。