Creating Indexes/Sorting on very large tables in Mysql

If your dataset is anywhere between 1 million to  180 million then you have come to the right blog.  Normally a procedure like this will require atleast 1 day or more depending on dataset. But I will teach you how to get results in as quick as 3 hrs. You will be able to execute queries on indexes fully utilizing them without going through the painful process of creating them on the the entire table. Recently at work I had to deal with a dataset of 180 million  and this entry gives details of the work flow that I adopted.  I will not effectively sort the complete table but will tell you how to achieve the same effect in less than 2-3 hrs for 180 million rows. I tried well known methods to achieve indexing. I tried to use Alter table statement after 2 hrs stopped it. I tried to dump the data in hope of inserting it in an indexed table but after 1 hr stopped that as well. Then I explored Mysql functionality a bit to discover 3 features which can really reduce time, those were Memory tables, Merge tables and Partitioning(not necessary). Fortunately sorting using merge sort is a recursive job, so in the process of sorting 180 million rows I also had to learn how to sort 2 million faster than normal index creation. So the first thing that you have to do is  find out the order of rows that you have.

If you have less than 1 million rows then you are better off using mysql’s inbuilt commands

Alter table <tblname> create index (<column name>);

If you have more than than then read on…..

The concepts I will be using are as follows:

Memory Tables:

These tables are hash maps that mysql stores in memory. These tables are temporary (in the sense they disappear after mysql server restart)  and have limited rows. However they are lighting fast on index creation or sorting.

Partitioning:

Somebody in Mysql finally realized that to be fast database has to split the data across servers. Finally in version 5 they have introduced partitioning. If you have the luxery of more than 1 server you can use this. I did not have this luxury so will do indexing without it.

Merge tables:

So these are 1 server counterpart of Partitioning. If you want to split your dataset into multiple tables and run single query on all of them you can use Merge tables. They have their own share of problems, but for general functionality of insertion, deletion, searching and dumping, they work great. The only 2 major limitations one that they only work if base tables are MyISAM and searching on them is multiples of logn(see below if interested)

Irrelevant Mathematical jargon(feel free to skip)

So if you search  table on indexed columns then search takes Log(n) time (n = number of tuples). So imagine you have 100*n size of table and you create 100 sub tables of n each. Now searching on one table will take Log(n) time and total time is 100*log(n)=log(n to the power 100). If it were a single table it would have taken Log(100*n) =Log(100)+log(n). So in a single table it scale logarithmically but merge table scales exponentially.

I will explain the steps that I took below, you can customize them to your dataset.

There are several ways of creating index however mine will require you to have quite a powerful machine and a reliable power supply. Atleast 2gb of ram is required and more is better here. If you have around 2 million rows then skip to step 4.

Step 0

This step is about reconnaissance. It will determine the numerical data in the next steps. You need to know the approximate amount of space your row takes. This will determine the size of your memory table. You can determine the size of memory table by trial and error also but a numerical estimate will help.  In the steps below I will assume initial table as client_data and final table as client_data_sorted. All other tables are temp and generated on the spot.

Step 1: Increasing allowed size of Memory table to fit data

The first thing you need to do is extend the allowed memory limit of mysql. To sort faster we will need memory tables. Do so by adding the lines marked as red below under  [mysqld] in my.cnf file. Typically located in /etc/my.cnf or /etc/mysql/my.cnf depending on your distribution.

my.cnf file

[mysqld]

datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
..
max_heap_table_size=4096M
tmp_table_size=4096M

Adding these lines will ensure that mysql can now take 4GB space for storing memory tables.Feel free to increase this number if you have higher memory but ensure that you dont encounter swapping.

Step 2: Dump data in memory table for indexing

Now you need to create a  memory table with same structure as your final table.

create table client_data_memory like client_data_sorted;
alter table client_data_memory engine=Memory;

The second line alters the engine to memory. The only thing to keep in mind is that memory engine keeps all the data in memory hence if your mysql server restarts or machine restarts then all the data is lost. Ensure that you never rely on them as a reliable storage.

Figure out  how many rows your memory table can contain. In a typical scenario this would be order of 10 million. This value will change depending on your system’s memory and values set in step 1.  You can test it out by inserting data from source table to memory table. After the limit the process will interrupt in the middle giving error like table is full.

Insert into client_data_memory select * from client_data limit <test limit value>;

Ex:  Insert into client_data_memory select * from client_data limit 0,10000000;    # 10 million rows from 0th row

Step 4: Storing data from Memory table to MyISAM table

If you have order of 10 million rows then the process stops for you here. Just use an alter table command to create index on memory table(3-4 minutes) then insert the memory table data into client_data_sorted table. You just saved yourself  hours of pain. If you have more than 10 million then  skip this step.

Alter table client_data_memory add index (<column name>);

(or sort the table)

Insert into client_data_sorted select * from client_data_memory;

If you can store 10million rows in your memory table and total tables are around 40 million then you are better off iteratively repeating the above steps. Merely sort 10 million then insert, then truncate memory table , then insert next 10 million , sort then insert. The process will take exponentially more time every time but still will finish far faster than normal.  If you have more than 40 million then read on. A world of pain awaits you….

Step 5: Create Merge table for accessing data

The above process will not work iteratively for you if you have more than 50 million rows as while inserting you have to merge too. Suppose your memory can store 10 million and you have 180 million rows then create 18 temporary tables of engine type MyISAM.

create table data_temp1 like client_data_sorted;
alter table data_temp1 engine=MyISAM;

Now use the above technique to insert data 10 million a piece into each of these tables. So data_temp0 will contain 0-10 million then data_temp1 will have 10 miliion to 20 million and so forth.

truncate table client_data_memory;insert into client_data_memory select * from client_data limit
0,15000000;insert into data_temp0 select * from client_data_memory;
truncate table client_data_memory;insert into client_data_memory select * from client_data limit
10000000,15000000;insert into data_temp1 select * from client_data_memory;

…….

This will take quite a long time, for me it took 10 min for every 10 million so 180 min in total. Meanwhile consider some of the more worthwhile alternatives like  hive, its too late for it now but its worth a look. This is also useful.

Now create a merge table. A merge table is created on top of several MyISAM tables of same structure. Its important to have same structure and MyISAM tables. You can use this as  a single table however all the searching is performed on all tables and hence is slower than single table.

create table client_data_merge like client_data_sorted;
alter table client_data_merge engine=merge union=

(data_temp0,data_temp1,data_temp2,data_temp3,data_temp4,data_temp5,data_temp6,data_temp7,data_temp8,data_temp

9,data_temp10,data_temp11,data_temp12) INSERT_METHOD=LAST;

Step 5: Final Sorted table(if required)

For most temporary uses this merge table will work. You can use this table normally. Insert data into the table etc etc. However if the limitations of merge tables are not acceptable you have to create the final table  from this merge table. You can use merge tables to dump the data in a sorted fashion according to indexes or simply use

Insert into <tblname> Select * from <merge table> order by <index1>,<index2>…..

This way you get to Insert much faster as the data is already sorted and mysql simply has to insert the data and update the indexes. If this method helped.. do let me know in the comments.

About these ads

11 responses to this post.

  1. Posted by fuzzy on August 6, 2009 at 6:16 am

    great work and very well written.
    keep it up.

    Reply

  2. Great hack, and a great way to sort large tables effectively. Must say, I will have to bookmark this for future reference :-)

    Friendfeed also had a unique mechanism for storing schema-less data, and creating fast indices on them: http://bret.appspot.com/entry/how-friendfeed-uses-mysql

    Reply

  3. This is a sorting speed problem and databases are not the right tool for sorting large amounts of data. The fastest way I have found to index a large mysql table is to presort the data using the unix sort command.

    Reply

  4. Posted by Muhammad on November 16, 2009 at 10:53 am

    This methode don’t work for tables that have BLOB or TEXT columns, Since MEMORY tables cannot contain BLOB or TEXT columns.
    see http://dev.mysql.com/doc/refman/5.0/en/memory-storage-engine.html

    Reply

  5. Posted by Mark on November 17, 2009 at 11:46 am

    Wow, great help, just used memory table bit, turned my reindex cron job from a 10 hour job into 25mins! amazing! Just to mention, when you have your completed table still as a memory table, it takes no time at all to convert it back to normal eg MyISAM (i know it seems obvious, but im guessing some people may just leave it in the memory, which is rather a waste)
    Simply use ALTER TABLE `table_name` engine=MyISAM;
    to convert back to MyISAM or whatever engine you use.

    Reply

  6. Posted by Cory on May 25, 2010 at 10:49 pm

    THANK YOU THANK YOU THANK YOU

    Reply

  7. Posted by Cory on May 25, 2010 at 10:51 pm

    Oh, I should add: 140M records; 6 hours -> 20 minutes.

    Reply

  8. Posted by David on July 5, 2010 at 6:02 pm

    Hi,

    Great tutorial and was very useful to handle 470M records.

    However, it does not truly solve the problem of wanting to place all records in a SINGLE table. In step 5 you simply say that you can insert the sorted table into a new single table — but this process takes days upon days (if not weeks or more).

    I’m still looking for a way to insert 470M records from my merged myisam table into a single myisam/innodb table. Running the sorted insert query was not practical in my case.

    Cheers,
    David

    Reply

  9. Posted by David on July 6, 2010 at 4:13 pm

    As a response to my own comment above to get the merged table into a single innodb table, what I finally did is a INSERT INTO OUTFILE … ORDER BY index1 on my merged table. Then:

    SET autocommit=0;
    SET unique_checks=0;
    SET foreign_key_checks=0;
    LOAD DATA INFILE … INTO TABLE … (destination table is innodb)
    COMMIT;
    SET unique_checks=1;
    SET foreign_key_checks=1;

    The whole process above took about 8 hours for nearly 500M records. Any other method kept going for days and days with very slow progress.

    Reply

  10. I agree with Doug wholeheartedly. Although this is a nice workaround, sorting large data sets should be done on Unix using the sort command. MUCH faster than attempting to do it after the data is already inserted into a database table.

    Reply

  11. Posted by Alain on March 9, 2012 at 12:10 pm

    Using partitions in mysql is also hugely advantageous.
    On a table with 300million records, adding a UNIQUE index across multiple columns was horribly slow (days to complete), moved it to partitioned table, and job completed in under 2 hours. And the resultant SELECTs benefitted with a performance boost too!

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: