Archive for the ‘open-source’ Category

Creating Indexes/Sorting on very large tables in Mysql

If your dataset is anywhere between 1 million to  180 million then you have come to the right blog.  Normally a procedure like this will require atleast 1 day or more depending on dataset. But I will teach you how to get results in as quick as 3 hrs. You will be able to execute queries on indexes fully utilizing them without going through the painful process of creating them on the the entire table. Recently at work I had to deal with a dataset of 180 million  and this entry gives details of the work flow that I adopted.  I will not effectively sort the complete table but will tell you how to achieve the same effect in less than 2-3 hrs for 180 million rows. I tried well known methods to achieve indexing. I tried to use Alter table statement after 2 hrs stopped it. I tried to dump the data in hope of inserting it in an indexed table but after 1 hr stopped that as well. Then I explored Mysql functionality a bit to discover 3 features which can really reduce time, those were Memory tables, Merge tables and Partitioning(not necessary). Fortunately sorting using merge sort is a recursive job, so in the process of sorting 180 million rows I also had to learn how to sort 2 million faster than normal index creation. So the first thing that you have to do is  find out the order of rows that you have.

If you have less than 1 million rows then you are better off using mysql’s inbuilt commands

Alter table <tblname> create index (<column name>);

If you have more than than then read on…..

The concepts I will be using are as follows:

Memory Tables:

These tables are hash maps that mysql stores in memory. These tables are temporary (in the sense they disappear after mysql server restart)  and have limited rows. However they are lighting fast on index creation or sorting.


Somebody in Mysql finally realized that to be fast database has to split the data across servers. Finally in version 5 they have introduced partitioning. If you have the luxery of more than 1 server you can use this. I did not have this luxury so will do indexing without it.

Merge tables:

So these are 1 server counterpart of Partitioning. If you want to split your dataset into multiple tables and run single query on all of them you can use Merge tables. They have their own share of problems, but for general functionality of insertion, deletion, searching and dumping, they work great. The only 2 major limitations one that they only work if base tables are MyISAM and searching on them is multiples of logn(see below if interested)

Irrelevant Mathematical jargon(feel free to skip)

So if you search  table on indexed columns then search takes Log(n) time (n = number of tuples). So imagine you have 100*n size of table and you create 100 sub tables of n each. Now searching on one table will take Log(n) time and total time is 100*log(n)=log(n to the power 100). If it were a single table it would have taken Log(100*n) =Log(100)+log(n). So in a single table it scale logarithmically but merge table scales exponentially.

I will explain the steps that I took below, you can customize them to your dataset.

There are several ways of creating index however mine will require you to have quite a powerful machine and a reliable power supply. Atleast 2gb of ram is required and more is better here. If you have around 2 million rows then skip to step 4.

Step 0

This step is about reconnaissance. It will determine the numerical data in the next steps. You need to know the approximate amount of space your row takes. This will determine the size of your memory table. You can determine the size of memory table by trial and error also but a numerical estimate will help.  In the steps below I will assume initial table as client_data and final table as client_data_sorted. All other tables are temp and generated on the spot.

Step 1: Increasing allowed size of Memory table to fit data

The first thing you need to do is extend the allowed memory limit of mysql. To sort faster we will need memory tables. Do so by adding the lines marked as red below under  [mysqld] in my.cnf file. Typically located in /etc/my.cnf or /etc/mysql/my.cnf depending on your distribution.

my.cnf file



Adding these lines will ensure that mysql can now take 4GB space for storing memory tables.Feel free to increase this number if you have higher memory but ensure that you dont encounter swapping.

Step 2: Dump data in memory table for indexing

Now you need to create a  memory table with same structure as your final table.

create table client_data_memory like client_data_sorted;
alter table client_data_memory engine=Memory;

The second line alters the engine to memory. The only thing to keep in mind is that memory engine keeps all the data in memory hence if your mysql server restarts or machine restarts then all the data is lost. Ensure that you never rely on them as a reliable storage.

Figure out  how many rows your memory table can contain. In a typical scenario this would be order of 10 million. This value will change depending on your system’s memory and values set in step 1.  You can test it out by inserting data from source table to memory table. After the limit the process will interrupt in the middle giving error like table is full.

Insert into client_data_memory select * from client_data limit <test limit value>;

Ex:  Insert into client_data_memory select * from client_data limit 0,10000000;    # 10 million rows from 0th row

Step 4: Storing data from Memory table to MyISAM table

If you have order of 10 million rows then the process stops for you here. Just use an alter table command to create index on memory table(3-4 minutes) then insert the memory table data into client_data_sorted table. You just saved yourself  hours of pain. If you have more than 10 million then  skip this step.

Alter table client_data_memory add index (<column name>);

(or sort the table)

Insert into client_data_sorted select * from client_data_memory;

If you can store 10million rows in your memory table and total tables are around 40 million then you are better off iteratively repeating the above steps. Merely sort 10 million then insert, then truncate memory table , then insert next 10 million , sort then insert. The process will take exponentially more time every time but still will finish far faster than normal.  If you have more than 40 million then read on. A world of pain awaits you….

Step 5: Create Merge table for accessing data

The above process will not work iteratively for you if you have more than 50 million rows as while inserting you have to merge too. Suppose your memory can store 10 million and you have 180 million rows then create 18 temporary tables of engine type MyISAM.

create table data_temp1 like client_data_sorted;
alter table data_temp1 engine=MyISAM;

Now use the above technique to insert data 10 million a piece into each of these tables. So data_temp0 will contain 0-10 million then data_temp1 will have 10 miliion to 20 million and so forth.

truncate table client_data_memory;insert into client_data_memory select * from client_data limit
0,15000000;insert into data_temp0 select * from client_data_memory;
truncate table client_data_memory;insert into client_data_memory select * from client_data limit
10000000,15000000;insert into data_temp1 select * from client_data_memory;


This will take quite a long time, for me it took 10 min for every 10 million so 180 min in total. Meanwhile consider some of the more worthwhile alternatives like  hive, its too late for it now but its worth a look. This is also useful.

Now create a merge table. A merge table is created on top of several MyISAM tables of same structure. Its important to have same structure and MyISAM tables. You can use this as  a single table however all the searching is performed on all tables and hence is slower than single table.

create table client_data_merge like client_data_sorted;
alter table client_data_merge engine=merge union=


9,data_temp10,data_temp11,data_temp12) INSERT_METHOD=LAST;

Step 5: Final Sorted table(if required)

For most temporary uses this merge table will work. You can use this table normally. Insert data into the table etc etc. However if the limitations of merge tables are not acceptable you have to create the final table  from this merge table. You can use merge tables to dump the data in a sorted fashion according to indexes or simply use

Insert into <tblname> Select * from <merge table> order by <index1>,<index2>…..

This way you get to Insert much faster as the data is already sorted and mysql simply has to insert the data and update the indexes. If this method helped.. do let me know in the comments.


Enable Multitouch(two-fingers) on Synaptics trackpad on Linux

This could be your Trackpad

This could be your Trackpad

Very few people know that every synaptics trackpad supports multitouch and as long as your OS supports it  you can always use multitouch to enhance your computers interface. Apple has patented several UI design elements ,hence has inhibited Windows and Linux to come out with similar UI elements like pinch and zoom, scroll etc.   

microsoft-windows-vista-logoIf you are using Windows xp, vista or Windows 7 then click here

signature_color_high_res_0 Install Synaptics driver  in ubuntu

apt-get install  xserver-xorg-input-synaptics 

Now  edit your xorg.conf file

sudo vi /etc/X11/xorg.conf

find the input section

Section “InputDevice” Identifier “Synaptics Touchpad”

Add these lines in it

Option “HorizEdgeScroll” “on”

Option “VertEdgeScroll” “on”
Option “VertTwoFingerScroll” “on
Option “HorizTwoFingerScroll” “on”
Option “CornerCoasting” “on”
Option “PalmDetect” “on”
Option “CircularScrolling” “on”
Option “CircScrollTrigger” “3”


MPEG 4 Part -2 / AVI video codec performance review: Divx vs Xvid which is better?

MPEG-4 Part 2 is a video compression technology developed by MPEG. It belongs to the MPEG-4 ISO/IEC standard (ISO/IEC 14496-2). It is a discrete cosine transform compression standard, similar to previous standards such as MPEG-1 and MPEG-2. Several popular codecs including DivX, Xvid and Nero Digital are implementations of this standard.

If you were looking for MP4/Mpeg4 Part 10  CoreAVC vs FFmpeg video codec performance review then click here


Though there are several codecs that perform the encoding owing to the universal use of this encoding in dvds, common implementations are Nero, Divx, Xvid etc. Among these arguably the most popular codecs are Xvid and Divx. We will be seeing side by side comparison of both in the article. You can find both codecs individual history on their wiki pages. I find the history of Divx quite amusing,They adopted their name Divx to mock a company that used to charge users for viewing by the hour and themselves used spyware in their later version hence becoming a subject of mockery. You might have already noticed that “Xvid” is “Divx” backwards.

I will talk of decoding performance here. Decoding probably makes much more sense to a normal user as encoding is done by few but everybody decodes videos. Decoding is quite a subjective criteria hence besides my own comments I have also attached screenshots that allow the user to make his/her own judgment. The codecs were taken from Final build site.

Xvid Logo
There is a very good chance that any video that you come across is encoded with Xvid. Across my search of encoders I found Xvid is the most preferred one for encoding. Its opensource hence free, much more configurable hence lets you draw the last byte’s worth. Though it seems logical that if video is encoded with encoder its respective decoder should also be the best, however I found the results didn’t fare as I expected.

Divx Logo

The good old closed source version. Divx is costly and costing is on per PC basis hence encoding costly. Decoding can be done for free using their web player. The pro version is not free however there are other ways of using the codec in Media Player classic shipped with Klite Codec as explained below.

Here is the screenshot of Gspot codec analyzation. This provides the codec information with which the file was encoded in:

GSPOT Screenshot of Sherlock Holmes Important things that can be highlighted from the Gspot codec is

1. Video Encoding was done by Divx version 5.1.1

2. Audio Encoding was done by MPEG-1 Layer 3 more popularly known as MP3

3. The default resolution is 352×272 and should be judged at that resolution

Further information can also be seen from the screenshot

PS: you can also use other codec information utils like AVI codec etc.

Following are the screenshots taken from the video with different filters(All the left screenshots are Divx and right ones are Xvid)
Divx Filter with Sherlock Holmes

Xvid filter with Sherlock Holmes

Screenshot with Divx Filter

Screenshot with Xvid Filter

Click the pics and view them at full-size and compare them. As the videos clearly depicted the Divx Filter far outperforms Xvid Filter in visual apeal.

There is a very important decoder that I have not mentioned here, that is the FFmpeg decoder(default decoder in most open players like mplayer, media player classic and vlc). This decoder generally uses libavcodec to decode the media files and even though the performance is not as good as the above decoders but it is good enough but the difference in CPU usage is extreme. Ex for a normal movie where Divx and Xvid take as much as 25-30% CPU usage, FFmpeg will take barely 10-15%. So if its quality you are looking for then look no further than Divx but for performance nobody matches FFmpeg

So overall the results are quite ambigous. If you are looking for quality then Divx, for free encoding/decoding Xvid and for decoding performance FFmpeg . You can download the filters from here, and you can learn how to switch your filter from here. So what are you going to do with your player???

MP4/ HD video codec performance review: CoreAVC vs FFmpeg which is better ?

MPEG-4 is used for AV data for web (streaming media) and CD distribution, voice (telephone, videophone) and broadcast television applications.MPEG-4 adds new features such as (extended) VRML support for 3D rendering, object-oriented composite files (including audio, video and VRML objects), support for externally-specified Digital Rights Management and various types of interactivity. AAC (Advanced Audio Codec) was standardized as an adjunct to MPEG-2 (as Part 7) before MPEG-4 was issued.The key parts to be aware of are MPEG-4 part 2 (MPEG-4 SP/ASP, used by codecs such as DivX, Xvid, Nero Digital and 3ivx and by Quicktime 6) and MPEG-4 part 10 (MPEG-4 AVC/H.264, used by the x264 codec, by Nero Digital AVC, by Quicktime 7, and by next-gen DVD formats like HD DVD and Blu-ray Disc).

If you were looking for MPEG 4 Part 2/ AVI video codec performance review: Divx vs Xvid then click here


As told previously 2 parts of MPEG-4 format are popular. There are several codecs available that implement  MPEG-4 with different level of perfection, popular ones are Quicktime(Part 10), FFmpeg(Part 10), Xvid(Part 2), Divx(Part 2) and CoreAVC(Part 10). After consulting several forums, I found 2 of those very talked about CoreAVC and FFmpeg. I have compared the 2 codecs below.

FFmpeg Logo
Whether you are a codec geek or not, this is the most probable codec that you have been using since time  immemorial.  Its free, opensource, can allow a lot of tweaking, plays a major number of containers and codecs. I will try not to go beyond H.264 in this article but one thing is for sure, whether you are using CoreAVC or not, you definitely have to use FFmpeg due to its overwhelmingly large number of codec compatibility.

CoreAVC or Corecodec Logo
Its a closed source alternative for decoding H.264, infact its one of the accepted formats for Blue-ray. When it came out it astonished everyone with the sheer speed and performance outputs. Its known to exceed several hardware implementations (reminds me of John Carmack implementation of square root in Quake 3 which was faster than FPU). Corecodec people call it the fastest codec on earth.(and I am nobody to argue).

Left one is CoreAVC and right is FFmpeg

x264 encoded movie Sunshine decoded with CoreAVC

x264 encoded movie Sunshine decoded with CoreAVC

x264 encoded movie Sunshine decoded with CoreAVC

x264 encoded movie Sunshine decoded with CoreAVC
x264 encoded movie Sunshine decoded with CoreAVC

x264 encoded movie Sunshine decoded with CoreAVC

Doesnt take a rocket scientist to notice that better one. However when I discussed this on outside forums, I faced considerable shouting and screaming so I did more tweaking and testing and found another bit of information that convinced me to use CoreAVC for good.
Left is CoreAVC and right is FFmpeg.
Top is W/O post processing and bottom is with Post Processing

CPU usage between FFmpeg and CoreAVC
Cpu Cycles with CoreAVC
CPU cycles consumption when using FFmpeg codec
Difference between CPU usage when using FFmpeg and CoreAVC is huge. It can be seen from the above CPU usage or by seeing CPU cycles consumption . Considering that I have a Core 2 duo system with 1.8Ghz this difference is a substantial one, infact with preprocessing the player hung at 100% cpu usage in FFmpeg.

I found out to my amazement that Corecodecs are actually the fastest codecs in the world. I used all possible codecs at my disposal(VLC, ffmpeg, quicktime, windows media player and Nero) but CoreAVC not only gave better performance compared to all of them but also astonishingly smooth one at that. I ran it on 1080p trailer of 10000BC on MPC using quicktime alternative(had to rename .mov to .hdmov to use coreAVC) and performance was mind boggling.
Bottomline, I’d suggest you one thing. just go right now to Final build site and download the coreavc format and start using it. You can find out how to do that here.

Sins of Opensource

Well, you are an opensource user. Your geek rating is much higher than any of your other puny friends on facebook. You bask in the glory of being clean and untainted. But dont confuse being opensource as “not being evil”. The real trick is to use the good will of opensource and still use it for your nefarious (defination) purposes. Behold the sins of Opensource and how/where to use it in your own company:

  1. As Strategy (aka Greed): Most evil are those who employ opensource as a strategy, this not only allows them to publisize there product but also gives them a GNU loving status. Google like always leading the way with Google Gears and Libjingle. By making Gears opensource it caused a 3 way effect
    1. It allowed the people to look into the code and ensure that there are no google spies sitting in it, hence building trust.Without this nobody would opt for it
    2. It also became instant favorite to all those who hate The Microsoft way, giving them the satisfaction of using an opensource alternative on an opensource platform(firefox).
    3. Allowing other people to step in with it and start making more tools to replace more softwares(ZOHO, Offline WIkipedia etc)

    So using it as a stratigic weapon you can assasinate a lot more quickly than otherwise.

  2. As a User (aka Gluttony): You reading this blog in firefox and me writing it in wordpress doesnt make things a whole lot better. If you have used a product of opensource and have never cared to:
    1. Submit a bug report.
    2. donate for it(and are never planning to).
    3. Contribute back.

    then you are evil. You are another one of those selfish people who just wanna use the product for free and run away with the benifits. I wouldnt blame you for it but we are all sinners

  3. As a developer(aka Pride): If you are a developer associated with a opensource community because of any of the following reasons then you are also a sinner:
    1. Even if you see your own project suffering because of unsufferable GUI, you still stand by and keep working on core internal library which is already ruthlessly efficient. Since GUI is one of the most boring things to work on you choose to ignore it and because of that your project suffers. This is the sin that leads to poor interfaces of so many opensource software.
    2. You have been laid off, and just want to code to keep your skills sharp and to build your CV and have no interest in the concept of “greater good” and benifit of all man kind.
  4. As a startup(aka Sloth): If you are a startup or for that matter any company and you are using opensource software because of any of the reasons then you are also sinners:
    1. Using your expert skills you are able to disprove Microsoft’s TCO model and use opensource as cheap software.
    2. Use plenty of opensource tools in your company but have never released any of the scripts etc neither have cared enough to put even a simple logo on your official site.
    3. You rely of free service of opensource community that is 10 times larger than that of any company’s support staff.
    4. Worse are all are those developers who use the free experience provided by opensource community in development of features that they eventually want to put in there own software(best oppurtunity is GSOC).

Disclaimer: I’d like to apologize to anybody who feels offended. This article fundamentally represents how opensource is practically being used by different catagory of people(coders, users and companies). There are several other ways of employing/deploying opensource which will be discussed in future articles.

Ubuntu: Bridging the gap of linux usability

GNU license (whether 1 or 2 or…) makes sense to few. But whats more important in this world is compatibility. As Linux grows, the arrogance of few are blocking the freedom that comes with it. Its sad when somebody giving something as “free” also wants people using it to be bound by their philosophy, its like a religion being forced upon others.

Recently bought a PCI-express gigabit ethernet card(Agere chipset ET1310)].  I put it in linux (secretly  praying for a stable driver), reboot and pop goes ubuntu, restricted driver found, turns out the restricted driver is put up, under BSD license. And so another day goes by where ubuntu saves countless hours of setup time and futile searching and debugging, thanx to a few people who can accept other people for not following their philosophy as bible.