PHP vs Perl?

Disclaimer / Motivation

First of all, it is important to note that this isn't supposed to be a benchmark. The results of this test are worth what they are, it only means that PHP performed better than perl for this particular program.

For my thesis, I need to analyse large sets of data. The data is stored in the DataSeries format, which is a format developed by HP Labs specially for these type of things. I need to do several things with the data, so I created a script to do some basic analysis.

I had several options: I could write a shell script, or choose php or perl or something else instead. I realized that writing a shell script for this would be very complex, so I pondered between PHP and Perl. My feeling is that PHP is more suitable for Web and Perl is more suitable for sysadmin tasks, parsing, etc. I should choose Perl then, but my knowledge of Perl is very very basic, so I would need to learn it first. Unfortunately I am running against time, so I ended up choosing PHP since it would be very easy for me to write the script.

The test

So I wrote a first version of the PHP script. The script is not optimized whatsoever, but works as expected. The problem with a large dataset is that its parsing takes ages, and my first run took ages. I started wondering if a Perl equivalent would be a lot faster than the script I wrote, so I asked a friend to write an equivalent script in Perl. He wrote it and I ran both scripts at the same time on the same server. My friend noticed later that he had left an extra instruction in the main loop that doesn't exist in the PHP version. Anyway, both scripts were already running so I didn't abort the run. The results were quite surprising. The PHP version took 551m56.349s and the Perl equivalente took 712m16.792s.

The Run

The way I ran the scripts and the respective output is followed:
            $ time nfsdsanalysis -Z common archive/lindump_total.ds | ./stats_basic.php > stats_basic.txt
            real    551m56.349s
            user    605m38.263s
            sys     28m44.936s
        
            $ time nfsdsanalysis -Z common archive/lindump_total.ds | perl stats_basic.pl > stats_basic1.txt
            real    712m16.792s
            user    677m48.698s
            sys     66m32.526s
        
The file lindump_total.ds is a 80Gb file. The output of nfsdsanalysis (what is piped to the script) is something like this:
            # Extent, type='Trace::NFS::common'
            packet_at source source_port dest dest_port is_udp is_request nfs_version transaction_id op_id operation rpc_status payload_length record_id
            1253831523212739 3a163121 790 01c633c7 2049 TCP request V3 21ff6e38 3 lookup null 56 0
            1253831523212743 3a163121 790 01c633c7 2049 TCP request V3 21ff6e38 3 lookup null 56 1
            1253831523212746 3a163121 897 01c633c7 2049 TCP request V3 2eff9a5e 1 getattr null 36 2
            1253831523212748 3a163121 897 01c633c7 2049 TCP request V3 2eff9a5e 1 getattr null 36 3
            1253831523214877 2a2622c2 2049 1a264421 790 TCP response V3 2ffdae28 3 lookup 0 216 4
            1253831523214886 2a2622c2 2049 1a264421 897 TCP response V3 2ffca15e 1 getattr 0 88 5       
        

Optimizations

Some people asked me to run the scripts isolated, i.e., not in paralel like last time. I got optimized versions from several people, and I even got some versions in other languages like python and C.

Apparently, the Perl version was so slow due some serious performance bug with regards to list assignment. Thanks to Pedro Figueiredo for the tip. Just by installing 5.10.1 I got a 37% performance improvement. Even though the improvements were significative, Perl still performed in last.

Below you can see the results of the runs of the several optimized scripts in different languages. The results are ordered by run time, being the first one the fastest one and the last one the slowest one:

C Version (By Jose Celestino):
$ time nfsdsanalysis -Z common archive/lindump_total.ds | ./stats_basic > stats_basic4.txt 
real    202m37.347s
user    265m46.817s
sys     9m39.888s
    

PHP Version (Optimized by Diogo Neves, and modified by me since there were several bugs):
$ time nfsdsanalysis -Z common archive/lindump_total.ds | ./stats_basic_optimized.php >stats_basic5.txt
real    270m48.511s
user    444m43.480s
sys     8m56.562s
    

Python Version (by Andre Cruz):
$ time nfsdsanalysis -Z common archive/lindump_total.ds | python stats_basic.py > stats_basic3.txt
real    322m55.569s
    

Perl Version (Original by Carlos Pires, Optimized version by Joao Pedro):
$ time nfsdsanalysis -Z common archive/lindump_total.ds | perl stats_basic_optimized.pl > stats_basic2.txt

real    419m11.267s
user    508m26.699s
sys     16m20.717s
    


Some Conclusions

Some already asked me why the user time is greater than the real time. Keep in mind that the server where I ran these scripts has 8 cores and that is the reason for it.

It really surprised me that Perl performed the worst, I wasn't really expecting it. I also ran PHP without APC and the results were similar.