Internal PHP function usage

Warning: This blogpost has been posted over two years ago. That is a long time in development-world! The story here may not be relevant, complete or secure. Code might not be complete or obsoleted, and even my current vision might have (completely) changed on the subject. So please do read further, but use it with caution.

« A toolbox for less than $100 / month Shuffling elements in Gatling »

Posted on 25 Jul 2014
Tagged with:

How many internal PHP functions (things like count(), strpos(), array_merge() etc), does PHP have? Depending on which version you use, and how many extensions you have loaded, somewhere between 1000 and 2000 would be a good guess. But how many of these internal functions are you REALLY using? I don’t hear many people talking about iconv_strlen(), is_soap_fault() or mb_http_output(), yet these functions do exists. And how many times are people actually calling these functions?

One of the biggest (if not THE biggest) source of PHP applications is obviously GitHub. So with a simple query, we can fetch repositories based on a language. To get the best results, I’ve only used repositories that github marked as “PHP” repositories, with at least 50 stars, without returning any forks. You can try this query as well in github.

To automate things, I’ve setup a system in four steps:

step 1: fetch the query result from github.
step 2: download the repositories from the results.
step 3: parse the repositories for functions in the php files.
step 4: aggregate the result.

Step 1: fetching the query results

Fetching the query results is fairly easy. By using the api from github, you can extract the wanted repositories easily. Each page gives you a maximum of 30 results, so I iterate over all the pages and store the results into a redis store. I’m aiming for 1000 repositories, which happens to be the limit of the search results for github anyway.

Step 2: downloading the respositories

As I’m not really interested in anything related to version control or a repository’s history, I’m just downloading the archived tarball from github. This can be found on https://github.com/<account>/<repo>/archive/master.zip. With a bit of string-manipulation, it’s quite easy to iterate all repositories stored in redis, and fetch the corresponding tarball from the “master” branches.

Step 3: parse the repositories

Parsing the repositories is a two-part process: first of all, i need to unpack the corresponding tarball and iterate through all files. However, I soon discovered this doesn’t really work quite well. With a bit of tweaking, I’ve came up with the following conditions:

Filename must end on .php
Directory must not be called “test”, “tests”, “examples” or “documentation”.
File should be less than 1 megabyte.

Obviously, any unit-test files are not really helpful. The same for anything that are part of examples or documentation. This narrows our results down a bit to only “real” code.

Some repositories had some HUGE files around (mostly data-files), that would literally crash the tokenizer. Since those files are data, I just skip these files as well. Everything above 1 megabyte is considered either generated code, or just data.

Once we have unpacked our tarball and can iterate the correct PHP-files, we have 2 functions to do our core-business: token_get_all() and get_defined_functions(). The latter will give us a list of all internal functions that are known to the current php version (1937 functions, in my case) and token_get_all() will return all the tokens of a PHP file.

Instead of finding all the internal functions with strpos() or regular expressions or something worse, we use the token_get_all() function. This will “tokenize” a php file and return a list of all the tokens. Without going into what tokenizing is, we can easily find all the functions that are called by a script this way. The only thing left to do is check if those functions are on the internal functions list, and if so, we can increase the counters for those functions. As with everything else, we save these counters into redis as well (per repository).

Step 4: parse the results.

So all that is left to do, is to aggregate all the results from redis and output it. Unfortunately there are some “flukes” in the data which are skewing the results a bit, so we have to cleanup a bit.

I had a count of 35648 times the class_exists inside the googleads/googleads-php-lib repository.
The https://github.com/s9y/Serendipity/ uses “define()” for pretty much every everything. I counted a total of 74004 defines.

I’ve removed these two anomalies from the results. I’ve scanned the results for anything that is used over 1000 times in each repository. These are the only two that matched so everything else I consider valid enough.

The results:

I’ve collected data from 967 repositories, sorted by stars. I was surprised to find an old-not-maintained repository from myself in there as well :). A total of 727693 calls to internal PHP functions are made, averaging to 753 functions per repository.

The top 10 of most called functions:

Position	Function	Count
#1	count	31566
#2	substr	29402
#3	sprintf	25623
#4	is_array	24303
#5	strlen	22705
#6	define	18870
#7	str_replace	16672
#8	strpos	15565
#9	preg_match	15430
#10	in_array	14695

A complete list of all the called functions can be found in this gist.

Some unscientific conclusions, drawn from unscientific data, generated by unscientific tools, created by an unscientific developer:

Nobody is using preg_filter() or soundex(). Something that surprised me.
One repository is using phpcredits(). For what I do not know.
From the array_* functions, array_merge() is used the most times (11867), while array_intersect_* functions are used the least. Also, nobody uses the array_diff_uassoc() and array_diff_ukey().
The php 5.5 function array_column() is spotted in the wild only once.
1383 times, assert() is used.
1508 times base64_encode() is used, but only 831 times base64_decode().
Pretty much no repository uses socket_* methods.

The “bad” ones:

exec() is used 1559 times, shell_exec() 185 times, while escapeshellarg() (504) and escapeshellcmd() (83) stay behind.
Unfortunately, eval is not checked, as it is a language construct, and not a function. Might be nice to check it later though.
Still 110 times we find a mysql_connect(), and mysql_query() is called 435 times.
97 times people dared to unload the spl autoload queue through spl_autoload_unregister().
Only 27 times people decided str_rot13() was a perfect function for their needs. I didn’t dare to find out why.
Somebody really don’t want to you debug your code: xdebug_disable() is called once.

The tools to generate this is so horrible, it will never see the living daylight. But the data (repositories and function calls) are available, so happy processing!

« A toolbox for less than $100 / month Shuffling elements in Gatling »