Internal PHP function usage
How many internal PHP functions (things like count(), strpos(), array_merge() etc), does PHP have? Depending on which version you use, and how many extensions you have loaded, somewhere between 1000 and 2000 would be a good guess. But how many of these internal functions are you REALLY using? I don’t hear many people talking about iconv_strlen(), is_soap_fault() or mb_http_output(), yet these functions do exists. And how many times are people actually calling these functions?
One of the biggest (if not THE biggest) source of PHP applications is obviously GitHub. So with a simple query, we can fetch repositories based on a language. To get the best results, I’ve only used repositories that github marked as “PHP” repositories, with at least 50 stars, without returning any forks. You can try this query as well in github.
To automate things, I’ve setup a system in four steps:
- step 1: fetch the query result from github.
- step 2: download the repositories from the results.
- step 3: parse the repositories for functions in the php files.
- step 4: aggregate the result.
Step 1: fetching the query results
Fetching the query results is fairly easy. By using the api from github, you can extract the wanted repositories easily. Each page gives you a maximum of 30 results, so I iterate over all the pages and store the results into a redis store. I’m aiming for 1000 repositories, which happens to be the limit of the search results for github anyway.
Step 2: downloading the respositories
As I’m not really interested in anything related to version control or a repository’s history, I’m just downloading the archived tarball from github. This can be found on https://github.com/<account>/<repo>/archive/master.zip. With a bit of string-manipulation, it’s quite easy to iterate all repositories stored in redis, and fetch the corresponding tarball from the “master” branches.
Step 3: parse the repositories
Parsing the repositories is a two-part process: first of all, i need to unpack the corresponding tarball and iterate through all files. However, I soon discovered this doesn’t really work quite well. With a bit of tweaking, I’ve came up with the following conditions:
- Filename must end on .php
- Directory must not be called “test”, “tests”, “examples” or “documentation”.
- File should be less than 1 megabyte.
Obviously, any unit-test files are not really helpful. The same for anything that are part of examples or documentation. This narrows our results down a bit to only “real” code.
Some repositories had some HUGE files around (mostly data-files), that would literally crash the tokenizer. Since those files are data, I just skip these files as well. Everything above 1 megabyte is considered either generated code, or just data.
Once we have unpacked our tarball and can iterate the correct PHP-files, we have 2 functions to do our core-business: token_get_all() and get_defined_functions(). The latter will give us a list of all internal functions that are known to the current php version (1937 functions, in my case) and token_get_all() will return all the tokens of a PHP file.
Instead of finding all the internal functions with
strpos() or regular expressions or something worse, we use the
token_get_all() function. This will “tokenize” a php file and return a list of all the tokens. Without going into what
tokenizing is, we can easily find all the functions that are called by a script this way. The only thing left to do is
check if those functions are on the internal functions list, and if so, we can increase the counters for those
functions. As with everything else, we save these counters into redis as well (per repository).
Step 4: parse the results.
So all that is left to do, is to aggregate all the results from redis and output it. Unfortunately there are some “flukes” in the data which are skewing the results a bit, so we have to cleanup a bit.
- I had a count of 35648 times the
class_existsinside the googleads/googleads-php-lib repository.
- The https://github.com/s9y/Serendipity/ uses “define()” for pretty much every everything. I counted a total of 74004 defines.
I’ve removed these two anomalies from the results. I’ve scanned the results for anything that is used over 1000 times in each repository. These are the only two that matched so everything else I consider valid enough.
I’ve collected data from 967 repositories, sorted by stars. I was surprised to find an old-not-maintained repository from myself in there as well :). A total of 727693 calls to internal PHP functions are made, averaging to 753 functions per repository.
The top 10 of most called functions:
A complete list of all the called functions can be found in this gist.
Some unscientific conclusions, drawn from unscientific data, generated by unscientific tools, created by an unscientific developer:
- Nobody is using
soundex(). Something that surprised me.
- One repository is using
phpcredits(). For what I do not know.
- From the array_* functions,
array_merge()is used the most times (11867), while array_intersect_* functions are used the least. Also, nobody uses the
- The php 5.5 function
array_column()is spotted in the wild only once.
- 1383 times,
- 1508 times
base64_encode()is used, but only 831 times
- Pretty much no repository uses socket_* methods.
The “bad” ones:
exec()is used 1559 times,
shell_exec()185 times, while
escapeshellcmd()(83) stay behind.
evalis not checked, as it is a language construct, and not a function. Might be nice to check it later though.
- Still 110 times we find a
mysql_query()is called 435 times.
- 97 times people dared to unload the spl autoload queue through
- Only 27 times people decided
str_rot13()was a perfect function for their needs. I didn’t dare to find out why.
- Somebody really don’t want to you debug your code:
xdebug_disable()is called once.
The tools to generate this is so horrible, it will never see the living daylight. But the data (repositories and function calls) are available, so happy processing!