Top 10 Apache Top Level Projects

Warning: This blogpost has been posted over two years ago. That is a long time in development-world! The story here may not be relevant, complete or secure. Code might not be complete or obsoleted, and even my current vision might have (completely) changed on the subject. So please do read further, but use it with caution.
Posted on 22 May 2011
Tagged with: [ Apache

When saying Apache, most developers immediately think of a web server. And this of course is true, Apache httpd web server is the most used web server today and the number of users keep on growing every day. What not everybody knows, is that Apache is a foundation that hosts many other open source (web) projects. A short introduction on a (my) top 10 of interesting projects for the (PHP) programmer.

“The ASF is made up of nearly 100 top level projects that cover a wide range of technologies. Chances are if you are looking for a rewarding experience in Open Source, you are going to find it here.”, according to the

Apache Foundation website. These “Top Level Projects” are projects that receive the highest apache status and can be considered part of the Apache Foundation. In order to receive that status, a project has to go through an incubator period (http://incubator.apache.org/). There are many promising projects currently available inside the Apache incubator, which some of them make it as a top level project (TLP) but of  course, some of them don’t. I’ve made a quick and dirty top 10 list of projects inside the Top Level Project directory that might be useful for a (PHP) web developer.

1. httpd

How else to start the list with Apache main project: httpd. This highly flexible, stable and fast web server dominates the internet for a long time now, and even though many other lightweight web servers are rising, it still stays strong in its seat as the number #1 web server.  Every web-programmer should at least be aware of the pros (and cons) of the web server and how to configure it properly. There are lots of options and extensions available, and even writing your own extensions is a breeze.

2. Hadoop

Hadoop is a mix of several smaller projects that are closely tied together. The most important function from Hadoop is the ability to Map/Reduce data efficiently. By using map/reduce you can process data in a parallel (clustered) way so your total processing will be much faster. For instance, you have some system that outputs daily statistics. Now you can actually map that data into key/value pairs. From those key/value pairs, the data is aggregated (reduced) by the reducer to create overall statistics. Since the massiveness of all that data, Hadoop provides an efficient infrastructure for you to actually achieve that goal.

It has got its own map/reduce functionality and uses hbase as a database system, which is something you can see as the open source version of Google bigtables, the system that Google uses for their own search engine.

3. Mahout

Mahout is a collaborative filtering system. A complex name for a system that can answer questions like : “If a person likes book 1 and book 2, will he also like book 3” or it can give suggestions on other type of data you feed it. It can create correlations between data so you don’t need to since it can be very difficult to achieve this (both in time and size).

A very complex system made easy by Mahout.

4. Tika

Tika is a file format parser which allows you to parse many different kind of documents into raw data. For instance, it can easily be used to capture text and metadata from PDF files, MS office formats and even music formats like MP3. It’s even possible to write your own parser if you have special needs or have proprietary formats.

Installing and using is easy:

  • Download the Tika source.
  • run “mvn install”
  • java -jar tika-app/target/tika-app-0.9.jar -text

5. Solr

Solr is an enterprise search platform derived from the Apache Lucene project. It a very high scalable and fast search system. If you have done fulltext searches in MySQL, you will love the power of solr. There are a few common misconceptions about it though.

First: Solr is NOT a database, it’s a search engine. There is a big difference here. Even though you will feed solr with the documents you would like to search in, it cannot return the actual documents. It stores all information heuristic so the search will be blazingly fast. (actually, it CAN store the actual documents, but there are other and better solutions for that).

For instance: you have a web shop with 100K of articles and you like your customers to be able to browse through them. Instead of querying your database with queries like “select * from articles where description like ‘%searchterm%’”, you query it to Solr. Then Solr will return (depending on the configuration) a list of document ID’s (in this case, probably the article IDs for your articles). Sounds like a hassle, but keep in mind that the speed is absolutely amazing. Furthermore, it can be configured to use things like stemming and alternative word lists, so looking for “elevator” will actually return documents containing the wordt “lift”, or searching for “trees” will return “tree” documents as well. Many complex queries can be given as well, for instance if you want to search for the word “Windows” and “98 or XP”, but they have to be in close vicinity (so it matches “windows 98” or “windows xp”).

There is another very sweet option in Solr called faceted search. Now just like in a database you can define different “fields”, (just like a database schema). For instance, when you store picture information, you might store dimensions, size, author, gps information, exif data, camera type etc.. In the end you would be able to search for all pictures made by a specified author, or every picture that is at least 1024x768.

With faceting search Solr will return these classifications and a count of documents that are inside Solr. You can see faceted search in action on many advanced search forms on the internet nowadays.

Also, do check out the solarium project as a solr frontend for PHP.

6. Nutch

Nutch is a spider / web crawler. It can crawl sites for specific content and store it inside a Lucene database which can be searched with Solr. One of the nice  things about Nutch is that you can run on a Hadoop framework to parallize the crawling.

7. SpamAssassin

SpamAssassin is a very good spam filter. Not only are there many different rulesets you can use but it can also use different methods to scan for spam including checking against blacklists and using heuristic and bayesian scans. It can be said that most mail servers will have SpamAssassin running on their system.

So this application is not so much for developers, as it is for everybody :-)

8. Subversion

You might not know this, but subversion is part of the Apache foundation since end of 2009. Subversion is still one of the most popular version control systems nowadays but more and more users are making the switch to distributed systems like git or mercurial. But until everybody has moved on, subversion is still taking care of all your commits. If you do not know about subversion (or version control) you should definitely take a look at it (and stop living under that rock!)

9. Apache JMeter

Jmeter is an awesome tool to generate testcases for your website. It can automate breakthroughs so you don’t have to test every single page on your site over and over again. It can be used for testing your site, or even benchmarking by letting it run in parallel so it simulates many users visiting your website.

10. Apache Traffic Server

Apache Traffic Server is a very fast and very flexible caching server initially developed by Yahoo. Since April 2010 its a Apache top level project and can compete with other caching servers like Varnish. It is used at Yahoo as forwarding and reverse proxy and handles over 400 terabyte of data on a daily basis.