IMHO, the current state of search is depressing. This is not a new realization for me. It is seven or eight years ago now that I first imagined a social search engine which would not rely solely on algorithms to determine the relative importance of search results but that would consider both machine and end-user feedback. This was in the early days of Nutch and I began researching the possibility of utilizing Nutch as the underlying core engine for such an endeavor, I rounded up some small-term investment capital, and so on. Unfortunately, this was also at the high peak of my struggle with Obsessive Compulsive Disorder (OCD) and my efforts eventually fell through.
Over the years I have watched as promising engine after promising engine has come along and in their turn failed to take the lead or even maintain their momentum. Years have passed and at each step of the way I have said, “It must just be around the corner…This is ages in technology time.” Even Google came out with SearchWiki, while not a perfect implementation it was a huge step in the right direction. For the last year or two I’ve been using Zakta and I’ve spent time on almost every other social search engine currently (or previously) available – yet I find that in the long-run they have all failed me.
So here I am so many years later longing for just such an engine. I’ve written on this blog about the topic before, but I will write again. In this post I will specifically propose the formation of an endeavor to create a social search engine, and I hope it will foster some interest in the community. I am not ready nor able to undertake such an endeavor myself – but I am interested in being part of such an endeavor.
Open Source: Ensuring Continuity
It is worth noting at this juncture that I’d intend for this project to be open source. Too many times I have lost the social search data I have accumulated because a specific engine has folded. My hope would be that the resultant project would be open source with commercial implementations and would provide a significant amount of data portability between engines, in case one engine should fold. We’ll talk more about the open source and portability aspects of the project later in this proposal.
What is Social Search?
Before we jump into a discussion about how to build a social search engine it is necessary first to define what is meant by social search. Unfortunately the term social search is used to delineate several different concepts which are very different from one another.
There are the real-time search engines which focus on aggregating information from various social media networks – and sometimes prioritizing links based on their popularity within a network. For example (also defunct) Topsy, the no-longer-real-time OneRiot, and the now-defunct Scoopler.
There are the engines which are focused on finding humans – e.g. allowing one to garner information about a person. Wink eventually became this sort of engine, Spokeo would be another example. They are essentially white pages on steroids.
Finally there is what I mean by social search – and I would use another term but there is no other term I am aware of which is so widely used to delineate this type of engine (and I want to ensure the widest possible audience). It is sometimes called a “human-powered search engine” Google and Wikimedia may have come closest by terming it a “Wiki” (SearchWiki and Wikia), but it seems to me that there is a need for an entirely new term that better and more precisely defines the idea…perhaps one result of this proposal and its aftermath will be just such a term.
In this section I will delineate what I believe are the core required features for a social search engine. An engine which included these features I believe would be a 1.0 release. There is certainly room for numerous improvements, but this would define a baseline by which to measure the proposal’s progress. I am not infallible, and I am sure there are aspects of the baseline which should be edited, removed, or replaced – I am open to suggestions.
- Web Crawler – The engine must include a robust web crawler which can index the web, not just a subset of sites (e.g. Nutch).
- Interpretive Ability – The engine must be able to interpret a wide variety of file formats, minimizing the invisible web (e.g. Tika).
- Engine – The engine must be able to quickly query the aggregated web index and return results in an efficient manner (e.g. Nutch).
- Search Interface – The engine must include a powerful search interface for quickly and accurately returning relevant results (e.g. Solr).
- Scalability – The engine must be scalable to sustain worldwide utilization (e.g. Hadoop).
- Algorithms – In addition to the standard automated algorithms for page relevance the system must integrate human-based feedback including:
- Positive and negative votes on a per page basis.
- The ability to add and remove pages from query results.
- Influence of votes based on a calculation of user trustworthiness (merit).
- Promotion of results by administrative users.
- Custom Results – The results must be customized for the user. While the aggregate influence of users affects the results, the individual user is also able to customize results. One should see a search page which reflects the results one has chosen and not the results one has removed.
- Ability to annotate individual entries.
- Portability – The engine should define a standard format for user data which can be exported and imported between engines. This should include customized query results, annotations, votes, removed and added pages, etc. This will be available to the user for export/import at any time. While additional data may be maintained by individual engines, the basic customizations should be portable.
I’m sure I’m missing some essentials – please share with me whatever essentials I have forgotten that come to your mind.
Starting from Zero?
It is not necessary for this project to begin from nothing, significant portions of the endeavor have already been undertaken toward creating an open source search engine – largely by Apache’s Nutch project. The available code should be utilized and with customization could integrate social search features. This would allow some of the most significant aspects of the project to be offloaded to already existing projects.
Additionally, it might be hoped that companies and individuals who have previously created endeavors in this direction would open source their code. For example, Wikia was built on Nutch and the code – including the distributed crawler (GRUB) and UI (Wikia) was released into the open source world.
What We Need
Now the question becomes, “What do we need?” and more importantly, “Whom do we need?”
First off, we could use donated hosting. Perhaps one of the larger cloud-based hosting companies would consider offering us space for a period of time? I’m thinking here of someone like Rackspace, Amazon Web Services, or GoGrid.
Secondly, we’d need developers. I’m not a Java developer…though I’ve downloaded the code and am preparing to jump in. I also don’t have a ton of time – so depending on me to get the development done…well, it could take a while.
Thirdly, we’d need content curators…and I think this is key (and also one of the areas I love the most). We’d need people to edit the content and make the results awesome. These individuals would be “power users” whose influence on results would be more significant than the new user. With time individuals could increase their reputation, but this would seed us with a trusted core of individuals who would ensure that the results returned would be high quality right from the get-go for new users
Finally, we’d need some designers. I’m all for simplicity in search – but goodness knows most of us developers have very limited design abilities and an aesthetic touch here and there would be a huge boon to the endeavor.
At this juncture its all about gathering interest. Finding projects that have already begun the process, looking for old hidden open source code that may be of use, etc. Leave a comment if you’d like to be part of the discussion.
Current Open Source Search Engines
- DataparkSearch – GNU GPL, diverged from mnGoSearch in 2003, coded in C and CGI.
- Egothor – Open source, written in Java, currently under a complete from scratch rewrite for version 3.
- Grubng – Open source, distributed crawler..
- BeeSeek – Open source, P2P, focuses on user anonymity.
- Yioop! (SearchQuarry) – GNU GPLv3, documentation is very informative.
- Heritrix – Open source, by Archive.org for their web archives.
- Seeks Project – AGPLv3, P2P, fairly impressive project which attempts to take social search into consideration.
- OpenWebSpider – Open source, written in .NET, appears to be abandoned.
- Ex-Crawler – Open source, Java, impressive, last updated released 2010.
- Jumper Search – Open source, social search, website appears to be down, currently linking to SF.
- Open Search Server – Open source.