Google search is not broken

 Posted by on January 14, 2011 at 00:19  search  No Responses »
Jan 142011
 

Google search is not broken.  And, Google search is broken.  Let’s start with the not broken. 

Google search is not broken because Google users are not motivated enough to move off Google:

  1. For how most people use Google, it meets their expectations.
  2. Most people don’t believe the grass is greener on the other side.

So when you read the “broken” blog posts, keep in mind that broken is in the eye of the beholder, and mainstream Google users don’t write many posts.  This is an obvious point but it is important to keep this context when we discuss how Google search is broken.  I’ll write about how Google is broken in future posts.  I’m not a search expert but have encountered signal to noise challenges while working on algorithms in other fields.

Back to not broken.  Why don’t mainstream Google users see Google search as broken?  I’d love to see a histogram and analysis of use cases for Google search but here’s a crude estimate:

  1. 20%: As a global bookmark (see data below)
  2. 20%: News
  3. 15%: Images and videos
  4. 10%: Entertainment (sports, movies, music, celebrities, TV, etc.)
  5. 15%: Commerce-related
  6. 20%: Everything else

Google is clearly not broken for the first three categories (better alternatives for subsets, but Google’s results are very good for mainstream queries in those categories).   See appendix for data on #1.

Entertainment and commerce are in the worse shape.  That’s for further discussion.  But suffice to say that Google search is not broken for that entire category.

Category six can’t be painted with a broad brush.  However it is maybe the most interesting.  A part of it is long tail search.  Google, while not great, is better than any other engine for long tail type searches – simply Google indexes more sites faster and ranks them better.  The rest of this category needs to be broken down to be properly analyzed.

So, bottom line, Google is not only not broken, but is the best choice period for at least 60% of broad brush use cases and maybe 80%.  So even if you crudely divide up the remaining 20% to 40% evenly amongst Google, Bing and niche/vertical search engines, then Google is likely the best choice for 75%+ of all searches.  Hardly broken.

Appendix data for the global bookmark use case category #1 because it is surprising if you don’t have firsthand knowledge of it.  Experian Hitwise data for 2010 searches to quantify how Google is used:

  1. Getting to Facebook and other popular sites represents the top 10 Google search queries.  3.5% for search queries like “Facebook” alone.
  2. Facebook had 8.9% of total web visits and 3.5% of the search queries, while YouTube had 2.7% of the visits and 1.1% of the search queries.  Long-tail websites see similar – I was shocked to learn this from LocalReplay (more of the startup learnings here)  – showing that when a user wants to get to a known site, he Googles the site name somewhere between 20% and 40% of the time.

So, Google is used heavily just to get to a site that the user already knows he wants to get to.  It could be around 20% of Google’s total traffic.

    car stereo + internet

     Posted by on December 16, 2010 at 15:55  internet  1 Response »
    Dec 162010
     

    car-stereoI want a car stereo with a mobile broadband connection, receiver and a separate interface to customize my car audio experience from the web – customized algorithmically but with a layer of individual curation to bound the algorithms.

    Traffic and weather “channels” will push content (broadcast signal or web streaming) to me based on my location (mobile triangulation is precise enough or GPS can provide) and future location (partially based on the Google Maps query I did last night). Algorithms will push “station one” to me if it is 8:00 and that station does traffic at the top of the hour, or push station two if their time better matches, or push the most recent cached update if none matches, or push me the station with the highest ratings for traffic. When Twitter and similar data is better curated, I can listen to tweets about an accident that just happened at an intersection that I’m due to hit in ten minutes.

    My favorite music will be pulled in from Internet radio services like Pandora. If I want to surf randomly, but based on my preferences, other stations will pull in local broadcast stations that meet my criteria. Programmed, smart “seek”. Same for sports, news, weather etc. Similar model for all other content in the cloud – podcasts, audio books, MP3s etc. Ditto for other content that I may want in audio form – voicemails, texts converted to voice, etc.

    The algorithms that build my channels will incorporate social networking and social graph, recommendation engines and crowdsourcing, along with my preferences and curation. Channels that feature content that is most in play in my social network. Or least in play if that’s my persuasion. Brand new content that I might not know about but is recommended based on my preferences, listening history and social graph. Etc. When my wife drives my car, she switches to her profile – as long as I’m not a passenger ; ). And that’s just from a non-professional driver’s perspective – there are more interesting use cases for professional drivers, from trucking to FedEx to taxis.

    I think we’ll see this model of individual customized pushes of slices of content in many other areas too, e.g. TV, as a specific type of signal to noise solution via web services, but think specific use cases like car audio experience could be the first to develop with the least barriers in the way and the most opportunity for the various players.

      21st century research

       Posted by on December 8, 2010 at 19:12  healthcare and medicine, science  No Responses »
      Dec 082010
       

      Daily Aspirin Linked to Steep Drop in Cancer Risk, screams a recent study. You and I and six billion other people should be data, but are not part of the .0004% of the population represented by this study, so the study is not nearly as important as the headline may suggest.

      Data is a gas that fuels progress, but the engine of medicine and healthcare – biotech, medical research, human biology, etc. – is operating on a few drops of kerosene.  Meanwhile we accumulate data everyday that would fuel more advances in medical science in a few years than we’ve seen in the last 100 years.   Unfortunately, all of that data might as well be a sun in another solar system; it is not fueling any progress here. 

      With that data, statistical programs could provide researchers with multi-variable correlations that are specific to individual level combinations of characteristics that small sample size studies will never find. Statistical analysis algorithms would drive focused professional testing of specific areas identified by the stats, and the results of the controlled testing would feed back into the universe of data in an incredibly powerful feedback loop.

      “These findings provide the first proof in man that aspirin reduces deaths due to several common cancers,” the study team noted in the news release. Maybe it is that simple – aspirin is a cancer killer -regardless of genetics, diet, environment, gender, blood type, age, habits, medications and combinations of those and a myriad of other factors. But we can’t conclude that from this study and unfortunately it is not likely that simple.  We need data, all of our data, and it can happen in today’s age of ubiquitous web connectivity.

      In most areas, low signal to noise ratio problems dominate.  In medical science, that would be a good problem to have.  Right now, it is way too quiet.

        reputation

         Posted by on December 6, 2010 at 22:11  internet  No Responses »
        Dec 062010
         

        The disintermediation of the layers between content producer and consumer has been a boon. However, one element that was forged between all those layers was a proxy for individual reputation. The brand of the big media outlet became your reputation and this was a decent approximation in most cases.

        We can improve upon that approximation with individual, granular reputation. You (individual) are an expert in civil engineering but know nothing about global warming (granular). You are an expert in civil engineering no matter where you or your work goes (portable), e.g. if you move from your publisher, web address or social network then your reputation moves with you – you take your PageRank with you so to speak.

        This is largely broken today. PageRank helps to a degree although it too is declining in effectiveness. However, I can Google you, check a few links and triangulate to my own opinion of your reputation, although often it won’t be very granular, it is a human algorithm that isn’t as quick or simple as most would want and doesn’t extend well into mobile use cases. Same for other examples – the development of our social graph and its integration with social networking is an ingredient but not a full recipe – for example volume, number of friends or followers etc don’t always correlate directly with reputation, and certainly not in a granular or portable manner.

        Big media and traditional content – articles, videos etc. – is an easy example because it is so visible. But this is true everywhere and is much more important in areas like medicine, science and research where the web would enable much more progress if individual reputation was essentially metadata tied into every bit of content.

        Thoughts on solutions? Reputation is one of the critical ingredients missing in our efforts to improve the signal to noise ratio, maybe the most important one?

          signal to noise

           Posted by on December 2, 2010 at 15:08  internet  No Responses »
          Dec 022010
           

          bat-signalShare this. Tweet that. Like everything. Noise screaming from every site, page, app and applet. Signal to noise ratio approaching zero.

          Everyone is now a publisher. Blogs, tweets, videos, podcasts, wall updates, broadcasts, forums, magazines, movies, ebooks, reviews, comments, aggregations. Long-form to 140-character form. All goodness but lots of Noise.

          Signal? Google. Google can get the first page of results “right” most of the time. PageRank has been brilliant but even today falls short and in the future it too will be a dinosaur.

          David Segal did a nice job showing one example of a kink in the armor in his Bully Finds a Pulpit on the Web article. Google showed that they can act quickly like a small company while leveraging long-term signal to noise R&D efforts that their large company resources funds by adding some signal to noise algorithms for this specific use case as described by Amit Singhal in this blog post.

          This is just one use case. The more general signal to noise development will be fascinating. More on that another time except to list few variables that need to be better developed in the signal to noise algorithms:

          + individual-level, granular, portable reputation
          + the interesection of algorithmic curation, human curation and crowdsourcing
          + social graph intelligence
          + use of presence and location to add metadata automatically
          + feedback loops amongst these