Cyber-langagerie: Dans le fouillis rechercher par agrégation de notions (clustering)

Nous avons tous comme but lors d'une recherche sur le Web de repérer rapidement l'information. Les moteurs courants sont souvent décevants parce que, pour la plupart, ils ne font pas d'élagage des données et sont souvent axés sur des critères commerciaux.

Google par exemple devient de moins en moins séduisant pour les langagiers se permettant par exemple de laisser tomber les accents, d'extrapoler des formes dérivées, etc. Nous sommes biens habitués à Google qui devient de plus en plus marchand mais peut être un jour faudra-t-il songer à un autre moteur. Voici un texte qui fait réfléchir :

« It's called SEO—search engine optimization—and it's pretty much all anyone working with Web sites ever talks about nowadays ... But in fact, it centers around the idea that Google sucks so much that companies think they need to use SEO to get the results they deserve.

... From a user's perspective, once you learn how Google does what it does, it's a miracle that you ever get the right results. And from my experience, the right results in many circumstances are nearly impossible to obtain—and may never be obtainable in the future.

Let's look at some of the problems that have developed over the years.

Inability to identify a home site. All the search engines have this habit, but often it is laughable. You'd think that if I were looking for Art Jenkins, and Art Jenkins had a Web site named Artjenkins.com, search engines would list that first, right? Most often this page is never listed anywhere.

Too much commerce, not enough information. There seems to be an underlying belief, especially at Google, that the only reason you go online is to buy something. People merely looking for information are a nuisance. This is made apparent anytime you look for information about a popular product. All you find are sites trying to sell you the product. Hey, here's a challenge: Ask Google to find you a site that honestly compares cell-phone plans and tells you which is best. Try it! All you get are thousands of sites with fake comparisons promoting something they are selling.

...

Parked sites. Have you ever gone to look for something and found what seems like the perfect site near the top of the Google results? You click on it only to find one of those fake "parked" sites, where people park domain names, pack them with links to other sites, and hope for random clicks that pay them 10 cents each. How does page ranking, if it works, ever manage to give these bogus sites a high number?

Unrepeatable search results. Ever run a search a week later and get completely different results? In the end, you have to use the search history and hope you can find it. Can things change so drastically day-to-day that the search results vary to an extreme month-to-month? This is compounded by the weird results you get when you are logged in to Google. These are somehow customized for you? In what way?

Google sign-in changes a query's results to an extreme with no discernible benefit. Often two people are on a call trying to discuss something and both will try finding something online. The conversation often goes like this: "Here it is, I found it. Type in the search term 'ABCD Fix' and it's the fourth result listed." "I don't see it. The fourth one down is a pill company." "You typed in ABCD Fix, right?" "Yeah." This goes on for a while until you realize that one of the two people is logged into Google.

The solution to this entire mess, which is slowly worsening, is to "wikify" search results somehow without overdoing it. Yahoo! had a good idea when its search engine was actually a directory with segments "owned" by communities of experts. These people could isolate the best of breed, something Google has never managed to do. The basis for Google page-ranking is to equate popularity with quality, and once you look at the information developed by SEO experts, you learn that this strategy barely works.

We have to suffer until something better comes along, but there is at least one crucial fix that could be easily implemented: user flagging. Parked sites, for instance, could be flagged the way you flag spam on a message board or a miscategorized post on craigslist. The risk here is that creeps trying to shut down a specific site could swamp Google with false flags, so maintaining integrity would be difficult. People with their own agendas have already infiltrated and controlled aspects of craigslist and Wikipedia, unfortunately. On Wikipedia, for example, a group pushing the global-warming agenda prevents almost any post with contrary data or opinions, no matter how minor the point.

One suggestion floating around involves the semantic Web, which anticipates even more SEO tricks—and requires a certain level of honesty that can never be maintained. I suggest rethinking the basic organization of the Web itself, using the Google News concept. In other words, compartmentalize the Web to an extreme. Tagging might help. But you should be able just to search through a subsegment and check a box that eliminates merchants with faux-informational sites.

And speaking of check boxes, over the years there have been numerous attempts at creating an advanced search mechanism utilizing check boxes and a question-and-response AI network. You'd think that idea would have gotten further than it has. Hopefully, someone will conceptualize something new that works better than what we have today. The situation is just deteriorating too fast. »

SOURCE : http://www.pcmag.com/article2/0,2817,2334870,00.asp

AUTEUR : John C. Dvorak

Personnellement je pense qu'il importe d'avoir l'oeil ouvert. Si on tient compte de arguments troublants mentionnés plus haut et devant l'abondance de la Toile un début de solution pourrait être une pré-classification des résultats de recherche pour sauver du temps.

C'est le principe même des banques de terminologie dans lesquelles le filtrage par domaines est apparu depuis le début comme un moyen efficace de rechercher plus rapidement le bon équivalent (terme traduit) face à la polysémie de certains mots. Ce principe du filtrage est appliqué aux moteurs de recherche par agrégation qui catégorisent les données à l'aide de descripteurs. Un de ces moteurs est Xclustering.

L'avantage de ce moteur est qu'il affiche à gauche de l'écran une hiérarchie de descripteurs et de sous-descripteurs qui permet un débroussaillage souvent très efficace des données brutes qui deviennent des informations utilisables. Dans le cas d'expressions contenant des termes polysémiques comme « heat sensor » par exemple il est préférable d'interroger « sensor » pour obtenir les usages du terme dans plusieurs domaines.