The Semantic Search Engine

来源：互联网发布：淘宝著作权原始图编辑：程序博客网时间：2024/04/27 18:59

How the Search Engines of the future are going to operate is anybody’s guess, and guessing about the future is always hard. If you read up on the subject in the more theoretical parts of the net, if you take an interest in algorithms and the sorts, you will quickly se some signs in the (not so) distant future. One of the buzzwords in the business is “Semantic” as in “The Semantic Web” or “Latent Semantic Indexing”. These principles will have quit an impact on how we search the net for information and on how we, as webmasters, design web sites and optimize our pages and web to attract vital traffic from the Search Engines.

字串9

If you are to succeed in coming to terms with the future, you must first understand the past and the present. Let us, therefore, start with looking into how Search Engines work today and what principles govern the algorithms in use today. 字串8

If we are to design a search system, we can apply to basically different methods of approach. We can either choose to use text indexing or we can choose META indexing. (We could of course combine the two, but we will get to that later)

字串7

Text Indexing

字串9

In text indexing, the Search Engine will harvest all text of the page, process it to extract a list of relevant content words from each page. This, of course, can be done in many ways, but a likely approach could be the following:

字串4

1. Discard articles, prepositions, and conjunctions
2. Discard common verbs (know, see, do, be)
3. Discard pronouns
4. Discard common adjectives (big, late, high)
5. Discard frilly words (therefore, thus, however, albeit, etc.)
6. Discard any words that appear in every document
7. Discard any words that appear in only one document

字串5

META Indexing 字串6

META indexing works with META data placed in the different documents and web pages by the author or webmaster. It is the author or webmaster who decides what keywords are relevant for the webpage and inserts these in META tags, which in turn are indexed by the Search Engine. The advantage in this system is that searches can be made to worrk fast and efficient, especially if the keywords used in META are applied intelligently and with a degree of standardisation. Unfortunately the reality of the web in not like this and shady types misused META placing popular but irrelevant keywords just to get traffic. There are very few Search Engines, if any at all, that rely soly on META today.

字串3

The Real World

字串5

The reality is that Search Engines uses a combination of the two systems. Text indexing is used to extract the content words, and META++ (++ meaning and a lot of other tags and codes apart from META tags) is used to weight the content words individually. When the content words of a webpage are found, the other codes are analyzed. The search engine will evaluate things like content word density, frequency and proximity. These parameters must be within certain threshold levels to do well in the Search Engine.

字串7

Content words, keywords and search words are the same thing, what name you use depends on viewpoint. The SE will call them content words, the searcher search words and the web master keywords.

字串4

Search engines will also often use different site specific parameter. A good example of this, is Google’s PageRank where links are used to rank the pages and sites for relevant content words.

字串1

The Problem. 字串9

The problem with the Search Engines of to day is lag of intelligence. The Search Engine can only find pages that have the chosen key/search/content word in the text. It you for instance are in need of information on “French impressionism” the Search Engines will only find pages which have the words French and impressionism on them. Pages regarding Claude Monet, Renoir exhibitions, the museum at Giverny, or Salon des Refusés will not appear in the Search Engine Result pages even though they are or could be very relevant. If you yourself know very little about French impressionism, you will, perhaps, never consider searching for these words and their fore never find this relevant information.

字串1

The ideal Search Engine
The ideal Search Engine does not exists and probably never will, but describing it will be helpful if we are to design a Search Engine better then the ones we have today. Again this could be done in many ways, but this is how it is done in the article Latent Semantic Indexing (se link below) 字串1

?Scope: The ideal engine would be able to search every document on the Internet
?Speed: Results would be available immediately
?Currency: All the information would be kept completely up-to-date
?Recall: We could always find every document relevant to our query. No false positives
?Precision: There would be no irrelevant documents in our result set. No false negatives
?Ranking: The most relevant results would come first, and the ones furthest afield would come last

字串1

The Search Engine of the future

字串7

The Search Engine of the future will to some extent be semantic. One might say, to the greatest extent possible, be semantic. I believe that the Search Engine of the future will use all the elements in use today plus Latent Semantic Indexing and probably some elements we haven’t seen yet.
Latent Semantic Indexing is a well-defined mathematical method, which uses pure mathematics to create the semantic cohesion between documents and collections of documents. The mathematics is not that complicated and should you be interested, just follow the link below for further explanation.

字串9

The way I se things, the Semantic Search Engines will harvest content words in much the same way as is done today. They will also weight in much the same way as today, but when today’s Search Engine stops and present the results, the Semantic Search Engine will go on some steps further.
The Semantic Search Engine will analyse the collection of content words, the relative weight, the cohesion between them and the way they are (semantically) connected. The Search Engine will then find other pages or collections of pages with the same semantic profile or with a semantic profile that falls within an acceptable threshold of values. This might sound as an impossible task, but it is already being done with very good results on some of American universities. It is here worth remembering that Google started as a university project at Stanford University and that Google still keeps close ties with Stanford. Tries have shown that it is indeed possible to produce search results that are relevant and that do not contain the actual search word on all the pages

字串7

The future seen form the point of view of the searcher 字串3

When the Search Engines are equipped with semantic intelligence, it will invite the searcher to search for information in a semantic manner, instead of using just single or double keywords. When today, you will search for “French Impressionism”, tomorrow you might be better off searching for “the French impressionism, in particular Claude Monet and Renoir, not exhibitions”. How the specific search is formulated is impossible to say anything about. What operators and wildcards will be made available remains to be seen, but the days of the single keyword search are coming to an end for professional searchers. 字串1

The future form the point of view of the Web Master

字串9

Things will change and if you want to have a presence in the semantic search engines of tomorrow, you should start your planning and preparations today. Where the Search Engines of today are relatively easy to second-guess and where achieving a good placement, if you are not in a heavily competitive marked segment, is also relatively easy. The semantic Search Engine will present a more complex problem. The use of multi keyword searches alone gives many more possibilities when you operate in a semantic world. Guessing the actual search sentence might not be possible or not even desirable. You would be much better off trying to optimize your site to land in the middle of the semantic cloud, enabling your pages to appear in the top of many different searches. Her is my bid on what’s important to emphasise

字串9

Content

字串7

Content will be even more important tomorrow then it is today, and the way in witch we write our content will be essential. Today it is important to know your relevant and realistic keywords and then optimize your pages to hit high for these words. Tomorrow the content will need to be much more varied. When your are aiming for a semantic middle of something, keywords are not enough. You will also need synonyms, acronyms, alternatives, opposites and variations, in fact all the nyms, tives, sites and tions you can think of. 字串6

The scope of the content in the entire site will increase in importance. Where the Search Engines of today look at each page separately (site parameters such as PageRank added of course), the semantic Search Engine will also consider the semantic context between the pages of the site as a whole. The content space of the site will therefore increase in importance.
In other words, if you what to do well in the Search Engines of the future, you need to rewrite and add to your content and maybe broaden the scope of the site, 字串1

Internal links 字串4

As Search Engines move by links and as links bind together the site, it is natural, that the internal link structure of the site, will tell a lot about the semantic cohesion of that site. It is my belief that e.g. Google will use links as a parameter in the semantic evaluation of a site even though links are not part of the math behind Latent Semantic Indexing. As a Web master, you need to look at your link stricture as a semantic road map that, properly made, will enhance the semantic content of your site. Analyse your links, and design them in such a way, that they support the content space you are aiming for. 字串2

External links

字串6

Where Google today uses external links to calculate the PageRank of your pages, it is my belief htah external links, inbound as well as outbound, will have enhanced importance in the future. Who you link to, and who links to you, says a lot about what semantic cohesion your site most clearly belongs to. In doing so, the external links also will be an important parameter in the semantic ranking of your site. Outbound links maybe more so then inbound since it is the web Master who controls who the site links to.

字串3

Miscellaneous. 字串8

There will be a shift in the way we as web masters look at the sites we manage. Where the basis of today, is the all-important keywords, tomorrow we will need to emphasize context and cohesion. We might still start off with the keywords analysis, but we need to look at the individual keyword in a semantic context as well. Today we might end up with a list of 25 words or phrases that we optimize our pages for. In the future I think we will start with 25 pages, each with a keyword or phrase as the headline and then the semantic “ingredients” underneath. These 25 pages will end up in a collection of content where words like “collection” and “context” are just as important as the word content 字串6

It’s a wrap

字串7

This is my bid for the future in SEO, how it actually goes, time will tell. This article might change a lot in the near future, or it might disappear all together because I’m embarrassingly wrong 字串6

For further info follow these links 字串8

http://javelina.cet.middlebury.edu/lsa/out/lsa_definition.htm
http://www.w3.org/2001/sw/
http://infomesh.net/2001/swintro/

字串7