What is the Indexing Service?

来源:互联网 发布:淘宝介入是怎么处理的 编辑:程序博客网 时间:2024/05/20 04:50
 

What is the Indexing Service?

Microsoft Indexing Service is a service that provides a means of quickly searching for files on the machine. The most familiar usage of the service is on web servers, where it provides the functionality behind site searches. It is built into Windows 2000 and 2003. It provides a straightforward way to index and search your web site.

Setting up the Indexing Service is explained at windowswebsolutions.com and will not be covered here.

Connecting to the Indexing Service

The Indexing Service exposes itself to the developer as as ADO.Net provider MSIDXS with the data source equal to the indexing catalog name. For example, the connection string used insearching this site is

Provider="MSIDXS";Data Source="idunno.org";

As with any other ADO.Net provider you use the connection string property of the System.Data.OleDb.OleDbConnection object.

using System.Data.OleDb;
protected OleDbConnection odbSearch;
odbSearch.ConnectionString=
  "Provider= /"MSIDXS/";Data Source=/"idunno.org/";"
odbSearch.Open();
//Query and process result
odbSearch.Close();

You can also use the connection string in Visual Studio by dragging and dropping an OleDbConnection onto your asp.net page and setting the ConnectionString property in the Properties tab.

Building your search query

Once you have your connection you can obviously execute queries against it. The syntax used to query the Indexing Service is a limited subset of SQL, documented in MSDN's Indexing Service section.

For example the search I use on the .Net portion of this site looks like

select doctitle, filename, vpath, rank, characterization
from scope('"/dotNet"')
where FREETEXT(Contents, 'searchText')
and filename <>'search.aspx'
order by rank desc

select parameters

Lets examine some of the columns that Index Server tables contains.

  • Access
    This is the last accessed date of a document
  • AllocSize
    This is the current allocated disk size allocated to a document
  • Attrib
    This is the current file system attributes (read only, system etc.) flagged to the file.
  • Characterization
    This is the document abstract, if available. This is configurable as part of the index properties, where you can also set the number of characters in the abstract. The abstract is then produced from the document body.
  • ClassID
    This is the OLE ClassID for the document.
  • Contents
    The complete contents of a file in the index. This can be queried on but not retrieved as part of the select clause.
  • Create
    This is the creation date of the document.
  • DocAuthor
    This is the document author, if the document provides this meta data. Office documents, Adobe PDFs and media files generally have this property, HTML, XML, ASP and ASP.Net documents do not.
  • DocTitle
    This is the document title, extracted from the document meta data in Office documents, or from the <title /> tag in markup documents.
  • FileIndex
    This is the index name the file was found in.
  • FileName
    This is the document filename.
  • HitCount
    This is the number of times the search term appears in your document.
  • Path
    This is the full  physical path to the document, including the file name to a document, for example c:/inetpub/wwwroot/examples/example.aspx
  • Rank
    This is an indication of relevance. Index server ranks its results between 0 and 1000. The higher the rank the more relevant to the search criteria index server believes it is.
  • ShortFileName
    This is the 8.3 short DOS file name of the document.
  • Size
    This is the document size.
  • USN
    This is the update sequence number, an NTFS attribute.
  • VPath
    This is the virtual path to the document, including the file name, relative to the root of the web site the index is for, for example /examples/example.aspx. If more than one virtual path exists to a file IndexServer chooses the virtual path it believes best matches the query.
  • WorkID
    This is the internal ID index server uses for a each file.
  • Write
    This is the date the file was last written to.

Additional columns are provided for OLE aware documents, such as those produced by Microsoft Office. These include DocAppName, DocAuthor, DocCharCount, DocComments, DocCreateDTM, DocEditTime, DocKeywords, DocLastAuthor, DocLastPrinted, DocLastSaveDTM, DocPageCount, DocRevNumber, DocSecurity, DocSubject, DocTemplate, DocTitle and DocWordCount.

from parameters

So now we know what we can select, we need to examine where we are selecting from. From the sample query above we can see the from statement,

from scope('"/dotNet"')


There are a few methods to limit your search.

You can use the scope() function, as shown in the example. This is the main component of the from predicate. The scope function takes zero or more comma-separated scope arguments. A scope argument combines a Traversal_Type and a Path). You can also specify scope with an empty argument list, or (). This is the default scope and effectively sets the scope to start at the virtual root of your web site ( / ). Each Scope_Argument must be surrounded by single quotes.

As an alternative to using scope(), you can use any one of a set of predefined views that Index server provides. You can reference one of these pre-defined views in the FROM predicate by specifying the View_Name.

For example,

select fileName from scope()<

This returns all file names in the current index, with no limitation on the the directories to search.

select filename
from scope('shallow traversal of "D:/inetpub/wwwroot/examples"',
'deep traversal of "/examples2" '
'hierarchical traversal of "/examples3")

In this example we have three scope limitations, three different Transversal_Type arguements and threes different path arguements, one physical and two virtual.

A shallow traversal searches the resources in the specified folder, but not in any of the subfolders.

A deep traversal searches against any and all subfolders of the given folder, all the way to the bottom of the folder hierarchy.

A hierarchical traversal searches against folder resources in a specified folder. A hierarchical traversal search can be used for a task such as determining the folder hierarchy of a specified folder.

select filename from EXTENDED_WEBINFO

In this example we can see one of the pre-defined views, EXTENDED_WEBINFO. The pre-defined views are documented in MSDN.

where parameters

Finally we are able to specify what we want to search for and exclude from the search. In the original example the where predicate was

where FREETEXT(Contents, 'searchText') and
filename <> 'search.aspx'

The rules for searching are as follows

  • Consecutive words are treated as a phrase; they must appear in the same order within a matching document.
  • Queries are case-insensitive, so you can type your query in uppercase or lowercase.
  • You can search for any word except for those in the exception list (for English, this includes a, an, and, as, and other common words), which are ignored during a search. Words in the exception list are treated as placeholders in phrase and proximity queries. For example, if you searched for "Word for Windows", the results could give you "Word for Windows" and "Word and Windows", because for is a noise word and appears in the exception list.
  • Punctuation marks such as the period (.), colon (:), semicolon (;), and comma (,) are ignored during a search. To use specially treated characters such as &, |, ^, #, @, $, (, ), in a query, enclose your query in quotation marks ("). To search for a word or phrase containing quotation marks, enclose the entire phrase in quotation marks and then double the quotation marks around the word or words you want to surround with quotes. For example, "World-Wide Web or ""Web""" searches for World-Wide Web or "Web".
  • You can insert Boolean operators (AND, OR, and NOT) and the proximity operator (NEAR) to specify additional search information.
  • The wildcard character (*) can match words with a given prefix. The query esc* matches the terms "ESC", "escape," and so on.
  • Free-text queries can be specified without regard to query syntax.

where syntax can be a simple comparison such as 'DocAuthor = "Barry Dorrans"', but the real power of the search engine starts to when you use the contains statement.

A simple contains statement might look for a single word or a combination of words in the contents of a document. With near() you can use proximity searching, looking for one word near another, for example contains(' "Visual" near() "Studio" '). Even more useful is formsof() which performs a fuzzy searches. For example contains('formsof(inflectional, "drive") ') examines the root of a word and will match against 'drive', 'driving','driven' or 'drives'. Of course which search method you use depends on the type of search you wish to perform.

The most common method of searching uses the freetext() method. This works much like searching in the Microsoft Office wizards does (without the drawbacks of having that damned paperclip popup). It analyses the meaning of the search criteria in addition to the words it contains and so brings back documents it feels is relevant.

Limiting or exluding documents and directories from the results can be performed using normal SQL syntax, for example filename<>"secretpasswords.htm".

Putting it altogether

So finally we have the search query built and we know how to open the index server catalogue we can simple treat it as a normal OLE command and get a DataReader or a full combination of a DataAdapter and DataSet, with the results being databound to a control if you require.

System.Data.OleDb.OleDbConnection odbSearch =
new System.Data.OleDb.OleDbConnection;
System.Data.OleDb.OleDbCommand cmdSearch
new System.Data.OleDb.OleDbCommand;
odbSearch.ConnectionString =
"Provider=/"MSIDXS/";Data Source=/"idunno.org/";"
cmdSearch.Connection = odbSearch;
cmdSearch.CommandText = "searchSQL"
odbSearch.Open();
OleDbDataReader rdrSearch = cmdSearch.ExecuteReader();
// your databinding or display code here
odbSearch.Close();

After all that you should have a working search facility for your web site.

Indexing .aspx and .ascx files

By default the Indexing Service does not index .aspx and .ascx files. If you want to extend the file types indexed to include your asp.net content follow the instructions in the Microsoft Knowledge Base article 311521.

原创粉丝点击