Do or Recommended
1) For
SharePoint site content source if you want to crawl the content on a particular
site collection on a different schedule than other site collections then Crawl
only the SharePoint site of each start address. This option accepts any URL,
but will start the crawl from the top-level site of the site collection that is
specified in the URL you enter. For example, if you enter
http://contoso/sites/sales/car but http://contoso/sites/sales is the top-level
site of the site collection, the site collection http://contoso/sites/sales and
all of its subsites are crawled.
2) For
SharePoint site content source if you want to crawl all content in all site
collections in a particular Web application on the same schedule then Crawl
everything under the host name of each start address. This option accepts only
host names as start addresses, such as http://contoso. You cannot enter the URL
of a subsite, such as http://contoso/sites/sales when using this option.
3) For Web
sites content source if relevant content is only the first page then crawl only
the first page of each start address
4) For Web
sites content source if you want to limit how deep to crawl the links on the
start addresses then Custom — specify the number of pages deep and number of
server hops to crawl. We recommend you start with a small number on a highly
connected site because specifying more than three pages deep or more than three
server hops can crawl the entire Internet. You can also use one or more crawl
rules to specify what content to crawl
5) For File
shares or Exchange public folders content source if Content
available in the subfolders is not likely to be relevant then Crawl only the
folder of each start address
6) File shares
or Exchange public folders Content Source type
if Content in the subfolders is
likely to be relevant
then Crawl
the folder and subfolder of each start address
Avoid
Search:
1) You
cannot crawl the same address using multiple content sources. For example, if
you use a particular content source to crawl a site collection and all its
subsites, you cannot use a different content source to crawl one of those
subsites on a different schedule. For performance reasons, you cannot add the
same start addresses to multiple content sources
————-
Search
1) Content
is any item that can be crawled, such as a Web page, a Microsoft Office Word
document, business data, or an e-mail message. Content resides in a content
repository, such as a Web site, file share, or SharePoint site.
2) A content
source is a set of rules that tells the crawler where it can find content, how
to access the content, and how to behave when it is crawling the content. It
includes one or more addresses of a content repository from which to start
crawling, also called start addresses. These settings apply to all start
addresses within the entire content source.
3) Type of
Content Repository – Sites, SharePoint sites, Exchange Folder, Network Folders,
BDC
4) Each
content source contains a list of start addresses that the crawler uses to
connect to the repository of content
5) When the
crawler accesses the start addresses listed in a content source, the crawler
must be authenticated by and granted access to the servers that host that
content. The user account that is used by the crawler must have at least read
permission to crawl content
6) The
crawler in uses protocol handlers to access content and then IFilters to
extract content from files that are crawled. IFilters remove
application-specific formatting before the engine indexes the content of a
document. Only file types for which a protocol handler and IFilter are
installed are crawled by Office SharePoint Server 2007
7) The
crawler uses protocol handlers and IFilters as follows:
· The
crawler retrieves the start addresses of content sources and calls the protocol
handler based on the URL’s prefix.
· The
protocol handler connects to the content source and extracts system-level
metadata and access control lists information.
· The
protocol handler identifies the file type of each content item, based on the
file name extension, and calls the appropriate IFilter associated with that
file type.
· The
IFilter extracts content, removing any embedded formatting, and then retrieves
content item metadata.
· Content
is parsed by one or more language-appropriate word breakers and is added to the
content index, also called the full-text index. Metadata and access control
lists are added to the search database.
8) If there
is no IFilter for a file type that you want to crawl, the content index in
SharePoint can only include the file’s properties, and not the file’s content.
If you want to index content that does not have an IFilter installed by
default, you have to install and register an IFilter for that file type, for
example for PDF
9) Content
is only crawled if the relevant file name extension is included in the
file-type inclusions list and an IFilter is installed on the index server that
supports those file types
10) The crawler uses protocol
handlers to access content. When creating a content source, shared services
administrators specify the protocol handler that the crawler will use when
crawling the URLs specified in that content source
11) Default Protocol handlers
for example – http, https, bdc, sps, sps3, bdc2, sts, rb
12) If you want to crawl
content that does not have a protocol handler installed, you must install a
third-party or custom protocol handler before you can crawl that content.
Several third-party protocol handlers
Search Performance
1) Database
Server: The index server writes metadata that it collects from crawled
documents into tables on the database server. When Indexer Performance is set
to Maximum, the index server can generate data at a rate that overloads the
database server. This can affect the performance of other applications that are
using the same database server. It can also affect the performance of other
shared services that are running under the shared services provider (SSP), such
as Excel Calculation Services.
2) Index
server: Indexing can place considerable demands on index server
resources such as the disk, processors, and memory. An index server must have
sufficient hardware to accommodate the amount of indexing required by your
organization
3) Web Front
End Server: To crawl content on local SharePoint sites, the index server
sends requests to Web front-end servers that host the content. Such requests
consume resources on the Web front-end servers and can thus reduce the
responsiveness of the SharePoint sites that are hosted on these servers for end
users.
4) Monitoring
server performance during crawls can help you determine the appropriate setting
for Indexer Performance. We recommend that you conduct your own testing to
balance crawl speed, network latency, database load, and the load on crawled
servers.
5) Consider
the following suggestions regarding adjusting the Indexer Performance setting:
· If you
are using the index server and database server only for searching (using the
Office SharePoint Server Search service), you might want to set the Indexer
Performance level to Maximum and note how this affects your database server
performance. If the increase in database server CPU utilization exceeds 30
percent, we recommend changing the Indexer Performance level to Partly reduced.
· If the
index server and database server are shared across multiple services, such as
the Office SharePoint Server Search service and Excel Calculation Services, we
recommend that you select the Partly reduced or Reduced setting for Indexer
Performance.
6) Manager
Crawler Impact: Content crawls can place a significant load on crawled servers
and thereby adversely affect response times for server users. Therefore, we
recommend that you use crawler impact rules to specify how aggressively your
crawler should perform. A search services administrator can manage the affect
of the crawler on a crawled site by using a crawler impact rule to specify one
of the following:
·
The maximum number of documents that the crawler can request at a
time from the specified site.
·
The frequency with which the crawler can request any particular
document from the specified site.
7) Try to
avoid crawling internal servers at peak load times
8) You can
increase or limit the quantity of content that is crawled by using:
·
· Crawl
settings in the content sources For
example, you can specify to crawl only the start addresses that are specified
in a particular content source, or you can specify how many levels deep in the
namespace (from those start addresses) to crawl and how many server hops to
allow. Note that the options that are available within a content source for
specifying the quantity of content that is crawled vary by content-source type.
·
· File type
inclusions You can choose the file types that you want to
crawl.
·
· Crawl
rules You can use crawl rules to exclude all items in
a given path from being crawled. This is a good way to ensure that subsites
that you do not want to index are not crawled with a parent site that you are
crawling. You can also use crawl rules to increase the amount of content that
is crawled — for example crawling complex URLs for a given path.
No comments:
Post a Comment