频道栏目
首页 > 网络 > 云计算 > 正文

Solr入门之官方文档6.0阅读笔记系列(十)

2016-09-07 09:38:24           
收藏   我要投稿

The Well-Configured Solr Instance

告诉你如何调节solr实例到最佳性能

Configuring solrconfig.xml

solrconfig.xml的配置对solr工作的影响很大

能完成以下内容:

request handlers, which process the requests to Solr, such as requests to add documents to the index orrequests to return results for a query

listeners, processes that "listen" for particular query-related events; listeners can be used to trigger theexecution of special code, such as invoking some common queries to warm-up caches

the Request Dispatcher for managing HTTP communications

the Admin Web interface

parameters related to replication and duplication (these parameters are covered in detail inLegacyScaling and Distribution)

主要讲述的内容:

DataDir and DirectoryFactory in SolrConfig

Lib Directives in SolrConfig

Schema Factory Definition in SolrConfig

IndexConfig in SolrConfig

RequestHandlers and SearchComponents in SolrConfig

InitParams in SolrConfig

UpdateHandlers in SolrConfig

Query Settings in SolrConfig

RequestDispatcher in SolrConfig

Update Request Processors

Codec Factory

Substituting Properties in Solr Config Files

solrconf,xml中支持动态设置属性值

${propertyname[:option default value]}

给予默认值或者运行时指定值或者报错

几种指定变量的方式:

JVM System Properties

Any JVM System properties, usually specified using the -D flag when starting the JVM, can be used as variablesin any XML configuration file in Solr.

For example, in the sample solrconfig.xml files, you will see this value which defines the locking type to use:

${solr.lock.type:native}

Which means the lock type defaults to "native" but when starting Solr, you could override this using a JVM

system property by launching the Solr it with:

bin/solr start -Dsolr.lock.type=none

In general, any Java system property that you want to set can be passed through the bin/solr script using the

standard -Dproperty=value syntax. Alternatively, you can add common system properties to the SOLR_OPTS

environment variable defined in the Solr include file (bin/solr.in.sh). For more information about how the

Solr include file works, refer to:Taking Solr to Production.

设置参数的两种方式:

一个是启动时传入

一个是在solr的初始化文件中设置

solrcore.properties

If the configuration directory for a Solr core contains a file named solrcore.properties that file can contain

any arbitrary user defined property names and values using the Java standardproperties file format, and those

properties can be used as variables in the XML configuration files for that Solr core.

For example, the following solrcore.properties file could be created in the conf/ directory of a collection

using one of the example configurations, to override the lockType used.

#conf/solrcore.properties

solr.lock.type=none

第二种方式使用 solrcore.properties

这个文件的名称和位置默认在conf下,可以使用core.properties中指定名称和位置

User defined properties fromcore.properties

For example, consider the following core.properties file:

#core.properties

name=collection2

my.custom.prop=edismax

The my.custom.prop property can then be used as a variable, such as in solrconfig.xml:

${my.custom.prop}

Implicit Core Properties

隐式定义的核心属性:

All implicit properties use the solr.core. name prefix, and reflect the runtime value of the equivalentcore.pr

operties property:

solr.core.name

solr.core.config

solr.core.schema

solr.core.dataDir

solr.core.transient

solr.core.loadOnStartup

DataDir and DirectoryFactory in SolrConfig

Specifying a Location for Index Data with thedataDirParameter

通过dataDir指定索引数据的存放位置

/var/data/solr/

If you are using replication to replicate the Solr index (as described inLegacy Scaling and Distribution), then the directory should correspond to the index directory used in the replication configuration.

相对路径和绝对路径及副本设置

Specifying the DirectoryFactory For Your Index

You can force a particular implementation by specifying solr.MMapDirector

yFactory, solr.NIOFSDirectoryFactory, or solr.SimpleFSDirectoryFactory.

class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>

The solr.RAMDirectoryFactory is memory based, not persistent, and does not work with replication. Use

this DirectoryFactory to store your index in RAM.

不同操作系统采用不同的文件目录系统,还可以将索引建在hdfs上

solr.HdfsDirectoryFactory instead of either of the above implementations.

Lib Directives in SolrConfig

能够使用正则表达式,所有的位置都是相对solr实例:

All directories are resolved as relative to the Solr instanceDir

Schema Factory Definition in SolrConfig

While the "read" features of the Solr API are supported for all Schema types, support for making Schemamodifications programatically depends on the in use.

Managed Schema Default

Solr implicitly uses a ManagedIndexSchemaFactory

一个例子:

true

managed-schema

mutable - controls whether changes may be made to the Schema data. This must be set totrueto allowedits to be made with the Schema API.

managedSchemaResourceName is an optional parameter that defaults to "managed-schema", anddefines a new name for the schema file that can be anything other than "schema.xml".

Classicschema.xml

disallows any programaticchanges to the Schema at run time.

不支持运行时的修改,仅仅支持修改后重新加载生效模式

Switching fromschema.xmlto Managed Schema

可以将 不能编辑的schema.xml转为可编辑的 模式在solrconfig.xml中配置

Changing to Manually Editedschema.xml

改变为手动编辑的模式

步骤:

Rename the managed-schema file to schema.xml.

Modify solrconfig.xml to replace the schemaFactory class.

Remove any ManagedIndexSchemaFactory definition if it exists.

Add a ClassicIndexSchemaFactory definition as shown aboveReload the core(s).

If you are using SolrCloud, you may need to modify the files via ZooKeeper.

IndexConfig in SolrConfig

In most cases, the defaults are fine

...

Parameters covered in this section:

Writing New Segments

Merging Index Segments

Compound File Segments

Index Locks

Other Indexing Settings

Writing New Segments

ramBufferSizeMB

100

maxBufferedDocs

1000

useCompoundFile

false

上面是文件的更新控制

Merging Index Segments

mergePolicyFactory

default in Solr is to use a TieredMergePolicy

Other policiesavailable are the LogByteSizeMergePolicy and LogDocMergePolicy.

10

10

Controlling Segment Sizes: Merge Factors

For TieredMergePolicy, this is controlled by setting the and options, while LogByteSizeMergePolicy has a single option (all of which default to "10").

对于合并索引片段能加快搜索但是需要提交创建索引的时间

Customizing Merge Policies

一个例子:

timestamp desc

inner

org.apache.solr.index.TieredMergePolicyFactory

10

10

mergeScheduler

The merge scheduler controls how merges are performed

The default ConcurrentMergeScheduler 多线程

The alternative, SerialMergeScheduler, 串行线程

mergedSegmentWarmer

有利于近时时搜索

Compound File Segments

复合文件片段

Index Locks

lockType

锁类型

StandardDirectoryFactory (the default)

native

simple

single

hdfs

native

writeLockTimeout

写入锁的超时时间

1000

Other Indexing Settings

其余的一些参数:

reopenReaders

deletionPolicy

infoStream

例子:

true

1

0

1DAY

false

RequestHandlers and SearchComponents in SolrConfig

Request Handlers

SearchHandlers

UpdateRequestHandlers

ShardHandlers

Other Request Handlers

Search Components

Default Components

First-Components and Last-Components

Components

Other Useful Components

Request Handlers

请求处理器和路径的映射关系

SearchHandlers

参数和特点

UpdateRequestHandlers

ShardHandlers

Other Request Handlers

其实solr中的处理器的类型也不是很多现在也就四五种,两种常用的,搜索和更新

Search Components

Search components define the logic that is used by the SearchHandler to perform queries for users.

对应的是搜索处理器

Default Components

除了用first-components and last-component 来定义外,默认组件的顺序是:

query solr.QueryComponent Described in the section Query Syntax and Parsing.

facet solr.FacetComponent Described in the section Faceting.

mlt solr.MoreLikeThisComponent Described in the sectionMoreLikeThis.

highlight solr.HighlightComponent Described in the sectionHighlighting.

stats solr.StatsComponent Described in the sectionThe Stats Component.

debug solr.DebugComponent Described in the section onCommon Query Parameters

expand solr.ExpandComponent Described in the sectionCollapse and Expand Results.

可以通过配置相同名称对默认的组件进行替换

First-Components and Last-Components

mycomponent

spellcheck

Components

如果不使用 first和last来添加组件,默认的组件将不启动

mycomponent

query

debug

Other Useful Components

SpellCheckComponent, described in the sectionSpell Checking.

TermVectorComponent, described in the sectionThe Term Vector Component.

QueryElevationComponent, described in the sectionThe Query Elevation Component.

TermsComponent, described in the sectionThe Terms Component.

InitParams in SolrConfig

An section of solrconfig.xml allows you to define request handler parameters outside ofthe handler configuration.

_text_

给指定的处理器路径进行统一的默认配置

Ifwe later want to change the /query request handler to search a different field by default, we could override the by defining the parameter in the section for /query

可以在当个路径中进行覆盖

Wildcards

例子:

_text_

10

title

UpdateHandlers in SolrConfig

...

Topics covered in this section:

Commits

commit and softCommit

autoCommit

commitWithin

Event Listeners

Transaction Log

Commits

Data sent to Solr is not searchable until it has beencommittedto the index.

commitandsoftCommit

commit是硬提交,数据完全提交到硬盘中

softCommit 能快速的将索引可见,实现近实时索引,但是机器挂了会丢数据

autoCommit

10000

1000

false

1000

commitWithin

for that reason the default is to perform a soft commit

false

With this configuration, when you call commitWithin as part of

your update message, it will automaticallyperform a hard commit every time.

Event Listeners

These can betriggered to occur after any commit (event="postCommit") or only after optimize commands (event="postOptimize")

两种监听配置

监听到后可以进行相应的处理:

RunExecutableListener

有些参数--

Transaction Log

a transaction log is required for that feature. It is configured in the updateHandler section of solrconfig.xml.

${solr.ulog.dir:}

有一些配置的参数;

${solr.ulog.dir:}

500

20

65536

Query Settings in SolrConfig

The settings in this section affect the way that Solr will process and respond to queries

...

Topics covered in this section:

Caches

Query Sizing and Warming

Query-Related Listeners

Caches

将查询的条件和结果缓存下来,当再次查询时从缓存中获取,提高查询速度.

当重新打卡索引时对缓存进行预热更新.

使用的有三种:

In Solr, there are three cache implementations: solr.search.LRUCache, solr.search.FastLRUCache, and solr.search.LFUCache .

filterCache

当使用fq参数查询时,会将条件和结果缓存下来,等待下次相同的查询条件命中,进行快速返回

size="512"

initialSize="512"

autowarmCount="128"/>

queryResultCache

This cache holds the results of previous searches: ordered lists of document IDs (DocList) based on a query, asort, and the range of documents requested

size="512"

initialSize="512"

autowarmCount="128"

maxRamMB="1000"/>

documentCache

This cache holds Lucene Document objects (the stored fields for each document).

Since Lucene internaldocument IDs are transient, this cache is not auto-warmed.

size="512"

initialSize="512"

autowarmCount="0"/>

User Defined Caches

自定义缓存

size="4096"

initialSize="1024"

autowarmCount="1024"

regenerator="org.mycompany.mypackage.MyRegenerator" />

预热器的另一个配置:

regenerator="solr.NoOpRegenerator".

Query Sizing and Warming

maxBooleanClauses

最大布尔查询数量,依赖最后一个初始化配置:

1024

enableLazyFieldLoading

true

useFilterForSortedQuery

没有使用score进行排序时很有用

true

queryResultWindowSize

超范围查询结果缓存:大于指定数目:

20

queryResultMaxDocsCached

200

useColdSearcher

This setting controls whether search requests for which there is not a currently registered searcher should waitfor a new searcher to warm up (false) or proceed immediately (true). When set to "false", requests will block untilthe searcher has warmed its caches.

false

maxWarmingSearchers

2

Query-Related Listeners

两种类型:

static firstSearcher warming in solrconfig.xml

RequestDispatcher in SolrConfig

Topics in this section:

handleSelect Element

requestParsers Element

httpCaching Element

handleSelectElement

向后兼容

...

requestParsersElement

The sub-element controls values related to parsing requests. This is an empty XMLelement that doesn't have any content, only attributes.

几个参数的介绍;

multipartUploadLimitInKB="2048000"

formdataUploadLimitInKB="2048"

addHttpRequestToContext="false" />

httpCachingElement

lastModFrom="openTime"

etagSeed="Solr">

max-age=30, public

cacheControlElement

Update Request Processors

Anatomy and life cycle

Configuration

Update processors in SolrCloud

Using custom chains

Update Request Processor Factories

Anatomy and life cycle

更新过程有默认的处理链,除非你配置了一个自己的处理链.

处理器要有处理器工厂,符合两个要求:

An update request processor need not be thread safe because it is used by one and only

one requesthread and destroyed once the request is complete.

The factory class can accept configuration parameters and maintain any state that may be

required between requests. The factory class must be thread-safe.

Configuration

配置在solrconfig.xml中加载时就加载或者使用参数,运行时加载

自定义需要参考默认的处理器,一些必备的处理过程

The default update request processor chain

按照顺序:

LogUpdateProcessorFactory - Tracks the commands processed during this request and

logs them

DistributedUpdateProcessorFactory - Responsible for distributing update requests to the right node e.g.

routing requests to the leader of the right shard and distributing updates from the leader to each replica. This processor is activated only in SolrCloud mode. RunUpdateProcessorFactory - Executes the update using internal Solr APIs.

Custom update request processor chain

updateRequestProcessorChain

true

id

false

name,features,cat

solr.processor.Lookup3Signature

Solr will automatically insert DistributedUpdateProcessorFactory in this chain that does not include it just prior to the RunUpdateProcessorFactory

Configuring individual processors as top-level plugins

updateProcessor

name="signature">

true

id

false

name,features,cat

solr.processor.Lookup3Signature

name="remove_blanks"/>

接下来可以使用作为自定义的参数:

updateRequestProcessorChains and updateProcessors

Update processors in SolrCloud

A critical SolrCloud functionality is the routing and distributing of requests – for update requests this routing isimplemented by the DistributedUpdateRequestProcessor, and this processor is given a special status by Solrdue to its important function.

更新处理器链中分布式更新处理时,分布式处理器之前是在接收到的节点进行处理,到分布式处理器后会进行路由的分发,到指定的lead节点处理,后进行日志记录,分发到副本进行处理;

举个栗子:

For example, consider the "dedupe" chain which we saw in a section above. Assume that a 3 node SolrCloud

cluster exists where node A hosts the leader of shard1, node B hosts the leader of shard2 and node C hosts the

replica of shard2. Assume that an update request is sent to node A which forwards the update to node B

(because the update belongs to shard2) which then distributes the update to its replica node C. Let's see what

happens at each node:

Node A: Runs the update through the SignatureUpdateProcessor (which computes the signature and puts

it in the "id" field), then LogUpdateProcessor and then DistributedUpdateProcessor. This processor

determines that the update actually belongs to node B and is forwarded to node B. The update is not

processed further. This is required because the next processor which is RunUpdateProcessor will execute

the update against the local shard1 index which would lead to duplicate data on shard1 and shard2.

Node B: Receives the update and sees that it was forwarded by another node. The update is directly sent

to DistributedUpdateProcessor because it has already been through the SignatureUpdateProcessor on

node A and doing the same signature computation again would be redundant. The DistributedUpdateProc

essor determines that the update indeed belongs to this node, distributes it to its replica on Node C and

then forwards the update further in the chain to RunUpdateProcessor.

Node C: Receives the update and sees that it was distributed by its leader. The update is directly sent to

DistributedUpdateProcessor which performs some consistency checks and forwards the update further in

the chain to RunUpdateProcessor.

In summary:

All processors before DistributedUpdateProcessor are only run on the first node that receives an update

request whether it be a forwarding node (e.g. node A in the above example) or a leader (e.g. node B). We

call these pre-processors or just processors.

All processors after DistributedUpdateProcessor run only on the leader and the replica nodes. They are

not executed on forwarding nodes. Such processors are called "post-processors".

post-processors

post-processor="remove_blanks">

Using custom chains

update.chain request parameter

你可以选择使用那个更新处理器链来处理请求

update.chain

curl

"http://localhost:8983/solr/gettingstarted/update/json?update.chain=dedupe&commit=tr

ue" -H 'Content-type: application/json' -d '

[

{

"name" : "The Lightning Thief",

"features" : "This is just a test",

"cat" : ["book","hardcover"]

},

{

"name" : "The Lightning Thief",

"features" : "This is just a test",

"cat" : ["book","hardcover"]

}

]'

processor & post-processor request parameters

使用这两个参数来构造一个动态的处理过程

Constructing a chain at request time

# Executing processors configured in solrconfig.xml as (pre)-processors

curl

"http://localhost:8983/solr/gettingstarted/update/json?processor=remove_blanks,signa

ture&commit=true" -H 'Content-type: application/json' -d '

[

{

"name" : "The Lightning Thief",

"features" : "This is just a test",

"cat" : ["book","hardcover"]

},

{

"name" : "The Lightning Thief",

"features" : "This is just a test",

"cat" : ["book","hardcover"]

}

]'

# Executing processors configured in solrconfig.xml as pre and post processors

curl

"http://localhost:8983/solr/gettingstarted/update/json?processor=remove_blanks&postprocessor=signature&commit=true" -H 'Content-type: application/json' -d '

[

{

"name" : "The Lightning Thief",

"features" : "This is just a test",

"cat" : ["book","hardcover"]

},

{

"name" : "The Lightning Thief",

"features" : "This is just a test",

"cat" : ["book","hardcover"]

}

]'

Configuring a custom chain as a default

将自己配置的自定义处理过程定义为默认的处理的两种方式:

This can be done by adding either "update.chain" or "processor" and "post-processor" as default parameter for a

given path which can be done either viaInitParams in SolrConfigor by adding them in a"defaults" sectionwhich

is supported by all request handlers.

例子:

InitParams

add-unknown-fields-to-the-schema

defaults

startup="lazy"

class="solr.extraction.ExtractingRequestHandler" >

add-unknown-fields-to-the-schema

Update Request Processor Factories

有下列工厂类,具体功能见文档:

AddSchemaFieldsUpdateProcessorFactory:

CloneFieldUpdateProcessorFactory:

DefaultValueUpdateProcessorFactory:

DocBasedVersionConstraintsProcessorFactory:

DocExpirationUpdateProcessorFactory:

IgnoreCommitOptimizeUpdateProcessorFactory:

RegexpBoostProcessorFactory:

SignatureUpdateProcessorFactory:

StatelessScriptUpdateProcessorFactory:

TimestampUpdateProcessorFactory:

URLClassifyProcessorFactory:

UUIDUpdateProcessorFactory:

FieldMutatingUpdateProcessorFactory derived factories

ConcatFieldUpdateProcessorFactory

CountFieldValuesUpdateProcessorFactory

FieldLengthUpdateProcessorFactory

FirstFieldValueUpdateProcessorFactory

HTMLStripFieldUpdateProcessorFactory

IgnoreFieldUpdateProcessorFactory

LastFieldValueUpdateProcessorFactory

MaxFieldValueUpdateProcessorFactory

MinFieldValueUpdateProcessorFactory

ParseBooleanFieldUpdateProcessorFactory

ParseDateFieldUpdateProcessorFactory

ParseNumericFieldUpdateProcessorFactoryderived classes

ParseDoubleFieldUpdateProcessorFactory:Attempts to mutate selected fields that have onlyCharSequence-typed values into Double values.

ParseFloatFieldUpdateProcessorFactory:Attempts to mutate selected fields that have onlyCharSequence-typed values into Float values.

ParseIntFieldUpdateProcessorFactory:Attempts to mutate selected fields that have only CharSequence-typed values into Integer values.

ParseLongFieldUpdateProcessorFactory:Attempts to mutate selected fields that have only CharSequence-typed values into Long values.

PreAnalyzedUpdateProcessorFactory

RegexReplaceProcessorFactory:

RemoveBlankFieldUpdateProcessorFactory:

TrimFieldUpdateProcessorFactory:

TruncateFieldUpdateProcessorFactory:

UniqFieldsUpdateProcessorFactory:

Update Processor factories that can be loaded as plugins

可以自己扩展的接个插件工厂包:

LangDetectLanguageIdentifierUpdateProcessorFactory : 这个是google的??

TikaLanguageIdentifierUpdateProcessorFactory

UIMAUpdateRequestProcessorFactory

Update Processor factories you shouldnotmodify or remove

最好不要乱修改 solr的更新处理器工厂

Codec Factory

定义写入磁盘的编码方式,没有定义solr将使用默认值,在solrconfig.xml中定义

A compressionMode option:

BEST_SPEED (default) is optimized for search speed performance

BEST_COMPRESSION is optimized for disk space usage

例子:

BEST_COMPRESSION

Solr Cores and solr.xml

In Solr, the termcoreis used to refer to a single index and associated transaction log and configuration files(including the solrconfig.xml and Schema files, among others).

In standalone mode, solr.xml must reside in solr_home. In SolrCloud mode, solr.xml will be loaded fromZookeeper if it exists, with fallback to solr_home.

The recommended way is to dynamically create cores/collections using the APIs

The following sections describe these options in more detail.

Format of solr.xml: Details on how to define solr.xml, including the acceptable parameters for the solr.xml file

Defining core.properties: Details on placement of core.properties and available property options.

CoreAdmin API: Tools and commands for core administration using a REST API.

Config Sets: How to use configsets to avoid duplicating effort when defining a new core.

Format of solr.xml

This section will describe the default solr.xml file included with Solr and how to modify it for your needs. Fordetails on how to configure core.properties, see the sectionDefining core.properties.

Defining solr.xml

Solr.xml Parameters

The Element

The element

The element

The element

The element

Substituting JVM System Properties in solr.xml

Defining solr.xml

You can find solr.xml in your Solr Home directory or in Zookeeper. The default solr.xml file looks like this:

${host:}

${jetty.port:8983}

${hostContext:solr}

${zkClientTimeout:15000}

${genericCoreNodeNames:true}

class="HttpShardHandlerFactory">

${socketTimeout:0}

${connTimeout:0}

Unless the -DzkHost or -DzkRun are specified at startup time, this section is ignored.

Solr.xml Parameters

TheElement

几个属性值的介绍

Theelement

This section is ignored unless the solrinstance is started with either -DzkRun or -DzkHost

solrcloud模式下的参数配置及访问控制令牌配置

Theelement

日志类及是否启用

Theelement

日志监控配置信息

Theelement

定义分片处理器:

Custom shard handlers can be defined in solr.xml if you wish to create a custom shard handler.

Since this is a custom shard handler, sub-elements are specific to the implementation.

Substituting JVM System Properties in solr.xml

可以在 solr.xml中配置 jvm属性${propertyname[:option default value]} 设置默认值

动态设置jvm的属性值将覆盖设置的默认值

class="HttpShardHandlerFactory">

${socketTimeout:0}

${connTimeout:0}

Defining core.properties

core.properties文件是典型的javaproperties文件形式,例子:

name=my_core_name

Placement of core.properties

core.properties的位置在solr_home下的core文件中

Defining core.properties Files

name

The name of the SolrCore. You'll use this name to reference the SolrCore when running

commands with theCoreAdminHandler

config

Theconfiguration filename for a given core. The default is solrconfig.xml.

schema

The schema file name for a given core. The default is schema.xml but please note that if

you are using a "managed schema" (the default behavior) then any value for this property

which does not match the effective managedSchemaResourceName will be read once,

backed up, and converted for managed schema use.

dataDir

The core's data directory (where indexes are stored) aseither an absolute pathname, or a

path relative to the value of instanceDir. This is data by default.

configSet

The name of a defined configset, if desired, to use to configure the core

properties

The name of the properties file for this core. The value can be an absolute pathname or a

path relative to the value of instanceDir

transient

Iftrue, the core can be unloaded if Solr reaches the transientCacheSize. The default

if not specified isfalse. Cores are unloaded in order of least recently used first.

Setting totrueis not recommended in SolrCloud mode.

loadOnStartup

Iftrue, the default if it is not specified, the core will loaded when Solr starts.Setting tofals

eis not recommended in SolrCloud mode.

coreNodeName

Used only in SolrCloud, this is a unique identifier for the node hosting this replica. By

default a coreNodeName is generated automatically, but setting this attribute explicitly

allows you to manually assign a new core to replace an existing replica. For example:

when replacing a machine that has had a hardware failure by restoring from backups on a

new machine with a new hostname or port..

ulogDir

The absolute or relative directory for the update log for this core (SolrCloud)

shard

The shard to assign this core to (SolrCloud)

collection

The name of the collection this core is part of (SolrCloud).

roles

Future param for SolrCloud or a way for users to mark nodes for their own use

这个不太理解

Additional "user defined" properties may be specified for use as variables. For more information on how to definelocal properties, see the sectionSubstituting Properties in Solr Config Files.

用户自定义属性???

CoreAdmin API 适用于单机版本

SolrCloud users should not typically use the CoreAdmin API directly

solrcloud模式通常不直接使用coreadmin api

the cores running in that node and is accessible at the /solr/admin/cores path.

HTTP requests that specify an "action" request parameter

All action names are uppercase, and are defined in depth in the sections below

STATUS

CREATE

RELOAD

RENAME

SWAP

UNLOAD

MERGEINDEXES

SPLIT

REQUESTSTATUS

STATUS

The STATUS action returns the status of all running Solr cores, or status for only the named core.

http://localhost:8983/solr/admin/cores?action=STATUS&core=core0

Input

core 指定core的名字

indexInfo 是否返回索引的信息 默认返回,当数量过多时,加快返回可以设置为false

CREATE

The CREATE action creates a new core and registers it.

If a Solr core with the given name already exists, it will continue to handle requests while the new core isinitializing. When the new core is ready, it will take new requests and the old core will be unloaded.

创建一个已经存在的core,旧的core将被替换掉

http://localhost:8983/solr/admin/cores?action=CREATE&name=coreX&instanceDir=path/to/dir&config=config_file_name.xml&dataDir=data

Input

name

instanceDir

config

schema

dataDir

configSet

collection

shard

property.name=value

async

Example

http://localhost:8983/solr/admin/cores?action=CREATE&name=my_core&collection=my_collection&shard=shard2

RELOAD

The RELOAD action loads a new core from the configuration of an existing, registered Solr core.

http://localhost:8983/solr/admin/cores?action=RELOAD&core=core0

Input

core

RENAME

The RENAME action changes the name of a Solr core.

http://localhost:8983/solr/admin/cores?action=RENAME&core=core0&other=core5

Input

core

other

async

SWAP

交换名字

http://localhost:8983/solr/admin/cores?action=SWAP&core=core1&other=core0

UNLOAD

卸载

http://localhost:8983/solr/admin/cores?action=UNLOAD&core=core0

Input

MERGEINDEXES

合并索引

The MERGEINDEXES action merges one or more indexes to another index.

http://localhost:8983/solr/admin/cores?action=MERGEINDEXES&core=new_core_name&indexDir=/solr_home/core1/data/index&indexDir=/solr_home/core2/data/index

Alternatively, we can instead use a srcCore parameter, as in this example:

http://localhost:8983/solr/admin/cores?action=mergeindexes&core=new_core_name&srcCore=core1&srcCore=core2

SPLIT

The SPLIT action splits an index into two or more indexes.

The SPLIT action supports five parameters, which are described in the table below

Input

core

path

多值,索引将被写入的目录

targetCore

多值,索引将被合并的目标solr core

ranges 不懂

split.key

async

Examples

The core index will be split into as many pieces as the number of path or targetCore parameters.

Usage with twotargetCoreparameters:

http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&targetCore=core1&targetCore=core2

Usage of with twopathparameters:

http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&path=/path/to/in

dex/1&path=/path/to/index/2

Usage with thesplit.keyparameter:

http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&targetCore=core1&split.key=A!

Usage with ranges parameter:

http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&targetCore=core1&targetCore=core2&targetCore=core3&ranges=0-1f4,1f5-3e8,3e9-5dc

REQUESTSTATUS

查看异步请求状态

Input

requestid

http://localhost:8983/solr/admin/cores?action=REQUESTSTATUS&requestid=1

Config Sets

在多个 core中分享配置文件的方式

On a multicore Solr instance, you may find that you want to share configuration between a number of different

cores. You can achieve this using named configsets, which are essentially shared configuration directories

stored under a configurable configset base directory.

To create a configset, simply add a new directory under the configset base directory. The configset will be

identified by the name of this directory. Then into this copy the config directory you want to share. The structure

should look something like this:

/

/configset1

/conf

/managed-schema

/solrconfig.xml

/configset2

/conf

/managed-schema

/solrconfig.xml

The default base directory is $SOLR_HOME/configsets, and it can be configured in solr.xml.

To create a new core using a configset, pass configSet as one of the core properties. For example, if you do

this via the core admin API:

http:///admin/cores?action=CREATE&name=mycore&instanceDir=path/to/instance&configSet=configset2

Configuration APIs

Solr includes several APIs that can be used to modify settings in solrconfig.xml.

修改solrconfig.xml

Blob Store API

Config API

Request Parameters API

Managed Resources

Blob Store API

The Blob Store REST API provides REST methods to store, retrieve or list files in a Lucene index.

The blob store is only available when running in SolrCloud mode

The blob store API is implemented as a requestHandler. A special collection named ".system" must be createdas the collection that contains the blob store index.

Create a .system Collection

You can create the .system collection with theCollections API, as in this example:

curl "http://localhost:8983/solr/admin/collections?action=CREATE&name=.system&replication

Factor=2"

Upload Files to Blob Store

After the .system collection has been created, files can be uploaded to the blob store with a request similar to

the following:

curl -X POST -H 'Content-Type: application/octet-stream' --data-binary @{filename}

http://localhost:8983/solr/.system/blob/{blobname}

For example, to upload a file named "test1.jar" as a blob named "test", you would make a POST request like:

curl -X POST -H 'Content-Type: application/octet-stream' --data-binary @test1.jar

http://localhost:8983/solr/.system/blob/test

A GET request will return the list of blobs and other details:

curl http://localhost:8983/solr/.system/blob?omitHeader=true

Output:

{

"response":{"numFound":1,"start":0,"docs":[

{

"id":"test/1",

"md5":"20ff915fa3f5a5d66216081ae705c41b",

"blobName":"test",

"version":1,

"timestamp":"2015-02-04T16:45:48.374Z",

"size":13108}]

}

}

Details on individual blobs can be accessed with a request similar to:

curl http://localhost:8983/solr/.system/blob/{blobname}

For example, this request will return only the blob named 'test':

curl http://localhost:8983/solr/.system/blob/test?omitHeader=true

Output:

{

"response":{"numFound":1,"start":0,"docs":[

{

"id":"test/1",

"md5":"20ff915fa3f5a5d66216081ae705c41b",

"blobName":"test",

"version":1,

"timestamp":"2015-02-04T16:45:48.374Z",

"size":13108}]

}

}

The filestream response writer can return a particular version of a blob for download, as in:

curl http://localhost:8983/solr/.system/blob/{blobname}/{version}?wt=filestream >

{outputfilename}

For the latest version of a blob, the {version} can be omitted,

curl http://localhost:8983/solr/.system/blob/{blobname}?wt=filestream >

{outputfilename}

文件的上传,查看,及下载

Use a Blob in a Handler or Component

To use the blob as the class for a request handler or search component, you create a request handler in solrconfig.xml as usual. You will need to define the following parameters:

class: the fully qualified class name. For example, if you created a new request handler class calledCRUDHandler, you would enter org.apache.solr.core.CRUDHandler.

runtimeLib: Set to true to require that this component should be loaded from the classloader that loadsthe runtime jars.

Config API

Thisfeature is enabled by default and works similarly in both SolrCloud and standalone mode.

When using this API, solrconfig.xml is is not changed. Instead, all edited configuration is stored in a filecalled configoverlay.json. The values in configoverlay.json override the values in solrconfig.xml.

API Entry Points

Commands

Commands for Common Properties

Commands for Custom Handlers and Local Components

Commands for User-Defined Properties

How to Map solrconfig.xml Properties to JSON

Examples

Creating and Updating Common Properties

Creating and Updating Request Handlers

Creating and Updating User-Defined Properties

How It Works

Empty Command

Listening to config Changes

API Entry Points

/config: retrieve or modify the config. GET to retrieve and POST for executing commands

/config/overlay: retrieve the details in the configoverlay.json alone

/config/params : allows creating parameter sets that can override or take the place of parametersdefined in solrconfig.xml.

Commands

The config commands are categorized into 3 different sections which manipulate various data structures in solr config.xml. Each of these is described below.

Common Properties

Components

User-defined properties

The common properties are those that are frequently need to be customized in a Solr instance. They aremanipulated with two commands:

set-property: Set a well known property. The names of the properties are predefined and fixed. If theproperty has already been set, this command will overwrite the previous setting.

unset-property: Remove a property set using the set-property command.

Commands for Custom Handlers and Local Components

大小写不敏感,添加 更新 删除 三种动作

The full list of available commands follows below:

General Purpose Commands

These commands are the most commonly used:

add-requesthandler

update-requesthandler

delete-requesthandler

add-searchcomponent

update-searchcomponent

delete-searchcomponent

add-initparams

update-initparams

delete-initparams

add-queryresponsewriter

update-queryresponsewriter

delete-queryresponsewriter

Advanced Commands

These commands allow registering more advanced customizations to Solr:

add-queryparser

update-queryparser

delete-queryparser

add-valuesourceparser

update-valuesourceparser

delete-valuesourceparser

add-transformer

update-transformer

delete-transformer

add-updateprocessor

update-updateprocessor

delete-updateprocessor

add-queryconverter

update-queryconverter

delete-queryconverter

add-listener

update-listener

delete-listener

add-runtimelib

update-runtimelib

delete-runtimelib

What about ?

The Config API does not let you create or edit elements. However, it ispossible to create entries and can use them by name to create a chain.

example:

curl http://localhost:8983/solr/techproducts/config -H

'Content-type:application/json' -d '{

"add-updateprocessor" : { "name" : "firstFld",

"class": "solr.FirstFieldValueUpdateProcessorFactory",

"fieldName":"test_s"}}'

You can use this directly in your request by adding a parameter in the forthe specific update processor called processor=firstFld.

Commands for User-Defined Properties

Solr lets users templatize the solrconfig.xml using the place holder format ${variable_name:default_

val}. You could set the values using system properties, for example, -Dvariable_name= my_customvalue.

The same can be achieved during runtime using these commands:

set-user-property: Set a user-defined property. If the property has already been set, this command

will overwrite the previous setting.

unset-user-property: Remove a user-defined property.

The structure of the request is similar to the structure of requests using other commands, in the format of "comm

and":{"variable_name":"property_value"}. You can add more than one variable at a time if

necessary.

运行时设置jvm属性

How to Mapsolrconfig.xmlProperties to JSON

将处理过程参数和map进行对应

Here is what a request handler looks like in solrconfig.xml:

explicit

json

true

The same request handler defined with the Config API would look like this:

{

"add-requesthandler":{

"name":"/query",

"class":"solr.SearchHandler",

"defaults":{

"echoParams":"explicit",

"wt":"json",

"indent":true

}

}

}

A searchComponent in solrconfig.xml looks like this:

string

elevate.xml

And the same searchComponent with the Config API:

{

"add-searchcomponent":{

"name":"elevator",

"class":"QueryElevationComponent",

"queryFieldType":"string",

"config-file":"elevate.xml"

}

}

Set autoCommit properties in solrconfig.xml:

15000

false

Define the same properties with the Config API:

{

"set-property": {

"updateHandler.autoCommit.maxTime":15000,

"updateHandler.autoCommit.openSearcher":false

}

}

Name Components for the Config API

对于没有名字的,要强制给予一个名字

Examples

Creating and Updating Common Properties

This change sets the query.filterCache.autowarmCountto 1000 items and unsets the query.filterCa

che.size.

curl http://localhost:8983/solr/techproducts/config -H

'Content-type:application/json' -d'{

"set-property" : {"query.filterCache.autowarmCount":1000},

"unset-property" :"query.filterCache.size"}'

Using the /config/overlay endpoint, you can verify the changes with a request like this:

curl http://localhost:8983/solr/gettingstarted/config/overlay?omitHeader=true

And you should get a response like this:

{

"overlay":{

"znodeVersion":1,

"props":{"query":{"filterCache":{

"autowarmCount":1000,

"size":25}}}}}

Creating and Updating Request Handlers

To create a request handler, we can use the add-requesthandler command:

curl http://localhost:8983/solr/techproducts/config -H

'Content-type:application/json' -d '{

"add-requesthandler" : {

"name": "/mypath",

"class":"solr.DumpRequestHandler",

"defaults":{ "x":"y" ,"a":"b", "wt":"json", "indent":true },

"useParams":"x"

},

}'

Make a call to the new request handler to check if it is registered:

curl http://localhost:8983/solr/techproducts/mypath?omitHeader=true

And you should see the following as output:

{

"params":{

"indent":"true",

"a":"b",

"x":"y",

"wt":"json"},

"context":{

"webapp":"/solr",

"path":"/mypath",

"httpMethod":"GET"}}

To update a request handler, you should use the update-requesthandler command :

curl http://localhost:8983/solr/techproducts/config -H

'Content-type:application/json' -d '{

"update-requesthandler": {

"name": "/mypath",

"class":"solr.DumpRequestHandler",

"defaults": { "x":"new value for X", "wt":"json", "indent":true },

"useParams":"x"

}

}'

As another example, we'll create another request handler, this time adding the 'terms' component as part of the

definition:

curl http://localhost:8983/solr/techproducts/config -H

'Content-type:application/json' -d '{

"add-requesthandler": {

"name": "/myterms",

"class":"solr.SearchHandler",

"defaults": { "terms":true, "distrib":false },

"components": [ "terms" ]

}

}'

Creating and Updating User-Defined Properties

his command sets a user property.

curl http://localhost:8983/solr/techproducts/config

-H'Content-type:application/json' -d '{

"set-user-property" : {"variable_name":"some_value"}}'

Again, we can use the /config/overlay endpoint to verify the changes have been made:

curl http://localhost:8983/solr/techproducts/config/overlay?omitHeader=true

And we would expect to see output like this

{"overlay":{

"znodeVersion":5,

"userProps":{

"variable_name":"some_value"}}

}

To unset the variable, issue a command like this:

curl http://localhost:8983/solr/techproducts/config

-H'Content-type:application/json' -d '{"unset-user-property" : "variable_name"}'

How It Works

Every core watches the ZooKeeper directory for the configset being used with that core. In standalone mode,

however, there is no watch (because ZooKeeper is not running). If there are multiple cores in the same node

using the same configset, only one ZooKeeper watch is used. For instance, if the configset 'myconf' is used by a

core, the node would watch /configs/myconf. Every write operation performed through the API would 'touch'

the directory (sets an empty byte[] to trigger watches) and all watchers are notified. Every core would check if the

Schema file, solrconfig.xml or configoverlay.json is modified by comparing the znode versions and if

modified, the core is reloaded.

If params.json is modified, the params object is just updated without a core reload

当配置被更改时,zookeeper的监听器收到监听,做检查,发现更改后自动重新加载

Empty Command

If an empty command is sent to the /config endpoint, the watch is triggered on all cores using this configset.

For example:

curl http://localhost:8983/solr/techproducts/config

-H'Content-type:application/json' -d '{}'

Directly editing any files without 'touching' the directorywill notmake it visible to all nodes.

It is possible for components to watch for the configset 'touch' events by registering a listener using SolrCore#r

egisterConfListener() .

空命令触发

Listening to config Changes

Any component can register a listener using:

SolrCore#addConfListener(Runnable listener)

to get notified for config changes. This is not very useful if the files modified result in core reloads (i.e., configo

verlay.xml or Schema). Components can use this to reload the files they are interested in.

添加监听器

Request Parameters API

The Request Parameters API allows creating parameter sets that can override or take the place of parametersdefined in solrconfig.xml.

In this case, the parameters are stored in a file named params.json. This file is

kept in ZooKeeper or in the conf directory of a standalone Solr instance.

The settings stored in params.json are used at query time to override settings defined in solrconfig.xml insome cases as described below.

When might you want to use this feature?

To avoid frequently editing your solrconfig.xml to update request parameters that change often.

To reuse parameters across various request handlers.

To mix and match parameter sets at request time.

To avoid a reload of your collection for small parameter changes.

The Request Parameters Endpoint

All requests are sent to the /config/params endpoint of the Config API.

Setting Request Parameters

The request to set, unset, or update request parameters is sent as a set of Maps with names. These objects can

be directly used in a request or a request handler definition.

The available commands are:

set: Create or overwrite a parameter set map.

unset: delete a parameter set map.

update: update a parameter set map. This is equivalent to a map.putAll(newMap) . Both the maps are

merged and if the new map has same keys as old they are overwritten

You can mix these commands into a single request if necessary.

Each map must include a name so it can be referenced later, either in a direct request to Solr or in a request

必须要有名字方便引用

handler definition.

In the following example, we are setting 2 sets of parameters named 'myFacets' and 'myQueries'.

curl http://localhost:8983/solr/techproducts/config/params -H

'Content-type:application/json' -d '{

"set":{

"myFacets":{

"facet":"true",

"facet.limit":5}},

"set":{

"myQueries":{

"defType":"edismax",

"rows":"5",

"df":"text_all"}}

}'

In the above example all the parameters are equivalent to the "defaults" in solrconfig.xml. It is possible to addinvariants and appends as follows

curl http://localhost:8983/solr/techproducts/config/params -H

'Content-type:application/json' -d '{

"set":{

"my_handler_params":{

"facet.limit":5,

"_invariants_": {

"facet":true,

"wt":"json"

},

"_appends_":{"facet.field":["field1","field2"]

}

}}

}'

now it is possible to define a request handler as follows

useParams="my_handler_params"/>

It will be equivalent to a requesthandler definition as follows,

5

json

true

field1

field2

Update example,

curl http://localhost:8983/solr/techproducts/config/params -H

'Content-type:application/json' -d '{

"update":{

"myFacets":{

"facet.limit":10}},

}'

This command will add (or replace) the facet.limit param to the myFacets map, keeping all other existing m

yFacets params.

To see the parameters that have been set, you can use the /config/params endpoint to read the contents of

params.json, or use the name in the request:

curl http://localhost:8983/solr/techproducts/config/params

#Or use the params name

curl http://localhost:8983/solr/techproducts/config/params/myQueries

TheuseParamsParameter

When making a request, the useParams parameter applies the request parameters set to the request. This is

translated at request time to the actual params.

For example (using the names we set up in the earlier example, please replace with your own name):

http://localhost/solr/techproducts/select?useParams=myQueries

It is possible to pass more than one parameter set in the same request. For example:

http://localhost/solr/techproducts/select?useParams=myFacets,myQueries

In the above example the param set 'myQueries' is applied on top of 'myFacets'. So, values in 'myQueries'

take precedence over values in 'myFacets'. Additionally, any values passed in the request take precedence over

'useParams' params. This acts like the "defaults" specified in the '' definition in solrconfi

g.xml.

The parameter sets can be used directly in a request handler definition as follows. Please note that the

'useParams' specified is always applied even if the request contains useParams.

true

false

terms

如何去使用定义的请求参数

To summarize, parameters are applied in this order:

parameters defined in in solrconfig.xml.

parameters applied in _invariants_ in params.json and that is specified in the requesthandler definition or

even in request

parameters defined in the request directly.

parameter sets defined in the request, in the order they have been listed with useParams.

parameter sets defined in params.json that have been defined in the request handler.

parameters defined in in solrconfig.xml.

Public APIs

Java访问请求参数

The RequestParams Object can be accessed using the method SolrConfig#getRequestParams(). Each

paramset can be accessed by their name using the method RequestParams#getRequestParams(String

name).

Managed Resources

资源管理多种方式

All of the examples in this section assume you are running the "techproducts" Solr example:

bin/solr -e techproducts

Overview

Let's begin learning about managed resources by looking at a couple of examples provided by Solr for managingstop words and synonyms using a REST API. After reading this section, you'll be ready to dig into the details ofhow managed resources are implemented in Solr so you can start building your own implementation.

Stop words

To begin, you need to define a field type that uses theManagedStopFilterFactory, such as:

managed="english" />

There are two important things to notice about this field type definition. First, the filter implementation class isso

lr.ManagedStopFilterFactory. This is a special implementation of theStopFilterFactorythat uses a set of

stop words that are managed from a REST API. Second, themanaged=”english”attribute gives a name to

the set of managed stop words, in this case indicating the stop words are for English text.

The REST endpoint for managing the English stop words in the techproducts collection is: /solr/techproduc

ts/schema/analysis/stopwords/english.

The example resource path should be mostly self-explanatory. It should be noted that the

ManagedStopFilterFactory implementation determines the/schema/analysis/stopwordspart of the path, which

makes sense because this is an analysis component defined by the schema. It follows that a field type that uses

the following filter:

managed="french" />

would resolve to path: /solr/techproducts/schema/analysis/stopwords/french.

So now let’s see this API in action, starting with a simple GET request:

curl "http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english"

Assuming you sent this request to Solr, the response body is a JSON document:

{

"responseHeader":{

"status":0,

"QTime":1

},

"wordSet":{

"initArgs":{"ignoreCase":true},

"initializedOn":"2014-03-28T20:53:53.058Z",

"managedList":[

"a",

"an",

"and",

"are",

... ]

}

}

The sample_techproducts_configsconfig setships with a pre-built set of managed stop words, however

you should only interact with this file using the API and not edit it directly.

One thing that should stand out to you in this response is that it contains amanagedListof words as well asi

nitArgs. This is an important concept in this framework—managed resources typically have configuration and

data. For stop words, the only configuration parameter is a boolean that determines whether to ignore the case

of tokens during stop word filtering (ignoreCase=true|false). The data is a list of words, which is represented as a

JSON array namedmanagedListin the response.

Now, let’s add a new word to the English stop word list using an HTTP PUT:

curl -X PUT -H 'Content-type:application/json' --data-binary '["foo"]'

"http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english"

Here we’re using cURL to PUT a JSON list containing a single word “foo” to the managed English stop words

set. Solr will return 200 if the request was successful. You can also put multiple words in a single PUT request.

You can test to see if a specific word exists by sending a GET request for that word as a child resource of the

set, such as:

curl "http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english/foo"

This request will return a status code of 200 if the child resource (foo) exists or 404 if it does not exist the

managed list.

To delete a stop word, you would do:

curl -X DELETE

"http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english/foo"

Note: PUT/POST is used to add terms to an existing list instead of replacing the list entirely. This is because it is

more common to add a term to an existing list than it is to replace a list altogether, so the API favors the more

common approach of incrementally adding terms especially since deleting individual terms is also supported

停用词的CURD操作

Synonyms

For the most part, the API for managing synonyms behaves similar to the API for stop words, except instead of

working with a list of words, it uses a map, where the value for each entry in the map is a set of synonyms for a

term. As with stop words, the sample_techproducts_configsconfig setincludes a pre-built set of synonym

mappings suitable for the sample data that is activated by the following field type definition in schema.xml:

managed="english" />

managed="english" />

和停用词类似,同义词有一定不一样词的存放使用的是map的方式.

To get the map of managed synonyms, send a GET request to:

curl "http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english"

This request will return a response that looks like:

{

"responseHeader":{

"status":0,

"QTime":3},

"synonymMappings":{

"initArgs":{

"ignoreCase":true,

"format":"solr"},

"initializedOn":"2014-12-16T22:44:05.33Z",

"managedMap":{

"GB":

["GiB",

"Gigabyte"],

"TV":

["Television"],

"happy":

["glad",

"joyful"]}}}

Managed synonyms are returned under themanagedMapproperty which contains a JSON Map where the value

of each entry is a set of synonyms for a term, such as "happy" has synonyms "glad" and "joyful" in the example

above.

To add a new synonym mapping, you can PUT/POST a single mapping such as:

curl -X PUT -H 'Content-type:application/json' --data-binary

'{"mad":["angry","upset"]}'

"http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english"

The API will return status code 200 if the PUT request was successful. To determine the synonyms for a specific

term, you send a GET request for the child resource, such as /schema/analysis/synonyms/english/mad

would return ["angry","upset"].

You can also PUT a list of symmetric synonyms, which will be expanded into a mapping for each term in the list.

For example, you could PUT the following list of symmetric synonyms using the JSON list syntax instead of a

map:

curl -X PUT -H 'Content-type:application/json' --data-binary '["funny",

"entertaining", "whimiscal", "jocular"]'

"http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english"

Note that the expansion is performed when processing the PUT request so the underlying persistent state is still

a managed map. Consequently, if after sending the previous PUT request, you did a GET for /schema/analys

is/synonyms/english/jocular, then you would receive a list containing ["funny", "entertaining",

"whimiscal"]. Once you've created synonym mappings using a list, each term must be managed separately.

Lastly, you can delete a mapping by sending a DELETE request to the managed endpoint

同音字的curd操作

Applying Changes

Changes made to managed resources via this REST API are not applied to the active Solr components until the

Solr collection (or Solr core in single server mode) is reloaded. For example:, after adding or deleting a stop

word, you must reload the core/collection before changes become active.

重新加载才能生效

This approach is required when running in distributed mode so that we are assured changes are applied to all

cores in a collection at the same time so that behavior is consistent and predictable. It goes without saying that

you don’t want one of your replicas working with a different set of stop words or synonyms than the others.

One subtle outcome of thisapply-changes-at-reloadapproach is that the once you make changes with the API,

there is no way to read the active data. In other words, the API returns the most up-to-date data from an API

perspective, which could be different than what is currently being used by Solr components. However, the intent

of this API implementation is that changes will be applied using a reload within a short time frame after making

them so the time in which the data returned by the API differs from what is active in the server is intended to be

negligible.

一个正确的启用流程及重新索引

RestManager Endpoint

Metadata about registered ManagedResources is available using the/schema/managedand/config/managed

endpoints for each collection. Assuming you have themanaged_enfield type shown above defined in your

schema.xml, sending a GET request to the following resource will return metadata about which schema-related

resources are being managed by the RestManager:

curl "http://localhost:8983/solr/techproducts/schema/managed"

The response body is a JSON document containing metadata about managed resources under

the /schema root:

{

"responseHeader":{

"status":0,

"QTime":3

},

"managedResources":[

{

"resourceId":"/schema/analysis/stopwords/english",

"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource",

"numObservers":"1"

},

{

"resourceId":"/schema/analysis/synonyms/english",

"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymFilterFactory$SynonymMan

ager",

"numObservers":"1"

}

]

}

You can also create new managed resource using PUT/POST to the appropriate URL – before ever configuring

anything that uses these resources.

For example: imagine we want to build up a set of German stop words. Before we can start adding stop words,

we need to create the endpoint:

/solr/techproducts/schema/analysis/stopwords/german

To create this endpoint, send the following PUT/POST request to the endpoint we wish to create:

curl -X PUT -H 'Content-type:application/json' --data-binary \

'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}' \

"http://localhost:8983/solr/techproducts/schema/analysis/stopwords/german"

Solr will respond with status code 200 if the request is successful. Effectively, this action registers a new

endpoint for a managed resource in the RestManager. From here you can start adding German stop words as

we saw above:

curl -X PUT -H 'Content-type:application/json' --data-binary '["die"]' \

"http://localhost:8983/solr/techproducts/schema/analysis/stopwords/german"

For most users, creating resources in this way should never be necessary, since managed resources are created

automatically when configured.

However: You may want to explicitly delete managed resources if they are no longer being used by a Solr

component.

For instance, the managed resource for German that we created above can be deleted because there are no

Solr components that are using it, whereas the managed resource for English stop words cannot be deleted

because there is a token filter declared in schema.xml that is using it.

curl -X DELETE

"http://localhost:8983/solr/techproducts/schema/analysis/stopwords/german"

你可以定义一个资源管理和删除他!但是定义在schema.xml中的文件不能被删除!~

Solr Plugins

Solr allows you to load custom code to perform a variety of tasks within Solr, from custom Request Handlers to

process your searches, to custom Analyzers and Token Filters for your text field. You can even load custom

Field Types. These pieces of custom code are called plugins.

Not everyone will need to create plugins for their Solr instances - what's provided is usually enough for most

applications. However, if there's something that you need, you may want to review the Solr Wiki documentation

on plugins atSolrPlugins.

If you have a plugin you would like to use, and you are running in SolrCloud mode, you can use the Blob Store

API and the Config API to load the jars to Solr. The commands to use are described in the sectionAdding

Custom Plugins in SolrCloud Mode.

solr可以自定义插件,如果你需要的话

Adding Custom Plugins in SolrCloud Mode

When running Solr in SolrCloud mode and you want to use custom code (such as custom analyzers, tokenizers,

query parsers, and other plugins), it can be cumbersome to add jars to the classpath on all nodes in your cluster.

Using theBlob Store APIand special commands with theConfig API, you can upload jars to a special

system-level collection and dynamically load plugins from them at runtime with out needing to restart any nodes.

通过命令上传jar到solrcloud中比每个jar放在节点下方便

This Feature is Disabled By Default

In addition to requiring that Solr by running inSolrCloudmode, this feature is also disabled by default

unless all Solr nodes are run with the -Denable.runtime.lib=true option on startup.

Before enabling this feature, users should carefully consider the issues discussed in theSecuring

Runtime Librariessection below.

此功能默认是不可用的,需要启动时设置参数

Uploading Jar Files

The first step is to use theBlob Store APIto upload your jar files. This will to put your jars in the .system collecti

on and distribute them across your SolrCloud nodes. These jars are added to a separate classloader and only

accessible to components that are configured with the property runtimeLib=true. These components are

loaded lazily because the .system collection may not be loaded when a particular core is loaded.

上传jar

Config API Commands to use Jars as Runtime Libraries

The runtime library feature uses a special set of commands for theConfig APIto add, update, or remove jar files

currently available in the blob store to the list of runtime libraries.

The following commands are used to manage runtime libs:

add-runtimelib

update-runtimelib

delete-runtimelib

curl http://localhost:8983/solr/techproducts/config -H

'Content-type:application/json' -d '{

"add-runtimelib": { "name":"jarblobname", "version":2 },

"update-runtimelib": { "name":"jarblobname", "version":3 },

"delete-runtimelib": "jarblobname"

}'

使用命令对jar操作

The name to use is the name of the blob that you specified when you uploaded your jar to the blob store. You

should also include the version of the jar found in the blob store that you want to use. These details are added to

configoverlay.json.

The default SolrResourceLoader does not have visibility to the jars that have been defined as runtime

libraries. There is a classloader that can access these jars which is made available only to those components

which are specially annotated.

Every pluggable component can have an optional extra attribute called runtimeLib=true, which means that

the components are not loaded at core load time. Instead, they will be loaded on demand. If all the dependent

jars are not available when the component is loaded, an error is thrown.

This example shows creating a ValueSourceParser using a jar that has been loaded to the Blob store.

curl http://localhost:8983/solr/techproducts/config -H

'Content-type:application/json' -d '{

"create-valuesourceparser": {

"name": "nvl",

"runtimeLib": true,

"class": "solr.org.apache.solr.search.function.NvlValueSourceParser,

"nvlFloatValue": 0.0 }

}'

需要资源管理去加载在运行时,设置一个参数.

Securing Runtime Libraries

A drawback of this feature is that it could be used to load malicious executable code into the system. However, it

is possible to restrict the system to load only trusted jars usingPKIto verify that the executables loaded into the

system are trustworthy.

The following steps will allow you enable security for this feature. The instructions assume you have started all

your Solr nodes with the -Denable.runtime.lib=true.

这种方式有一个缺点就是有可能加载了恶意的jar到系统中.下面介绍如何保证安全的加载

Step 1: Generate an RSA Private Key

The first step is to generate an RSA private key. The example below uses a 512-bit key, but you should use the

strength appropriate to your needs.

$ openssl genrsa -out priv_key.pem 512

使用rsa加密算法生成一个私钥

Step 2: Output the Public Key

The public portion of the key should be output in DER format so Java can read it.

$ openssl rsa -in priv_key.pem -pubout -outform DER -out pub_key.der

公钥部分的编码保证jar能识别

Step 3: Load the Key to ZooKeeper

The .der files that are output from Step 2 should then be loaded to ZooKeeper under a node /keys/exe so they

are available throughout every node. You can load any number of public keys to that node and all are valid. If a

key is removed from the directory, the signatures of that key will cease to be valid. So, before removing the a

key, make sure to update your runtime library configurations with valid signatures with the update-runtimeli

b command.

At the current time, you can only use the ZooKeeper zkCli.sh (or zkCli.cmd on Windows) script to issue

these commands (the Solr version has the same name, but is not the same). If you are running the embedded

ZooKeeper that is included with Solr, youdo nothave this script already; in order to use it, you will need to

download a copy of ZooKeeper v3.4.6 fromhttp://zookeeper.apache.org/.Don't worry about configuring the

download, you're just trying to get the command line utility script. When you start the script, you will connect to

the embedded ZooKeeper. If you have your own ZooKeeper ensemble running already, you can find the script

in $ZK_INSTALL/bin/zkCli.sh (or zkCli.cmd if you are using Windows).

To load the keys, you will need to connect to ZooKeeper with zkCli.sh, create the directories, and then create

the key file, as in the following example.

# Connect to ZooKeeper

# Replace the server location below with the correct ZooKeeper connect string for

your installation.

$ .bin/zkCli.sh -server localhost:9983

# After connection, you will interact with the ZK prompt.

# Create the directories

[zk: localhost:9983(CONNECTED) 5] create /keys

[zk: localhost:9983(CONNECTED) 5] create /keys/exe

# Now create the public key file in ZooKeeper

# The second path is the path to the .der file on your local machine

[zk: localhost:9983(CONNECTED) 5] create /keys/exe/pub_key.der

/myLocal/pathTo/pub_key.der

After this, any attempt to load a jar will fail. All your jars must be signed with one of your private keys for Solr to

trust it. The process to sign your jars and use the signature is outlined in Steps 4-6.

使用zookeeper命令行创建指定的文件夹和将公钥上传到指定的位置

Step 4: Sign the jar File

Next you need to sign the sha1 digest of your jar file and get the base64 string.

$ openssl dgst -sha1 -sign priv_key.pem myjar.jar | openssl enc -base64

The output of this step will be a string that you will need to add the jar to your classpath in Step 6 below.

为你的jar生成一个签名,保存下来

Step 5: Load the jar to the Blob Store

Load your jar to the Blob store, using theBlob Store API. This step does not require a signature; you will need

the signature in Step 6 to add it to your classpath.

curl -X POST -H 'Content-Type: application/octet-stream' --data-binary @{filename}

http://localhost:8983/solr/.system/blob/{blobname}

The blob name that you give the jar file in this step will be used as the name in the next step.

将jar上传到.system 集合中 这步和原来没区别

Step 6: Add the jar to the Classpath

Finally, add the jar to the classpath using the Config API as detailed above. In this step, you will need to providethe signature of the jar that you got in Step 4

curl http://localhost:8983/solr/techproducts/config -H

'Content-type:application/json' -d '{

"add-runtimelib": {

"name":"blobname",

"version":2,

"sig":"mW1Gwtz2QazjfVdrLFHfbGwcr8xzFYgUOLu68LHqWRDvLG0uLcy1McQ+AzVmeZFBf1yLPDEHBWJb5

KXr8bdbHN/

PYgUB1nsr9pk4EFyD9KfJ8TqeH/ijQ9waa/vjqyiKEI9U550EtSzruLVZ32wJ7smvV0fj2YYhrUaaPzOn9g0

=" }

}

使用步骤4中获取的签名,将上传的jar加载到classpath下,已供使用!

JVM Settings

Configuring your JVM can be a complex topic. A full discussion is beyond the scope of this document. Luckily,

most modern JVMs are quite good at making the best use of available resources with default settings. The

following sections contain a few tips that may be helpful when the defaults are not optimal for your situation.

For more general information about improving Solr performance, seehttps://wiki.apache.org/solr/SolrPerformanceFactors.

默认的jvm配置已经是很好的,如果你有特殊的需要也可以设置.

Choosing Memory Heap Settings

JVM 的 两个参数

These are -Xms,

which sets the initial size of the JVM's memory heap, and -Xmx, which sets the maximum size to which the heap

is allowed to grow.

及jvm的垃圾回收和IO性能问题的考虑

Use the Server HotSpot VM

If you are using Sun's JVM, add the -server command-line option when you start Solr. This tells the JVM that it

should optimize for a long running, server process. If the Java runtime on your system is a JRE, rather than a full

JDK distribution (including javac and other development tools), then it is possible that it may not support the -s

erver JVM option. Test this by running java -help and look for -server as an available option in the

displayed usage message.

使用 -server 参数 当运行solr的时候

Checking JVM Settings

A great way to see what JVM settings your server is using, along with other useful information, is to use the

admin RequestHandler, solr/admin/system. This request handler will display a wealth of server statistics and

settings.

You can also use any of the tools that are compatible with the Java Management Extensions (JMX). See the

sectionUsing JMX with SolrinManaging Solrfor more information.

如何去查看solr中的jvm参数信息

Managing Solr

This section describes how to run Solr and how to look at Solr when it is running. It contains the following

sections:

Taking Solr to Production: Describes how to install Solr as a service on Linux for production environments.

Securing Solr: How to use the Basic and Kerberos authentication and rule-based authorization plugins for Solr,

and how to enable SSL.

Running Solr on HDFS: How to use HDFS to store your Solr indexes and transaction logs.

Making and Restoring Backups of SolrCores: Describes backup strategies for your Solr indexes.

Configuring Logging: Describes how to configure logging for Solr.

Using JMX with Solr: Describes how to use Java Management Extensions with Solr.

MBean Request Handler: How to use Solr's MBeans for programmatic access to the system plugins and stats.

可以看到这部分内容还是挺丰富的呢

Taking Solr to Production

在生产环境使用solr

This section provides guidance on how to setup Solr to run in production on *nix platforms, such as Ubuntu.

Specifically, we’ll walk through the process of setting up to run a single Solr instance on a Linux host and then

provide tips on how to support multiple Solr nodes running on the same host.

Service Installation Script

Planning your directory structure

Solr Installation Directory

Separate Directory for Writable Files

Create the Solr user

Run the Solr Installation Script

Solr Home Directory

Environment overrides include file

Log settings

init.d script

Progress Check

Fine tune your production setup

Memory and GC Settings

Out-of-Memory Shutdown Hook

SolrCloud

ZooKeeper chroot

Solr Hostname

Override settings in solrconfig.xml

Enable Remote JMX Access

Running multiple Solr nodes per host

Service Installation Script

Solr includes a service installation script (bin/install_solr_service.sh) to help you install Solr as aservice on Linux. Currently, the script only supports Red Hat, Ubuntu, Debian, and SUSE Linux distributions.

Before running the script, you need to determine a few parameters about your setup. Specifically, you need to

decide where to install Solr and which system user should be the owner of the Solr files and process.

使用bin/install_solr_service.sh 来快速的安装solr实例

Planning your directory structure

We recommend separating your live Solr files, such as logs and index files, from the files included in the Solr

distribution bundle, as that makes it easier to upgrade Solr and is considered a good practice to follow as a

system administrator.

给出的solr安装目录的建议

Solr Installation Directory

By default, the service installation script will extract the distribution archive into /opt. You can change this

location using the -i option when running the installation script. The script will also create a symbolic link to the

versioned directory of Solr. For instance, if you run the installation script for Solr X.0.0, then the following

directory structure will be used:

/opt/solr-X.0.0

/opt/solr -> /opt/solr-X.0.0

Using a symbolic link insulates any scripts from being dependent on the specific Solr version. If, down the road,

you need to upgrade to a later version of Solr, you can just update the symbolic link to point to the upgraded

version of Solr. We’ll use /opt/solr to refer to the Solr installation directory in the remaining sections of this

page.

solr默认安装将生成一个软连接,以后升级时,仅仅需要更新对应的solr真实目录即可.

Separate Directory for Writable Files

You should also separate writable Solr files into a different directory; by default, the installation script uses /var

/solr, but you can override this location using the -d option. With this approach, the files in /opt/solr will

remain untouched and all files that change while Solr is running will live under /var/solr.

solr默认的软件安装位置,及数据安装位置(可以指定)

Create the Solr user

Running Solr as root is not recommended for security reasons. Consequently, you should determine the

username of a system user that will own all of the Solr files and the running Solr process. By default, the

installation script will create thesolruser, but you can override this setting using the -u option. If your

organization has specific requirements for creating new user accounts, then you should create the user before

running the script. The installation script will make the Solr user the owner of the /opt/solr and /var/solr di

rectories.

You are now ready to run the installation script.

考虑安全性,需要为solr指定一个非root用户,默认脚本会自己创建一个solr用户,这个也可以指定

Run the Solr Installation Script

To run the script, you'll need to download the latest Solr distribution archive and then do the following (NOTE:

replace solr-X.Y.Z with the actual version number):

$ tar xzf solr-X.Y.Z.tgz solr-X.Y.Z/bin/install_solr_service.sh --strip-components=2

The previous command extracts the install_solr_service.sh script from the archive into the current

directory. If installing on Red Hat, please make surelsofis installed before running the Solr installation script (su

do yum install lsof). The installation script must be run as root:

$ sudo bash ./install_solr_service.sh solr-X.Y.Z.tgz

By default, the script extracts the distribution archive into /opt, configures Solr to write files into /var/solr,

and runs Solr as the solr user. Consequently, the following command produces the same result as the previous

command:

$ sudo bash ./install_solr_service.sh solr-X.Y.Z.tgz -i /opt -d /var/solr -u solr -s

solr -p 8983

You can customize the service name, installation directories, port, and owner using options passed to the

installation script. To see available options, simply do:

$ sudo bash ./install_solr_service.sh -help

Once the script completes, Solr will be installed as a service and running in the background on your server (on

port 8983). To verify, you can do:

$ sudo service solr status

We'll cover some additional configuration settings you can make to fine-tune your Solr setup in a moment. Before

moving on, let's take a closer look at the steps performed by the installation script. This gives you a better

overview and will help you understand important details about your Solr installation when reading other pages in

this guide; such as when a page refers to Solr home, you'll know exactly where that is on your system.

一些solr安装命令的例子及状态的查看,也可以查看参数自己设置适合自己的

我的尝试:

solr-6.0.1/bin/install_solr_service.shsolr-6.0.1.zip-i/usr/local/-d/zyy/solr

这个相当于其余的都使用默认的

$ sudo bash ./install_solr_service.sh solr-X.Y.Z.tgz -i /opt -d /var/solr -u solr -s

solr -p 8983

这条就是默认的全命令

Solr Home Directory

The Solr home directory (not to be confused with the Solr installation directory) is where Solr manages core

directories with index files. By default, the installation script uses /var/solr/data. If the -d option is used on

the install script, then this will change to the data subdirectory in the location given to the -d option. Take a

moment to inspect the contents of the Solr home directory on your system. If you do notstore solr.xml in

ZooKeeper, the home directory must contain a solr.xml file. When Solr starts up, the Solr start script passes

the location of the home directory using the -Dsolr.solr.home system property.

solr home 是通过设置 系统属性来实现的

solr.xml必须存在,不论是在zookeeper上还是在data目录下

Environment overrides include file

The service installation script creates an environment specific include file that overrides defaults used by the bin/solr script. The main advantage of using an include file is that it provides a single location where all of yourenvironment-specific overrides are defined. Take a moment to inspect the contents of the /etc/default/solr.in.sh file, which is the default path setup by the installation script. If you used the -s option on the install scriptto change the name of the service, then the first part of the filename will be different. For a service named solr-demo, the file will be named /etc/default/solr-demo.in.sh. There are many settings that you canoverride using this file. However, at a minimum, this script needs to define the SOLR_PID_DIR and SOLR_HOME

variables, such as:

SOLR_PID_DIR=/var/solr

SOLR_HOME=/var/solr/data

The SOLR_PID_DIR variable sets the directory where the start script will write out a file containing the Solrserver’s process ID.

默认使用的是/etc/default/solr.in.sh.这个文件设置启动参数的,上面提到的两个必须的参数都是存在这个初始化脚本里的. 配置上/usr/local/solr/bin/init.d/solr 这个脚本

就可以设置一个动态启动了.(看看后面说嘛,不说自己总结下)

Log settings

Solr uses Apache Log4J for logging. The installation script copies /opt/solr/server/resources/log4j.properties to /var/solr/log4j.properties and customizes it for your environment. Specifically it updatesthe Log4J settings to create logs in the /var/solr/logs directory. Take a moment to verify that the Solrinclude file is configured to send logs to the correct location by checking the following settings in /etc/default/solr.in.sh :

LOG4J_PROPS=/var/solr/log4j.properties

SOLR_LOGS_DIR=/var/solr/logs

For more information about Log4J configuration, please see:Configuring Logging

关于日志配置文件的设置及日志文件的位置的配置

init.d script

When running a service like Solr on Linux, it’s common to setup an init.d script so that system administrators cancontrol Solr using the service tool, such as: service solr start. The installation script creates a very basicinit.d script to help you get started. Take a moment to inspect the /etc/init.d/solr file, which is the defaultscript name setup by the installation script. If you used the -s option on the install script to change the name ofthe service, then the filename will be different. Notice that the following variables are setup for your environmentbased on the parameters passed to the installation script:

/etc/init.d/solr 这个文件就是 service start solr 调用的脚本了 这个很重要

SOLR_INSTALL_DIR=/opt/solr

SOLR_ENV=/etc/default/solr.in.sh

RUNAS=solr

这三个参数的意思是:

solr的安装位置 -->用来调用solr

solr的覆盖参数设置

solr的运行者

The SOLR_INSTALL_DIR and SOLR_ENV variables should be self-explanatory. The RUNAS variable sets the

owner of the Solr process, such as solr; if you don’t set this value, the script will run Solr asroot, which is not

recommended for production. You can use the /etc/init.d/solr script to start Solr by doing the followingas root:

# service solr start

The /etc/init.d/solr script also supports thestop,restart, andstatuscommands. Please keep in mind

that the init script that ships with Solr is very basic and is intended to show you how to setup Solr as a service.

However, it’s also common to use more advanced tools likesupervisordorupstartto control Solr as a service

on Linux. While showing how to integrate Solr with tools like supervisord is beyond the scope of this guide, the i

nit.d/solr script should provide enough guidance to help you get started. Also, the installation script sets the

Solr service to start automatically when the host machine initializes.

Progress Check

In the next section, we cover some additional environment settings to help you fine-tune your production setup.

However, before we move on, let's review what we've achieved thus far. Specifically, you should be able to

control Solr using /etc/init.d/solr. Please verify the following commands work with your setup:

$ sudo service solr restart

$ sudo service solr status

The status command should give some basic information about the running Solr node that looks similar to:

Solr process PID running on port 8983

{

"version":"5.0.0 - ubuntu - 2014-12-17 19:36:58",

"startTime":"2014-12-19T19:25:46.853Z",

"uptime":"0 days, 0 hours, 0 minutes, 8 seconds",

"memory":"85.4 MB (%17.4) of 490.7 MB"}

If the status command is not successful, look for error messages in /var/solr/logs/solr.log.

Fine tune your production setup

Memory and GC Settings

By default, the bin/solr script sets the maximum Java heap size to 512M (-Xmx512m), which is fine for getting

started with Solr. For production, you’ll want to increase the maximum heap size based on the memory

requirements of your search application; values between 10 and 20 gigabytes are not uncommon for production

servers. When you need to change the memory settings for your Solr server, use the SOLR_JAVA_MEM variable

in the include file, such as:

SOLR_JAVA_MEM="-Xms10g -Xmx10g"

Also, the include file comes with a set of pre-configured Java Garbage Collection settings that have shown to

work well with Solr for a number of different workloads. However, these settings may not work well for your

specific use of Solr. Consequently, you may need to change the GC settings, which should also be done with the

GC_TUNE variable in the /etc/default/solr.in.sh include file. For more information about tuning your

memory and garbage collection settings, see:JVM Settings.

关于内存和GC的参数设置

Out-of-Memory Shutdown Hook

The bin/solr script registers the bin/oom_solr.sh script to be called by the JVM if an OutOfMemoryError

occurs. The oom_solr.sh script will issue a kill -9 to the Solr process that experiences the OutOfMemoryE

rror. This behavior is recommended when running in SolrCloud mode so that ZooKeeper is immediately

notified that a node has experienced a non-recoverable error. Take a moment to inspect the contents of the /op

t/solr/bin/oom_solr.sh script so that you are familiar with the actions the script will perform if it is invoked

by the JVM.

当发生内存溢出的时候,solr会调用oom_solr.sh来杀死当前的进程.

SolrCloud

To run Solr in SolrCloud mode, you need to set the ZK_HOST variable in the include file to point to your

ZooKeeper ensemble. Running the embedded ZooKeeper is not supported in production environments. For

instance, if you have a ZooKeeper ensemble hosted on the following three hosts on the default client port 2181

(zk1, zk2, and zk3), then you would set:

ZK_HOST=zk1,zk2,zk3

When the ZK_HOST variable is set, Solr will launch in "cloud" mode.

当使用solrcloud模式时,需要设置zk_host参数,设置后相当于自动开启了solrCloud模式!

ZooKeeper chroot

If you're using a ZooKeeper instance that is shared by other systems, it's recommended to isolate the SolrCloud

znode tree using ZooKeeper's chroot support. For instance, to ensure all znodes created by SolrCloud are stored

under /solr, you can put /solr on the end of your ZK_HOST connection string, such as:

ZK_HOST=zk1,zk2,zk3/solr

Before using a chroot for the first time, you need to create the root path (znode) in ZooKeeper by using the zkcl

i.sh script. We can use the makepath command for that:

$ server/scripts/cloud-scripts/zkcli.sh -zkhost zk1,zk2,zk3 -cmd makepath /solr

If you also want to bootstrap ZooKeeper with existing solr_home, you can instead use use zkcli.sh /

zkcli.bat's bootstrap command, which will also create the chroot path if it does not exist. SeeCom

mand Line Utilitiesfor more info.

如果你和别人共用了一个zookeeper集群,那么建议改变solrCloud在zookeeper的根目录,设置如上,需要自己创建该目录!

Solr Hostname

Use the SOLR_HOST variable in the include file to set the hostname of the Solr server.

SOLR_HOST=solr1.example.com

Setting the hostname of the Solr server is recommended, especially when running in SolrCloud mode, as this

determines the address of the node when it registers with ZooKeeper.

在 solrcloud模式下,推荐设置 solr_host参数

Override settings in solrconfig.xml

Solr allows configuration properties to be overridden using Java system properties passed at startup using the -

Dproperty=value syntax. For instance, in solrconfig.xml, the default auto soft commit settings are set to:

${solr.autoSoftCommit.maxTime:-1}

In general, whenever you see a property in a Solr configuration file that uses the ${solr.PROPERTY:DEFAULT

_VALUE} syntax, then you know it can be overridden using a Java system property. For instance, to set the

maxTime for soft-commits to be 10 seconds, then you can start Solr with -Dsolr.autoSoftCommit.maxTime

=10000, such as:

$ bin/solr start -Dsolr.autoSoftCommit.maxTime=10000

The bin/solr script simply passes options starting with -D on to the JVM during startup. For running in

production, we recommend setting these properties in the SOLR_OPTS variable defined in the include file.

Keeping with our soft-commit example, in /etc/default/solr.in.sh, you would do:

SOLR_OPTS="$SOLR_OPTS -Dsolr.autoSoftCommit.maxTime=10000"

举了一个例子,来说明如何在初始化配置中设置系统参数

SOLR_OPTS="$SOLR_OPTS -Dsolr.autoSoftCommit.maxTime=10000" 这个样子就可以了

Enable Remote JMX Access

If you need to attach a JMX-enabled Java profiling tool, such as JConsole or VisualVM, to a remote Solr server,

then you need to enable remote JMX access when starting the Solr server. Simply change the ENABLE_REMOTE_JMX_OPTS property in the include file to true. You’ll also need to choose a port for the JMX RMI connector to

bind to, such as 18983. For example, if your Solr include script sets:

设置的例子:

ENABLE_REMOTE_JMX_OPTS=true

RMI_PORT=18983

The JMX RMI connector will allow Java profiling tools to attach to port 18983. When enabled, the following

properties are passed to the JVM when starting Solr:

-Dcom.sun.management.jmxremote \

-Dcom.sun.management.jmxremote.local.only=false \

-Dcom.sun.management.jmxremote.ssl=false \

-Dcom.sun.management.jmxremote.authenticate=false \

-Dcom.sun.management.jmxremote.port=18983 \

-Dcom.sun.management.jmxremote.rmi.port=18983

We don’t recommend enabling remote JMX access in production, but it can sometimes be useful when doingperformance and user-acceptance testing prior to going into production.

如何设置系统的远程访问(不推荐)

Running multiple Solr nodes per host

The bin/solr script is capable of running multiple instances on one machine, but for atypicalinstallation, this

is not a recommended setup. Extra CPU and memory resources are required for each additional instance. A

single instance is easily capable of handling multiple indexes.

When to ignore the recommendation

For every recommendation, there are exceptions, particularly when discussing extreme scalability. The

best reason for running multiple Solr nodes on one host is decreasing the need for extremely large

heaps.

When the Java heap gets very large, it can result in extremely long garbage collection pauses, even with

the GC tuning that the startup script provides by default. The exact point at which the heap is

considered "very large" will vary depending on how Solr is used. This means that there is no hard

number that can be given as a threshold, but if your heap is reaching the neighborhood of 16 to 32

gigabytes, it might be time to consider splitting nodes. Ideally this would mean more machines, but

budget constraints might make that impossible.

There is another issue once the heap reaches 32GB. Below 32GB, Java is able to use compressed

pointers, but above that point, larger pointers are required, which uses more memory and slows down

the JVM.

Because of the potential garbage collection issues and the particular issues that happen at 32GB, if a

single instance would require a 64GB heap, performance is likely to improve greatly if the machine is set

up with two nodes that each have a 31GB heap.

不推荐在一台机器上启动多个solr实例,考虑到java的垃圾回收问题.

If your use case requires multiple instances, at a minimum you will need unique Solr home directories for each

node you want to run; ideally, each home should be on a different physical disk so that multiple Solr nodes don’t

have to compete with each other when accessing files on disk. Having different Solr home directories implies that

you’ll need a different include file for each node. Moreover, if using the /etc/init.d/solr script to control

Solr as a service, then you’ll need a separate script for each node. The easiest approach is to use the service

installation script to add multiple services on the same host, such as:

$ sudo bash ./install_solr_service.sh solr-X.Y.Z.tgz -s solr2 -p 8984

The command shown above will add a service named solr2 running on port 8984 using /var/solr2 for

writable (aka "live") files; the second server will still be owned and run by the solr user and will use the Solr

distribution files in /opt. After installing the solr2 service, verify it works correctly by doing:

$ sudo service solr2 restart

$ sudo service solr2 status

如果一定要启动多个实例,可以向上面的命令那样进行处理

这里相当于又设置了一份,我们自己可以写一个脚本来完成自动化部署嘛

....这个等我学习完liunx部分的知识,动手写一个自动化的集群和单机版的部署脚本

Securing Solr

When planning how to secure Solr, you should consider which of the available features or approaches are right

for you.

Authentication or authorization of users using:

Kerberos Authentication Plugin

Basic Authentication Plugin

Rule-Based Authorization Plugin

Custom authentication or authorization plugin

Enabling SSL

If using SolrCloud,ZooKeeper Access Control

这部分主要说的是solr的安全工作,身份认证及权限管理

这部分原来写过,仅仅研究一下单机版本的访问控制即可

Kerberos Authentication Plugin

还是需要一个服务器作为一个令牌中心

SolrCloud

Apache Solr includes the ability to set up a cluster of Solr servers that combines fault tolerance and high

availability. CalledSolrCloud, these capabilities provide distributed indexing and search capabilities, supporting

the following features:

Central configuration for the entire cluster

Automatic load balancing and fail-over for queries

ZooKeeper integration for cluster coordination and configuration.

In this section, we'll cover everything you need to know about using Solr in SolrCloud mode. We've split up the

details into the following topics:

Getting Started with SolrCloud

How SolrCloud Works

Shards and Indexing Data in SolrCloud

Distributed Requests

Read and Write Side Fault Tolerance

SolrCloud Configuration and Parameters

Setting Up an External ZooKeeper Ensemble

Using ZooKeeper to Manage Configuration Files

ZooKeeper Access Control

Collections API

Parameter Reference

Command Line Utilities

SolrCloud with Legacy Configuration Files

ConfigSets API

Rule-based Replica Placement

Cross Data Center Replication (CDCR)

Getting Started with SolrCloud

SolrCloud is designed to provide a highly available, fault tolerant environment for distributing your indexed

content and query requests across multiple servers. It's a system in which data is organized into multiple pieces,

or shards, that can be hosted on multiple machines, with replicas providing redundancy for both scalability and

fault tolerance,and a ZooKeeper server that helps manage the overall structure so that both indexing and searchrequests can be routed properly.

This section explains SolrCloud and its inner workings in detail, but before you dive in, it's best to have an idea of

what it is you're trying to accomplish. This page provides a simple tutorial to start Solr in SolrCloud mode, so you

can begin to get a sense for how shards interact with each other during indexing and when serving queries. Tothat end, we'll use simple examples of configuring SolrCloud on a single machine, which is obviously not a realproduction environment, which would include several servers or virtual machines.In a real productionenvironment, you'll also use the real machine names instead of "localhost" which we've used here.

In this section you will learn how to start a SolrCloud cluster using startup scripts and a specific configset.

SolrCloud Example

Interactive Startup

The bin/solr script makes it easy to get started with SolrCloud as it walks you through the process of

launching Solr nodes in cloud mode and adding a collection. To get started, simply do:

$ bin/solr -e cloud

This starts an interactive session to walk you through the steps of setting up a simple SolrCloud cluster with

embedded ZooKeeper. The script starts by asking you how many Solr nodes you want to run in your local

cluster, with the default being 2.

Welcome to the SolrCloud example!

This interactive session will help you launch a SolrCloud cluster on your local

workstation.

To begin, how many Solr nodes would you like to run in your local cluster? (specify

1-4 nodes) [2]

The script supports starting up to 4 nodes, but we recommend using the default of 2 when starting out. These

nodes will each exist on a single machine, but will use different ports to mimic operation on different servers.

Next, the script will prompt you for the port to bind each of the Solr nodes to, such as:

Please enter the port for node1 [8983]

Choose any available port for each node; the default for the first node is 8983 and 7574 for the second

node. The script will start each node in order and shows you the command it uses to start the server, such as:

solr start -cloud -s example/cloud/node1/solr -p 8983

The first node will also start an embedded ZooKeeper server bound to port 9983. The Solr home for the first

node is in example/cloud/node1/solr as indicated by the -s option.

After starting up all nodes in the cluster, the script prompts you for the name of the collection to create:

Please provide a name for your new collection: [gettingstarted]

The suggested default is "gettingstarted" but you might want to choose a name more appropriate for your

specific search application.

Next, the script prompts you for the number of shards to distribute the collection across.Shardingis covered in

more detail later on, so if you're unsure, we suggest using the default of 2 so that you can see how a collection is

distributed across multiple nodes in a SolrCloud cluster.

Next, the script will prompt you for the number of replicas to create for each shard.Replicationis covered in

more detail later in the guide, so if you're unsure, then use the default of 2 so that you can see how replication is

handled in SolrCloud.

Lastly, the script will prompt you for the name of a configuration directory for your collection. You can choosebas

ic_configs,data_driven_schema_configs, orsample_techproducts_configs. The configuration directories

are pulled from server/solr/configsets/ so you can review them beforehand if you wish. Thedata_drive

n_schema_configsconfiguration (the default) is useful when you're still designing a schema for your documents

and need some flexiblity as you experiment with Solr.

At this point, you should have a new collection created in your local SolrCloud cluster. To verify this, you can run

the status command:

$ bin/solr status

If you encounter any errors during this process, check the Solr log files in example/cloud/node1/logs and e

xample/cloud/node2/logs.

You can see how your collection is deployed across the cluster by visiting the cloud panel in the Solr Admin UI:h

ttp://localhost:8983/solr/#/~cloud. Solr also provides a way to perform basic diagnostics for a collection using the

healthcheck command:

$ bin/solr healthcheck -c gettingstarted

The healthcheck command gathers basic information about each replica in a collection, such as number of docs,

current status (active, down, etc), and address (where the replica lives in the cluster).

Documents can now be added to SolrCloud using thePost Tool.

To stop Solr in SolrCloud mode, you would use the bin/solr script and issue the stop command, as in:

$ bin/solr stop -all

如何开始一个solrcloud模式的例子

Starting with -noprompt

开始solrcloud 使用默认值

You can also get SolrCloud started with all the defaults instead of the interactive session using the following

command:

$ bin/solr -e cloud -noprompt

Restarting Nodes

重新开始相应的集群节点

You can restart your SolrCloud nodes using the bin/solr script. For instance, to restart node1 running on port

8983 (with an embedded ZooKeeper server), you would do:

$ bin/solr restart -c -p 8983 -s example/cloud/node1/solr

To restart node2 running on port 7574, you can do:

$ bin/solr restart -c -p 7574 -z localhost:9983 -s example/cloud/node2/solr

Notice that you need to specify the ZooKeeper address (-z localhost:9983) when starting node2 so that it can join

the cluster with node1.

Adding a node to a cluster

加入一个新的节点到solrcloud集群中去

Adding a node to an existing cluster is a bit advanced and involves a little more understanding of Solr. Once you

startup a SolrCloud cluster using the startup scripts, you can add a new node to it by:

$ mkdir

$ cp

$ bin/solr start -cloud -s solr.home/solr -p -z

Notice that the above requires you to create a Solr home directory. You either need to copy solr.xml to the so

lr_home directory, or keep in centrally in ZooKeeper /solr.xml.

Example (with directory structure) that adds a node to an example started with "bin/solr -e cloud":

$ mkdir -p example/cloud/node3/solr

$ cp server/solr/solr.xml example/cloud/node3/solr

$ bin/solr start -cloud -s example/cloud/node3/solr -p 8987 -z localhost:9983

The previous command will start another Solr node on port 8987 with Solr home set to example/cloud/node3

/solr. The new node will write its log files to example/cloud/node3/logs.

Once you're comfortable with how the SolrCloud example works, we recommend using the process described in

Taking Solr to Productionfor setting up SolrCloud nodes in production.

How SolrCloud Works

The following sections cover provide general information about how various SolrCloud features work. To

understand these features, it's important to first understand a few key concepts that relate to SolrCloud.

Shards and Indexing Data in SolrCloud

Distributed Requests

Read and Write Side Fault Tolerance

If you are already familiar with SolrCloud concepts and basic functionality, you can skip to the section coveringS

olrCloud Configuration and Parameters.

Key SolrCloud Concepts

A SolrCloud cluster consists of some "logical" concepts layered on top of some "physical" concepts.

Logical

A Cluster can host multiple Collections of Solr Documents.

A collection can be partitioned into multiple Shards, which contain a subset of the Documents in theCollection.

The number of Shards that a Collection has determines:

The theoretical limit to the number of Documents that Collection can reasonably contain.

The amount of parallelization that is possible for an individual search request.

Physical

A Cluster is made up of one or more Solr Nodes, which are running instances of the Solr server process.

Each Node can host multiple Cores.

Each Core in a Cluster is a physical Replica for a logical Shard.

Every Replica uses the same configuration specified for the Collection that it is a part of.

The number of Replicas that each Shard has determines:

The level of redundancy built into the Collection and how fault tolerant the Cluster can be in theevent that some Nodes become unavailable.

The theoretical limit in the number concurrent search requests that can be processed under heavyload.

Shards and Indexing Data in SolrCloud

When your data is too large for one node, you can break it up and store it in sections by creating one or moresh

ards. Each is a portion of the logical index, or core, and it's the set of all nodes containing that section of the

index.

A shard is a way of splitting a core over a number of "servers", or nodes. For example, you might have a shard

for data that represents each state, or different categories that are likely to be searched independently, but are

often combined.

Before SolrCloud, Solr supported Distributed Search, which allowed one query to be executed across multiple

shards, so the query was executed against the entire Solr index and no documents would be missed from the

search results. So splitting the core across shards is not exclusively a SolrCloud concept. There were, however,

several problems with the distributed approach that necessitated improvement with SolrCloud:

Splitting of the core into shards was somewhat manual.

There was no support for distributed indexing, which meant that you needed to explicitly send documents

to a specific shard; Solr couldn't figure out on its own what shards to send documents to.

There was no load balancing or failover, so if you got a high number of queries, you needed to figure out

where to send them and if one shard died it was just gone.

SolrCloud fixes all those problems. There is support for distributing both the index process and the queries

automatically, and ZooKeeper provides failover and load balancing. Additionally, every shard can also have

multiple replicas for additional robustness.

In SolrCloud there are no masters or slaves. Instead, there are leaders and replicas. Leaders are automatically

elected, initially on a first-come-first-served basis, and then based on the Zookeeper process described athttp://z

ookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection..

If a leader goes down, one of its replicas is automatically elected as the new leader. As each node is started, it's

assigned to the shard with the fewest replicas. When there's a tie, it's assigned to the shard with the lowest shard

ID.

When a document is sent to a machine for indexing, the system first determines if the machine is a replica or aleader.

If the machine is a replica, the document is forwarded to the leader for processing.

If the machine is a leader, SolrCloud determines which shard the document should go to, forwards the

document the leader for that shard, indexes the document for this shard, and forwards the index notation

to itself and any replicas.

为何要用分片副本,及分片副本的工作原理

Document Routing

Solr offers the ability to specify the router implementation used by a collection by specifying the router.name p

arameter whencreating your collection. If you use the "compositeId" router, you can send documents with a

prefix in the document ID which will be used to calculate the hash Solr uses to determine the shard a document

is sent to for indexing. The prefix can be anything you'd like it to be (it doesn't have to be the shard name, for

example), but it must be consistent so Solr behaves consistently. For example, if you wanted to co-locate

documents for a customer, you could use the customer name or ID as the prefix. If your customer is "IBM", for

example, with a document with the ID "12345", you would insert the prefix into the document id field:

"IBM!12345". The exclamation mark ('!') is critical here, as it distinguishes the prefix used to determine which

shard to direct the document to.

Then at query time, you include the prefix(es) into your query with the _route_ parameter (i.e., q=solr&_rout

e_=IBM!) to direct queries to specific shards. In some situations, this may improve query performance because

it overcomes network latency when querying all the shards

The compositeId router supports prefixes containing up to 2 levels of routing. For example: a prefix routing

first by region, then by customer: "USA!IBM!12345"

Another use case could be if the customer "IBM" has a lot of documents and you want to spread it across

multiple shards. The syntax for such a use case would be : "shard_key/num!document_id" where the /num is the

number of bits from the shard key to use in the composite hash.

So "IBM/3!12345" will take 3 bits from the shard key and 29 bits from the unique doc id, spreading the tenant

over 1/8th of the shards in the collection. Likewise if the num value was 2 it would spread the documents across

1/4th the number of shards. At query time, you include the prefix(es) along with the number of bits into your

query with the _route_ parameter (i.e., q=solr&_route_=IBM/3!) to direct queries to specific shards.

If you do not want to influence how documents are stored, you don't need to specify a prefix in your document ID.

If you created the collection and defined the "implicit" router at the time of creation, you can additionally define a

router.field parameter to use a field from each document to identify a shard where the document belongs. If

the field specified is missing in the document, however, the document will be rejected. You could also use the _r

oute_ parameter to name a specific shard.

分片时的文档路由规则及路由参数的使用

Shard Splitting

When you create a collection in SolrCloud, you decide on the initial number shards to be used. But it can be

difficult to know in advance the number of shards that you need, particularly when organizational requirements

can change at a moment's notice, and the cost of finding out later that you chose wrong can be high, involving

creating new cores and re-indexing all of your data.

The ability to split shards is in the Collections API. It currently allows splitting a shard into two pieces. The

existing shard is left as-is, so the split action effectively makes two copies of the data as new shards. You can

delete the old shard at a later time when you're ready.

More details on how to use shard splitting is in the section on theCollections API.

使用分片切割可以对分片的数量进行扩充.解决分片一开始固定的问题

Ignoring Commits from Client Applications in SolrCloud

In most cases, when running in SolrCloud mode, indexing client applications should not send explicit commit

requests. Rather, you should configure auto commits with openSearcher=false and auto soft-commits to

make recent updates visible in search requests. This ensures that auto commits occur on a regular schedule in

the cluster. To enforce a policy where client applications should not send explicit commits, you should update all

client applications that index data into SolrCloud. However, that is not always feasible, so Solr provides

the IgnoreCommitOptimizeUpdateProcessorFactory, which allows you to ignore explicit commits and/or optimize

requests from client applications without having refactor your client application code. To activate this request

processor you'll need to add the following to your solrconfig.xml:

对于分布式solrcloud你应该关闭显示的commit提交.采取软提交的自动提交的方式.当然,这也不能是很确保,可以在solrconfig.xml中配置如下来忽略相关commit的最优化参数

200

As shown in the example above, the processor will return 200 to the client but will ignore the commit / optimizerequest. Notice that you need to wire-in the implicit processors needed by SolrCloud as well, since this customchain is taking the place of the default chain.

In the following example, the processor will raise an exception with a 403 code with a customized error message:

可以定义一个异常,当发送commit或者最优化命令时

403

Thou shall not issue a commit!

Lastly, you can also configure it to just ignore optimize and let commits pass thru by doing:

也能仅仅只忽略最优化操作让commit命令通过

Thou shall not issue an optimize, but commits are

OK!

true

Distributed Requests

When a Solr node receives a search request, that request is routed behinds the scenes to a replica of some

shard that is part of the collection being searched. The chosen replica will act as an aggregator: creating internal

requests to randomly chosen replicas of every shard in the collection, coordinating the responses, issuing any

subsequent internal requests as needed (For example, to refine facets values, or request additional stored fields)

and constructing the final response for the client.

Limiting Which Shards are Queried

指定分片进行查询

One of the advantages of using SolrCloud is the ability to very large collections distributed among various shards

– but in some casesyou may know that you are only interested in results from a subset of your shards. You

have the option of searching over all of your data or just parts of it.

Querying all shards for a collection should look familiar; it's as though SolrCloud didn't even come into play:

http://localhost:8983/solr/gettingstarted/select?q=*:*

If, on the other hand, you wanted to search just one shard, you can specify that shard by it's logical ID, as in:

http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=shard1

If you want to search a group of shard Ids, you can specify them together:

http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=shard1,shard2

In both of the above examples, the shard Id(s) will be used to pick a random replica of that shard.

Alternatively, you can specify the explict replicas you wish to use in place of a shard Ids:

http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=localhost:7574/solr/ge

ttingstarted,localhost:8983/solr/gettingstarted

Or you can specify a list of replicas to choose from for a single shard (for load balancing purposes) by using the

pipe symbol (|):

http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=localhost:7574/solr/ge

ttingstarted|localhost:7500/solr/gettingstarted

And of course, you can specify a list of shards (seperated by commas) each defined by a list of replicas

(seperated by pipes). In this example, 2 shards are queried, the first being a random replica from shard1, the

second being a random replica from the explicit pipe delimited list:

http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=shard1,localhost:7574/

solr/gettingstarted|localhost:7500/solr/gettingstarted

Configuring the ShardHandlerFactory

You can directly configure aspects of the concurrency and thread-pooling used within distributed search in Solr.

This allows for finer grained control and you can tune it to target your own specific requirements. The default

configuration favors throughput over latency.

To configure the standard handler, provide a configuration like this in the solrconfig.xml:

1000

5000

分片处理器的配置及相应的参数设置

Configuring statsCache (Distributed IDF)

Document and term statistics are needed in order to calculate relevancy. Solr provides four implementations out

of the box when it comes to document stats calculation:

LocalStatsCache: This only uses local term and document statistics to compute relevance. In cases

with uniform term distribution across shards, this works reasonably well.

This option is the default if no is configured.

ExactStatsCache: This implementation uses global values (across the collection) for document

frequency.

ExactSharedStatsCache: This is exactly like the exact stats cache in it's functionality but the global

stats are reused for subsequent requests with the same terms.

LRUStatsCache: This implementation uses an LRU cache to hold global stats, which are shared

between requests.

The implementation can be selected by setting in solrconfig.xml. For example, the

following line makes Solr use the ExactStatsCache implementation:

文档和词频的统计及相关实现

Avoiding Distributed Deadlock

Each shard serves top-level query requests and then makes sub-requests to all of the other shards. Care should

be taken to ensure that the max number of threads serving HTTP requests is greater than the possible number

of requests from both top-level clients and other shards. If this is not the case, the configuration may result in a

distributed deadlock.

For example, a deadlock might occur in the case of two shards, each with just a single thread to service HTTP

requests. Both threads could receive a top-level request concurrently, and make sub-requests to each other.

Because there are no more remaining threads to service requests, the incoming requests will be blocked until the

other pending requests are finished, but they will not finish since they are waiting for the sub-requests. By

ensuring that Solr is configured to handle a sufficient number of threads, you can avoid deadlock situations like

this.

配置更多的线程来避免死锁

Prefer Local Shards

Solr allows you to pass an optional boolean parameter named preferLocalShards to indicate that a

distributed query should prefer local replicas of a shard when available. In other words, if a query includes prefe

rLocalShards=true, then the query controller will look for local replicas to service the query instead of

selecting replicas at random from across the cluster. This is useful when a query requests many fields or large

fields to be returned per document because it avoids moving large amounts of data over the network when it is

available locally. In addition, this feature can be useful for minimizing the impact of a problematic replica with

degraded performance, as it reduces the likelihood that the degraded replica will be hit by other healthy replicas.

Lastly, it follows that the value of this feature diminishes as the number of shards in a collection increases

because the query controller will have to direct the query to non-local replicas for most of the shards. In other

words, this feature is mostly useful for optimizing queries directed towards collections with a small number of

shards and many replicas. Also, this option should only be used if you are load balancing requests across all

nodes that host replicas for the collection you are querying, as Solr's CloudSolrClient will do. If not

load-balancing, this feature can introduce a hotspot in the cluster since queries won't be evenly distributed

across the cluster.

适用于本地的查询 有时候更加方便

Read and Write Side Fault Tolerance

读写和容错

SolrCloud supports elasticity, high availability, and fault tolerance in reads and writes. What this means,basically, is that when you have a large cluster, you can always make requests to the cluster: Reads will returnresults whenever possible, even if some nodes are down, and Writes will be acknowledged only if they aredurable; i.e., you won't lose data.

Read Side Fault Tolerance

In a SolrCloud cluster each individual node load balances read requests across all the replicas in collection. You

still need a load balancer on the 'outside' that talks to the cluster, or you need a smart client which understands

how to read and interact with Solr's metadata in ZooKeeper and only requests the ZooKeeper ensemble's

address to start discovering to which nodes it should send requests. (Solr provides a smart Java SolrJ client

calledCloudSolrClient.)

Even if some nodes in the cluster are offline or unreachable, a Solr node will be able to correctly respond to a

search request as long as it can communicate with at least one replica of every shard, or one replica of everyrel

evantshard if the user limited the search via the 'shards' or '_route_' parameters. The more replicas there are

of every shard, the more likely that the Solr cluster will be able to handle search results in the event of node

failures.

zkConnected

zk连接的参数在节点处理数据时

A Solr node will return the results of a search request as long as it can communicate with at least one replica of

every shard that it knows about, even if it cannotcommunicate with ZooKeeper at the time it receives the

request. This is normally the preferred behavior from a fault tolerance standpoint, but may result in stale or

incorrect results if there have been major changes to the collection structure that the node has not been informed

of via ZooKeeper (ie: shards may have been added or removed, or split into sub-shards)

A zkConnected header is included in every search response indicating if the node that processed the request

was connected with ZooKeeper at the time:

{

"responseHeader": {

"status": 0,

"zkConnected": true,

"QTime": 20,

"params": {

"q": "*:*"

}

},

"response": {

"numFound": 107,

"start": 0,

"docs": [ ... ]

}

}

shards.tolerant

In the event that one or more shards queried are completely unavailable, then Solr's default behavior is to fail the

request. However, there are many use-cases where partial results are acceptable and so Solr provides a

boolean shards.tolerant parameter (default 'false'). If shards.tolerant=true then partial results may

be returned. If the returned response does not contain results from all the appropriate shards then the response

header contains a special flag called 'partialResults'. The client can specify 'shards.info' along with the '

shards.tolerant' parameter to retrieve more fine-grained details.

Example response with partialResults flag set to 'true':

{

"responseHeader": {

"status": 0,

"zkConnected": true,

"partialResults": true,

"QTime": 20,

"params": {

"q": "*:*"

}

},

"response": {

"numFound": 77,

"start": 0,

"docs": [ ... ]

}

}

Write Side Fault Tolerance

写入容错

SolrCloud is designed to replicate documents to ensure redundancy for your data, and enable you to send

update requests to any node in the cluster. That node will determine if it hosts the leader for the appropriate

shard, and if not it will forward the request to the the leader, which will then forwards it to all existing replicas,

using versioning to make sure every replica has the most up-to-date version. If the leader goes down, and other

replica can take it's place. This architecture enables you to be certain that your data can be recovered in the

event of a disaster, even if you are usingNear Real Time Searching.

描述数据更新流程及容灾

Recovery

A Transaction Log is created for each nodeso that every change to content or organization is noted. The log is

used to determine which content in the node should be included in a replica. When a new replica is created, it

refers to the Leader and the Transaction Log to know which content to include. If it fails, it retries.

副本从lead的日志中获取信息那些数据应该被储存

Since the Transaction Log consists of a record of updates, it allows for more robust indexing because it includes

redoing the uncommitted updates if indexing is interrupted.

If a leader goes down, it may have sent requests to some replicas and not others. So when a new potential

leader is identified, it runs a synch process against the other replicas. If this is successful, everything should be

consistent, the leader registers as active, and normal actions proceed. If a replica is too far out of sync, the

system asks for a full replication/replay-based recovery.

If an update fails because cores are reloading schemas and some have finished but others have not, the leader

tells the nodes that the update failed and starts the recovery procedure.

Achieved Replication Factor

When using a replication factor greater than one, an update request may succeed on the shard leader but fail on

one or more of the replicas. For instance, consider a collection with one shard and replication factor of three. In

this case, you have a shard leader and two additional replicas. If an update request succeeds on the leader but

fails on both replicas, for whatever reason, the update request is still considered successful from the perspective

of the client. The replicas that missed the update will sync with the leader when they recover.

Behind the scenes, this means that Solr has accepted updates that are only on one of the nodes (the current

leader). Solr supports the optional min_rf parameter on update requests that cause the server to return the

achieved replication factor for an update request in the response. For the example scenario described above, if

the client application included min_rf >= 1, then Solr would return rf=1 in the Solr response header because the

request only succeeded on the leader. The update request will still be accepted as the min_rf parameter only

tells Solr that the client application wishes to know what the achieved replication factor was for the update

request. In other words, min_rf does not mean Solr will enforce a minimum replication factor as Solr does not

support rolling back updates that succeed on a subset of replicas.

On the client side, if the achieved replication factor is less than the acceptable level, then the client application

can take additional measures to handle the degraded state. For instance, a client application may want to keep a

log of which update requests were sent while the state of the collection was degraded and then resend the

updates once the problem has been resolved. In short, min_rf is an optional mechanism for a client application

to be warned that an update request was accepted while the collection is in a degraded state.

当请求被lead更新成功但是副本却是失败时,solrcloud依然认为成功的更新,其余的副本以后会进行数据恢复冲lead的日志文件中.但是你可以设置一个参数来获取成功执行的副本的数目. 使用参数min_rf .根据返回结果再做相应的处理

SolrCloud Configuration and Parameters

集群的配置和参数

In this section, we'll cover the various configuration options for SolrCloud.

The following sections cover these topics:

Setting Up an External ZooKeeper Ensemble

Using ZooKeeper to Manage Configuration Files

ZooKeeper Access Control

Collections API

Parameter Reference

Command Line Utilities

SolrCloud with Legacy Configuration Files

ConfigSets API

Setting Up an External ZooKeeper Ensemble

使用外部zookeeper集群

Although Solr comes bundled withApache ZooKeeper, you should consider yourself discouraged from using this

internal ZooKeeper in production, because shutting down a redundant Solr instance will also shut down its

ZooKeeper server, which might not be quite so redundant. Because a ZooKeeper ensemble must have a quorum

of more than half its servers running at any given time, this can be a problem.

The solution to this problem is to set up an external ZooKeeper ensemble. Fortunately, while this process can

seem intimidating due to the number of powerful options, setting up a simple ensemble is actually quite

straightforward, as described below.

为什么要使用外部的zookeeper集群,及告诉你构建是很简单的

How Many ZooKeepers?

ZooKeeper deployments are usually made up of an odd number of machines.

When planning how many ZooKeeper nodes to configure, keep in mind that the main principle for a ZooKeeper

ensemble is maintaining a majority of servers to serve requests. This majority is also called aquorum. It is

generally recommended to have an odd number of ZooKeeper servers in your ensemble, so a majority is

maintained. For example, if you only have two ZooKeeper nodes and one goes down, 50% of available servers

is not a majority, so ZooKeeper will no longer serve requests. However, if you have three ZooKeeper nodes and

one goes down, you have 66% of available servers available, and ZooKeeper will continue normally while you

repair the one down node. If you have 5 nodes, you could continue operating with two down nodes if necessary.

More information on ZooKeeper clusters is available from the ZooKeeper documentation athttp://zookeeper.apa

che.org/doc/r3.4.5/zookeeperAdmin.html#sc_zkMulitServerSetup.

Download Apache ZooKeeper

下载

The first step in setting up Apache ZooKeeper is, of course, to download the software. It's available fromhttp://zookeeper.apache.org/releases.html.

Solr currently uses Apache ZooKeeper v3.4.6.

集群创建步骤

Setting Up a Single ZooKeeper

Create the instance

Creating the instance is a simple matter of extracting the files into a specific target directory. The actual directory

itself doesn't matter, as long as you know where it is, and where you'd like to have ZooKeeper store its internal

data.

Configure the instance

The next step is to configure your ZooKeeper instance. To do that, create the following file:

/conf/zoo.cfg. To this file, add the following information:

tickTime=2000

dataDir=/var/lib/zookeeper

clientPort=2181

The parameters are as follows:

tickTime: Part of what ZooKeeper does is to determine which servers are up and running at any given time, and

the minimum session time out is defined as two "ticks". The tickTime parameter specifies, in miliseconds, how

long each tick should be.

dataDir: This is the directory in which ZooKeeper will store data about the cluster. This directory should start out

empty.

clientPort: This is the port on which Solr will access ZooKeeper.

Once this file is in place, you're ready to start the ZooKeeper instance.

配置相关信息

Run the instance

To run the instance, you can simply use the ZOOKEEPER_HOME/bin/zkServer.sh script provided, as with this

command: zkServer.sh start

Again, ZooKeeper provides a great deal of power through additional configurations, but delving into them is

beyond the scope of this tutorial. For more information, see the ZooKeeperGetting Startedpage. For this

example, however, the defaults are fine.

Point Solr at the instance

Pointing Solr at the ZooKeeper instance you've created is a simple matter of using the -z parameter when using

the bin/solr script. For example, in order to point the Solr instance to the ZooKeeper you've started on port 2181,

this is what you'd need to do:

Starting cloud example with Zookeeper already running at port 2181 (with all other defaults):

bin/solr start -e cloud -z localhost:2181 -noprompt

Add a node pointing to an existing ZooKeeper at port 2181:

bin/solr start -cloud -s -p 8987 -z localhost:2181

NOTE:When you are not using an example to start solr, make sure you upload the configuration set to

zookeeper before creating the collection.

使用这个zookeeper实例开始集群

Shut down ZooKeeper

To shut down ZooKeeper, use the zkServer script with the "stop" command: zkServer.sh stop

Setting up a ZooKeeper Ensemble

设置zk集群

With an external ZooKeeper ensemble, you need to set things up just a little more carefully as compared to the

Getting Started example.

The difference is that rather than simply starting up the servers, you need to configure them to know about and

talk to each other first. So your original zoo.cfg file might look like this:

dataDir=/var/lib/zookeeperdata/1

clientPort=2181

initLimit=5

syncLimit=2

server.1=localhost:2888:3888

server.2=localhost:2889:3889

server.3=localhost:2890:3890

Here you see three new parameters:

initLimit: Amount of time, in ticks, to allow followers to connect and sync to a leader. In this case, you have 5

ticks, each of which is 2000 milliseconds long, so the server will wait as long as 10 seconds to connect and sync

with the leader.

syncLimit: Amount of time, in ticks, to allow followers to sync with ZooKeeper. If followers fall too far behind a

leader, they will be dropped.

server.X: These are the IDs and locations of all servers in the ensemble, the ports on which they communicate

with each other. The server ID must additionally stored in the /myid file and be located in the dataD

ir of each ZooKeeper instance. The ID identifies each server, so in the case of this first instance, you would

create the file /var/lib/zookeeperdata/1/myid with the content "1".

Now, whereas with Solr you need to create entirely new directories to run multiple instances, all you need for a

new ZooKeeper instance, even if it's on the same machine for testing purposes, is a new configuration file. To

complete the example you'll create two more configuration files.

The /conf/zoo2.cfg file should have the content:

tickTime=2000

dataDir=c:/sw/zookeeperdata/2

clientPort=2182

initLimit=5

syncLimit=2

server.1=localhost:2888:3888

server.2=localhost:2889:3889

server.3=localhost:2890:3890

You'll also need to create /conf/zoo3.cfg:

tickTime=2000

dataDir=c:/sw/zookeeperdata/3

clientPort=2183

initLimit=5

syncLimit=2

server.1=localhost:2888:3888

server.2=localhost:2889:3889

server.3=localhost:2890:3890

Finally, create your myid files in each of the dataDir directories so that each server knows which instance it is.

The id in the myid file on each machine must match the "server.X" definition. So, the ZooKeeper instance (or

machine) named "server.1" in the above example, must have a myid file containing the value "1". The myid file

can be any integer between 1 and 255, and must match the server IDs assigned in the zoo.cfg file.

To start the servers, you can simply explicitly reference the configuration files:

cd

bin/zkServer.sh start zoo.cfg

bin/zkServer.sh start zoo2.cfg

bin/zkServer.sh start zoo3.cfg

Once these servers are running, you can reference them from Solr just as you did before:

bin/solr start -e cloud -z localhost:2181,localhost:2182,localhost:2183 -noprompt

For more information on getting the most power from your ZooKeeper installation, check out theZooKeeper

Administrator's Guide.

配置启动zk集群及使用zk启动solr

Securing the ZooKeeper connection

You may also want to secure the communication between ZooKeeper and Solr.

To setup ACL protection of znodes, seeZooKeeper Access Control.

Using ZooKeeper to Manage Configuration Files

With SolrCloud your configuration files are kept in ZooKeeper. These files are uploaded in either of the followingcases:

When you start a SolrCloud example using the bin/solr script.

When you create a collection using the bin/solr script.

Explicitly upload a configuration set to ZooKeeper.

Startup Bootstrap

When you try SolrCloud for the first time using the bin/solr -e cloud, the related configset gets uploaded to

zookeeper automatically and is linked with the newly created collection.

The below command would start SolrCloud with the default collection name (gettingstarted) and default configset

(data_driven_schema_configs) uploaded and linked to it.

$ bin/solr -e cloud -noprompt

You can also explicitly upload a configuration directory when creating a collection using the bin/solr

script with the -d option, such as:

$ bin/solr create -c mycollection -d data_driven_schema_configs

The create command will upload a copy of the data_driven_schema_configs configuration directory to

ZooKeeper under /configs/mycollection. Refer to theSolr Start Script Referencepage for more details

about the create command for creating collections

Once a configuration directory has been uploaded to ZooKeeper, you can update them using theZooKeeper

Command Line Interface (zkCLI).

默认上传配置文件在创建集合时

Uploading configs using zkcli or SolrJ

更新配置文件使用zkcli或者solrj客户端

In production situations,Config Setscan also be uploaded to ZooKeeper independent of collection creation using

either Solr'szkcli.sh script,or theCloudSolrClient.uploadConfigjava method.

The below command can be used to upload a new configset using the zkcli script.

$ sh zkcli.sh -cmd upconfig -zkhost -confname

-solrhome -confdir

More information about the ZooKeeper Command Line Utility to help manage changes to configuration files, can

be found in the section onCommand Line Utilities.

Managing Your SolrCloud Configuration Files

To update or change your SolrCloud configuration files:

Download the latest configuration files from ZooKeeper, using the source control checkout process.

Make your changes.

Commit your changed file to source control.

Push the changes back to ZooKeeper.

Reload the collection so that the changes will be in effect.

配置文件管理过程 这里的源代码控制指的是什么?

Preparing ZooKeeper before first cluster start

If you will share the same ZooKeeper instance with other applications you should use achrootin ZooKeeper.

Please seeTaking Solr to Production#ZooKeeperchrootfor instructions.

There are certain configuration files containing cluster wide configuration. Since some of these are crucial for the

cluster to function properly, you may need to upload such files to ZooKeeper before starting your Solr cluster for

the first time. Examples of such configuration files (not exhaustive) are solr.xml, security.json and clus

terprops.json.

If you for example would like to keep your solr.xml in ZooKeeper to avoid having to copy it to every node's so

lr_home directory, you can push it to ZooKeeper with the zkcli.sh utility (Unix example):

zkcli.sh -zkhost localhost:2181 -cmd putfile /solr.xml /path/to/solr.xml

在solr集群启动之前将所需的配置文件上传到zookeeper中,例如solr.xml文件的推送. 安全控制文件,集群属性等文件

ZooKeeper Access Control

This section describes using ZooKeeper access control lists (ACLs) with Solr. For information about ZooKeeper

ACLs, see the ZooKeeper documentation athttp://zookeeper.apache.org/doc/r3.4.6/zookeeperProgrammers.htm

l#sc_ZooKeeperAccessControl.

About ZooKeeper ACLs

How to Enable ACLs

Changing ACL Schemes

Example Usages

zk的访问控制部分

About ZooKeeper ACLs

SolrCloud uses ZooKeeper for shared information and for coordination.

This section describes how to configure Solr to add more restrictive ACLs to the ZooKeeper content it creates,

and how to tell Solr about the credentials required to access the content in ZooKeeper. If you want to use ACLs

in your ZooKeeper nodes, you will have to activate this functionality; by default, Solr behavior is open-unsafe

ACL everywhere and uses no credentials.

Changing Solr-related content in ZooKeeper might damage a SolrCloud cluster. For example:

Changing configuration might cause Solr to fail or behave in an unintended way.

Changing cluster state information into something wrong or inconsistent might very well make a SolrCloud

cluster behave strangely.

Adding a delete-collection job to be carried out by the Overseer will cause data to be deleted from the

cluster.

You may want to enable ZooKeeper ACLs with Solr if you grant access to your ZooKeeper ensemble to entities

you do not trust, or if you want to reduce risk of bad actions resulting from, e.g.:

Malware that found its way into your system.

Other systems using the same ZooKeeper ensemble (a "bad thing" might be done by accident).

You might even want to limit read-access, if you think there is stuff in ZooKeeper that not everyone should know

about. Or you might just in general work on need-to-know-basis.

Protecting ZooKeeper itself could mean many different things.This section is about protecting Solr content

in ZooKeeper. ZooKeeper content basically lives persisted on disk and (partly) in memory of the ZooKeeper

processes.This section is not about protecting ZooKeeper data at storage or ZooKeeper process levels-

that's for ZooKeeper to deal with.

But this content is also available to "the outside" via the ZooKeeper API. Outside processes can connect to

ZooKeeper and create/update/delete/read content; for example, a Solr node in a SolrCloud cluster wants to

create/update/delete/read, and a SolrJ client wants to read from the cluster. It is the responsibility of the outside

processes that create/update content to setup ACLs on the content. ACLs describe who is allowed to read,

update, delete, create, etc. Each piece of information (znode/content) in ZooKeeper has its own set of ACLs, and

inheritance or sharing is not possible. The default behavior in Solr is to add one ACL on all the content it creates

- one ACL that gives anyone the permission to do anything (in ZooKeeper terms this is called "the open-unsafe

ACL").

为什么要使用acls及 acls涉及的内容

How to Enable ACLs

We want to be able to:

Control the credentials Solr uses for its ZooKeeper connections. The credentials are used to get

permission to perform operations in ZooKeeper.

Control which ACLs Solr will add to znodes (ZooKeeper files/folders) it creates in ZooKeeper.

Control it "from the outside", so that you do not have to modify and/or recompile Solr code to turn this on.

Solr nodes, clients and tools (e.g. ZkCLI) always use a java class calledSolrZkClientto deal with their

ZooKeeper stuff. The implementation of the solution described here is all about changing SolrZkClient. If you

use SolrZkClient in your application, the descriptions below will be true for your application too.

solr和zk交互的几种方式都是使用了solrzkclient 关键就在这里

Controlling Credentials

You control which credentials provider will be used by configuring the zkCredentialsProvider property in solr.xml's section to the name of a class (on the classpath) implementing the following interface:

package org.apache.solr.common.cloud;

public interface ZkCredentialsProvider {

public class ZkCredentials {

String scheme;

byte[] auth;

public ZkCredentials(String scheme, byte[] auth) {

super();

this.scheme = scheme;

this.auth = auth;

}

String getScheme() {

return scheme;

}

byte[] getAuth() {

return auth;

}

}

Collection getCredentials();

}

Solr determines which credentials to use by calling the getCredentials() method of the given credentialsprovider. If no provider has been configured, the default implementation, DefaultZkCredentialsProvider isused.

solr获取令牌的途径

Out of the Box Implementations

You can always make you own implementation, but Solr comes with two implementations:

org.apache.solr.common.cloud.DefaultZkCredentialsProvider: Its getCredentials()

returns a list of length zero, or "no credentials used". This is the default and is used if you do not configure

a provider in solr.xml.

org.apache.solr.common.cloud.VMParamsSingleSetCredentialsDigestZkCredentialsPr

ovider: This lets you define your credentials using system properties. It supports at most one set of

credentials.

The schema is "digest". The username and password are defined by system properties "zkDiges

tUsername" and "zkDigestPassword", respectively. This set of credentials will be added

to the list of credentials returned by getCredentials() if both username and password are

provided.

If the one set of credentials above is not added to the list, this implementation will fall back to

default behavior and use the (empty) credentials list from DefaultZkCredentialsProvider

solr提供了两个实现类,第一个是默认的空令牌,第二个是可以通过系统属性去添加一个令牌的方法

Controlling ACLs

You control which ACLs will be added by configuring zkACLProvider property in solr.xml 's section to the name of a class (on the classpath) implementing the following interface:

package org.apache.solr.common.cloud;

public interface ZkACLProvider {

List getACLsToAdd(String zNodePath);

}

When solr wants to create a new znode, it determines which ACLs to put on the znode by calling the getACLsT

oAdd() method of the given acl provider. If no provider has been configured, the default implementation, Defau

ltZkACLProvider is used.

当solr创建一个节点会调用获取令牌的方法来设置谁有改节点的权限

Out of the Box Implementations

You can always make you own implementation, but Solr comes with:

org.apache.solr.common.cloud.DefaultZkACLProvider: It returns a list of length one for all z

NodePath-s. The single ACL entry in the list is "open-unsafe". This is the default and is used if you do not

configure a provider in solr.xml.

org.apache.solr.common.cloud.VMParamsAllAndReadonlyDigestZkACLProvider: This lets

you define your ACLs using system properties. Its getACLsToAdd() implementation does not use zNod

ePath for anything, so all znodes will get the same set of ACLs. It supports adding one or both of these

options:

A user that is allowed to do everything.

The permission is "ALL" (corresponding to all of CREATE, READ, WRITE, DELETE, and ADMI

N), and the schema is "digest".

The username and password are defined by system properties "zkDigestUsername" and "

zkDigestPassword", respectively.

This ACL will not be added to the list of ACLs unless both username and password are

provided.

A user that is only allowed to perform read operations.

The permission is "READ" and the schema is "digest".

The username and password are defined by system properties "zkDigestReadonlyUsern

ame" and "zkDigestReadonlyPassword, respectively.

This ACL will not be added to the list of ACLs unless both username and password are

provided.

If neither of the above ACLs is added to the list, the (empty) ACL list of DefaultZkACLProvider will

be used by default.

Notice the overlap in system property names with credentials provider VMParamsSingleSetCredentialsDig

estZkCredentialsProvider (described above). This is to let the two providers collaborate in a nice and

perhaps common way: we always protect access to content by limiting to two users - an admin-user and a

readonly-user - AND we always connect with credentials corresponding to this same admin-user, basically so

that we can do anything to the content/znodes we create ourselves.

You can give the readonly credentials to "clients" of your SolrCloud cluster - e.g. to be used by SolrJ clients.

They will be able to read whatever is necessary to run a functioning SolrJ client, but they will not be able to

modify any content in ZooKeeper.

提供访问控制列表 --两个默认实现

Changing ACL Schemes

Over the lifetime of operating your Solr cluster, you may decide to move from an unsecured ZooKeeper to a

secured instance. Changing the configured zkACLProvider in solr.xml will ensure that newly created nodes

are secure, but will not protect the already existing data. To modify all existing ACLs, you can use: ZkCLI -cmd

updateacls /zk-path.

Changing ACLs in ZK should only be done while your SolrCloud cluster is stopped. Attempting to do so while

Solr is running may result in inconsistent state and some nodes becoming inaccessible. To configure the new

ACLs, run ZkCli with the following VM properties: -DzkACLProvider=...

-DzkCredentialsProvider=....

The Credential Provider must be one that has current admin privileges on the nodes. When omitted, the

process will use no credentials (suitable for an unsecure configuration).

The ACL Provider will be used to compute the new ACLs. When omitted, the process will set allpermissions to all users, removing any security present.

You may use the VMParamsSingleSetCredentialsDigestZkCredentialsProvider and VMParamsAll

AndReadonlyDigestZkACLProvider implementations as described earlier in the page for these properties.

After changing the ZK ACLs, make sure that the contents of your solr.xml match, as described for initial set

up

已经存在的节点进行访问控制--重新设置后停止集群通过命令来完成

Example Usages

实现的例子:

Let's say that you want all Solr-related content in ZooKeeper protected. You want an "admin" user that is able to

do anything to the content in ZooKeeper - this user will be used for initializing Solr content in ZooKeeper and for

server-side Solr nodes. You also want a "readonly" user that is only able to read content from ZooKeeper - this

user will be handed over to "clients".

In the examples below:

The "admin" user's username/password is admin-user/admin-password.

The "readonly" user's username/password is readonly-user/readonly-password.

The provider class names must first be configured in solr.xml:

...

...

name="zkCredientialsProvider">org.apache.solr.common.cloud.VMParamsSingleSetCredenti

alsDigestZkCredentialsProvider

name="zkACLProvider">org.apache.solr.common.cloud.VMParamsAllAndReadonlyDigestZkACLP

rovider

To use ZkCLI:

SOLR_ZK_CREDS_AND_ACLS="-DzkDigestUsername=admin-user

-DzkDigestPassword=admin-password \

-DzkDigestReadonlyUsername=readonly-user

-DzkDigestReadonlyPassword=readonly-password"

java ... $SOLR_ZK_CREDS_AND_ACLS ... org.apache.solr.cloud.ZkCLI -cmd ...

For operations usingbin/solr, add the following at the bottom ofbin/solr.in.sh:

SOLR_ZK_CREDS_AND_ACLS="-DzkDigestUsername=admin-user

-DzkDigestPassword=admin-password \

-DzkDigestReadonlyUsername=readonly-user

-DzkDigestReadonlyPassword=readonly-password"

SOLR_OPTS="$SOLR_OPTS $SOLR_ZK_CREDS_AND_ACLS"

For operations usingbin\solr.cmd, add the following at the bottom ofbin\solr.in.cmd:

set SOLR_ZK_CREDS_AND_ACLS=-DzkDigestUsername=admin-user

-DzkDigestPassword=admin-password ^

-DzkDigestReadonlyUsername=readonly-user

-DzkDigestReadonlyPassword=readonly-password

set SOLR_OPTS=%SOLR_OPTS% %SOLR_ZK_CREDS_AND_ACLS%

To start your own "clients" (using SolrJ):

SOLR_ZK_CREDS_AND_ACLS="-DzkDigestUsername=readonly-user

-DzkDigestPassword=readonly-password"

java ... $SOLR_ZK_CREDS_AND_ACLS ...

Or since you yourself are writing the code creating the SolrZkClient-s, you might want to override the

provider implementations at the code level instead.

Collections API

The Collections API is used to enable you to create, remove, or reload collections, but in the context of SolrCloudyou can also use it to create collections with a specific number of shards and replicas

作用

API Entry Points

The base URL for all API calls below is http://:/solr.

/admin/collections?action=CREATE:createa collection

/admin/collections?action=MODIFYCOLLECTION:Modifycertain attributes of a collection

/admin/collections?action=RELOAD:reloada collection

/admin/collections?action=SPLITSHARD:splita shard into two new shards

/admin/collections?action=CREATESHARD:createa new shard

/admin/collections?action=DELETESHARD:deletean inactive shard

/admin/collections?action=CREATEALIAS:create or modify an aliasfor a collection

/admin/collections?action=DELETEALIAS:delete an aliasfor a collection

/admin/collections?action=DELETE:deletea collection

/admin/collections?action=DELETEREPLICA:delete a replicaof a shard

/admin/collections?action=ADDREPLICA:add a replicaof a shard

/admin/collections?action=CLUSTERPROP:Add/edit/delete a cluster-wide property

/admin/collections?action=MIGRATE:Migrate documents to another collection

/admin/collections?action=ADDROLE:Add a specific roleto a node in the cluster

/admin/collections?action=REMOVEROLE:Remove an assigned role

/admin/collections?action=OVERSEERSTATUS:Get status and statistics of the overseer

/admin/collections?action=CLUSTERSTATUS:Get cluster status

/admin/collections?action=REQUESTSTATUS:Get the statusof a previous asynchronous request

/admin/collections?action=DELETESTATUS:Delete the stored responseof a previous asynchronous

request

/admin/collections?action=LIST:List all collections

/admin/collections?action=ADDREPLICAPROP:Add an arbitrary propertyto a replica specified by

collection/shard/replica

/admin/collections?action=DELETEREPLICAPROP:Delete an arbitrary propertyfrom a replica specifiedby collection/shard/replica

/admin/collections?action=BALANCESHARDUNIQUE:Distribute an arbitrary property, one per shard,

across the nodes in a collection

/admin/collections?action=REBALANCELEADERS:Distribute leader rolebased on the "preferredLeader"

assignments

/admin/collections?action=FORCELEADER:Force a leader electionin a shard if leader is lost

/admin/collections?action=MIGRATESTATEFORMAT:Migrate a collection from shared clusterstate.jsontoper-collection state.json

所有的api 其余的详细的参数描述和例子

Parameter Reference

Cluster Parameters

numShards

SolrCloud Instance Parameters

These are set in solr.xml, but by default the host and hostContext parameters are set up to also work withsystem properties.

SolrCloud Instance ZooKeeper Parameters

Command Line Utilities

solr的zk 和zk的zk连接的区别及位置

Using Solr's ZooKeeper CLI

-cmd

CLI Command to be executed: bootstrap, upconfig, downconfig, linkconfi

g, makepath, get, getfile, put, putfile, list, clear or clusterprop.

This parameter ismandatory

还是挺多的

ZooKeeper CLI Examples

Upload a configuration directory

上传一个目录到zk

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983 \

-cmd upconfig -confname my_new_config -confdir

server/solr/configsets/basic_configs/conf

Bootstrap ZooKeeper from existing SOLR_HOME

不知道干啥的

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:2181

-cmd bootstrap -solrhome /var/solr/data

Put arbitrary data into a new ZooKeeper file

在zk中创建节点加入数据

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983

-cmd put /my_zk_file.txt 'some data'

Put a local file into a new ZooKeeper file

将本地文件设置为zk的文件

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983

-cmd putfile /my_zk_file.txt /tmp/my_local_file.txt

Link a collection to a configuration set

将集合和配置文件相互映射起来

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983

-cmd linkconfig -collection gettingstarted -confname my_new_config

Create a new ZooKeeper path

创建一个新的zk节点

Set a cluster property

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:2181

-cmd clusterprop -name urlScheme -val https

---->>功能没有写全,还有许多命令

SolrCloud with Legacy Configuration Files

ConfigSets API

仅仅适用solrcloud模式

API Entry Points

The base URL for all API calls is http://:/solr.

/admin/configs?action=CREATE:createa ConfigSet, based on an existing ConfigSet

/admin/configs?action=DELETE:deletea ConfigSet

/admin/configs?action=LIST:listall ConfigSets

Rule-based Replica Placement

When Solr needs to assign nodes to collections, it can either automatically assign them randomly or the user can

specify a set nodes where it should create the replicas. With very large clusters, it is hard to specify exact node

names and it still does not give you fine grained control over how nodes are chosen for a shard. The user should

be in complete control of where the nodes are allocated for each collection, shard and replica. This helps to

optimally allocate hardware resources across the cluster.

Rule-based replica assignment allows the creation of rules to determine the placement of replicas in the cluster.

In the future, this feature will help to automatically add or remove replicas when systems go down, or when

higher throughput is required. This enables a more hands-off approach to administration of the cluster.

This feature is used in the following instances:

Collection creation

Shard creation

Replica creation

Shard splitting

自动化的应对资源的分配 对于集合,分片,副本

Common Use Cases

There are several situations where this functionality may be used. A few of the rules that could be implemented

are listed below:

Don’t assign more than 1 replica of this collection to a host.

Assign all replicas to nodes with more than 100GB of free disk space or, assign replicas where there is

more disk space.

Do not assign any replica on a given host because I want to run an overseer there.

Assign only one replica of a shard in a rack.

Assign replica in nodes hosting less than 5 cores.

Assign replicas in nodes hosting the least number of cores

Rule Conditions

A rule is a set of conditions that a node must satisfy before a replica core can be created there.

必须满足的规则

Rule Conditions

There are three possible conditions.

shard: this is the name of a shard or a wild card (* means for all shards). If shard is not specified, then the

rule applies to the entire collection.

replica: this can be a number or a wild-card (* means any number zero to infinity).

tag: this is an attribute of a node in the cluster that can be used in a rule, e.g. “freedisk”, “cores”, “rack”,

“dc”, etc. The tag name can be a custom string. If creating a custom tag, a snitch is responsible for

providing tags and values. The sectionSnitchesbelow describes how to add a custom tag, and defines

six pre-defined tags (cores, freedisk, host, port, node, and sysprop).

六个预定义规则

Rule Operators

A condition can have one of the following operators to set the parameters for the rule.

equals (no operator required): tag:x means tag value must be equal to ‘x’

greater than (>): tag:>x means tag value greater than ‘x’. x must be a number

less than (<): tag: not equal (!): tag:!x means tag value MUST NOT be equal to ‘x’. The equals check is performed on

String value

规则操作符号

Fuzzy Operator (~)

This can be used as a suffix to any condition. This would first try to satisfy the rule strictly. If Solr can’t find

enough nodes to match the criterion, it tries to find the next best match which may not satisfy the criterion. For

example, if we have a rule such as, freedisk:>200~, Solr will try to assign replicas of this collection on nodes

with more than 200GB of free disk space. If that is not possible, the node which has the most free disk space will

be chosen instead.

Choosing Among Equals

The nodes are sorted first and the rules are used to sort them. This ensures that even if many nodes match the

rules, the best nodes are picked up for node assignment. For example, if there is a rule such as freedisk:>20,

nodes are sorted first on disk space descending and the node with the most disk space is picked up first. Or, if

the rule is cores:<5, nodes are sorted with number of cores ascending and the node with the least number of

cores is picked up first.

选择排序后的最优解

Rules for new shards

The rules are persisted along with collection state. So, when a new replica is created, the system will assign

replicas satisfying the rules. When a new shard is created as a result ofcreate shardensure that you have

created rules specific for that shard name. Rules can be altered using themodify collectioncommand. However,

it is not required to do so if the rules do not specify explicit shard names. For example, a rule such as shard:sh

ard1,replica:*,ip_3:168:, will not apply to any new shard created. But, if your rule is replica:*,ip_3:

168, then it will apply to any new shard created.

The same is applicable to shard splitting. Shard splitting is treated exactly the same way as shard creation. Even

though shard1_1 and shard1_2 may be created from shard1, the rules treat them as distinct, unrelated

shards.

对于新的分片规则分配

Snitches

Tag values come from a plugin called Snitch. If there is a tag named ‘rack’ in a rule, there must be Snitch which

provides the value for ‘rack’ for each node in the cluster. A snitch implements the Snitch interface. Solr, by

default, provides a default snitch which provides the following tags:

cores: Number of cores in the node

freedisk: Disk space available in the node

host: host name of the node

port: port of the node

node: node name

ip_1, ip_2, ip_3, ip_4: These are ip fragments for each node. For example, in a host with ip 192.168.1.

2, ip_1 = 2, ip_2 =1, ip_3 = 168 and ip_4 = 192

sysprop.{PROPERTY_NAME}: These are values available from system properties. sysprop.key mean

s a value that is passed to the node as -Dkey=keyValue during the node startup. It is possible to use

rules like sysprop.key:expectedVal,shard:*

可以被使用的标签

How Snitches are Configured

It is possible to use one or more snitches for a set of rules. If the rules only need tags from default snitch it need

not be explicitly configured. For example:

snitch=class:fqn.ClassName,key1:val1,key2:val2,key3:val3

How Tag Values are Collected

Identify the set of tags in the rules

Create instances of Snitches specified. The default snitch is always created.

Ask each Snitch if it can provide values for the any of the tags. If even one tag does not have a snitch, the

assignment fails.

After identifying the Snitches, they provide the tag values for each node in the cluster.

If the value for a tag is not obtained for a given node, it cannot participate in the assignment.

如何配置及工作原理讲解

Examples

Keep less than 2 replicas (at most 1 replica) of this collection on any node

For this rule, we define the replica condition with operators for "less than 2", and use a pre-defined tag named

node to define nodes with any name.

replica:<2,node:*

保证集合中任意节点不能对于两个副本

For a given shard, keep less than 2 replicas on any node

For this rule, we use the shard condition to define any shard name, the replica condition with operators for

"less than 2", and finally a pre-defined tag named node to define nodes with any name.

shard:*,replica:<2,node:*

任何分片上的任何节点副本数目都不能对于两个

Assign all replicas in shard1 to rack 730

This rule limits the shard condition to 'shard1', but any number of replicas. We're also referencing a custom tag

named rack. Before defining this rule, we will need to configure a custom Snitch which provides values for the

tag rack.

shard:shard1,replica:*,rack:730

In this case, the default value of replica is * (or, all replicas). So, it can be omitted and the rule can be reduced

to:

shard:shard1,rack:730

自定义了一个snitch rack

Create replicas in nodes with less than 5 cores only

This rule uses the replica condition to define any number of replicas, but adds a pre-defined tag named core

and uses operators for "less than 5".

replica:*,cores:<5

Again, we can simplify this to use the default value for replica, like so:

cores:<5

当节点的core数目少于5个可以创建副本

Do not create any replicas in host 192.45.67.3

This rule uses only the pre-defined tag host to define an IP address where replicas should not be placed.

host:!192.45.67.3

不在指定主机上创建副本

Defining Rules

Rules are specified per collection during collection creation as request parameters. It is possible to specify

multiple ‘rule’ and ‘snitch’ params as in this example:

snitch=class:EC2Snitch&rule=shard:*,replica:1,dc:dc1&rule=shard:*,replica:<2,dc:dc3

These rules are persisted in clusterstate.json in Zookeeper and are available throughout the lifetime of the

collection. This enables the system to perform any future node allocation without direct user interaction. The

rules added during collection creation can be modified later using theMODIFYCOLLECTIONAPI.

自己定义一个 snitch

Cross Data Center Replication (CDCR)

TheSolrCloudarchitecture is not particularly well suited for situations where a single SolrCloud cluster consists

of nodes in separated data clusters connected by an expensive pipe. The root problem is that SolrCloud is

designed to supportNear Real Time Searchingby immediately forwarding updates between nodes in the clusteron a per-shard basis.

"CDCR" features exist to help mitigate the risk of an entire Data Center outage.

用来减轻完全数据中心的中断

What is CDCR?

Glossary

Architecture

Major Components

CDCR Configuration

CDCR Initialization

Inter-Data Center Communication

Updates Tracking & Pushing

Synchronization of Update Checkpoints

Maintenance of Updates Log

Monitoring

CDC Replicator

Limitations

Configuration

Source Configuration

Target Configuration

Configuration Details

The Replica Element

The Replicator Element

The updateLogSynchronizer Element

The Buffer Element

CDCR API

API Entry Points (Control)

API Entry Points (Monitoring)

Control Commands

Monitoring commands

Initial Startup

Monitoring

ZooKeeper settings

Upgrading and Patching Production

是什么,那些组件,如何配置,如何使用,等

What is CDCR?

The goal of the project is to replicate data to multiple Data Centers. The initial version of the solution will cover

the active-passive scenario where data updates are replicated from a Source Data Center to a Target Data

Center. Data updates include adding/updating and deleting documents.

最初是为了解决将源数据中心的数据变更同步到目标数据中心

Data changes on the Source Data Center are replicated to the Target Data Center only after they are persisted

to disk. The data changes can be replicated in real-time (with a small delay) or could be scheduled to be sent in

intervals to the Target Data Center. This solution pre-supposes that the Source and Target data centers begin

with the same documents indexed. Of course the indexes may be empty to start.

数据的改变能被近时时的同步到目标数据中心,持久化到硬盘上, 开始时应该保证两个数据中的起点一致.

Each shard leader in the Source Data Center will be responsible for replicating its updates to the appropriate

collection in the Target Data Center. When receiving updates from the Source Data Center, shard leaders in the

Target Data Center will replicate the changes to their own replicas.

原数据中心的分片lead负责将更新的内容发送给目标分片的leader,完成备份.

This replication model is designed to tolerate some degradation in connectivity, accommodate limited bandwidth,and support batch updates to optimize communication.

Replication supports both a new empty index and pre-built indexes. In the scenario where the replication is set

up on a pre-built index, CDCR will ensure consistency of the replication of the updates, but cannot ensure

consistency on the full index. Therefore any index created before CDCR was set up will have to be replicated by

other means (described in the sectionStarting CDCR the first time with an existing index) in order that Source

and Target indexes be fully consistent.

未开始CDCR之前的索引需要通过另外的方式.来保证索引的一致

The active-passive nature of the initial implementation implies a "push" model from the Source collection to the

Target collection. Therefore, the Source configuration must be able to "see" the ZooKeeper ensemble in the

Target cluster. The ZooKeeper ensemble is provided configured in the Source's solrconfig.xml file.

采取推送的方式,将源集合中的数据推送到目标集合中,所以必须在solrconfig.xml中配置相关的zookeeper配置

CDCR is configured to replicate from collections in the Source cluster to collections in the Target cluster on a

collection-by-collection basis. Since CDCR is configured in solrconfig.xml (on both Source and Target

clusters), the settings can be tailored for the needs of each collection.

CDCR can be configured to replicate from one collection to a second collectionwithin the same cluster. That is a

specialized scenario not covered in this document.

支持夸集群和本集群的复制

Glossary

Terms used in this document include:

Node: A JVM instance running Solr; a server.

Cluster: A set of Solr nodes managed as a single unit by a ZooKeeper ensemble, hosting one or more

Collections.

Data Center:A group of networked servers hosting a Solr cluster. In this document, the termsClusterand

Data Centerare interchangeable as we assume that each Solr cluster is hosted in a different group of

networked servers.

Shard: A sub-index of a single logical collection. This may be spread across multiple nodes of the cluster.

Each shard can have as many replicas as needed.

Leader: Each shard has one node identified as its leader. All the writes for documents belonging to a

shard are routed through the leader.

Replica: A copy of a shard for use in failover or load balancing. Replicas comprising a shard can either be

leaders or non-leaders.

Follower:A convenience term for a replica that isnotthe leader of a shard.

Collection: Multiple documents that make up one logical index. A cluster can have multiple collections.

Updates Log: An append-only log of write operations maintained by each node.

Architecture

Here is a picture of the data flow

Updates and deletes are first written to the Source cluster, then forwarded to the Target cluster. The data flow

sequence is:

A shard leader receives a new data update that is processed by its Update Processor.

The data update is first applied to the local index.

Upon successful application of the data update on the local index, the data update is added to the

Updates Log queue.

After the data update is persisted to disk, the data update is sent to the replicas within the Data Center.

After Step 4 is successful CDCR reads the data update from the Updates Log and pushes it to the

corresponding collection in the Target Data Center. This is necessary in order to ensure consistency

between the Source and Target Data Centers.

The leader on the Target data center writes the data locally and forwards it to all its followers.

Steps 1, 2, 3 and 4 are performed synchronously by SolrCloud; Step 5 is performed asynchronously by a

background thread. Given that CDCR replication is performed asynchronously, it becomes possible to push

batch updates in order to minimize network communication overhead. Also, if CDCR is unable to push the

update at a given time -- for example, due to a degradation in connectivity -- it can retry later without any impact

on the Source Data Center.

One implication of the architecture is that the leaders in the Source cluster must be able to "see" the leaders in

the Target cluster. Since leaders may change, this effectively means that all nodes in the Source cluster must be

able to "see" all Solr nodes in the Target cluster so firewalls, ACL rules, etc. must be configured with care.

数据的更新过程步骤详解

Major Components

There are a number of key features and components in CDCR’s architecture:

CDCR Configuration

In order to configure CDCR, the Source Data Center requires the host address of the ZooKeeper cluster

associated with the Target Data Center. The ZooKeeper host address is the only information needed by CDCR

to instantiate the communication with the Target Solr cluster. The CDCR configuration file on the Source cluster

will therefore contain a list of ZooKeeper hosts. The CDCR configuration file might also contain

secondary/optional configuration, such as the number of CDC Replicator threads, batch updates related settings,

etc.

必须配置 zookeeper信息,及一些可选的配置

CDCR Initialization

CDCR supports incremental upddates to either new or existing collections. CDCR may not be able to keep up

with very high volume updates, especially if there are significant communications latencies due to a slow "pipe"

between the data centers. Some scenarios:

There is an initial bulk load of a corpus followed by lower volume incremental updates. In this case, one

can do the initial bulk load, replicate the index andthenkeep then synchronized via CDCR. See the

sectionStarting CDCR the first time with an existing indexfor more information.

The index is being built up from scratch, without a significant initial bulk load. CDCR can be set up on

empty collections and keep them synchronized from the start.

The index is always being updated at a volume too high for CDCR to keep up. This is especially possible

in situations where the connection between the Source and Target data centers is poor. This scenario is

unsuitable for CDCR in its current form.

适用情况的说明

Inter-Data Center Communication

Communication between Data Centers will be achieved through HTTP and the Solr REST API using the SolrJ

client. The SolrJ client will be instantiated with the ZooKeeper host of the Target Data Center. SolrJ will manage

the shard leader discovery process.

通过http和solrj进行跨集群通信

Updates Tracking & Pushing

CDCR replicates data updates from the Source to the Target Data Center by leveraging the Updates Log.

通过更新日志完成备份

A background thread regularly checks the Updates Log for new entries, and then forwards them to the Target

Data Center. The thread therefore needs to keep a checkpoint in the form of a pointer to the last update

successfully processed in the Updates Log. Upon acknowledgement from the Target Data Center that updates

have been successfully processed, the Updates Log pointer is updated to reflect the current checkpoint.

源集群后台线程定期检查更新的日志,从上次标记的点开始.更新成功后,目标集群应该返回一个反馈用于记录下次的检查点

This pointer must be synchronized across all the replicas. In the case where the leader goes down and a new

leader is elected, the new leader will be able to resume replication from the last update by using this

synchronized pointer. The strategy to synchronize such a pointer across replicas will be explained next.

这个检查点应该被同步到源集群中的所有节点上

If for some reason, the Target Data Center is offline or fails to process the updates, the thread will periodically try

to contact the Target Data Center and push the updates.

如果目标集群挂了,那么源集群将定期去尝试重新推送

Synchronization of Update Checkpoints

A reliable synchronization of the update checkpoints between the shard leader and shard replicas is critical to

avoid introducing inconsistency between the Source and Target Data Centers. Another important requirement is

that the synchronization must be performed with minimal network traffic to maximize scalability

in order to achieve this, the strategy is to:

Uniquely identify each update operation. This unique identifier will serve as pointer

Rely on two storages: an ephemeral storage on the Source shard leader, and a persistent storage on theTarget cluster.

唯一标识和两地存储

The shard leader in the Source cluster will be in charge of generating a unique identifier for each update

operation, and will keep a copy of the identifier of the last processed updates in memory. The identifier will be

sent to the Target cluster as part of the update request. On the Target Data Center side, the shard leader will

receive the update request, store it along with the unique identifier in the Updates Log, and replicate it to the

other shards.

SolrCloud is already providing a unique identifier for each update operation, i.e., a “version” number. This version

number is generated using a time-based lmport clock which is incremented for each update operation sent. This

provides an “happened-before” ordering of the update operations that will be leveraged in (1) the initialization of

the update checkpoint on the Source cluster, and in (2) the maintenance strategy of the Updates Log.

solrcloud通过version这个基于时间生成的字段来做唯一标识

The persistent storage on the Target cluster is used only during the election of a new shard leader on the Source

cluster. If a shard leader goes down on the Source cluster and a new leader is elected, the new leader will

contact the Target cluster to retrieve the last update checkpoint and instantiate its ephemeral pointer. On such a

request, the Target cluster will retrieve the latest identifier received across all the shards, and send it back to the

Source cluster. To retrieve the latest identifier, every shard leader will look up the identifier of the first entry in its

Update Logs and send it back to a coordinator. The coordinator will have to select the highest among them.

当源集群中的分片leader挂了,重新选举出来的leader会去目标集群中取出唯一标识,然后综合取最高版本使用

This strategy does not require any additional network traffic and ensures reliable pointer synchronization.

Consistency is principally achieved by leveraging SolrCloud. The update workflow of SolrCloud ensures that

every update is applied to the leader but also to any of the replicas. If the leader goes down, a new leader is

elected. During the leader election, a synchronization is performed between the new leader and the other

replicas. As a result, this ensures that the new leader has a consistent Update Logs with the previous leader.

Having a consistent Updates Log means that:

On the Source cluster, the update checkpoint can be reused by the new leader.

On the Target cluster, the update checkpoint will be consistent between the previous and new leader. This

ensures the correctness of the update checkpoint sent by a newly elected leader from the Target cluster.

Maintenance of Updates Log

The CDCR replication logic requires modification to the maintenance logic of the Updates Log on the Source

Data Center. Initially, the Updates Log acts as a fixed size queue, limited to 100 update entries. In the CDCR

scenario, the Update Logs must act as a queue of variable size as they need to keep track of all the updates up

through the last processed update by the Target Data Center. Entries in the Update Logs are removed only when

all pointers (one pointer per Target Data Center) are after them.

If the communication with one of the Target Data Center is slow, the Updates Log on the Source Data Center

can grow to a substantial size. In such a scenario, it is necessary for the Updates Log to be able to efficiently find

a given update operation given its identifier. Given that its identifier is an incremental number, it is possible to

implement an efficient search strategy. Each transaction log file contains as part of its filename the version

number of the first element. This is used to quickly traverse all the transaction log files and find the transaction

log file containing one specific version number.

日志文件的维护

Monitoring

CDCR provides the following monitoring capabilities over the replication operations:

Monitoring of the outgoing and incoming replications, with information such as the Source and Target

nodes, their status, etc.

Statistics about the replication, with information such as operations (add/delete) per second, number of

documents in the queue, etc.

Information about the lifecycle and statistics will be provided on a per-shard basis by the CDC Replicator thread.

The CDCR API can then aggregate this information an a collection level.

提供的一些监控内容

CDC Replicator

The CDC Replicator is a background thread that is responsible for replicating updates from a Source Data

Center to one or more Target Data Centers. It will also be responsible in providing monitoring information on a

per-shard basis. As there can be a large number of collections and shards in a cluster, we will use a fixed-size

pool of CDC Replicator threads that will be shared across shards.

Limitations

The current design of CDCR has some limitations. CDCR will continue to evolve over time and many of these

limitations will be addressed. Among them are:

CDCR is unlikely to be satisfactory for bulk-load situations where the update rate is high, especially if the

bandwidth between the Source and Target clusters is restricted. In this scenario, the initial bulk load

should be performed, the Source and Target data centers synchronized and CDCR be utilized for

incremental updates.

CDCR is currently only active-passive; data is pushed from the Source cluster to the Target cluster. There

is active work being done in this area in the 6x code line to remove this limitation.

一些目前的缺陷

频繁的更新需要保证带宽足够

目前是主动推送的方式

Configuration

The Source and Target configurations differ in the case of the data centers being in separate clusters. "Cluster"

here means separate ZooKeeper ensembles controlling disjoint Solr instances. Whether these data centers are

physically separated or not is immaterial for this discussion.

需要单独的集群

Source Configuration

一个源集群配置的例子

Here is a sample of a Source configuration file, a section in solrconfig.xml. The presence of the

section causes CDCR to use this cluster as the Source and should not be present in the Target collections in the

cluster-to-cluster case. Details about each setting are after the two examples:

10.240.18.211:2181

collection1

collection1

8

1000

128

1000

${solr.ulog.dir:}

Target Configuration

目标集群配置

Here is a typical Target configuration.

Target instance must configure an update processor chain that is specific to CDCR. The update processor chain

must include theCdcrUpdateProcessorFactory. The task of this processor is to ensure that the version

numbers attached to update requests coming from a CDCR Source SolrCloud are reused and not overwritten by

the Target. A properly configured Target configuration looks similar to this.

disabled

cdcr-processor-chain

${solr.ulog.dir:}

Configuration Details

The configuration details, defaults and options are as follows:

The Replica Element

CDCR can be configured to forward update requests to one or more replicas. A replica is defined with a “replica”list as follows:

The Replicator Element

复制因子元素

The CDC Replicator is the component in charge of forwarding updates to the replicas. The replicator will monitorthe update logs of the Source collection and will forward any new updates to the Target collection. The replicatoruses a fixed thread pool to forward updates to multiple replicas in parallel. If more than one replica is configured,

one thread will forward a batch of updates from one replica at a time in a round-robin fashion. The replicator canbe configured with a “replicator” list as follows:

The updateLogSynchronizer Element

Expert: Non-leader nodes need to synchronize their update logs with their leader node from time to time in order

to clean deprecated transaction log files. By default, such a synchronization process is performed every minute.

The schedule of the synchronization can be modified with a “updateLogSynchronizer” list as follows:

节点和leader节点的日志同步

The Buffer Element

CDCR is configured by default to buffer any new incoming updates. When buffering updates, the updates log will

store all the updates indefinitely. Replicas do not need to buffer updates, and it is recommended to disable buffer

on the Target SolrCloud. The buffer can be disabled at startup with a “buffer” list and the parameter“defaultState” as follows:

CDCR API

The CDCR API is used to control and monitor the replication process. Control actions are performed at acollection level, i.e., by using the following base URL for API calls: http://:/solr/.

Monitor actions are performed at a core level, i.e., by using the following base URL for API calls:http://:/solr/.

Currently, none of the CDCR API calls have parameters.

入口和功能

API Entry Points (Control)

collection/cdcr?action=STATUS:Returns the current stateof CDCR.

collection/cdcr?action=START:Starts CDCRreplication

collection/cdcr?action=STOPPED:Stops CDCRreplication.

collection/cdcr?action=ENABLEBUFFER:Enables the bufferingof updates.

collection/cdcr?action=DISABLEBUFFER:Disables the bufferingof updates

API Entry Points (Monitoring)

core/cdcr?action=QUEUES:Fetches statistics about the queuefor each replica and about the update logs.

core/cdcr?action=OPS:Fetches statistics about the replication performance(operations per second) foreach

replicacore/cdcr?action=ERRORS:Fetches statistics and other information about replication errorsfor each replica.

Control Commands

Initial Startup

Upload the modified solrconfig.xml to ZooKeeper on both Source and Target

Sync the index directories from the Source collection to Target collection across to the

corresponding shard nodes.

Tip: rsync works well for this.

For example: if there are 2 shards on collection1 with 2 replicas for each shard, copy the

corresponding index directories from

Start the ZooKeeper on the Target (DR) side

Start the SolrCloud on the Target (DR) side

Start the ZooKeeper on the Source side

Start the SolrCloud on the Source side

Tip: As a general rule, the Target (DR) side of the SolrCloud should be started before the

Source side.

Activate the CDCR on Source instance using the cdcr api

http://host:port/solr/collection_name/cdcr?action=START

http://host:port/solr/collection_name/cdcr?action=START

There is no need to run the /cdcr?action=START command on the Target

Disable the buffer on the Target

http://host:port/solr/collection_name/cdcr?action=DISABLEBUFFER

Renable indexing

Monitoring

Network and disk space monitoring are essential. Ensure that the system has plenty of available storage

to queue up changes if there is a disconnect between the Source and Target. A network outage between

the two data centers can cause your disk usage to grow.

Tip: Set a monitor for your disks to send alerts when the disk gets over a certain percentage (eg.

70%)

Tip: Run a test. With moderate indexing, how long can the system queue changes before you run

out of disk space?

Create a simple way to check the counts between the Source and the Target.

Keep in mind that if indexing is running, the Source and Target may not match document for

document. Set an alert to fire if the difference is greater than some percentage of the overall cloud

size.

需要监控磁盘空间和两个集群之间索引的数量关系设置警报

ZooKeeper settings

With CDCR, the Target ZooKeepers will have connections from the Target clouds and the Source clouds.

You may need to increase the maxClientCnxns setting in the zoo.cfg.

## set numbers of connection to 200 from client

## is maxClientCnxns=0 that means no limit

maxClientCnxns=800

Upgrading and Patching Production

When rolling in upgrades to your indexer or application, you should shutdown the Source (production) and

the Target (DR). Depending on your setup, you may want to pause/stop indexing. Deploy the release or

patch and renable indexing. Then start the Target (DR).

Tip: There is no need to reissue the DISABLEBUFFERS or START commands. These are

persisted.

Tip: After starting the Target, run a simple test. Add a test document to each of the Source clouds.

Then check for it on the Target.

#send to the Source

curl http:///solr/cloud1/update -H 'Content-type:application/json' -d

'[{"SKU":"ABC"}]'

#check the Target

curl "http://:8983/solr/cloud1/select?q=SKU:ABC&wt=json&indent=true"

Legacy Scaling and Distribution

What Problem Does Distribution Solve?

If searches are taking too long or the index is approaching the physical limitations of its machine, you should

consider distributing the index across two or more Solr servers.

To distribute an index, you divide the index into partitions called shards, each of which runs on a separate

machine. Solr then partitions searches into sub-searches, which run on the individual shards, reporting results

collectively. The architectural details underlying index sharding are invisible to end users, who simply experience

faster performance on queries against very large indexes.

分布式能解决的问题,搜索太慢,索引太大

What Problem Does Replication Solve?

Replicating an index is useful when:

You have a large search volume which one machine cannot handle, so you need to distribute searches

across multiple read-only copies of the index.

There is a high volume/high rate of indexing which consumes machine resources and reduces search

performance on the indexing machine, so you need to separate indexing and searching.

You want to make a backup of the index (seeMaking and Restoring Backups of SolrCores).

Distributed Search with Index Sharding

Distributing Documents across Shards

Configuring the ReplicationHandler

In addition to ReplicationHandler configuration options specific to the master/slave roles, there are a few

special configuration options that are generally supported (even when using SolrCloud).

maxNumberOfBackups an integer value dictating the maximum number of backups this node will keep

on disk as it receives backup commands.

Similar to most other request handlers in Solr you may configure a set of "defaults, invariants, and/or

appends" parameters corresponding with any request parameters supported by the ReplicationHandl

er whenprocessing commands.

The example below shows a possible 'master' configuration for the ReplicationHandler, including a fixed

number of backups and an invariant setting for the maxWriteMBPerSec request parameter to prevent slaves

from saturating it's network interface

optimize

optimize

schema.xml,stopwords.txt,elevate.xml

00:00:10

2

16

Replicatingsolrconfig.xml

In the configuration file on the master server, include a line like the following:

solrconfig_slave.xml:solrconfig.xml,x.xml,y.xml

This ensures that the local configuration solrconfig_slave.xml will be saved as solrconfig.xml on the

slave. All other files will be saved with their original names.

On the master server, the file name of the slave configuration file can be anything, as long as the name is

correctly identified in the confFiles string; then it will be saved as whatever file name appears after the colon

':'.

Configuring the Replication RequestHandler on a Slave Server

The code below shows how to configure a ReplicationHandler on a slave.

http://remote_host:port/solr/core_name/replication

00:00:20

internal

5000

10000

username

password

Setting Up a Repeater with the ReplicationHandler

A master may be able to serve only so many slaves without affecting performance. Some organizations have

deployed slave servers across multiple data centers. If each slave downloads the index from a remote data

center, the resulting download may consume too much network bandwidth. To avoid performance degradation in

cases like this, you can configure one or more slaves as repeaters. A repeater is simply a node that acts as both

a master and a slave.

To configure a server as a repeater, the definition of the Replication requestHandler in the solrconfi

g.xml file must include file lists of use for both masters and slaves.

Be sure to set the replicateAfter parameter to commit, even if replicateAfter is set to optimize

on the main master. This is because on a repeater (or any slave), a commit is called only after the index is

downloaded. The optimize command is never called on slaves.

Optionally, one can configure the repeater to fetch compressed files from the master through thecompression parameter to reduce the index download time.

Here is an example of a ReplicationHandler configuration for a repeater:

commit

schema.xml,stopwords.txt,synonyms.txt

name="masterUrl">http://master.solr.company.com:8983/solr/core_name/replication >

00:00:60

Commit and Optimize Operations

The replicateAfter parameter can accept multiple arguments. For example:

startup

commit

optimize

Slave Replication

副本的复制过程详解

The master is totally unaware of the slaves. The slave continuously keeps polling the master (depending on the

pollInterval parameter) to check the current index version of the master. If the slave finds out that the

master has a newer version of the index it initiates a replication process. The steps are as follows:

The slave issues a filelist command to get the list of the files. This command returns the names of

the files as well as some metadata (for example, size, a lastmodified timestamp, an alias if any).

The slave checks with its own index if it has any of those files in the local index. It then runs the filecontent

command to download the missing files. This uses a custom format (akin to the HTTP chunked encoding)

to download the full content or a part of each file. If the connection breaks in between, the download

resumes from the point it failed. At any point, the slave tries 5 times before giving up a replication

altogether.

The files are downloaded into a temp directory, so that if either the slave or the master crashes during the

download process, no files will be corrupted. Instead, the current replication will simply abort.

After the download completes, all the new files are moved to the live index directory and the file's

timestamp is same as its counterpart on the master.

A commit command is issued on the slave by the Slave's ReplicationHandler and the new index is loaded.

Replicating Configuration Files

To replicate configuration files, list them using using the confFiles parameter. Only files found in the conf dire

ctory of the master's Solr instance will be replicated.

Solr replicates configuration files only when the index itself is replicated. That means even if a configuration file is

changed on the master, that file will be replicated only after there is a new commit/optimize on master's index.

Unlike the index files, where the timestamp is good enough to figure out if they are identical, configuration files

are compared against their checksum. The schema.xml files (on master and slave) are judged to be identical if

their checksums are identical.

As a precaution when replicating configuration files, Solr copies configuration files to a temporary directory before

moving them into their ultimate location in the conf directory. The old configuration files are then renamed and

kept in the same conf/ directory. The ReplicationHandler does not automatically clean up these old files.

If a replication involved downloading of at least one configuration file, the ReplicationHandler issues a

core-reload command instead of a commit command.

副本配置文件的自动化更新

Resolving Corruption Issues on Slave Server

If documents are added to the slave, then the slave is no longer in sync with its master. However, the slave will

not undertake any action to put itself in sync, until the master has new index data. When a commit operation

takes place on the master, the index version of the master becomes different from that of the slave. The slave

then fetches the list of files and finds that some of the files present on the master are also present in the local

index but with different sizes and timestamps. This means that the master and slave have incompatible indexes.

To correct this problem, the slave then copies all the index files from master to a new index directory and asks

the core to load the fresh index from the new directory.

当索引发生冲突时副本的解决办法

HTTP API Commands for the ReplicationHandler

You can use the HTTP commands below to control the ReplicationHandler's operations

Distribution and Optimization

Optimizing an index is not something most users should generally worry about - but in particular users should be

aware of the impacts of optimizing an index when using the ReplicationHandler.

The time required to optimize a master index can vary dramatically. A small index may be optimized in minutes.

A very large index may take hours. The variables include the size of the index and the speed of the hardware.

Distributing a newly optimized index may take only a few minutes or up to an hour or more, again depending on

the size of the index and the performance capabilities of network connections and disks. During optimization the

machine is under load and does not process queries very well. Given a schedule of updates being driven a few

times an hour to the slaves, we cannot run an optimize with every committed snapshot.

Copying an optimized index means that theentireindex will need to be transferred during the next snappull. This

is a large expense, but not nearly as huge as running the optimize everywhere. Consider this example: on a

three-slave one-master configuration, distributing a newly-optimized index takes approximately 80 secondstotal.

Rolling the change across a tier would require approximately ten minutes per machine (or machine group). If this

optimize were rolled across the query tier, and if each slave node being optimized were disabled and not

receiving queries, a rollout would take at least twenty minutes and potentially as long as an hour and a half.

Additionally, the files would need to be synchronized so that thefollowingthe optimize, snappull would not think

that the independently optimized files were different in any way. This would also leave the door open to

independent corruption of indexes instead of each being a perfect copy of the master.

Optimizing on the master allows for a straight-forward optimization operation. No query slaves need to be taken

out of service. The optimized index can be distributed in the background as queries are being normally serviced.

The optimization can occur at any time convenient to the application providing index updates.

While optimizing may have some benefits in some situations, a rapidly changing index will not retain those

benefits for long, and since optimization is an intensive process, it may be better to consider other options, such

as lowering the merge factor (discussed in the section onIndex Configuration).

Combining Distribution and Replication

----> 直接使用solrcloud

Merging Indexes

If you need to combine indexes from two different projects or from multiple servers previously used in a

distributed configuration, you can use either the IndexMergeTool included in lucene-misc or the CoreAdminH

andler.

To merge indexes, they must meet these requirements:

The two indexes must be compatible: their schemas should include the same fields and they should

analyze fields the same way.

The indexes must not include duplicate data.

Optimally, the two indexes should be built using the same schema

UsingIndexMergeTool

合并:

To merge the indexes, do the following:

Make sure that both indexes you want to merge are closed.

Issue this command:

java -cp $SOLR/server/solr-webapp/webapp/WEB-INF/lib/lucene-core-VERSION.jar:$

SOLR/server/solr-webapp/webapp/WEB-INF/lib/lucene-misc-VERSION.jar

org/apache/lucene/misc/IndexMergeTool

/path/to/newindex

/path/to/old/index1

/path/to/old/index2

This will create a new index at /path/to/newindex that contains both index1 and index2.

Copy this new directory to the location of your application's solr index (move the old one aside first, of

course) and start Solr

Using CoreAdmin

The MERGEINDEXES command of theCoreAdminHandlercan be used to merge indexes into a new core – either

from one or more arbitrary indexDir directories or by merging from one or more existing srcCore core names.

See theCoreAdminHandlersection for details.

Client APIs

This section discusses the available client APIs for Solr. It covers the following topics:

Introduction to Client APIs: A conceptual overview of Solr client APIs.

Choosing an Output Format: Information about choosing a response format in Solr.

Using JavaScript: Explains why a client API is not needed for JavaScript responses.

Using Python: Information about Python and JSON responses.

Client API Lineup: A list of all Solr Client APIs, with links.

Using SolrJ: Detailed information about SolrJ, an API for working with Java applications.

Using Solr From Ruby: Detailed information about using Solr with Ruby applications.

MBean Request Handler: Describes the MBean request handler for programmatic access to Solr server statistics

and information.

Introduction to Client APIs

At its heart, Solr is a Web application, but because it is built on open protocols, any type of client application can

use Solr.

HTTP is the fundamental protocol used between client applications and Solr. The client makes a request and

Solr does some work and provides a response. Clients use requests to ask Solr to do things like perform queries

or index documents.

Client applications can reach Solr by creating HTTP requests and parsing the HTTP responses. Client APIs

encapsulate much of the work of sending requests and parsing responses, which makes it much easier to write

client applications.

Clients use Solr's five fundamental operations to work with Solr. The operations are query, index, delete, commit,and optimize.

Queries are executed by creating a URL that contains all the query parameters. Solr examines the request URL,

performs the query, and returns the results. The other operations are similar, although in certain cases the HTTP

request is a POST operation and contains information beyond whatever is included in the request URL. An index

operation, for example, may contain a document in the body of the request.

Solr also features an EmbeddedSolrServer that offers a Java API without requiring an HTTP connection. For

details, seeUsing SolrJ.

Choosing an Output Format

Many programming environments are able to send HTTP requests and retrieve responses. Parsing the

responses is a slightly more thorny problem. Fortunately, Solr makes it easy to choose an output format that will

be easy to handle on the client side.

Specify a response format using the wt parameter in a query. The available response formats are documented in

Response Writers.

Most client APIs hide this detail for you, so for many types of client applications, you won't ever have to specify a

wt parameter. In JavaScript, however, the interface to Solr is a little closer to the metal, so you will need to add

this parameter yourself.

Client API Lineup

Using JavaScript

Using Solr from JavaScript clients is so straightforward that it deserves a special mention. In fact, it is so

straightforward that there is no client API. You don't need to install any packages or configure anything.

HTTP requests can be sent to Solr using the standard XMLHttpRequest mechanism.

Out of the box, Solr can sendJavaScript Object Notation (JSON) responses, which are easily interpreted in

JavaScript. Just add wt=json to the request URL to have responses sent as JSON.

For more information and an excellent example, read the SolJSON page on the Solr Wiki:

http://wiki.apache.org/solr/SolJSON

Using SolrJ

到现在算是把官方文档还算细致的看了一遍,也仅仅是有个印象.不过内容颇多,如果是做solr这块还是值得一看的.

上一篇:Kafka实践(三)java开发环境搭建
下一篇:openstack-nova-API解析流程分析
相关文章
图文推荐

关于我们 | 联系我们 | 广告服务 | 投资合作 | 版权申明 | 在线帮助 | 网站地图 | 作品发布 | Vip技术培训 | 举报中心

版权所有: 红黑联盟--致力于做实用的IT技术学习网站