SmartChineseAnalyzer 源码分析

16-01-07 来源：[db:作者]

收藏我要投稿

smartcn是lucene自带的一个中文分词工具，它源自中科院的ICTCLAS中文分词系统。关于ICTCLAS的算法研究，可以参考这里。SmartChineseAnalyzer里的行为分析，可以从reusableTokenStream或tokenStream方法开始入手。其中前者可以重复使用以提高性能（简单看一下，像是有些实例放到了ThreadLocal里，下次再调用时避免了重新构建）。以下是我以reusableTokenStream为切入点做的分析笔记。

reusableTokenStream方法判断是否已有SavedStreams，如果没有则创建，否则重置一些状态后，直接返回。我只关注新建过程。新建streams其实并未进行实际操作。
streams = new SavedStreams();
setPreviousTokenStream(streams);
streams.tokenStream = new SentenceTokenizer(reader);
streams.filteredTokenStream = new WordTokenFilter(streams.tokenStream);
streams.filteredTokenStream = new PorterStemFilter(streams.filteredTokenStream);
if (!stopWords.isEmpty()) {
streams.filteredTokenStream = new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
streams.filteredTokenStream, stopWords, false);
}

此方法返回TokenStream实例。调用TokenStream的incrementToken()方法可以逐个获取分词。跟踪incrementToken，来到PorterStemFilter.incrementToken()，而此方法又调用了WordTokenFilter.incrementToken()。

WordTokenFilter.incrementToken()里首先调用SentenceTokenizer.incrementToken()进行断句。句子结束的标志为：碰到句号及其它表明句字结束的标点符号（定义在Utility.PUNCTION中），或者碰到文本结束。其中，空字符（Utility.SPACES）会被忽略。

断完句子之后，调用segmentSentence()对句子进行切分。segmentSentence()方法里最重要的一步就是调HHMMSegmenter.process()进行分词处理。此process()方法是算法核心所在。而其类名则表明其基于隐马尔科夫算法（Hidden Markov Model, HMM）。来看看此方法：
/**
* Return a list of {@link SegToken} representing the best segmentation of a sentence
* @param sentence input sentence
* @return best segmentation as a {@link List}
*/
public List process(String sentence) {
SegGraph segGraph = createSegGraph(sentence);
BiSegGraph biSegGraph = new BiSegGraph(segGraph);
List shortPath = biSegGraph.getShortPath();
return shortPath;
}

这里有三个步骤，第一步是生成SegGraph，第二步则是创建BiSegGraph，第三步寻找最短路径作为结果。关于这里的两个图，可参考ICTCLAS的算法介绍中的相关部分。大致上，第一个图是所有可能的词，第二个图则是词与词之间可能的结合（有向图的路径）。这两个图中结点都有权重属性。后一个图的权重是根据两个词组合的频率经过平滑算法得出的。寻找最短路径其实就是在寻找最佳词语组合。以下对相关代码进行注释。

第一步：org.apache.lucene.analysis.cn.smart.hhmm.HHMMSegmenter.createSegGraph(String)。SegGraph的主要属性是一个键为整数，值为SegToken列表的Map：

private Map<>> tokenListTable = new HashMap<>>();

<><>

其键其实是切分出来的词语在源字符串中的起始位置，其另外一个属性maxStart = -1，用来记录最大起始位置。
/**
* Create the {@link SegGraph} for a sentence.
*
* @param sentence
* input sentence, without start and end markers
* @return {@link SegGraph} corresponding to the input sentence.
*/
private SegGraph createSegGraph(String sentence) {
int i = 0, j;
int length = sentence.length();
int foundIndex;
int[] charTypeArray = getCharTypes(sentence);
StringBuilder wordBuf = new StringBuilder();
SegToken token;
int frequency = 0; // the number of times word appears.
boolean hasFullWidth;
int wordType;
char[] charArray;

SegGraph segGraph = new SegGraph();
/* 从头至尾处理 */
while (i < length) {
hasFullWidth = false;
switch (charTypeArray[i]) {
case CharType.SPACE_LIKE:
i++;
/* 略去空格 */
break;
case CharType.HANZI:
/* 处理汉字 */
j = i + 1;
wordBuf.delete(0, wordBuf.length());
// It doesn't matter if a single Chinese character (Hanzi) can
// form a phrase or not,
// it will store that single Chinese character (Hanzi) in the
// SegGraph. Otherwise, it will
// cause word division.
wordBuf.append(sentence.charAt(i));
charArray = new char[] { sentence.charAt(i) };
frequency = wordDict.getFrequency(charArray);
token = new SegToken(charArray, i, j, WordType.CHINESE_WORD,
frequency);
/* 加入单个汉字 */
segGraph.addToken(token);

foundIndex = wordDict.getPrefixMatch(charArray);
/* 寻找向后组合所能构成的所有词语并添加到图中 foundIndex != –1 表示有可能构成词 */
while (j <= length && foundIndex != -1) {
/* 如果已经构成词语（上面说有可能，是因为这也可能只是一个词语的前缀） */
if (wordDict.isEqual(charArray, foundIndex)
&& charArray.length > 1) {
// It is the phrase we are looking for; In other words,
// we have found a phrase SegToken
// from i to j. It is not a monosyllabic word (single
// word).
frequency = wordDict.getFrequency(charArray);
token = new SegToken(charArray, i, j,
WordType.CHINESE_WORD, frequency);
segGraph.addToken(token);
}

/* 向后组词时略去空格（空格格开的词smartcn一样可以正确切分出来） */
while (j < length
&& charTypeArray[j] == CharType.SPACE_LIKE)
j++;

/* 如果是汉字，继续拼加并测试是否成词 */
if (j < length && charTypeArray[j] == CharType.HANZI) {
wordBuf.append(sentence.charAt(j));
charArray = new char[wordBuf.length()];
wordBuf.getChars(0, charArray.length, charArray, 0);
// idArray has been found (foundWordIndex!=-1) as a
// prefix before.
// Therefore, idArray after it has been lengthened can
// only appear after foundWordIndex.
// So start searching after foundWordIndex.
foundIndex = wordDict.getPrefixMatch(charArray,
foundIndex);
j++;
} else {
break; /* 结束或不是汉字 */
}
}
i++;
break;
/* 以下是处理其它字符类型，这里就不作分析了 */
case CharType.FULLWIDTH_LETTER:
hasFullWidth = true;
case CharType.LETTER:
j = i + 1;
while (j < length
&& (charTypeArray[j] == CharType.LETTER || charTypeArray[j] == CharType.FULLWIDTH_LETTER)) {
if (charTypeArray[j] == CharType.FULLWIDTH_LETTER)
hasFullWidth = true;
j++;
}
// Found a Token from i to j. Type is LETTER char string.
charArray = Utility.STRING_CHAR_ARRAY;
frequency = wordDict.getFrequency(charArray);
wordType = hasFullWidth ? WordType.FULLWIDTH_STRING
: WordType.STRING;
token = new SegToken(charArray, i, j, wordType, frequency);
segGraph.addToken(token);
i = j;
break;
case CharType.FULLWIDTH_DIGIT:
hasFullWidth = true;
case CharType.DIGIT:
j = i + 1;
while (j < length
&& (charTypeArray[j] == CharType.DIGIT || charTypeArray[j] == CharType.FULLWIDTH_DIGIT)) {
if (charTypeArray[j] == CharType.FULLWIDTH_DIGIT)
hasFullWidth = true;
j++;
}
// Found a Token from i to j. Type is NUMBER char string.
charArray = Utility.NUMBER_CHAR_ARRAY;
frequency = wordDict.getFrequency(charArray);
wordType = hasFullWidth ? WordType.FULLWIDTH_NUMBER
: WordType.NUMBER;
token = new SegToken(charArray, i, j, wordType, frequency);
segGraph.addToken(token);
i = j;
break;
case CharType.DELIMITER:
j = i + 1;
// No need to search the weight for the punctuation. Picking the
// highest frequency will work.
frequency = Utility.MAX_FREQUENCE;
charArray = new char[] { sentence.charAt(i) };
token = new SegToken(charArray, i, j, WordType.DELIMITER,
frequency);
segGraph.addToken(token);
i = j;
break;
default:
j = i + 1;
// Treat the unrecognized char symbol as unknown string.
// For example, any symbol not in GB2312 is treated as one of
// these.
charArray = Utility.STRING_CHAR_ARRAY;
frequency = wordDict.getFrequency(charArray);
token = new SegToken(charArray, i, j, WordType.STRING,
frequency);
segGraph.addToken(token);
i = j;
break;
}
}

// Add two more Tokens: "beginning xx beginning"
/* 在最前面添加起始标志以方便后续处理（请参考ICTCLAS的算法） */
charArray = Utility.START_CHAR_ARRAY;
frequency = wordDict.getFrequency(charArray);
token = new SegToken(charArray, -1, 0, WordType.SENTENCE_BEGIN,
frequency);
segGraph.addToken(token);

// "end xx end"
/* 同上，添加结束标志 */
charArray = Utility.END_CHAR_ARRAY;
frequency = wordDict.getFrequency(charArray);
token = new SegToken(charArray, length, length + 1,
WordType.SENTENCE_END, frequency);
segGraph.addToken(token);

return segGraph;
}

第二步：org.apache.lucene.analysis.cn.smart.hhmm.BiSegGraph.generateBiSegGraph(SegGraph)，创建BiSegGraph。

BiSegGraph在构造方法里先对segGraph进行了索引的生成，也就是给所有词一个索引号，后面会用到这个。然后就是调用generateBiSegGraph()来生成路径图。
/*
* Generate a BiSegGraph based upon a SegGraph
*/
private void generateBiSegGraph(SegGraph segGraph) {
double smooth = 0.1;
int wordPairFreq = 0;
int maxStart = segGraph.getMaxStart();
double oneWordFreq, weight, tinyDouble = 1.0 / Utility.MAX_FREQUENCE;

int next;
char[] idBuffer;
// get the list of tokens ordered and indexed
segTokenList = segGraph.makeIndex(); /* 这个是不是重复了呢？前面构造方法里已经做过了 */
// Because the beginning position of startToken is -1, therefore
// startToken can be obtained when key = -1
int key = -1;
List nextTokens = null;
/* 从头到尾处理 */
while (key < maxStart) {
/* 判断进入点是否存在。（可对照参考前面的源码） */
if (segGraph.isStartExist(key)) {

List tokenList = segGraph.getStartList(key);

/**
* 处理给定key（也即起始位置）的所有token，对这些token分别去遍历它们下面可以邻接的token，并将
* 它们的邻接关系（token pair）存入图中。这个token对儿其实就是token之前的路径。
*/
// Calculate all tokens for a given key.
for (SegToken t1 : tokenList) {
oneWordFreq = t1.weight;
next = t1.endOffset;
nextTokens = null;
// Find the next corresponding Token.
// For example: "Sunny seashore", the present Token is
// "sunny", next one should be "sea" or "seashore".
// If we cannot find the next Token, then go to the end and
// repeat the same cycle.
/* 寻找可邻接的token的起始位置 */
while (next <= maxStart) {
// Because the beginning position of endToken is
// sentenceLen, so equal to sentenceLen can find
// endToken.
if (segGraph.isStartExist(next)) {
nextTokens = segGraph.getStartList(next);
break;
}
next++;
}
if (nextTokens == null) {
break;
}
/* 遍历可邻接的tokens */
for (SegToken t2 : nextTokens) {
idBuffer = new char[t1.charArray.length
+ t2.charArray.length + 1];
System.arraycopy(t1.charArray, 0, idBuffer, 0,
t1.charArray.length);
idBuffer[t1.charArray.length] = BigramDictionary.WORD_SEGMENT_CHAR;
System.arraycopy(t2.charArray, 0, idBuffer,
t1.charArray.length + 1, t2.charArray.length);

// Two linked Words frequency
wordPairFreq = bigramDict.getFrequency(idBuffer);

/* 计算权重的平滑算法还没研究，现在数学忘光了，汗一个！抽空再分析这里 */

// Smoothing

// -log{a*P(Ci-1)+(1-a)P(Ci|Ci-1)} Note 0<>
weight = -Math.log(smooth
* (1.0 + oneWordFreq)
/ (Utility.MAX_FREQUENCE + 0.0)
+ (1.0 - smooth)
* ((1.0 - tinyDouble) * wordPairFreq
/ (1.0 + oneWordFreq) + tinyDouble));

SegTokenPair tokenPair = new SegTokenPair(idBuffer,
t1.index, t2.index, weight);
this.addSegTokenPair(tokenPair);
}
}
}
key++;
}
}

第三步：org.apache.lucene.analysis.cn.smart.hhmm.BiSegGraph.getShortPath()，寻找最佳路径。这部分其实就是一个动态规划算法，可以参考数据结构和算法方面的书。
/**
* Find the shortest path with the Viterbi algorithm.
*
* @return {@link List}
*/
public List getShortPath() {
int current;
int nodeCount = getToCount();
List path = new ArrayList();
PathNode zeroPath = new PathNode();
zeroPath.weight = 0;
zeroPath.preNode = 0;
path.add(zeroPath); /* 插了个头，当做起始“原点”吧 */
/**
* 从头到尾，计算每步的最佳路径并保存相应信息。后面的计算会采用前面的数据，
* 所以此循环处理完成后，可以直接从最后一个节点顺着向前找出最佳路径。
*/
for (current = 1; current <= nodeCount; current++) {
double weight;
List edges = getToList(current);

double minWeight = Double.MAX_VALUE;
SegTokenPair minEdge = null;
for (SegTokenPair edge : edges) {
weight = edge.weight;
PathNode preNode = path.get(edge.from);
if (preNode.weight + weight < minWeight) {
minWeight = preNode.weight + weight;
minEdge = edge;
}
}
PathNode newNode = new PathNode();
newNode.weight = minWeight;
newNode.preNode = minEdge.from;
path.add(newNode);
}

// Calculate PathNodes
int preNode, lastNode;
lastNode = path.size() - 1;
current = lastNode;
List rpath = new ArrayList(); /* 觉得这里有点多余 */
List resultPath = new ArrayList();

rpath.add(current);
while (current != 0) {
PathNode currentPathNode = path.get(current);
preNode = currentPathNode.preNode;
rpath.add(Integer.valueOf(preNode));
current = preNode;
}
/* 为什么上一步不直接生成resultPath呢？是不是有点多余？ */
for (int j = rpath.size() - 1; j >= 0; j--) {
Integer idInteger = (Integer) rpath.get(j);
int id = idInteger.intValue();
SegToken t = segTokenList.get(id);
resultPath.add(t);
}
return resultPath;
}

smartcn的主要算法过程就是这样子,传说HMM算法比较学院派，比较复杂。

点击复制链接与好友分享!回本站首页