IT博客汇 | elasticsearch搜索及相关插件

elasticsearch搜索及相关插件

hyperxu发表于 2017-07-01 08:42:01

ES选型

elasticsearch-2.3.4

ES特性

SuggestionDiscovery

SuggestionDiscovery的职责是发现建议词；
建议词的来源可以是商品的分类名称、品牌名称、商品标签、商品名称的高频词、热搜词，也可以是一些组合词，比如“分类 + 性别”和“分类 + 标签”，还可以是一些自定义添加的词；
建议词维护的时候需要考虑去重，比如“卫衣男”和“卫衣男”应该是相同的，“Nike”和“nike”也应该是相同的；
由于建议词的来源通常比较稳定，所以执行的周期可以比较长一点，比如每周一次；

SuggestionCounter

SuggestionCounter的职责是获取建议词关联的商品数量，如果需要可以进行一些聚合操作，比如聚合分类和标签；
SuggestionCounter的实现的时候由于要真正地调用搜索接口，应该尽量避免对用户搜索的影响，比如在凌晨执行并且使用单线程调用；
为了提升效率，应该使用Elasticsearch的Multi Search接口批量进行count，同时批量更新数据库里建议词的count值；
由于SuggestionCounter是比较耗资源的，可以考虑延长执行的周期，但是这可能会带来count值与实际搜索时误差较大的问题，这个需要根据实际情况考虑；

SuggestionIndexRebuiler

SuggestionIndexRebuiler的职责是负责重建索引；
考虑到用户的搜索习惯，可以使用Multi-fields来给建议词增加多个分析器。比如对于【卫衣套头】的建议词使用Multi-fields增加不分词字段、拼音分词字段、拼音首字母分词字段、IK分词字段，这样输入【weiyi】和【套头】都可以匹配到该建议词；
重建索引时通过是通过bulk批量添加到临时索引中，然后通过别名来更新；
重建索引的数据依赖于SuggestionCounter，因此其执行的周期应该与SuggestionCounter保持一致；

SuggestionService

SuggestionService是真正处于用户搜索建议的服务类；
通常的实现是先到缓存中查询是否能匹配到缓存记录，如果能匹配到则直接返回；否则的话调用Elasticsearch的Prefix Query进行搜索，由于我们在重建索引的时候定义了Multi-fields，在搜索的时候应该用boolQuery来处理；如果此时Elasticsearch返回不为空的结果数据，那么加入缓存并返回即可；

ES配置

elasticsearch配置

elasticsearch.yml

[elk@M-WEB-098 config]$ cat elasticsearch.yml

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please see the documentation for further information on configuration options:
# <http://www.elastic.co/guide/en/elasticsearch/reference/current/setup-configuration.html>
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: pmh_es
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: node-1
#
# Add custom attributes to the node:
#
# node.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /data/elasticsearch/data/
#
# Path to log files:
#
path.logs: /data/elasticsearch/logs/
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
bootstrap.memory_lock: true
#
# Make sure that the `ES_HEAP_SIZE` environment variable is set to about half the memory
# available on the system and that the owner of the process is allowed to use this limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 192.168.1.98
#
# Set a custom port for HTTP:
#
http.port: 9200
#
# For more information, see the documentation at:
# <http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html>
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
# discovery.zen.ping.unicast.hosts: ["host1", "host2"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of nodes / 2 + 1):
#
discovery.zen.ping.unicast.hosts: ["192.168.1.82", "192.168.1.98"]
discovery.zen.ping_timeout: 10s
# discovery.zen.minimum_master_nodes: 3
#
# For more information, see the documentation at:
# <http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery.html>
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
# gateway.recover_after_nodes: 3
#
# For more information, see the documentation at:
# <http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-gateway.html>
#
# ---------------------------------- Various -----------------------------------
#
# Disable starting multiple nodes on a single system:
#
# node.max_local_storage_nodes: 1
#
# Require explicit names when deleting indices:
#
# action.destructive_requires_name: true
#ik
#index.analysis.analyzer.ik.type : "ik"
index:
  analysis:
    analyzer:
      ik_max_word:
          type: ik
          use_smart: false
      ik_smart:
          type: ik
          use_smart: true
bootstrap.memory_lock: true   //锁定到到内存，防止交换到硬盘

端口配置

ES对外提供服务端口默认为：9200
可用于访问ES插件及管理界面，如head.

节点间交互的tcp端口默认为：9300
用于提供ES集群节点间相互通信，或内部提供API给业务接口，如提供给JAVA 接口调用。

安全配置

由于ES原生是不带有任何安全认证相关的配置及措施，因此任何人都能调用我们的ES服务API，以及管理API，拥有所有的ES操作权限，极不安全。为此：
关闭了外网，只将ES服务绑定在内网上
通过host本地解析ES IP地址，配合openresty提供域名API服务
通过openresty隐藏9200端口，同时配置反向代理ES，为ES提供方便的可扩展性和安全性
通过openresty为kibana提供secret http服务，提供安全的数据可视化服务（密码找相关人员索取）

JDBC配置

导入ojdbc6.jar包到/usr/local/elasticsearch-2.3.4/elasticsearch-jdbc-2.3.4.0/lib
配置索引导入脚本

oracle-pmh_es.sh

#!/bin/sh
# This example is a template to connect to Oracle
# The JDBC URL and SQL must be replaced by working ones.
DIR=/usr/local/elasticsearch-2.3.4/elasticsearch-jdbc-2.3.4.0
bin=${DIR}/bin
lib=${DIR}/lib
echo '
{
    "type" : "jdbc",
    "jdbc" : {
        "url" : "jdbc:oracle:thin:@//192.168.1.129:1521/pomoho",
        "connection_properties" : {
            "oracle.jdbc.TcpNoDelay" : false,
            "useFetchSizeWithLongColumn" : false,
            "oracle.net.CONNECT_TIMEOUT" : 10000,
            "oracle.jdbc.ReadTimeout" : 50000
        },
        "user" : "****",
        "password" : "******",
        "sql" : "select * from PMH_SOLR",
        "index" : "pmh_es_smart-test",
        "type" : "myoracle",
        "elasticsearch" : {
            "cluster" : "pmh_es",
            "host" : "192.168.1.98",
            "port" : 9300
        },
        "max_bulk_actions" : 20000,
        "max_concurrent_bulk_requests" : 8,
        "index_settings" : {
            "index" : {
                "number_of_shards" : 1,
                "number_of_replica" : 1
            },
        "analysis" : {
                "analyzer" : {
                    "ik" : {
                        "tokenizer" : "ik_smart"
                    }
                }
            }
        },
        "type_mapping": {
                "myoracle":{
                        "properties" : {
                                "IMDBID":{
                                        "type" : "integer"
                                },
                                "FILMNAME":{
                                        "type" : "string",
                                        "analyzer" : "ik",
                                        "search_analyzer": "ik"
                                },
                                "CREATETIME":{
                                        "type":"date"
                                },
                                "CREATEUSER":{
                                        "type":"integer"
                                },
                                "PLAYCOST":{
                                        "type":"integer"
                                },
                                "STATUS":{
                                        "type":"integer"
                                },
                                "STATUSTIME":{
                                        "type":"date"
                                },
                                "SOLRTIME":{
                                        "type":"date"
                                },
                                "DEALSTATUS":{
                                        "type":"integer"
                                },
                                "FILETYPE":{
                                        "type":"string"
                                },
                                "TAGS":{
                                        "type":"string"
                                },
                                "BELONGEDFLAG":{
                                        "type":"integer"
                                },
                                "CLASSID":{
                                        "type":"integer"
                                },
                                "CLASSIDTWO":{
                                        "type":"integer"
                                },
                                "CLASSIDTHREE":{
                                        "type":"string"
                                },
                                "CLASSIDFOUR":{
                                        "type":"integer"
                                },
                                "CHANNELID":{
                                        "type":"integer"
                                },
                                "CHANNELNAME":{
                                        "type":"string"
                                },
                                "CHANNELDESC":{
                                        "type":"string"
                                }
                        }
                }
        }
    }
}
' | java \
    -cp "${lib}/*" \
    -Dlog4j.configurationFile=${bin}/log4j2.xml \
    org.xbib.tools.Runner \
    org.xbib.tools.JDBCImporter

oracle-pmh_mhh_deltaImport.sh

#!/bin/sh
# This example is a template to connect to Oracle
# The JDBC URL and SQL must be replaced by working ones.
DIR=/usr/local/elasticsearch-2.3.4/elasticsearch-jdbc-2.3.4.0
bin=${DIR}/bin
lib=${DIR}/lib
echo '
{
    "type" : "jdbc",
    "jdbc" : {
        "url" : "jdbc:oracle:thin:@//IP:PORT/INSTANCE",
        "connection_properties" : {
            "oracle.jdbc.TcpNoDelay" : false,
            "useFetchSizeWithLongColumn" : false,
            "oracle.net.CONNECT_TIMEOUT" : 10000,
            "oracle.jdbc.ReadTimeout" : 50000
        },
        "user" : "****",
        "password" : "****",
        "statefile" : "statefile-PMH_ES_MHH.json",
        "schedule" : "0 55 0/1 * * ?",
        "sql" : [
                {
                "statement" : "select * from PMH_MHH_SLORUSER where CREATETIME > ?",
                "parameter" : ["$metrics.lastexecutionstart"]
                }
],
        "index" : "pmh_es_mhh",
        "type" : "myoracle",
        "elasticsearch" : {
            "cluster" : "pmh_es",
            "host" : "192.168.1.82",
            "port" : 9300
        },
        "max_bulk_actions" : 20000,
        "max_concurrent_bulk_requests" : 8,
        "index_settings" : {
            "index" : {
                "number_of_shards" : 1,
                "number_of_replica" : 1
            },
        "analysis" : {
                "analyzer" : {
                    "ik" : {
                        "tokenizer" : "ik_smart",
                        "filter" : ["full_pinyin_no_space","full_pinyin_with_space","first_letter_pinyin"]
                    }
                },
                "filter" :{
                "full_pinyin_no_space" : {
                    "type" : "pinyin",
                    "first_letter" : "none",
                    "padding_char" : ""
                },
                "full_pinyin_with_space" : {
                    "type" : "pinyin",
                    "first_letter" : "none",
                    "padding_char" : " "
            },
                "first_letter_pinyin" : {
                    "type" : "pinyin",
                    "first_letter" : "only",
                    "padding_char" : ""
            }
        }
            }
        },
        "type_mapping": {
                "myoracle":{
                        "properties" : {
                                "USERID":{
                                        "type" : "integer"
                                },
                                "NICKNAME":{
                                        "type" : "string",
                                        "analyzer" : "ik",
                                        "search_analyzer": "ik"
                                },
                                "USERTYPE":{
                                        "type":"integer"
                                },
                                "HEADIMAGE":{
                                        "type":"string"
                                },
                                "REMARK":{
                                        "type":"string"
                                },
                                "CREATETIME":{
                                        "type":"date"
                                },
                                "STATUS":{
                                        "type":"integer"
                                }
                        }
                }
        }
    }
}
' | java \
    -cp "${lib}/*" \
    -Dlog4j.configurationFile=${bin}/log4j2.xml \
    org.xbib.tools.Runner \
    org.xbib.tools.JDBCImporter

oracle-pmh_es_nopinyin_deltaImport.sh

#!/bin/sh
# This example is a template to connect to Oracle
# The JDBC URL and SQL must be replaced by working ones.
DIR=/usr/local/elasticsearch-2.3.4/elasticsearch-jdbc-2.3.4.0
bin=${DIR}/bin
lib=${DIR}/lib
echo '
{
    "type" : "jdbc",
    "jdbc" : {
        "url" : "jdbc:oracle:thin:@//IP:PORT/INSTANCE",
        "connection_properties" : {
            "oracle.jdbc.TcpNoDelay" : false,
            "useFetchSizeWithLongColumn" : false,
            "oracle.net.CONNECT_TIMEOUT" : 10000,
            "oracle.jdbc.ReadTimeout" : 50000
        },
        "user" : "****",
        "password" : "****",
        "statefile" : "statefile-PMH_SOLR_NOPY.json",
        "schedule" : "0 15 0/1 * * ?",
        "sql" : [
                {
                "statement" : "select * from PMH_SOLR where SOLRTIME > ?",
                "parameter" : ["$metrics.lastexecutionstart"]
                }
],
        "index" : "pmh_es_so_nopy",
        "type" : "myoracle",
        "elasticsearch" : {
            "cluster" : "pmh_es",
            "host" : "192.168.1.82",
            "port" : 9300
        },
        "max_bulk_actions" : 20000,
        "max_concurrent_bulk_requests" : 8,
        "index_settings" : {
            "index" : {
                "number_of_shards" : 1,
                "number_of_replica" : 1
            },
        "analysis" : {
                "analyzer" : {
                    "ik" : {
                        "tokenizer" : "ik_smart"
                    }
                }
            }
        },
        "type_mapping": {
                "myoracle":{
                        "properties" : {
                                "IMDBID":{
                                        "type" : "integer"
                                },
                                "FILMNAME":{
                                        "type" : "string",
                                        "analyzer" : "ik",
                                        "search_analyzer": "ik"
                                },
                                "CREATETIME":{
                                        "type":"date"
                                },
                                "CREATEUSER":{
                                        "type":"integer"
                                },
                                "PLAYCOST":{
                                        "type":"integer"
                                },
                                "STATUS":{
                                        "type":"integer"
                                },
                                "STATUSTIME":{
                                        "type":"date"
                                },
                                "SOLRTIME":{
                                        "type":"date"
                                },
                                "DEALSTATUS":{
                                        "type":"integer"
                                },
                                "FILETYPE":{
                                        "type":"string"
                                },
                                "TAGS":{
                                        "type":"string"
                                },
                                "BELONGEDFLAG":{
                                        "type":"integer"
                                },
                                "CLASSID":{
                                        "type":"integer"
                                },
                                "CLASSIDTWO":{
                                        "type":"integer"
                                },
                                "CLASSIDTHREE":{
                                        "type":"string"
                                },
                                "CLASSIDFOUR":{
                                        "type":"integer"
                                },
                                "CHANNELID":{
                                        "type":"integer"
                                },
                                "CHANNELNAME":{
                                        "type":"string"
                                },
                                "CHANNELDESC":{
                                        "type":"string"
                                }
                        }
                }
        }
    }
}
' | java \
    -cp "${lib}/*" \
    -Dlog4j.configurationFile=${bin}/log4j2.xml \
    org.xbib.tools.Runner \
    org.xbib.tools.JDBCImporter

索引

分片

当在ElasticSearch集群中配置好你的索引后, 你要明白在集群运行中你无法调整分片设置. 既便以后你发现需要调整分片数量, 你也只能新建创建并对数据进行重新索引(reindex)(虽然reindex会比较耗时, 但至少能保证你不会停机).
主分片的配置与硬盘分区很类似, 在对一块空的硬盘空间进行分区时, 会要求用户先进行数据备份, 然后配置新的分区, 最后把数据写到新的分区上.
分配分片时主要考虑的你的数据集的增长趋势.

我们也经常会看到一些不必要的过度分片场景. 从ES社区用户对这个热门主题(分片配置)的分享数据来看, 用户可能认为过度分配是个绝对安全的策略(这里讲的过度分配是指对特定数据集, 为每个索引分配了超出当前数据量(文档数)所需要的分片数).

Elastic 在早期确实鼓吹过这种做法, 然后很多用户做的更为极端–例如分配1000个分片. 事实上, Elastic目前对此持有更谨慎的态度 .

稍有富余是好的, 但过度分配分片却是大错特错. 具体定义多少分片很难有定论, 取决于用户的数据量和使用方式. 100个分片, 即便很少使用也可能是好的;而2个分片, 即便使用非常频繁, 也可能是多余的.

要知道, 你分配的每个分片都是有额外的成本的:

每个分片本质上就是一个Lucene索引, 因此会消耗相应的文件句柄, 内存和CPU资源

每个搜索请求会调度到索引的每个分片中. 如果分片分散在不同的节点倒是问题不太. 但当分片开始竞争相同的硬件资源时, 性能便会逐步下降

ES使用词频统计来计算相关性 . 当然这些统计也会分配到各个分片上. 如果在大量分片上只维护了很少的数据, 则将导致最终的文档相关性较差

我们的客户通常认为随着业务的增长, 他们的数据量也会相应的增加, 所以很有必要为此做长期规划. 很多用户相信他们将会遇到暴发性增长(尽管大多数甚至都没有遇到过峰值), 当然也希望避免重新分片并减少可能的停机时间.

如果你真的担心数据的快速增长, 我们建议你多关心这条限制: ElasticSearch推荐的最大JVM堆空间是30~32G, 所以把你的分片最大容量限制为30GB, 然后再对分片数量做合理估算. 例如, 你认为你的数据能达到200GB, 我们推荐你最多分配7到8个分片.

总之, 不要现在就为你可能在三年后才能达到的10TB数据做过多分配. 如果真到那一天, 你也会很早感知到性能变化的.
动态副本

PUT /my_temp_index/_settings 
{     
"number_of_replicas": 1
}

analysis

standard 分析器是用于全文字段的默认分析器，对于大部分西方语系来说是一个不错的选择。它考虑了以下几点：
standard 分词器，在词层级上分割输入的文本。
standard 标记过滤器，被设计用来整理分词器触发的所有标记（但是目前什么都没做）。
lowercase 标记过滤器，将所有标记转换为小写。
stop 标记过滤器，删除所有可能会造成搜索歧义的停用词，如 a，the，and，is。
默认情况下，停用词过滤器是被禁用的。如需启用它，你可以通过创建一个基于 standard 分析器的自定义分析器，并且设置 stopwords 参数。可以提供一个停用词列表，或者使用一个特定语言的预定停用词列表。

PUT /spanish_docs 
{
     "settings": {
         "analysis": {
           "analyzer": {
            "es_std": {
             "type":      "standard",
             "stopwords": "_spanish_"
                    }
            }
       }
   }
}

中文分词

使用https://github.com/medcl/elasticsearch-analysis-ik
配置了ik_max_word和ik_smart，当前使用ik_smart更加人性化。

ik_max_word

ik_smart

配置：
index:
  analysis:
    analyzer:
      ik_max_word:
          type: ik
          use_smart: false
      ik_smart:
          type: ik
          use_smart: true
"analysis" : {
                "analyzer" : {
                    "ik" : {
                        "tokenizer" : "ik_smart"
                    }
                }
            }
        },

拼音

使用https://github.com/medcl/elasticsearch-analysis-pinyin 对应1.7.4版本执行mvn打包(打包时间较长，期间可能需要去外网下包）

wget http://mirror.bit.edu.cn/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz //
  326  tar zxvf apache-maven-3.3.9-bin.tar.gz //
  327  ls
  328  cp  apache-maven-3.3.9 /usr/local/maven
  329  cp -r apache-maven-3.3.9 /usr/local/maven
  330  vim /etc/profile //
  331  . /etc/profile  //
  332  cd /usr/local/maven/
  333  ls
  334  cd bin/
  335  ls
  336  vim /etc/profile
  337  source /etc/profile
  338  mvn
  339  vim /etc/profile
  340  source /etc/profile
  341  cd /tmp/software/
  342  ls
  343  cd product/
  344  ls
  345  git clone https://github.com/medcl/elasticsearch-analysis-pinyin.git
  346  ls
  347  cd elasticsearch-analysis-pinyin/
  348  ls
  349  mvn package
  350  ls
  351  cd ..
  352  ls
  353  rm -rf elasticsearch-analysis-pinyin
  354  ls
  355  wget https://github.com/medcl/elasticsearch-analysis-pinyin/archive/v1.7.4.zip  //
  356  ls
  357  mkdir elasticsearch-analysis-pinyin
  358  mv v1.7.4.zip elasticsearch-analysis-pinyin/  //
  359  cd elasticsearch-analysis-pinyin/
  360  ls
  361  unzip v1.7.4.zip   //
  362  ls
  363  cd elasticsearch-analysis-pinyin-1.7.4/  //
  364  ls
  365  mvn package    //
  366  ls
  367  cd target/  //
  368  ls
  369  cd releases/  //
  370  ls
  371  cp elasticsearch-analysis-pinyin-1.7.4.zip ../../../

实现中文分词后再进行pinyin过滤

{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "custom_pinyin_analyzer" : {
                    "tokenizer" : "ik_smart",
                    "filter" : ["full_pinyin_no_space","full_pinyin_with_space","first_letter_pinyin"]
                }
            },
            "filter" :{
                "full_pinyin_no_space" : {
                    "type" : "pinyin",
                    "first_letter" : "none",
                    "padding_char" : ""
                },
                "full_pinyin_with_space" : {
                    "type" : "pinyin",
                    "first_letter" : "none",
                    "padding_char" : " "
            },
                "first_letter_pinyin" : {
                    "type" : "pinyin",
                    "first_letter" : "only",
                    "padding_char" : ""
            }
        }
        }
    }
}

断词

同义词

自定义词库（自定义，第三方）

插件

当前使用插件：

elasticsearch-head 集群管理工具
http://estest.baomihua.com/_plugin/head/
提供索引分片基本信息查看和相关操作，以及基本的增删改查服务，和索引相关配置信息，集群状态，插件配置信息等。
bigdesk 集群监控工具
http://estest.baomihua.com/_plugin/bigdesk/#nodes
提供ES集群性能实时监测，包括JVM，Thread Pools，OS，Process，HTTP & Transport，Indices，File system相关信息。
kibana 可视化数据工具
http://estest.baomihua.com:5602
kibana是个日志可视化工具，在本环境下用来提供索引记录的实时详细查询，已经根据索引数据建立相关图表分析等。
Marvel 可视化ES集群状态监测工具
http://estest.baomihua.com:5602/app/marvel
提供更加美观的可视化ES集群性能实时监测。
elasticsearch-jdbc 数据导入工具

索引更新

全量索引

全量索引类似建立索引，全量导入oracle-pmh_es.sh

增量索引

ES-sql参数：
获取一个表,select from table可以使用查询。查询从数据库选择数据的简单的变体。他们转储表成Elasticsearch逐行。如果没有_id列名,IDs将自动生成。
id as _id 这样的话可以增量同步，_id是es的默认id命名
“interval”:”1800”, 这里是同步数据的频率 1800s，半小时，可以按需要设成 1s或其它
“schedule” : “0 0/60 0-23 ? *”, 同步数据任务 60分钟一次
“flush_interval” : “5s”, 刷新间隔为5S
sql.parameter——绑定SQL语句参数(按顺序)。可以使用一些特殊的值具有以下含义:

$now——当前时间戳$state——国家之一:BEFORE_FETCH,取回,AFTER_FETCH,无所事事,例外$metrics.counter——一个计数器$lastrowcount——从最后一条语句的行数$lastexceptiondate- SQL时间戳的例外$lastexception——完整的堆栈跟踪的例外$metrics.lastexecutionstart——最后一次执行SQL时间戳的时候开始$metrics.lastexecutionend- SQL时间戳的时候最后一次执行结束$metrics.totalrows——总获取的行数$metrics.totalbytes——获取的字节总数$metrics.failed——失败的SQL执行的总数$metrics.succeeded

deltaImportQuery="SELECT * FROM  PHM_SOLR WHERER SOLRTIME >TO_date('${metrics.lastexecutionstart}','YYYY-MM-DD hh24:mi:ss')"
"statefile" : "statefile-article.json",
        "schedule" : "0 0-59 0-23 ? * *",
"sql" : [
            {
                "statement" : "select *, id as _id from article where update_time > ?",
                "parameter" : [ "$metrics.lastexecutionstart" ]
            }
        ]

ES查询API

简易搜索

http://estest.baomihua.com/pmh_es_smart-test/_search?&pretty
pretty:美化json

http://estest.baomihua.co/pmh_es_smart-test/_search?q=FILMNAME:%E4%B8%AD%E5%9B%BD+CHANNELNAME:%E4%B8%AD%E5%9B%BD&pretty
字段搜索：_search?q=FILMNAME:中国+CHANNELNAME:中国

GET /_search?timeout=10ms
定义响应超时时间
/_search
在所有索引的所有类型中搜索
/gb/_search
在索引gb的所有类型中搜索
/gb,us/_search
在索引gb和us的所有类型中搜索
/g*,u*/_search
在以g或u开头的索引的所有类型中搜索
/gb/user/_search
在索引gb的类型user中搜索
/gb,us/user,tweet/_search
在索引gb和us的类型为user和tweet中搜索
/_all/user,tweet/_search
在所有索引的user和tweet中搜索 search types user and tweet in all indices

分页

size: 结果数，默认10
from: 跳过开始的结果数，默认0
每页显示5个结果，页码从1到3：
GET /_search?size=5
GET /_search?size=5&from=5
GET /_search?size=5&from=10

高亮

"highlight": {
    "pre_tags": [
      "<tag1>",
      "<tag2>"
    ],
    "post_tags": [
      "</tag1>",
      "</tag2>"
    ],
    "fields": {
      "FILMNAME": {}
    }
  }

{
  "query": {
    "match": {
      "FILMNAME": "中国"
    }
  },
  "highlight": {
    "fields": {
      "FILMNAME": {}
    }
  }
}

ES结构化API

请求体查询

1 2	GET /_search {} <1>

返回索引中所有的文档

GET /_search 
{   
"from": 30,   
"size": 10 
}
POST /_search 
{   
"from": 30,   
"size": 10 
}

分页

Query DSL

GET /_search 
{     
      "query": {         
          "match_all": {}     
  } 
}

匹配所有的文档

合并多子句

查询子句就像是搭积木一样，可以合并简单的子句为一个复杂的查询语句，比如：
叶子子句(leaf clauses)(比如match子句)用以在将查询字符串与一个字段(或多字段)进行比较
复合子句(compound)用以合并其他的子句。例如，bool子句允许你合并其他的合法子句，must，must_not或者should，如果可能的话：

{     
"bool": 
{         
"must":     { "match": { "tweet": "elasticsearch" }},         
"must_not": { "match": { "name":  "mary" }},         
"should":   { "match": { "tweet": "full text" }}
     } 
}

Filter DSL

term 过滤

term主要用于精确匹配哪些值，比如数字，日期，布尔值或 not_analyzed的字符串(未经分析的文本数据类型)：

{ "term": { "age":    26           }}
 { "term": { "date":   "2014-09-01" }}
 { "term": { "public": true         }}
 { "term": { "tag":    "full_text"  }}

terms 过滤

terms 跟 term 有点类似，但 terms 允许指定多个匹配条件。如果某个字段指定了多个值，那么文档需要一起去做匹配：

{
     "terms": {
              "tag": [ "search", "full_text", "nosql" ]
            }
}

range过滤

{
     "range": {
              "age": {
              "gte":  20,
              "lt":   30
            }
    }
}

范围操作符包含：
gt :: 大于
gte:: 大于等于
lt :: 小于
lte:: 小于等于

exists 和 missing 过滤

{
     "exists":   {
              "field":    "title"    
      } 
}

bool 过滤

bool 过滤可以用来合并多个过滤条件查询结果的布尔逻辑，它包含一下操作符：
must :: 多个查询条件的完全匹配,相当于 and。
must_not :: 多个查询条件的相反匹配，相当于 not。
should :: 至少有一个查询条件匹配, 相当于 or。

match 查询

multi_match 查询

multi_match查询允许你做match查询的基础上同时搜索多个字段：

{
     "multi_match": {
              "query":    "full text search",
              "fields":   [ "title", "body" ]
      }
}

排序

字段值排序

GET /_search
 {
      "query" : {
               "filtered" : {
              "filter" : { "term" : { "user_id" : 1 }}
                       }
           },
           "sort": { "date": { "order": "desc" }}
}

date排序会转换为毫秒进行排序
_score得分排序，最优结果

注意：本文为工作记录，未进行文档化，部分内容可读性较差，如有啥知识性误导或问题，可以留言反馈。以后或许会写些系列性的ES文档。