ElasticSearch

16 minute read

Published:

Java Web – ElasticSearch

ElasticSearch

Overview

Elasticsearchm, Solr: Search (Baidu, github, Taobao e-commerce!)

As long as you need to use search, you can use ES! (Use it in the case of a large amount of data!)

ElasticSearch, es for short, es is an open source, highly scalable, distributed full-text search engine, it can store and retrieve data in a near-implementation; it has good scalability and can be extended to hundreds of servers , process petabytes of data. es is also developed in java and uses Lucene as its core to implement all indexing and searching functions, but its purpose is to hide the complexity of Lucene through a simple RESTful API, thus making full text search easier. simple

Who is using

  • Wikipedia, similar to Baidu Baike, full-text search, highlight, search recommendation
  • Foreign news sites, similar to Sohu News, user behavior logs (clicks, browsing, favorites, comments) + social network data, data analysis. . .
  • Stack Overflow foreign program exception discussion forum
  • GitHub (open source code management), search hundreds of billions of lines of code
  • E-commerce website, search for products
  • Log data analysis, logstash collects logs, ES performs complex data analysis, ELK technology (elasticsearch+logstash+kibana)
  • Commodity price monitoring website
  • Business Intelligence System
  • Site Search

Difference between ES and solr

Introduction to ElasticSearch

ElasticSearch is an implementation of distributed search and analytics engine. It makes it possible to process big data faster than ever before. It is used for full text search, structured search, analytics and a mix of all three:

Wikipedia uses es to provide full-text search and keyword highlighting, as well as input to implement search and search error correction features such as search suggestions; the British Gazette uses es to combine user logs and social network data to provide feedback to their editors to implement in order to understand Gong Always respond to newly published articles. . .

es is an open source search engine based on Apache Lucene(TM). Whether in the open source or proprietary space, Lucene can be considered the most advanced, performant, and full-featured search engine library to date. To use it, you must use java as the development language and inherit it directly into your application.

Introduction to solr

Solr is a top-level open source project under Apache, developed in java, and is a full-text search server based on Lucene. Solr provides a richer query language than Lucene, and at the same time, it is configurable, extensible, and optimized for indexing and search performance. It can run independently**, is an independent enterprise and search application server, it provides an API interface similar to web-service. Users can submit files in a certain format through http requests, such as search engine servers, to generate indexes; they can also make search requests and get returned results.

Compare the two

  • Solr is faster when simply searching on existing data
  • When real-time indexing is, Solr will cause io blocking, poor query performance, ElasticSearch has obvious advantages
  • As data volume increases, Solr’s search efficiency will become lower, while Elasticsearch does not change significantly

Summarize

  1. es is basically out of the box, very simple. And solr will be a little more complicated.
  2. Solr uses Zookeeper for distributed management, while elasticsearch itself has a distributed coordination management function
  3. Solr supports more formats of data, such as json xml csv. And es only supports json file format
  4. Solr officially provides more functions, while elasticsearch pays more attention to core functions, and advanced functions are provided by third-party plug-ins
  5. Solr query is fast, but it is slow to update the index. It is used for applications with many queries such as e-commerce.
  6. es builds a wide index, that is, the real-time query is fast, and it is used for searches such as facebook, sina, etc.
  7. Solr is more mature and has a larger and more mature community of users, developers and contributors, while elasticsearch has fewer developers and maintainers, updates too fast, and costs more to learn and use

ES core concepts

What is cluster, node, index, type, document, shard, map

elasticsearch is document oriented, relational database and elasticsearch objectively contrast! Everything is json

RelationalDBElasticsearch
database (database)index (indices)
Tablestypes (version 7 and later will be discarded, default _doc)
rowsdocuments
fieldsfields

Elasticsearch (cluster) can contain multiple indexes (databases), each index can contain multiple types (tables), each type first contains multiple documents (rows), and each document contains multiple fields ( List).

Physical Design

Elasticsearch divides each index into multiple shards in the background, and each shard can be migrated between different servers in the cluster

A person is a cluster, and the default cluster name is ElasticSearch

Logic Design

An index type contains multiple documents, such as document 1 and document 2. When we index a document, we can find it in the following order: index>type document ID, through this combination we can index a specific document. Note: ID doesn’t have to be an integer, it’s actually a string

Documentation

I said earlier that elasticsearch is document-oriented, which means that the smallest unit of indexing and searching data is a document. In elasticsearch, documents have several important properties:

  • Self-contained, a document contains both fields and corresponding values, that is, contains key:value at the same time!
  • Can be hierarchical, a document contains self-documents, and complex logical entities come from this!
  • Flexible structure, documents do not depend on pre-defined schemas. We know that in relational databases, fields must be defined in advance to be used. In elasticsearch, fields are very flexible. Sometimes, we can ignore this field, or dynamically Add a new field.

Although we can add or ignore a field at will, the type of each field is very important, such as an age field type, which can be a string or an integer. Because elasticsearch will save the mapping between fields and types and other settings. This mapping is specific to each type of each mapping, which is why in elasticsearch, types are sometimes called mapping types.

type

Types are logical containers for documents, and like relational databases, tables are containers for rows. The definition of a field in a type is called a mapping, for example, a name is mapped to a string type. We say that documents are schemaless, they do not need to have all the fields defined in the mapping, such as adding a new field, so how does elasticsearch do it? Elasticsearch will automatically add new fields to the mapping, but the uncertainty of this field What type it is, elasticsearch starts to guess, if the value is 18, then elasticsearch will think it is an integer. However, elasticsearch may also guess wrong, so the safest way is to define the required mappings in advance, which is the same as the relational database. Define the fields first, and then use them. Don’t make a fool of yourself.

index

is the database

An index is a container of mapping types, and an index in elasticsearch is a very large collection of documents. The index stores the fields and other settings of the mapped type. They are then stored on individual shards. Let’s examine how sharding works.

Physical Design: How Nodes and Shards Work

A cluster has at least one node, and a node is an elasricsearch process. A node can have multiple indexes. By default, if you create an index, the index will have 5 shards (primary shard, also known as primary shard). Yes, each primary shard will have a replica (replica shard, also known as replica shard).

For example, there is a cluster with 3 nodes. You can see that the primary shard and the corresponding replicated shard will not be in the same node, which is beneficial for a node to hang up, and the data will not be lost. In effect, a shard is a Lucene index, a directory of files containing an inverted index. The inverted index is structured so that elasticsearch can tell you which documents contain a particular keyword without scanning all documents. *. But wait, what the heck is an inverted index?

Inverted index

Elasticsearch uses a structure called inverted index, using Lucene inverted index as the bottom layer. This structure is suitable for fast full-text searches, where an indexed document consists of a list of all distinct lists, and for each word, there is a list of documents that contain it. For example, there are now two documents, each containing the following:

Study every day, good good up to forever # What Document 1 contains
To forever, study every day, good good up # What Document 2 contains

To create an inverted index, we first split each document into individual words (or terms or tokens), then create a sorted list of all distinct terms, then list each term Which document appears in.

Now, we’re trying to search for to forever. Both documents match, but the first document is more closely matched than the second. If nothing else, both documents containing the keyword will now be returned. Let’s look at another example, let’s say we search for blog posts by the blog tag.

If you want to search for articles with python tags, it will be much faster to find the inverted index data than to find all the original data. Just look at the tags column and get the relevant article ID. Completely filter out all irrelevant data and improve efficiency!

Comparison of elasticsearch index and Lucene index

In elasticsearch, the word index (repository) is used frequently, and this is how the term is used. In elasticsearch, the index is divided into multiple shards, and each shard is a Lucene index. So an elasticsearch index is composed of multiple Lucene indexes. Don’t ask why, who let elasticsearch use Lucene as the bottom layer! If there is no specific reference, when it comes to indexes, it refers to the indexes of elasticsearch.

All the following operations are done in the Console under Dev Tools in kibana. Basic operation!

IK tokenizer

What is an IK tokenizer?

Word segmentation: that is to divide a paragraph of Chinese or other words into keywords, we will segment our own information when searching, segment the data in the database or index library, and then perform a matching operation, the default Chinese word segmentation is to treat each word as a word, such as ““I love crazy god” will be divided into “I”, “love”, “crazy” and “god”, which obviously does not meet the requirements, so we need to install Chinese Tokenizer ik to solve this problem.

If you want to use Chinese, it is recommended to use the ik tokenizer

IK provides two word segmentation algorithms: ik smart and ik_max_word, where ik_smart is the least segmentation and ik_max_word is the most fine-grained segmentation!

ik_smart is the least segmentation

ik_max_word is the most fine-grained division, exhaustive thesaurus is possible

In the future, we need to configure the word segmentation by ourselves and configure it in the dic file defined by ourselves.

RestFul style description

A software architectural style, rather than a standard, only provides a set of design principles and constraints. It is mainly used for client-server interaction software. Software designed based on this style can be more concise, more hierarchical, and more flexible. Easy to implement mechanisms such as caching

Basic operations on indexing

  1. Create the first index

    PUT /index name/~type name (may be cancelled in the future)~/document id
    {
          request body
    }
    

    The index has been automatically added! The data has also been added successfully, which is why I said that everyone can use it as a database for learning in the early stage

    1) Then the name field does not need to specify the type. After all, we need to specify the type in relational databases.

    • String type

      text, keyword

    • Numeric type

      long, integer, short, byte, double, float, half float, scaled fload

    • date type

      date

    • Boolean type

      boolean

    • binary type

      binart

    • and many more…

    2) Specify the type of field

    3) To obtain this rule information, you can obtain specific information through a GET request

    4) View the default information

    If your own document field is not specified, then es will give us the default configuration field type!

    Extended:

    A lot of current information can be obtained through get _cat/

  2. Modify Submit or use PUT, then overwrite the value, or use a new method

  3. Delete the index

    Deletion is achieved through the DELETE command, and according to your request, it is determined whether to delete the index or delete the document record.

    Using RESTFUL style is what we recommend for everyone in ES

Basic operations on documents

  1. Add data
PUT ahui/test/1
{
    
  "name": "ahui",
  "age": 21,
  "desc": "May you have strong wind and strong wine, but also enjoy solitude and freedom",
  "tags": ["Secondary Yuan", "Otaku", "Code Farmer"]
}
  1. Get data
GET ahui/test/1
  1. Update data PUT (equivalent to overwriting the previous one)
PUT ahui/test/1
{
    
  "name": "ahui",
  "age": 22,
  "desc": "May you have strong wind and strong wine, but also enjoy solitude and freedom",
  "tags": ["Secondary Yuan", "Otaku", "Code Farmer"]
}
  1. POST to modify data, it is recommended to use this modification method
POST /test2/user/2/_update
{
  "doc":{
    "age": 18
  }
  
}

simple search

GET ahui/user/1

Simple conditional query, can generate basic query according to default mapping rules

Complex query search select (sort, pagination, highlight, fuzzy query, precise query, etc.)

GET ahui/user/_search?q=name:ahui
GET ahui/user/_search
{
  "query":{
    "match": {
    	"name": "ahui"
    }
  }
}
GET ahui/user/_search
{
  "query":{
    "match": {
    	"name": "ahui"
    }
  },
  "_source": ["name","tags"]
}

We will use Java to operate es later, all methods and objects are the keys here

Sort

GET ahui/user/_search
{
  "query":{
    "match": {
    	"name": "ahui"
    }
  },
  "sort":{
  	"age":{
  		"order": "desc"
  	}
  }
}

Paging query

GET ahui/user/_search
{
  "query":{
    "match": {
    	"name": "ahui"
    }
  },
  "sort":{
  	"age":{
  		"order": "desc"
  	}
  },
  "from": 0,
  "size": 2
}

Data subscripts still start from 0

Boolean query

must (and) all conditions must be met Equivalent to where id = 1 and name = xxx

GET ahui/user/_search
{
  "query":{
  	"bool":{
  		"must":[
  			{
  				"match": {
    				"name": "ahui"
    			}
    		},
    		{
  				"match": {
    				"age": 23
    			}
    		}
  		]
  	}
  }
}

should (or) all conditions meet one of them, equivalent to where id = 1 or name = xxx

GET ahui/user/_search
{
  "query":{
  	"bool":{
  		"should":[
  			{
  				"match": {
    				"name": "ahui"
    			}
    		},
    		{
  				"match": {
    				"age": 23
    			}
    		}
  		]
  	}
  }
}

must_not (not) reverse lookup

GET ahui/user/_search
{
  "query":{
  	"bool":{
  		"must_not":[
    		{
  				"match": {
    				"age": 23
    			}
    		}
  		]
  	}
  }
}

filter

GET ahui/user/_search
{
  "query":{
  	"bool":{
  		"must":[
  			{
  				"match": {
    				"name": "ahui"
    			}
    		}
  		],
        "filter":{
            "range":{
                "age":{
                    "lt": 10
                }
            }
        }
  	}
  }
}
  • gt is greater than
  • gte is greater than or equal to
  • lt is less than
  • lte is less than or equal to

match multiple conditions

GET ahui/user/_search
{
  "query":{
    "match": {
    	"tags": "tagA tagB"
    }
  }
}

Exact query

The term query is directly searched through the process of the term specified by the inverted index.

About participles:

  • term, direct query exact
  • match, will use the tokenizer to parse (first analyze the document, then query through the analyzed document)

Two types text keyword

GET _analyze
{
  "analyzer": "keyword", 
  "text": "tagA tagB"
}

text is combined

GET _analyze
{
  "analyzer": "standard", 
  "text": "tagA tagB"
}

text is tokenized

Exact query for multiple value matches

GET ahui/user/_search
{
  "query":{
  	"bool":{
  		"should":[
  			{
  				"term": {
    				"name": "ahui"
    			}
    		},
    		{
  				"term": {
    				"name": "bbbb"
    			}
    		}
  		]
  	}
  }
}

Highlight query

GET ahui/user/_search
{
  "query":{
    "match": {
    	"name": "ahui"
    }
  },
  "highlight": {
      "pretags": "<p class='key' style='color:red'>",
      "posttags": "</p>",
      "fileds": {
          "name": {}
      }
  }
}

In fact, MySQL can also do these, but the efficiency of trying Mysql is relatively low.

  • match
  • Match by condition
  • Exact match
  • Match field filtering
  • Multi-condition query
  • Highlight query
  • Inverted query