ELK on docker
Integrating Hadoop and Elasticsearch – Part 1 – Loading into and Querying Elasticsearch from Apache Hive
Integrating Hadoop and Elasticsearch – Part 2 – Writing to and Querying Elasticsearch from Apache Spark
curl -XPOST ‘http://localhost:9200/_shutdown‘
A request to Elasticsearch consists of the same parts as any HTTP request:
curl -X ‘:///?’ -d ”
The parts marked with < > above are:
The appropriate HTTP method or verb: GET, POST, PUT, HEAD, or DELETE.
Either http or https (if you have an https proxy in front of Elasticsearch.)
The hostname of any node in your Elasticsearch cluster, or localhost for a node on your local machine.
The port running the Elasticsearch HTTP service, which defaults to 9200.
Any optional query-string parameters (for example ?pretty will pretty-printthe JSON response to make it easier to read.)
A JSON-encoded request body (if the request needs one.)
For instance, to count the number of documents in the cluster, we could use this:
curl -XGET ‘http://localhost:9200/_count?pretty‘ -d ‘
Search all types in all indices
Search all types in the gb index
Search all types in the gb and us indices
Search all types in any indices beginning with g or beginning with u
Search type user in the gb index
Search types user and tweet in the gb and us indices
Search types user and tweet in all indices
- in Elasticsearch, a document belongs to a type, and those types live inside anindex. You can draw some (rough) parallels to a traditional relational database:
Relational DB ⇒ Databases ⇒ Tables ⇒ Rows ⇒ Columns
Elasticsearch ⇒ Indices ⇒ Types ⇒ Documents ⇒ Fields
- Elasticsearch supports the following simple field types:
- String: string
- Whole number: byte, short, integer, long
- Floating-point: float, double
- Boolean: boolean
- Date: date
What Is Relevance?
We’ve mentioned that, by default, results are returned in descending order of relevance. But what is relevance? How is it calculated?
The relevance score of each document is represented by a positive floating-point number called the_score. The higher the _score, the more relevant the document.
A query clause generates a _score for each document. How that score is calculated depends on the type of query clause. Different query clauses are used for different purposes: a fuzzy query might determine the _score by calculating how similar the spelling of the found word is to the original search term; a terms query would incorporate the percentage of terms that were found. However, what we usually mean by relevance is the algorithm that we use to calculate how similar the contents of a full-text field are to a full-text query string.
The standard similarity algorithm used in Elasticsearch is known as term frequency/inverse document frequency, or TF/IDF, which takes the following factors into account:
How often does the term appear in the field? The more often, the more relevant. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention.
Inverse document frequency
How often does each term appear in the index? The more often, the less relevant. Terms that appear in many documents have a lower weight than more-uncommon terms.
How long is the field? The longer it is, the less likely it is that words in the field will be relevant. A term appearing in a short title field carries more weight than the same term appearing in a long content field.
Individual queries may combine the TF/IDF score with other factors such as the term proximity in phrase queries, or term similarity in fuzzy queries.
Relevance is not just about full-text search, though. It can equally be applied to yes/no clauses, where the more clauses that match, the higher the _score.
When multiple query clauses are combined using a compound query like the bool query, the _scorefrom each of these query clauses is combined to calculate the overall _score for the document.