Managing ElasticSearch metadata

ElasticSearch is a schema-less search engine. This means that we dont need to have a fixed structure during time. A document can be indexed once with a defined structure and later with a new property. Under the hood, ElastichSearch manages a mapping between the provided data structure and the documents of the index. If we do nothing, the engine applies default strategies for this mapping.

Such approach is a bit ideal since ElasticSearch applies default strategies to store elements in the underlying indices. For real-life applications, this mapping needs to be tweaked to exactly match our needs. This is what we will tackle in this post.

Configure ElasticSearch Java client in the project

The simplest way to configure the ElasticSearch Java client to interact with the server is to use Maven and define the client as a dependency in the file pom.xml, as described below:

<?xml version="1.0" encoding="UTF-8"?>
<project (...)>
    <modelVersion>4.0.0</modelVersion>
    (...)
    <dependencies>
        <dependency>
            <groupId>org.elasticsearch</groupId>
            <artifactId>elasticsearch</artifactId>
            <version>1.3.2</version>
        </dependency>
    </dependencies>
    (...)
</project>

Maven can be used then to generate the configuration for your IDE. For example, for Eclipse, simply execute the following command:

mvn eclipse:eclipse

Now we have a configured project, lets have a look at what the metadata in ElasticSearch stand for.

What are ElasticSearch metadata for?

Before describing how to manage ElasticSearch metadata, we need to describe what are they and what they stand for.

The metadata corresponds to the way to handle the data received and returned by the index compared to the data within the index. This is also called index mapping.

Following code describes the structure of the mapping returned by ElasticSearch:

{
  "index1": {
    "mappings": {
      "type1": {
        "properties": {
          "property1a": { "type":"double" },
          "property1b": { "type":"long" },
          "property1c": { "type":"long" },
          "property1d": {
            "type":"string"
            "index" : "not_analyzed",
            "store" : true
          },
          (...)
        }
      },
      "type2":{
        "properties":{
          "property2a": {
            "properties": {
              "subProperty1": { "type":"double" },
              "subProperty2": { "type":"double" }
            }
          },
          "property2b": { "type":"long" },
          (...)
        }
      },
      (...)
    }
  },
  (...)
}

This content can be reached at the following addresses:

ElasticSearch provides a set of parameters at the property level to set the way to handle fields in the index. Here are the big families of parameters:

  • Property types
  • Property type auto detection
  • Way to index properties (analyzis, storage, name and so on)
  • Dynamic mappings

Lets start with types. Of course, common primitive types are supported:

  • Strings with type string.
  • Numbers with types integer, long, float and double.
  • Booleans with type boolean.
  • Dates with the type date. ElasticSearch provides a date detection support and an automatic date convertion from string to date values. We can notice that they can also be stored as fields of type long.

In addition to primitive types, ElasticSearch supports complex types with the two following types. This corresponds to embedded data and allows to store JSON documents directly in indices. These two types are similar since they handle same feature. Only the way to the data are stored and the way to request such data are different.

  • Type object. In this case, the data are flattened into a single Lucene document.
  • Type nested. In this case, the data are internally splitted into several Lucene documents for the inner data.

Here are the differences between these two approaches. We will use the example provided by the page Nested type.

// Provided data
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" : "Smith"
    },
    {
      "first" : "Alice",
      "last" : "White"
    },
  ]
}

// Internally indexed document
{
  "group" : "fans",
  "user.first" : [ "alice", "john" ],
  "user.last" : [ "smith", "white" ]
}

// Internally indexed documents
// Hidden child document

  "user.first" : "alice",
  "user.last" : "white"
}
// Hidden child document

  "user.first" : "john",
  "user.last" : "smith"
}
// Visible parent document

  "group" : "fans"
}

We can notice that there are differences to take into account when building queries according the chosen type for inner data.

Regarding arrays, there is no specific configuration since they are natively and transparently supported. This can be disturbing at the first sight since its not explicitely described in the mapping. Following mapping can store both single value or array for the field property1a:

"type1": {
  "properties": {
    "property1a": { "type":"double" },
    (...)
    }
  }
}  

It accepts both following contents:

// Without array
{
  "property1a": 20.7,
  (...)
}
// With array
{
  "property1a": [ 20.7, 43.8 ],
  (...)
}

We can notice that such approach also apply to complex types.

ElasticSearch also provides type detections:

  • For dates. As a matter of fact, JSON doesnt support dates and they must be set either as long or string. For the latter, the engine allows to detect the field contains a date and transform it as long value. The autodetection is controller by the parameter date_detection at a field level. The default value is true. The parameter dynamic_date_formats is also provided to define patterns that match date values. We can notice that these two parameters must be defined at the root object level. Here is a sample of use:

// Date detection enabled (default)
{
  "type1" : {
    "dynamic_date_formats" : ["yyyy-MM-dd", "dd-MM-yyyy"],
    "date_detection" : true,
    "properties" : {
      "date" : {"type" : "date"}
    }
  }
}

// Date detection disabled
{
  "type1" : {
    "date_detection" : false,
    "properties" : {
      "date" : {"type" : "string"}
    }
  }
}

  • For numeric. As a matter of fact, JSON supports numeric but we can imagine that they are provided as string. For this case, the engine allows to detect the field contains a numeric value and transform it to the appropriate numeric type. We can notice that this parameter must be defined at the root object level. Here is a sample of use:

// Numeric detection enabled (default)
{
  "type1" : {
    "numeric_detection" : true,
    "properties" : {
      "value" : {"type" : "integer"}
    }
  }
}

// Numeric detection disabled
{
  "type1" : {
    "numeric_detection" : false,
    "properties" : {
      "value" : {"type" : "string"}
    }
  }
}

ElasticSearch also allows to which values correspond to null (see this link for more details: Dealing with null values. The parameter null_value allows to configure such case:

{
  "type1" : {
    "properties" : {
      "message" : {
        "type" : "string",
        "null_value" : "na"
      }
    }
  }
}

We can also hints on indexing and analyzing:

  • Parameter index_name. This specifies the corresponding name of the field in the indexed document.
  • Parameter index. This specifies what must be done during the index phasis (the field must be analyzed or not). The possible values are analyzed and not_analyzed.
  • Parameter store. This specifies if the field must be stored or not in the index. The possible values are true or false.
  • Several parameters target the configuration of analyzers. The parameter analyzer specifies the analyzer to use for both indexing and search. Parameters index_analyzer and search_analyzer can be used to define different analyzers for indexing and search. Following link lists the analyzers provided by default: Built-in analyzers. We can also notice that we can provide and configure custom ones as described in the page Custom analyzers.

Below is a sample of use of these parameters:

{
  "type1" : {
    "properties" : {
      "message" : {
        "type" : "string",
        "index": "analyzed,
        "store": true,
        "index_name" : "msg",
        "analyzer": "standard"
      }
    }
  }
}

When dynamic mapping is enabled (by default), we can configure the way that the mapping is deduced. This feature is called dynamic templates and can be configured at the type level with the parameter dynamic_templates. The following sample describes how to add the parameter store for each fields:

{
  "type1" : {
    "dynamic_templates" : [
      {
        "store_generic" : {
          "match" : "*",
          "mapping" : {
            "store" : true
          }
        }
      }
    ]
  }
}

We can notice that the order of templates is important since only the first matching template is evaluated.

To finish, ElasticSearch allows to disable the dynamic mapping for types, this means that we need to explicitly define the mapping.

{
  "type1" : {
    "dynamic": "strict",
    "properties" : {
      (...)
    }
}

Lets now tackle the way to manage these mappings using the ElasticSearch Java client.

Getting index metadata

The ElasticSearch Java client allows to get index metadata. It calls URIs /indexname/_mapping on an ElasticSearch node. This is implemented using the GetMappingsRequest as described below:

GetMappingsResponse response = client.admin().indices()
        .prepareGetMappings(indexName)
        .execute().actionGet();
ImmutableOpenMap<String, ImmutableOpenMap<String, MappingMetaData>> mappings
        = response.mappings();

We can also specify the types we want to get in the mapping request with the method setTypes, as described below:

String typeName = (...)
GetMappingsResponse response = client.admin().indices()
        .prepareGetMappings(indexName)
        .setTypes(typeName)
        .execute().actionGet();

We can notice that in such case the URI called is /indexname/_mapping/typename.

Now we have the mappings, we can iterate the map to get the metadata. What we must be aware is that the structure is always the same whatever the specified parameters (index name, index type). This means that we need to browse the mapping structure in the same way. Following code describes how to get the mapping of types within a particular index:

ImmutableOpenMap<String, MappingMetaData> indexMappings
             = mappings.get(indexName);
// Iterate over the types
for (Iterator<String> iterator = indexMappings.keysIt(); iterator.hasNext();) {
    String typeName = iterator.next();
    MappingMetaData mappingMetadata = indexMappings.get(typeName);

    // Get mapping content for the type
    Map<String, Object> source = mappingMetadata.sourceAsMap();
    Map<String, Object> properties
          = (Map<String, Object>) source.get("properties");
    if (properties != null) {
        // Iterate over mapping properties for the type
        for (String propertyName : properties.keySet()) {
            Map<String, String> property
                = (Map<String, String>) properties.get(propertyName);
            String type = property.get("type");
}

We can notice that ElasticSearch supports mapping inner properties (complex types) as well, as describe below:

"type2": {
  "properties": {
    "property2a": {
      "properties": {
        "subProperty1": { "type":"double" },
        "subProperty2": { "type":"double" }
      }
    },
    (...)
  }
}

The processing to get metadata for all properties should take care of this. Moreover, we only get the type within the property mappings but the hints at this level are more wide. Lets have a look now at this aspect.

We need to add a bit of recusivity within our processing to take into account mapping of inner properties. As a matter of fact, there is no restriction in the depth of such properties (inner properties can also contain inner properties).

Below is the new code:

ImmutableOpenMap<String, MappingMetaData> indexMappings
             = mappings.get(indexName);
// Iterate over the types
for (Iterator<String> iterator = indexMappings.keysIt(); iterator.hasNext();) {
    String typeName = iterator.next();
    MappingMetaData mappingMetadata = indexMappings.get(typeName);

    // Get mapping content for the type
    Map<String, Object> source = mappingMetadata.sourceAsMap();
    Map<String, Object> properties
          = (Map<String, Object>) source.get("properties");
    handleTypeMappingProperties(properties);
}

This code is based on the method handleTypeMappingProperties:

private void handleTypeMappingProperties(
        Map<String, Object> properties) {
    if (properties == null || properties.isEmpty()) {
      return;
    }

    for (String propertyName : properties.keySet()) {
        Map<String, Object> property
            = (Map<String, Object>) properties.get(propertyName);

        Map<String, Object> subProperties
            = (Map<String, Object>) property.get("properties");
        if (subProperties != null && !subProperties.isEmpty()) {
            // Handle complex property
            handleTypeMappingProperties(subProperties);
        } else {
            // Handle primitive property
            String type = property.get("type");
        }
    }
}

Custom metadata for types

ElasticSearch allows to use custom metadata for index types using the property _meta at the same level than the property properties within the mapping definition for types, as described below:

{
  "blog": {
     "mappings": {
       "article": {
         "properties": {
           (...)
         },
         "_meta": {
           "constraints" : {
             "title": {
               "notnull":true,
               "regexp":"^[a-zA-Z]$"
             }
           }
         }
       }
     }
  }
}

Following code describes how to get the content of custom metadata:

String typeName = iterator.next();
MappingMetaData mappingMetadata = indexMappings.get(typeName);

// Get mapping content for the type
Map<String, Object> source = mappingMetadata.sourceAsMap();
Map<String, Object> customMetadata
          = (Map<String, Object>) source.get("_meta");
(...)

To go further, we can refer to the following post about data validation, Implementing data validation in ElasticSearch.

Updating index metadata

The ElasticSearch Java client allows to put index metadata but only for a specific type. It calls URIs /indexname/mytype/_mapping on an ElasticSearch node.

This is implemented using the PutMappingRequest as described below:

PutMappingResponse response = client.admin().indices()
        .preparePutMapping(indexName).setSource(mapping)
        .setType(typeName).execute().actionGet();

The provided mapping in the method setSource can simple correspond to a map, as described below:

Map<String, Object> mapping = new HashMap<String, Object>();

Map<String, Object> type1MappingContent = new HashMap<String, Object>();
mapping.put("type1", type1MappingContent);
Map<String, Object> type1Properties = new HashMap<String, Object>();
type1MappingContent.put("properties", type1Properties);

addPropertyParameter(type1Properties, "property1a", "double");
addPropertyParameter(type1Properties, "property1b", "long");
addPropertyParameter(type1Properties, "property1c", "long");
addPropertyParameter(type1Properties, "property1d",
        "string", "not_analyzed", true);

Here are the content of the methods addPropertyParameter:

private void addPropertyParameter(
        Map<String, Object> properties,
        String name, String type) {
    addPropertyParameter(properties, name, type, null, null);
}

private void addPropertyParameter(
        Map<String, Object> properties,
        String name, String type, String index, Boolean store) {
    Map<String, Object> propertyContent = new HashMap<String, Object>();
    properties.put(name, propertyContent);
    propertyContent.put("type", type);
    if (index != null) {
        propertyContent.put("index", index);
    }
    if (store != null) {
        propertyContent.put("store", store);
    }
}

We can notice that we can include recursively map into map to support complex types, as described below:

Map<String, Object> mapping = new HashMap<String, Object>();

Map<String, Object> type1MappingContent = new HashMap<String, Object>();
mapping.put("type2", type1MappingContent);
Map<String, Object> type1Properties = new HashMap<String, Object>();
type1MappingContent.put("properties", type1Properties);

Map<String,Object> property2bProperties
        = addComplexPropertyParameter(type1Properties, "property2b");
addPropertyParameter(property2bProperties, "subProperty1", "double");
addPropertyParameter(property2bProperties, "property1d", "double");
addPropertyParameter(type1Properties, "property1d",
        "string", "not_analyzed", true);

Here are the content of the method addComplexPropertyParameter:

private Map<String,Object> addComplexPropertyParameter(
        Map<String, Object> properties,
        String name) {
    Map<String, Object> propertyContent = new HashMap<String, Object>();
    properties.put(name, propertyContent);
    Map<String, Object> subProperties = new HashMap<String, Object>();
    propertyContent.put("properties", subProperties);
    return subProperties;
}

Configuring schema at engine level

We can finally notice that we can explicitely configure default mappings in the configuration folder of ElasticSearch. Its a bit outside the scope of the post but this needs to be mentionned. For more details, have a look at the page Default mappings.

Advertisements
This entry was posted in Client, ElasticSearch, Mapping and tagged , , . Bookmark the permalink.

One Response to Managing ElasticSearch metadata

  1. Pingback: Implementing integration testing for ElasticSearch with Java | Sandbox for the Web stack

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s