Implementing data validation in ElasticSearch

ElasticSearch is a schema-less engine that means that no data validation is done before indexing data in underlying Lucene indices. In some cases, it’s important to have data validation at the data store itfself to ensure that data are correct. We will describe here a way to implement such approach based on type mapping and a dedicated plugin.

We will use here documents of type article with the following structure:

  • id: the identifier of the article
  • title: its title
  • content: its content
  • tags: its associated tags to categorize article

Defining constraints on fields in metadata

ElasticSearch implicitely manages a type mapping that defines rules to apply when storing data in indices. This mapping is automatically created at the first indexing on a type but can be also updated manually by users through its REST API.

For a type article in an index blog, we have access to the corresponding mapping with the address http://localhost:9200/blog/article/_mapping. Hints under the block properties correspond to the mapping itself. ElasticSearch also permits the use of a block _meta to store metadata associated to the type. The structure of these metadata is free and consists in a JSON element.

Following snippet describes the mapping of our type article in ElasticSearch:

{ "blog" : {
  "mappings" : {
    "article" : {
      "properties":{
        "id":{"type":"string"},
        "title":{"type":"string"},
        "content":{"type":"string"},
        "tags":{ "type":"string"}
      }
    }
  }
}}

Under the block _meta, we define the rules to apply on the fields of the type article. We define only two constraint kinds but this can be extended according to the needs of application.

Following snippet describes the previous mapping of our type article extended with a block _meta defining field constraints:

{ "blog": {
  "mappings": {
    "article": {
      "properties": {
        (...)
      },
      "_meta": {
        "constraints" : {
          "title":{
            "notnull":true,
            "regexp":"^[a-zA-Z]$"
          }
        }
      }
    }
  }
}

Now we have configured field constraints for our type article, we need to implement the processing to check these constraints when indexing or updating documents.

Constraint checking plugin

We don’t describe here the basics of plugin implementation. If you want more hints, please have a look at the following links:

Constraint checking filter

A REST filter is suitable to implement preprocessing for both indexing and updating documents. Within its method process, we will be able to check if the configured constraints are checked or not for these kinds of operations.

Following snippet describes the skeleton of a REST filter in ElasticSearch:

public class ConstraintsRestFilter extends RestFilter {
    private Client client;

    public ConstraintsRestFilter(Client client) {
        this.client = client;
    }

    @Override
    public void process(RestRequest request, RestChannel channel,
                        RestFilterChain chain) {
        (....)
    }
}

This filter can be simply registered within our REST handler ConstraintsHandler, as described below:

public class ConstraintsHandler extends BaseRestHandler {

    @Inject
    public ConstraintsHandler(Settings settings, Client client,
                              RestController controller) {
        super(settings, client);

        controller.registerFilter(new ConstraintsRestFilter(client));
    }

    (...)
}

Now we have implemented foundations of our constraints plugin, it’s time to tackle details of its processing:

  • specifying when filter must check constraints
  • loading the configuration of constraints from mapping
  • checking constraints on the provided document

Checking constraints

Within the method process of our REST filter, we need to determine when applying constraints checking. This corresponds to detecte when a request to index or update data is sent. Let’s add a method mustCheckContraints at the very beginning of the method process of our filter.

@Override
public void process(RestRequest request, RestChannel channel,
                    RestFilterChain chain) {

    if (mustCheckConstraints(request)) {
        String[] tokens = request.path().split("/");
        String index = tokens[1];
        String type = tokens[2];

        if (!validateContent(request, index, type)) {
            channel.sendResponse(new StringRestResponse(
                                     RestStatus.BAD_REQUEST));
            return;
        }
    }

    chain.continueProcessing(request, channel);
}

The method mustCheckConstraints is responsible for determining if we need to check constraints for the provided document. This part is a tricky one since we need to determine that the request correspond to either a document indexing or updating. We can implement some processing based on URI registered by default in ElasticSearch but adding new ones can have impacts.

We consider that no token of the path must begin with the character _ since it corresponds to features other than indexing or updating data. In the case of two tokens, we have an indexing operation with the HTTP method POST and, in the case of three tokens, we have an updating operation with the HTTP method PUT.

Following snippet describes the implementation of the method mustCheckConstraints.

private boolean mustCheckConstraints(RestRequest request) {
    String[] tokens = request.path().split("/");
    for (String token : tokens) {
        if (token.startsWith("_")) {
            return false;
        }
    }

    if (tokens.length == 2 && request.method().equals(Method.POST)) {
        return true;
    } else if (tokens.length == 3 && request.method().equals(Method.PUT)) {
        return true;
    }

    return false;
}

The method validateContent actually applies constraints on the submitted content. We will describe its implementation in the last section of this article. Let’s deal now with how to load the constraint configuration from the metadata of the mapping.

Loading constraint configuration

The first step to load constraint configuration consists in loading mapping using the java client of ElasticSearch. The latter provides a GetMappingRequest for this.

Following snippet describes how to it:

private GetMappingsResponse loadMappings(
                     RestRequest request, String index) {
    GetMappingsRequest getMappingsRequest = new GetMappingsRequest();
    getMappingsRequest.indices(new String[] { index }).local(false);
    ActionFuture<GetMappingsResponse> res = client.admin().indices()
               .getMappings(getMappingsRequest);
    return res.actionGet();
}

Now the mapping is loaded, we can extract from it the block _meta and our configuration of constraints. For this, we need to browse the imbricated maps of the mapping. The source map for specified index and type will contain the raw configuration of our constraints.

private Map<String, Object> getConstraints(
                     GetMappingsResponse getMappingsResponse,
                     String index, String type) {
    ImmutableOpenMap<String, ImmutableOpenMap<String,
         MappingMetaData>> mappings = getMappingsResponse.mappings();
    ImmutableOpenMap<String, MappingMetaData> indexMappings
                                = mappings.get(index);
    MappingMetaData mappingMetaData = indexMappings.get(type);
    try {
        Map<String, Object> map = mappingMetaData.getSourceAsMap();
        Map<String, Object> metadata
                                = (Map<String, Object>) map.get("_meta");
        if (metadata != null) {
            return (Map<String, Object>) metadata.get("constraints");
        }
    } catch (Exception ex) {
        (...)
    }
    return new HashMap<String, Object>();
}

Based on the map of constraints, we can build a list of constraint objects to be checked per field. This list will be more useful than the raw configuration map of constraints.

Moreover, for simplicity, constraints are only loaded for first-level fields. To go beyond, the method getAllTypeFieldConstraints must be converted to a recursive one to parse all the depth of configuration.

private List<TypeFieldConstraints> getAllTypeFieldConstraints(
                              Map<String, Object> constraints) {
    List<TypeFieldConstraints> allConstraints
                  = new ArrayList<TypeFieldConstraints>();

    for (String fieldName : constraints.keySet()) {
        TypeFieldConstraints constraints = new TypeFieldConstraints();
        allConstraints.add(constraints);
        constraints.setFieldName(fieldName);
        Map<String, Object> values
                      = (Map<String, Object>) constraints.get(fieldName);
        constraints.setRegexp((String) values.get("regexp"));
    }
    return allConstraints;
}

We have now the configuration of constraints loaded. Let’s check them for the provided document.

Implementing constraint checking

After loaded constraints from metadata, we need to parse the input content. For this, we need to create ElasticSearch API to create a dedicated parser. The latter allows parsing the JSON content using a token-based approach. We can distinguish token for field names and token for field value.

When a field name token occurs, we get the constraints for the corresponding field and then using them to validate its value when a field value token occurs. The following content describes the implementation of such processing.

private boolean validateContent(RestRequest request,
                   final String index,
                   final String type) throws IOException {
    // Load constraints from metadata
    GetMappingsResponse getMappingsResponse = loadMappings(request, index);
    Map<String, Object> metadata = getMetadata(
                            getMappingsResponse, index, type);
    List<TypeFieldMetadata> constraints = getAllTypeFieldMetadata(metadata);

    // Create parser from source content
    BytesReference source = request.content();
    XContentType xContentType = XContentFactory.xContentType(source);
    XContent xContent = XContentFactory.xContent(xContentType);
    XContentParser parser = xContent.createParser(source);

    XContentParser.Token t = parser.nextToken();
    if (t == null) {
        throw new IllegalArgumentException("Unable to parse input data");
    }

    // Parse the content and check contraints for fields
    String currentFieldName = null;
    TypeFieldMetadata currentMetadata = null;
    while ((t = parser.nextToken()) != XContentParser.Token.END_OBJECT) {
        if (t == XContentParser.Token.FIELD_NAME) {
            currentFieldName = parser.currentName();
            currentMetadata = getTypeFieldMetadata(constraints,
                                currentFieldName);
        } else if (t == XContentParser.Token.VALUE_STRING) {
            String value = parser.text();
            if (currentMetadata != null) {
                if (!value.matches(currentMetadata.getRegexp())) {
                    return false;
                }
            }
        }
    }

    return true;
}

You can notice that, to keep the processing simple and readable, we only validate fields immediately under the root level of our JSON objects.

This entry was posted in ElasticSearch and tagged , , , . Bookmark the permalink.

2 Responses to Implementing data validation in ElasticSearch

  1. Pingback: Managing ElasticSearch metadata | Sandbox for the Web stack

Leave a reply to Brooke Cancel reply