Configure your Elasticsearch/Opensearch cluster credentials in both core & apps helm values :
global: storage: events: driver: elasticsearch # or opensearch existingSecret: "core-prismeai-events-store" prefix: "" # Optional, set when sharing the cluster
prismeai-crawler and prismeai-searchengine consume ELASTIC_INDICES_PREFIX to namespace their indices when sharing a cluster.See Helm install for the full install context.
Prismeai automatically configures ILM policies to automate index rollover + segments merging when their primary shard reach 40 GB, as recommended by Elasticsearch/Opensearch.Our Elasticsearch driver also configures an ILM policy to automate the events deletion 30 days (default, configurable with EVENTS_SCHEDULED_DELETION_DAYS) after workspace deletion.
This is not yet supported by the Opensearch driver, which deletes events as soon as the workspace is deleted.Events expiration is not configured from ILM as they do not offer the precision needed to tune different expiration periods depending on the different kind of data.
Instead, events expiration is enforced by prismeai-events/sys/cleanup/* APIs which are automatically called from a Kubernetes CronJob as described below.
In addition to ILM, we provide a lightweight Kubernetes-native cleaner service that automatically:
Deletes expired events to enforce data retention regulation (i.e GDPR).
Deletes datastreams from small & inactive workspaces to reduce shards usage and avoid reaching the 1000 shards per node limit.
Removes payload and output fields from runtime.automations.executed technical events to save disk space without compromising audit/debug capabilities in short term.
These 3 tasks are configurable from helm values:
prismeai-events: ... events: cleanupjob: true # Create a cronjob to call /cleanup API in order to regularly apply retention, clean unused & inactive workspaces (see EVENTS_CLEANUP_* vars)... # 1. Delete events older than 3 years retention: 1080 # 2. delete all events from small AND inactive workspaces : workspaceMaxEvents: 50 # with max N events workspaceInactivityDays: 30 # & inactive for N days # 3. Delete payload & output fields from all runtime.automations.executed events older than : automationExecutedExpiration: '14d'
Step 1 and 2 are executed from a cleanup-es-indices Kubernetes CronJob scheduled every sunday at 0AM, while step 3 is executed from a cleanup-exec-events CronJob every night at 3:30AM.
Scale your search cluster by adding more nodes and optimizing node roles. Configure dedicated master nodes for cluster management and data nodes for storage and search operations.
Optimize index settings including primary shard count, replica count, and refresh intervals based on your data volume and query patterns.
Implement Index Lifecycle Management (ILM) policies to automatically manage index aging, including hot, warm, cold, and delete phases.
Here’s a typical configuration to apply to an index (or datastream, as for Knowledges) to improve write performance:
Retrieve your Knowledges (or other) index template configuration:
GET _index_template/index-template-events-<workspaceId>
Keep it, adjust existing configuration as needed and add the last template settings:
Here, we configure the index template with 3 primary shards and 1 replica per primary, allowing you to distribute write traffic to all of your 3 nodes.
Decrease index.number_of_shards to 2 if you only have 2 nodes.
index.refresh_interval configures how often Elasticsearch will make your freshly written data available for search.
Rollover your datastream in order to create a new index with the updated template:
When running a self-hosted Elasticsearch or OpenSearch cluster, ensure nodes are distributed across different physical machines for proper redundancy. Use high-performance disks and monitor CPU iowait metrics to identify potential disk bottlenecks that could impact search performance.Pay attention to cluster health metrics and ensure adequate disk space for index growth and operations like merging and replication.
Elasticsearch limits each node to 1000 shards by default. List your shards and count them with:
GET /_cat/shards
A 3-node cluster can therefore hold at most 3000 shards. The events cleanup CronJob keeps this under control; if you approach the limit on a tenant you may need to consolidate older data streams or raise the per-node cap.
Move an Elasticsearch node without losing redundancy
When you need to migrate an Elasticsearch node to a different Kubernetes node without ever dropping below the original replica count (so shards keep two live copies at all times):
Provision the new Kubernetes node.
Increase the Elasticsearch nodeSet.replicas by one (e.g. kubectl -n <ns> edit elasticsearch/core).
Wait for shards to rebalance onto the new ES node (GET /_cat/allocation/?v — relocating shards should drop to 0 for a few minutes).
Drain the Kubernetes node you want to retire (kubectl drain node/<name>). The old ES pod terminates and shards relocate to the remaining nodes.
The StatefulSet will try to recreate the missing pod elsewhere — it will stay Pending because no node has the resources requested.
Decrease the Elasticsearch nodeSet.replicas back to the original count. The Pending pod is removed cleanly without disturbing the running ones.
Delete the drained Kubernetes node.
Relocating shards takes hours on a large cluster (ES processes them two at a time). If you don’t need zero-downtime relocation, the simpler path is to delete the target node first and let the cluster rebalance with one fewer node — at the cost of a short window where some shards run on a single replica.
When formatting the Elasticsearch/Opensearch filesystem volume, it is important to first shutdown the prismeai-events microservice in core namespace. This can be easily done from Kubernetes by editing the deployment and setting replicas: 0.Index mappings are initialized when prismeai-events starts up. If indexes and index mappings are deleted (such as when formatting a volume) without first stopping prismeai-events, the next event persistence request sent to the cluster would cause it to automatically infer an incorrect index mapping, which would cause most other persistence requests to fail and result in data loss. This is because when Prisme.ai events mapping are not initialized before first events write requests, ES/OS automatically infer mappings for payload.* nested fields, making it incompatible with the needed flattened (ES) / flat_object (OS) mapping on the entire payload field.This situation causes errors like Limit of total fields [1000] has been exceeded as ES/OS tries to map every single payload.* nested field until reaching the 1000 fields maximum limit.If it is possible to delete the events data (includes workspaces debug events and AIK usage metrics), this can be easily solved by:
Shutting down prismeai-events by editing the deployment and setting replicas: 0. If a HorizontalPodAutoscaler exists for prismeai-events, first delete it or set its min/max replicas to 0.
Removing every failed index/datastream either from Kibana or curl:
Failed index names can be found in _index field of prismeai-events error logs.
Names starting with .ds-events- are datastreams underlying indexes and can be renamed like this to delete all underlying indexes at once: .ds-events-<id>-000001 -> events-<id>
Restarting prismeai-events by editing the deployment and setting replicas: 1
Follow these steps in order to reindex a workspace events datastream with the default index settings & mappings initialized (or updated) by prismeai-events:
Find the correct name for the datastream you want to reindex and make sure it exists:
GET /events-<id1>/_search
List existing index templates and make sure an index template exists for your events-* pattern:
Some workspaces like Knowledges have a custom specific index template tuned for their needs, with the workspace id included in their index & component template name.
This index template automatically applies its configuration and component templates (composed_of) to all indices matching index_patterns.
The component template is where Prisme.ai custom index settings & mappings are configured.
Create a temporary datastream:
PUT _data_stream/events-<id1>-tmp
Make sure your temporary index name matches the index_patterns seen above so this new datastream will inherit default index settings & mappings.
Reindex your data from the current to the temporary & remapped datastream:
You can optionally add a query filter to the source, very useful if you want to drop all error & execution events (which can take lot of disk space & are only useful for debugging last few days activities):
When retrying this request multiple times (for example with different source query to drop documents incompatible with the new mapping), you can add an "conflicts": "proceed" option to the body in order to ignore documents already created in destination index.A response with {"failures": []} indicates all data have been reindexed & match the destination mapping.
In case of a mismatch between source data and destination mapping, you can receive error response like this:
{ "took": 79, "timed_out": false, "total": 12, "updated": 0, "created": 6, "deleted": 0, "batches": 1, "version_conflicts": 0, "noops": 0, "retries": { "bulk": 0, "search": 0 }, "throttled_millis": 0, "requests_per_second": -1, "throttled_until_millis": 0, "failures": [ { "index": ".ds-test-events-test-000001", "id": "1757668991388-0", "cause": { "type": "mapper_parsing_exception", "reason": "failed to parse field [payload.output] of type [double] in document with id '1757668991388-0'. Preview of field's value: '{some={nested=field}, foo=bar}'", "caused_by": { "type": "json_parse_exception", "reason": "Current token (START_OBJECT) not numeric, can not use numeric value accessors\n at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` disabled); line: 1, column: 530]" } }, "status": 400 }, ... ]}
Either adapt the destination mapping or filter out these documents using the source query.
Delete the current datastream:
DELETE /_data_stream/events-<id1>
Clone our temporary datastream to the “current” datastream exactly like we previously did the other way around:
hits.total.value: the number of matching documents
aggregations.latestDate.value_as_string: the latest matching document date
aggregations.oldestDate.value_as_string: the oldest matching document date
Check from your browser that the target workspace events feed is not empty and contains old data, and that events previously failing to persist are now persisted.
If everything is fine, you can delete the temporary datastream: