Need to tune ElasticSearch Cluster Performance.
Project detail
ELASTIC SEARCH INTEGRATION
IN SHORT
We expanded our big data pipeline with a hot storage layer buillt in top of ElasticSearch. We aimed to query data and have fast, very fast response time, and make fast analytical decisions but we have a bad performance. Data Indexing is very slow and Data query have long response time. We need an ELK expert who can help us fix that.
DATA AND FORMAT
Our Data (mainly text), currently stored in parquet format (in S3) and raw (TXT, CSV, XLSX etc.) format is around 10 TB and will grow exponentially.
CURRENT ARCHITECTURE
In our current architecture, we have
• A Spark cluster of 10 nodes (16 CPU, 64 GB RAM, 256GB Disk) to process raw files
• A s3 storage to store processed data in parquet format
• A PostgreSQL Database to store sessions, history and some meta-data.
• A web app built with Play framework (Scala) from which all requests (Spark jobs included) are triggered.
• (Non Optimized) Elastic search cluster of 5 nodes (16 CPU, 64 GB RAM, 700GB Disk). Indexation of ~170GB of data (~900 millions of rows) takes 5 hours.
OUR APPROACH AND PROBLEMS
1. After data transfrmation, we save resuting data in S3 (in parquet format).
2. Then we read these parquet files with spark in a dataframe.
3. Then we save this dataframe to ElasticSearch index (we tried many sharding and replication configuration mix without gaining in perfrmance)
4. We query/search data from ES and feed Kibana/Graphana or display it in any required format by business needs.
While the first two steps are relatively fast (~1hour for 1billion rows), the third step takes around 5 hours for a 170GB file.
And Data query has awful response time
OUR REQUIREMENTS
• Set up a very cost-effective and efficient ELK(ElasticSearch-Logstash,-Kibana) cluster (or Optimize our existing one)
• Provide (Code) an indexer that can process migration of existing data from s3 to Elastic Search
• Fast Indexing of documents Elastic Search
• Very Fast Queries and data retrieval. This is very important for our business needs. 1-3 seconds is acceptable response time
• Improve Spark Cluster Communication with ES cluster. Any bottleneck in communication between Spark and ES Cluster should be detect and fixed
PROFILE NEEDED
You need to:
• Have a strong experience with Elastic Search (ELK) in Big Data processing environment .
• Be comfortable with play framework (a least scala)
• Have good experience working with Spark
IMPORTANT CONSIDERATIONS.
• Data is about 10 TB and is quickly growing.
• Spark jobs are triggered from a web App built with play framework (SCALA)
• Need the project to be done in a reasonably short time (no more that one week).
• You need to connect to our Internal network in order to work. You will need to have a very good internet bandwidth and TeamViewer Application installed.