Solr

Solr, known as Cloudera Search by Cloudera, built on Lucene, is a distributed service engine, used for indexing and searching data stored in HDFS.

Configure Solr

Install Solr

Cloudera Manager distributes Solr in CDH and offers the following services:

  • Solr Server – Add a Solr Server to a host that is not hosting ZooKeeper or Oozie as Solr will take a lot of CPU and memory. You can collocate a Solr server with a YARN NodeManager, HBase RegionServer, and a HDFS DataNode. When collocating with NodeManagers, be sure that the resources of the machine are not oversubscribed. Due to the amount of memory used by RegionServers, if you collocate a Solr Server with a RegionServer, make sure you calculate memory carefully to not oversubscribe the server.  Also you should not install a Solr Server on a node running a YARN ResourceManager or HBase Master.
  • Gateway – Add a Solr Gateway to all APP servers where a CLI and network map is required.

Configure Solr

Configuration
Description
Value
Calculation
Java Heap Size of Solr Server Maximum size in bytes for the Java Process heap memory. Passed to Java -Xmx. 1 GB
Java Direct Memory Size of Solr Server Maximum amount of off-heap memory in bytes that may be allocated by the Java process. Passed to Java -XX:MaxDirectMemorySize. If unset, defaults to the size of the heap. 1 GB The amount of data in memory to be indexed and available to a search. In some cases can be MUCH higher than the Java heap. See my notes below.

Notes: To ensure an appropriate amount of memory, consider your requirements and experiment in your environment. In general:

  • 4 GB is sufficient for some smaller loads or for evaluation.
  • 12 GB is sufficient for some production environments.
  • 48 GB is sufficient for most situations.

Here is Cloudera’s current Solr guide: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Search/Cloudera-Search-User-Guide/Cloudera-Search-User-Guide.html

To use Solr for the first time you will have to create Collections. Here is how: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Search/Cloudera-Search-Installation-Guide/csig_deploy_search_solrcloud.html, look under the heading: Creating Your First Solr Collection. You will then be able to create a new core.

Reference for an article about managing distributed Solr Servers: http://blog.mgm-tp.com/2010/09/hadoop-log-management-part4/

In regards to resourcing the system for Solr, here is good insight from an expert:

Whether or not you separate the Solr servers into their own cluster or collocate Solr with your existing Hadoop/YARN nodes depends on the size of your search index. If the index fits into one Core, I would recommend using a dedicated Solr-Server separated from the Hadoop-Cluster.

If on the other hand the index is too large for a single core and you need a kind of sharding, you might be able to reuse your cluster for Solr. But first you need to evaluate the use of your Hadoop Cluster. If the Cluster is also heavily used for Map/Reduce-Jobs, you will not have enough resources for Solr.

Bottom line: If your Hadoop cluster is primarily used for storage and has only a light Map/Reduce load, you can reuse it for running Solr. In all other cases you are better off with a separate Solr Cluster.

Creating Your First Solr Collection

By default, the Solr server comes up with no collections. Make sure that you create your first collection using the instancedir that you provided to Solr in previous steps by using the same collection name. (numOfShards is the number of SolrCloud shards you want to partition the collection across. The number of shards cannot exceed the total number of Solr servers in your SolrCloud cluster):

solrctl collection –create collection1 -s {{numOfShards}}

You should be able to check that the collection is active. For example, you should be able to navigate to: http://ServerName:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true and verify that the collection is active. Similarly, you should also be able to observe the topology of your SolrCloud using a URL similar to: http://ServerName:8983/solr/#/~cloud

Reference: http://blog.cloudera.com/blog/2013/11/how-to-add-cloudera-search-to-your-cluster-using-cloudera-manager/

Adding another Collection with Replication

To support scaling for query load, create a second collection with replication. Having multiple servers with replicated collections distributes the request load for each shard. Create one shard cluster with a replication factor of two. Your cluster must have at least two running servers to support this configuration, so ensure Cloudera Search is installed on at least two servers before continuing with this process. A replication factor of two causes two copies of the index files to be stored in two different locations.

1. Generate the config files for the collection:

solrctl instancedir –generate $HOME/solr_configs2

2. Upload the instance directory to ZooKeeper:

solrctl instancedir –create collection2 $HOME/solr_configs2

3. Create the second collection:

solrctl collection –create collection2 -s 1 -r 2

Verify the collection is live and that your one shard is being served by two nodes. For example, you should receive content from: http://ServerName:8983/solr/#/~cloud