Apache Solr

Russell Bateman
December2021
last update:

Useful links

Apache Solr homepage.
Crawl Websites and Search in Apache Solr. This tutorial's resources are found at Crawl Websites with Apache Solr.

Thought I abandoned this effort as unworkable and flawed, I decided to keep these notes. What didn't work? Indexing of .html files using Apache Solr 8.11.1 and the prospect of connecting it up to an application and the magnitude of effort beyond a simple, static website. I might come back one day.

Steps to setting up Apache Solr

Download Solr and explode it into your filesystem.

russ@tirion ~/dev $ unzip solr-8.11.1.zip
Archive:  solr-8.11.1.zip
   creating: solr-8.11.1/
   creating: solr-8.11.1/contrib/
   creating: solr-8.11.1/contrib/analysis-extras/
   creating: solr-8.11.1/contrib/analysis-extras/lib/
   creating: solr-8.11.1/contrib/analysis-extras/lucene-libs/
   .
   .
   .
  inflating: solr-8.11.1/contrib/prometheus-exporter/bin/solr-exporter
  inflating: solr-8.11.1/contrib/prometheus-exporter/bin/solr-exporter.cmd
  inflating: solr-8.11.1/example/exampledocs/test_utf8.sh
  inflating: solr-8.11.1/server/scripts/cloud-scripts/snapshotscli.sh
  inflating: solr-8.11.1/server/scripts/cloud-scripts/zkcli.sh

What's inside that we'll need? Look under the bin subdirectory. We'll be using
- bin/solr, has 12 commands, such as bin/solr status
- bin/post, to post documents like web pages to create an index
Other subdirectories?
- Once built, the core and all of its data will reside in the server/ subdirectory
Locate your current working directory in the solr installation.

Type

russ@tirion ~/dev/solr-8.11.1 $ bin/solr status

No Solr nodes are running.

Start Solr with default settings, then get status again:

russ@tirion ~/dev/solr-8.11.1 $ bin/solr start
*** [WARN] *** Your open file limit is currently 1024.
 It should be set to 65000 to avoid operational disruption.
 If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
Waiting up to 180 seconds to see Solr running on port 8983 [|]
Started Solr server on port 8983 (pid=582369). Happy searching!

russ@tirion ~/dev/solr-8.11.1 $ bin/solr status

Found 1 Solr nodes:

Solr process 582369 running on port 8983
{
  "solr_home":"/home/russ/dev/solr-8.11.1/server/solr",
  "version":"8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:50:55",
  "startTime":"2021-12-28T14:21:02.838Z",
  "uptime":"0 days, 0 hours, 1 minutes, 24 seconds",
  "memory":"89.8 MB (%17.5) of 512 MB"}

Build our first Solr core, solrhelp:

russ@tirion ~/dev/solr-8.11.1 $ bin/solr create_core -c solrhelp
WARNING: Using _default configset with data driven schema functionality. NOT RECOMMENDED for production use.
         To turn off: bin/solr config -c solrhelp -p 8983 -action set-user-property -property update.autoCreateFields -value false

Created new core 'solrhelp'

Here is where what we just created lives:

russ@tirion ~/dev/solr-8.11.1 $ ll server/solr
total 36
drwxr-xr-x  6 russ russ 4096 Dec 28 07:28 .
drwxr-xr-x 11 russ russ 4096 Dec 28 07:21 ..
drwxr-xr-x  4 russ russ 4096 Dec 14 13:51 configsets
drwxrwxr-x  2 russ russ 4096 Dec 28 07:21 filestore
-rw-r--r--  1 russ russ 3095 Dec 14 13:51 README.txt
drwxrwxr-x  4 russ russ 4096 Dec 28 07:28 solrhelp
-rw-r--r--  1 russ russ 2487 Dec 14 13:51 solr.xml
drwxrwxr-x  2 russ russ 4096 Dec 28 07:21 userfiles
-rw-r--r--  1 russ russ 1083 Dec 14 13:51 zoo.cfg

Practically speaking, we need documents, for example, from the web to ingest into Solr. For this, we'll use the post tool under the bin subdirectory. We could also this same tool to ingest, for example, HTML documents in our local filesystem (what I'm planning to do). This latter is more illustrative of what happens for a production enviroment.

Here's what that looks like schematically:

bin/post -c <collection> [OPTIONS] <files|directories|urls|-d ["...",...]>
- the -c option identifies which core to post data to (in our case, "solrhelp")
- Via OPTIONS, you can override any of the defaults as to where to find the core (ours is just local at server/solr)
- Still via OPTIONS are web-crawl options you can set to accomplish useful things like extending the directory depth (-recursive option) to which the crawl will go looking for documents or...
- ...option -delay by which you can be kind to the web server serving up the data by specifying how long (in seconds) to wait between Solr's HTTP requests for data.
- Other OPTIONS include specifying which -filetypes to look for, whether to post Solr responses to the console (-out yes|no).
- And still others.
Examples:
```
$ bin/post -c solrhelp -filetypes html https://factorpad.com/tech/solr/ -recursive 1 -delay 10
```
Notes:
- The artifacts of indexing will go into the solrhelp core.
- We're only looking to index HTML documents.
- The slash at the end of the URL/path indicates that the last element is a directory (rather than a file).
- We'll only look one directory deep (subdirectory solr only).
- We'll only importune the web server withour HTTP requests once every 10 seconds.
```
$ bin/post -c solrhelp -filetypes html https://factorpad.com/tech/solr/index.html
```
...would only search the top page at that location.

Let's pull the trigger on our second example to see what we get. Remember that this is an HTML page we're going to index. This is the full output from Solr:

russ@tirion ~/dev/solr-8.11.1 $ bin/post -c solrhelp -filetypes html https://factorpad.com/tech/solr/index.html
/home/russ/dev/jdk-11.0.10+9/bin/java -classpath /home/russ/dev/solr-8.11.1/dist/solr-core-8.11.1.jar -Dauto=yes -Dfiletypes=html -Dc=solrhelp -Ddata=web org.apache.solr.util.SimplePostTool https://factorpad.com/tech/solr/index.html
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/solrhelp/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file endings html
Entering crawl at level 0 (1 links total, 1 new)
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
http://localhost:8983/solr/solrhelp/update/extract?literal.id=https%3A%2F%2Ffactorpad.com%2Ftech%2Fsolr%2Findex.html&literal.url=https%3A%2F%2Ffactorpad.com%2Ftech%2Fsolr%2Findex.html
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404 Not Found</h2>
<table>
<tr><th>URI:</th><td>/solr/solrhelp/update/extract</td></tr>
<tr><th>STATUS:</th><td>404</td></tr>
<tr><th>MESSAGE:</th><td>Not Found</td></tr>
<tr><th>SERVLET:</th><td>default</td></tr>
</table>

</body>
</html>
SimplePostTool: WARNING: IOException while reading response:
java.io.FileNotFoundException:
http://localhost:8983/solr/solrhelp/update/extract?literal.id=https%3A%2F%2Ffactorpad.com%2Ftech%2Fsolr%2Findex.html&literal.url=https%3A%2F%2Ffactorpad.com%2Ftech%2Fsolr%2Findex.html
SimplePostTool: WARNING: An error occurred while posting https://factorpad.com/tech/solr/index.html
0 web pages indexed.
COMMITting Solr index changes to http://localhost:8983/solr/solrhelp/update/extract...
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url: http://localhost:8983/solr/solrhelp/update/extract?commit=true
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404 Not Found</h2>
<table>
<tr><th>URI:</th><td>/solr/solrhelp/update/extract</td></tr>
<tr><th>STATUS:</th><td>404</td></tr>
<tr><th>MESSAGE:</th><td>Not Found</td></tr>
<tr><th>SERVLET:</th><td>default</td></tr>
</table>

</body>
</html>
Time spent: 0:00:00.696

To see a graphical report of the above, I'll launch a new tab in my browser at http://localhost:8983/solr/#/. Once there, I use the drop-down Core Selector to select solrhelp and I see statistics displayed.
Down the left navigational thumb, I click on Query. The syntax of searching in Lucene/Solr is very different than in SQL which I'm more familiar with.

It would be useful to example the first document's contents at this point. Open a tab in your browser to https://factorpad.com/tech/solr/index.html. There you will see words like "Enterprise," "Search," "scalable" and "production-ready."

The tutorial bids use enter in the query edit field (in place of *:*) the word "apple" which does not occur in the indexed document. Do that, then click Execute Query below.

Solr returns discouraging output in JSON to the effect that it did not find the term:

{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"apple",
      "indent":"true",
      "q.op":"OR",
      "_":"1640704276130"}},
  "response":{"numFound":0,"start":0,"numFoundExact":true,"docs":[]
  }}

As the tutorial advises, compare this with asking Solr to find the term, "website," which should exist in that document:

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"website",
      "indent":"true",
      "q.op":"OR",
      "_":"1640704276130"}},
  "response":{"numFound":0,"start":0,"numFoundExact":true,"docs":[]
  }}

This didn't work either (unlike the video). This is a little trip down the rabbit hole: if you were paying attention to the output from Solr when we attempted to post this webpage to it, we got an HTTP 404 error, probably because we aren't allowed to index the tutorial author's web site.

Let's index something we own instead. Create this tiny document locally and name it sample.html:

<!DOCTYPE html>
<html>
<head>
	<title> Sample document </title>
<head>
<body>
<h1> Sample document </h1>

<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Haec igitur Epicuri
non probo, inquam. Quid est enim aliud esse versutum? An hoc usque quaque,
aliter in vita? Falli igitur possumus.
</p>
</body>
</html>

Now post it to solrhelp:

russ@tirion ~/dev/solr-8.11.1 $ bin/post -c solrhelp -filetypes html ./sample.html

Discoveries

I am discovering that the tutorial is based on a differently configured installation of Solr that is set up to index HTML files. My new, naked installation will not accomplish it. However, if I remove all the HTML elements inside sample.html (renaming it to sample.txt in order not confused it with the original), it works.

There appear to be libraries (JARs) on the path contrib/extraction that are what is needed to do this. However, following:

Configuring the ExtractingRequestHandlerin solrconfig.xml

and Lib Directives in SolrConfig

appear not to tell the story, at least not for Apache Solr 8.11.1. (10 minutes in) <lib dir="${solr.install.dir}/contrib/extraction/lib" regex=".*\.jar" /> <lib dir="${solr.install.dir}/dist/" regex="solr-cell-\d.*\.jar" /> <lib dir="${solr.install.dir}/contrib/extraction/lib" regex=".*\.jar" /> <lib dir="${solr.install.dir}/dist/" regex="solr-cell-\d.*\.jar" /> . . . <requestHandler name="/update" class="solr.UpdateRequestHandler" /> <requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <str name="fmap.content">_text_</str> </lst> </requestHandler>