Apache Solr
Russell Bateman |
Thought I abandoned this effort as unworkable and flawed, I decided to keep these notes. What didn't work? Indexing of .html files using Apache Solr 8.11.1 and the prospect of connecting it up to an application and the magnitude of effort beyond a simple, static website. I might come back one day.
russ@tirion ~/dev $ unzip solr-8.11.1.zip
Archive: solr-8.11.1.zip
creating: solr-8.11.1/
creating: solr-8.11.1/contrib/
creating: solr-8.11.1/contrib/analysis-extras/
creating: solr-8.11.1/contrib/analysis-extras/lib/
creating: solr-8.11.1/contrib/analysis-extras/lucene-libs/
.
.
.
inflating: solr-8.11.1/contrib/prometheus-exporter/bin/solr-exporter
inflating: solr-8.11.1/contrib/prometheus-exporter/bin/solr-exporter.cmd
inflating: solr-8.11.1/example/exampledocs/test_utf8.sh
inflating: solr-8.11.1/server/scripts/cloud-scripts/snapshotscli.sh
inflating: solr-8.11.1/server/scripts/cloud-scripts/zkcli.sh
russ@tirion ~/dev/solr-8.11.1 $ bin/solr status
No Solr nodes are running.
russ@tirion ~/dev/solr-8.11.1 $ bin/solr start *** [WARN] *** Your open file limit is currently 1024. It should be set to 65000 to avoid operational disruption. If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh Waiting up to 180 seconds to see Solr running on port 8983 [|] Started Solr server on port 8983 (pid=582369). Happy searching! russ@tirion ~/dev/solr-8.11.1 $ bin/solr status Found 1 Solr nodes: Solr process 582369 running on port 8983 { "solr_home":"/home/russ/dev/solr-8.11.1/server/solr", "version":"8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:50:55", "startTime":"2021-12-28T14:21:02.838Z", "uptime":"0 days, 0 hours, 1 minutes, 24 seconds", "memory":"89.8 MB (%17.5) of 512 MB"}
russ@tirion ~/dev/solr-8.11.1 $ bin/solr create_core -c solrhelp
WARNING: Using _default configset with data driven schema functionality. NOT RECOMMENDED for production use.
To turn off: bin/solr config -c solrhelp -p 8983 -action set-user-property -property update.autoCreateFields -value false
Created new core 'solrhelp'
Here is where what we just created lives:
russ@tirion ~/dev/solr-8.11.1 $ ll server/solr
total 36
drwxr-xr-x 6 russ russ 4096 Dec 28 07:28 .
drwxr-xr-x 11 russ russ 4096 Dec 28 07:21 ..
drwxr-xr-x 4 russ russ 4096 Dec 14 13:51 configsets
drwxrwxr-x 2 russ russ 4096 Dec 28 07:21 filestore
-rw-r--r-- 1 russ russ 3095 Dec 14 13:51 README.txt
drwxrwxr-x 4 russ russ 4096 Dec 28 07:28 solrhelp
-rw-r--r-- 1 russ russ 2487 Dec 14 13:51 solr.xml
drwxrwxr-x 2 russ russ 4096 Dec 28 07:21 userfiles
-rw-r--r-- 1 russ russ 1083 Dec 14 13:51 zoo.cfg
$ bin/post -c solrhelp -filetypes html https://factorpad.com/tech/solr/ -recursive 1 -delay 10
Notes:
$ bin/post -c solrhelp -filetypes html https://factorpad.com/tech/solr/index.html
...would only search the top page at that location.
russ@tirion ~/dev/solr-8.11.1 $ bin/post -c solrhelp -filetypes html https://factorpad.com/tech/solr/index.html
/home/russ/dev/jdk-11.0.10+9/bin/java -classpath /home/russ/dev/solr-8.11.1/dist/solr-core-8.11.1.jar -Dauto=yes -Dfiletypes=html -Dc=solrhelp -Ddata=web org.apache.solr.util.SimplePostTool https://factorpad.com/tech/solr/index.html
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/solrhelp/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file endings html
Entering crawl at level 0 (1 links total, 1 new)
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
http://localhost:8983/solr/solrhelp/update/extract?literal.id=https%3A%2F%2Ffactorpad.com%2Ftech%2Fsolr%2Findex.html&literal.url=https%3A%2F%2Ffactorpad.com%2Ftech%2Fsolr%2Findex.html
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404 Not Found</h2>
<table>
<tr><th>URI:</th><td>/solr/solrhelp/update/extract</td></tr>
<tr><th>STATUS:</th><td>404</td></tr>
<tr><th>MESSAGE:</th><td>Not Found</td></tr>
<tr><th>SERVLET:</th><td>default</td></tr>
</table>
</body>
</html>
SimplePostTool: WARNING: IOException while reading response:
java.io.FileNotFoundException:
http://localhost:8983/solr/solrhelp/update/extract?literal.id=https%3A%2F%2Ffactorpad.com%2Ftech%2Fsolr%2Findex.html&literal.url=https%3A%2F%2Ffactorpad.com%2Ftech%2Fsolr%2Findex.html
SimplePostTool: WARNING: An error occurred while posting https://factorpad.com/tech/solr/index.html
0 web pages indexed.
COMMITting Solr index changes to http://localhost:8983/solr/solrhelp/update/extract...
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url: http://localhost:8983/solr/solrhelp/update/extract?commit=true
SimplePostTool: WARNING: Response: <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404 Not Found</h2>
<table>
<tr><th>URI:</th><td>/solr/solrhelp/update/extract</td></tr>
<tr><th>STATUS:</th><td>404</td></tr>
<tr><th>MESSAGE:</th><td>Not Found</td></tr>
<tr><th>SERVLET:</th><td>default</td></tr>
</table>
</body>
</html>
Time spent: 0:00:00.696
{ "responseHeader":{ "status":0, "QTime":1, "params":{ "q":"apple", "indent":"true", "q.op":"OR", "_":"1640704276130"}}, "response":{"numFound":0,"start":0,"numFoundExact":true,"docs":[] }}
{ "responseHeader":{ "status":0, "QTime":0, "params":{ "q":"website", "indent":"true", "q.op":"OR", "_":"1640704276130"}}, "response":{"numFound":0,"start":0,"numFoundExact":true,"docs":[] }}
<!DOCTYPE html> <html> <head> <title> Sample document </title> <head> <body> <h1> Sample document </h1> <p> Lorem ipsum dolor sit amet, consectetur adipiscing elit. Haec igitur Epicuri non probo, inquam. Quid est enim aliud esse versutum? An hoc usque quaque, aliter in vita? Falli igitur possumus. </p> </body> </html>
russ@tirion ~/dev/solr-8.11.1 $ bin/post -c solrhelp -filetypes html ./sample.html
I am discovering that the tutorial is based on a differently configured installation of Solr that is set up to index HTML files. My new, naked installation will not accomplish it. However, if I remove all the HTML elements inside sample.html (renaming it to sample.txt in order not confused it with the original), it works.
There appear to be libraries (JARs) on the path contrib/extraction that are what is needed to do this. However, following:
appear not to tell the story, at least not for Apache Solr 8.11.1.
(10 minutes in)
<lib dir="${solr.install.dir}/contrib/extraction/lib" regex=".*\.jar" /> <lib dir="${solr.install.dir}/dist/" regex="solr-cell-\d.*\.jar" /> <lib dir="${solr.install.dir}/contrib/extraction/lib" regex=".*\.jar" /> <lib dir="${solr.install.dir}/dist/" regex="solr-cell-\d.*\.jar" /> . . . <requestHandler name="/update" class="solr.UpdateRequestHandler" /> <requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <str name="fmap.content">_text_</str> </lst> </requestHandler>