Notes on Tika
|
Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). I needed to use Tika and wrote a NiFi processor interfacing it.
These are quick, disorganized notes that I just need to get written down somewhere. Tika's not that complicated, so it's unlikely that I'll need to come back and help these notes out much.
I set up a Tika server this way:
~/Downloads $ java -jar tika-server-1.13.jar --port 9998
Sep 16, 2016 3:32:26 PM org.apache.tika.server.TikaServerCli main
INFO: Starting Apache Tika 1.13 server
Sep 16, 2016 3:32:27 PM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
Sep 16, 2016 3:32:27 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: jetty-8.y.z-SNAPSHOT
Sep 16, 2016 3:32:27 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Started SelectChannelConnector@localhost:9998
Sep 16, 2016 3:32:27 PM org.apache.tika.server.TikaServerCli main
INFO: Started
req = urllib2.Request( 'http://127.0.0.1:9998/tika', pdfdata, headers=TIKA_HEADER ) req.get_method = lambda: 'PUT' try: tikaout = urllib2.urlopen( req, timeout=10 ).read().decode( 'iso-8859-1' ).encode( 'utf8' )
...returned 422: Unprocessable Entity which basically means that the base-64 encoding was garbage (?). I saw this output from Jetty:
Sep 16, 2016 3:36:29 PM org.apache.tika.server.resource.TikaResource logRequest INFO: tika (application/pdf) Sep 16, 2016 3:36:29 PM org.apache.tika.server.resource.TikaResource parse WARNING: tika: Text extraction failed org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@5ef290ec at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) . . . Sep 16, 2016 3:36:29 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerProblem SEVERE: Problem with writing the data, class org.apache.tika.server.resource.TikaResource$4, ContentType: text/plain ~/Downloads $
...so, it thinks it's PDF, but that it's bad somehow. And it died. What's up with that? Of course, this example was tossed to me in e-mail. Maybe it's not a real one.
I don't know what the figures after boundary639 and boundary556 mean. I am sort of looking for offsets to the beginning and end of the base-64 data.
It's important to use Tika on actual unencoded content. It doesn't work on base-64 encoded data.
Tika has four modules:
Comprehensive Tika guide at: Tika Quick Guide. This doc is pretty useful too: Apache Tika API Usage Examples.
All the tests below document what I've learned, Tika's very easy to use, but the assertions assume specific test data. Your mileage may vary.
I wrote TikaDetection in order to play with Tika façade a bit. The language detection is bogus, but the MIME detection is okay on my sample data (HL7 pipe message).
package com.etretatlogiciels.tika; import java.io.File; import java.io.IOException; import org.junit.Test; import static org.junit.Assert.assertTrue; import org.apache.tika.Tika; /** * @author Russell Bateman * @since September 2016 */ public class MimeDetectionTest { @Test public void testByOpenFile() throws IOException { File file = new File( "src/test/resources/Statement_201412.pdf" ); Tika tika = new Tika(); String filetype = tika.detect( file ); System.out.println( "File type is " + filetype + "." ); assertTrue( "Wrong MIME type", filetype.equals( "application/pdf" ) ); } @Test public void testByData() { final String text = "This is a test."; Tika tika = new Tika(); String filetype = tika.detect( text.getBytes() ); System.out.println( "Content type is " + filetype + "." ); assertTrue( "Wrong MIME type", filetype.equals( "text/plain" ) ); } }
package com.etretatlogiciels.tika; import org.junit.Test; import static org.junit.Assert.assertTrue; import org.apache.tika.language.LanguageIdentifier; /** * Guess what? This simply doesn't work--what did I miss here? * @author Russell Bateman * @since September 2016 */ @SuppressWarnings( "deprecation" ) public class LanguageDetectionTest { @Test public void testEnglish() { LanguageIdentifier identifier = new LanguageIdentifier( "this is english" ); String language = identifier.getLanguage(); System.out.println( "Language of string content is " + language ); assertTrue( "Wrong language", language.equals( "en" ) ); } /** * Well, this certainly doesn't work. */ @Test public void testFrench() { LanguageIdentifier identifier = new LanguageIdentifier( "ce texte est en français" ); String language = identifier.getLanguage(); System.out.println( "Language of string content is " + language ); assertTrue( "Wrong language", language.equals( "fr" ) ); } }
Tika in library form: I have been using the Tika JAR, but it's available via Maven.
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>1.13</version> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.13</version> </dependency>
Tika can be used from the command line or from an HTTP API (as shown yesterday). However, to use Tika from Java, use the Tika façade class.
Parse PDF in Tika.
package com.etretatlogiciels.tika; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.util.Arrays; import java.util.List; import org.junit.Test; import static org.junit.Assert.assertTrue; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.pdf.PDFParser; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.SAXException; /** * @author Russell Bateman * @since September 2016 */ public class ExtractPdfTest { @Test public void test() throws IOException, TikaException, SAXException { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream stream = new FileInputStream( new File( "src/test/resources/Statement_201412.pdf" ) ); ParseContext context = new ParseContext(); PDFParser parser = new PDFParser(); parser.parse( stream, handler, metadata, context ); String contents = handler.toString(); System.out.println( "PDF contents:" ); System.out.println( contents ); System.out.println( "Metadata: " ); String[] metadataNames = metadata.names(); List< String > names = Arrays.asList( metadataNames ); for( String name : names ) System.out.println( " " + name + " : " + metadata.get( name ) ); assertTrue( "Not the expected content", contents.contains( "Bateman, Russell" ) ); assertTrue( "Missing Content-Type", names.contains( "Content-Type" ) ); assertTrue( "Content-Type is not PDF", metadata.get( "Content-Type" ).equals( "application/pdf" ) ); } }
Using this is simpler still than the façade.
package com.etretatlogiciels.tika; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.util.Arrays; import java.util.Collections; import java.util.List; import org.junit.BeforeClass; import org.junit.Test; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.Parser; import org.apache.tika.metadata.Metadata; import org.apache.tika.sax.BodyContentHandler; import org.apache.tika.exception.TikaException; import org.apache.tika.parser.ParseContext; import org.xml.sax.SAXException; /** * I undertook this to determine whether AutoDetectParser is preferable * over Tika Facade. On the face of it, it is. It does more types, one * need not retro-fit new types later, it's as fast, etc. * @author Russell Bateman * @since October 2016 */ public class AutoDetectParserTest { private static BodyContentHandler handler; private static Metadata metadata; private static ParseContext context; private static Parser parser; private static final boolean VERBOSE = true; @BeforeClass public static void erectTikaParsingArtifacts() { long start = System.currentTimeMillis(); // here we do not tell Tika we want to parse PDF; it will figure that out... // this statement takes about 3 seconds to execute: parser = new AutoDetectParser(); System.out.println( "new AutoDetectParser() elapsed time: " + ( System.currentTimeMillis() - start ) + " milliseconds (single instance saved for all tests)" ); handler = new BodyContentHandler(); metadata = new Metadata(); context = new ParseContext(); } private void printSummary() { String[] metadataNames = metadata.names(); List< String > names = Arrays.asList( metadataNames ); Collections.sort( names ); if( VERBOSE ) { System.out.println( metadata.get( "Content-Type" ) + " contents:" ); System.out.println( handler.toString() ); System.out.println( "Metadata: " ); for( String name : names ) System.out.println( " " + name + " : " + metadata.get( name ) ); } } @Test public void testPdf() throws IOException, TikaException, SAXException { FileInputStream stream = new FileInputStream( new File( "src/test/resources/Statement_201412.pdf" ) ); // this statement takes roughly 7 seconds to execute: long start = System.currentTimeMillis(); parser.parse( stream, handler, metadata, context ); System.out.println( "parser.parse( small PDF ) elapsed time: " + ( System.currentTimeMillis() - start ) + " milliseconds" ); printSummary(); } @Test public void testXml() throws IOException, TikaException, SAXException { FileInputStream stream = new FileInputStream( new File( "pom.xml" ) ); long start = System.currentTimeMillis(); parser.parse( stream, handler, metadata, context ); System.out.println( "parser.parse( pom.xml ) elapsed time: " + ( System.currentTimeMillis() - start ) + " milliseconds" ); printSummary(); } @Test public void testHtml() throws IOException, TikaException, SAXException { FileInputStream stream = new FileInputStream( new File( "src/test/resources/sample.html" ) ); long start = System.currentTimeMillis(); parser.parse( stream, handler, metadata, context ); System.out.println( "parser.parse( tiny HTML ) elapsed time: " + ( System.currentTimeMillis() - start ) + " milliseconds" ); printSummary(); } @Test public void testText() throws IOException, TikaException, SAXException { FileInputStream stream = new FileInputStream( new File( "src/test/resources/sample.txt" ) ); long start = System.currentTimeMillis(); parser.parse( stream, handler, metadata, context ); System.out.println( "parser.parse( tiny text ) elapsed time: " + ( System.currentTimeMillis() - start ) + " milliseconds" ); printSummary(); } @Test public void testJpeg() throws IOException, TikaException, SAXException { FileInputStream stream = new FileInputStream( new File( "src/test/resources/test-first-code-second.jpg" ) ); long start = System.currentTimeMillis(); parser.parse( stream, handler, metadata, context ); System.out.println( "parser.parse( JPeG ) elapsed time: " + ( System.currentTimeMillis() - start ) + " milliseconds" ); printSummary(); } @Test public void testPng() throws IOException, TikaException, SAXException { FileInputStream stream = new FileInputStream( new File( "src/test/resources/etl-fhir-quiver.png" ) ); long start = System.currentTimeMillis(); parser.parse( stream, handler, metadata, context ); System.out.println( "parser.parse( PNG ) elapsed time: " + ( System.currentTimeMillis() - start ) + " milliseconds" ); printSummary(); } }