Notes on Tika

Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). I needed to use Tika and wrote a NiFi processor interfacing it.

These are quick, disorganized notes that I just need to get written down somewhere. Tika's not that complicated, so it's unlikely that I'll need to come back and help these notes out much.

Super simple use of Apache Tika

Inside pom.xml:

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>1.28.4</version>
</dependency>

UseTika.java:

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.util.HashMap;
import java.util.Map;

import static java.util.Objects.nonNull;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;

public class UseTika
{
  public static void main( String ... args ) throws FileNotFoundException, TikaException, IOException
  {
    File                  file        = new File( args[ 1 ] );
    InputStream           inputStream = new FileInputStream( file );
    Tika                  tika        = new Tika();
    Metadata              metadata    = new Metadata();
    Map< String, String > attributes  = new HashMap<>();
    String                content     = tika.parseToString( inputStream, metadata );

    for( String name : metadata.names() )
    {
      String value = metadata.get( name );

      if( nonNull( value ) && value.length() > 0 )
        attributes.put( name, value );
    }

    System.out.println( "Metadata:\n" );
    for( Map.Entry< String, String > attribute : attributes.entrySet() )
    {
      String key   = attribute.getKey();
      String value = attribute.getValue();
      System.out.println( "  " + key + " = " + value );
    }
    System.out.println( "Contents:\n" + content );
  }

Setting up a Tika server to consume from a Python script

I set up a Tika server this way:

I went to https://tika.apache.org/download.html and downloaded both tika-app and tika-server. I ended up using only the latter.
I found How to use Tika in server mode.

I launched the server thus:

~/Downloads $ java -jar tika-server-1.13.jar --port 9998
Sep 16, 2016 3:32:26 PM org.apache.tika.server.TikaServerCli main
INFO: Starting Apache Tika 1.13 server
Sep 16, 2016 3:32:27 PM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
Sep 16, 2016 3:32:27 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: jetty-8.y.z-SNAPSHOT
Sep 16, 2016 3:32:27 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Started SelectChannelConnector@localhost:9998
Sep 16, 2016 3:32:27 PM org.apache.tika.server.TikaServerCli main
INFO: Started

This code

        req            = urllib2.Request( 'http://127.0.0.1:9998/tika', pdfdata, headers=TIKA_HEADER )
        req.get_method = lambda: 'PUT'

        try:
            tikaout = urllib2.urlopen( req, timeout=10 ).read().decode( 'iso-8859-1' ).encode( 'utf8' )

...returned 422: Unprocessable Entity which basically means that the base-64 encoding was garbage (?). I saw this output from Jetty:

Sep 16, 2016 3:36:29 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: tika (application/pdf)
Sep 16, 2016 3:36:29 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika: Text extraction failed
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@5ef290ec
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
  .
  .
  .
Sep 16, 2016 3:36:29 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerProblem
SEVERE: Problem with writing the data, class org.apache.tika.server.resource.TikaResource$4, ContentType: text/plain
~/Downloads $

...so, it thinks it's PDF, but that it's bad somehow. And it died. What's up with that? Of course, this example was tossed to me in e-mail. Maybe it's not a real one.

Looking over the data, I'm interested in:
- Content-Type: multipart/mixed
- Content-Disposition: attachment; filename=""
- Content-Type: application/pdf
I don't know what the figures after boundary639 and boundary556 mean. I am sort of looking for offsets to the beginning and end of the base-64 data.
Is Tika's rejection because of multipart/mixed?

It's important to use Tika on actual unencoded content. It doesn't work on base-64 encoded data.

What's Tika?

Tika has four modules:

language-detection
MIME-detection
parser interface
façade class, the simplest way of calling Tika from Java

Comprehensive Tika guide at: Tika Quick Guide. This doc is pretty useful too: Apache Tika API Usage Examples.

All the tests below document what I've learned, Tika's very easy to use, but the assertions assume specific test data. Your mileage may vary.

I wrote TikaDetection in order to play with Tika façade a bit. The language detection is bogus, but the MIME detection is okay on my sample data (HL7 pipe message).

MimeDetectionTest.java:

package com.etretatlogiciels.tika;

import java.io.File;
import java.io.IOException;

import org.junit.Test;

import static org.junit.Assert.assertTrue;

import org.apache.tika.Tika;

/**
 * @author Russell Bateman
 * @since September 2016
 */
public class MimeDetectionTest
{
  @Test
  public void testByOpenFile() throws IOException
  {
    File file = new File( "src/test/resources/Statement_201412.pdf" );
    Tika tika = new Tika();

    String filetype = tika.detect( file );
    System.out.println( "File type is " + filetype + "." );

    assertTrue( "Wrong MIME type", filetype.equals( "application/pdf" ) );
  }

  @Test
  public void testByData()
  {
    final String text = "This is a test.";

    Tika tika = new Tika();

    String filetype = tika.detect( text.getBytes() );
    System.out.println( "Content type is " + filetype + "." );

    assertTrue( "Wrong MIME type", filetype.equals( "text/plain" ) );
  }
}

LanguageDetectionTest.java:

package com.etretatlogiciels.tika;

import org.junit.Test;

import static org.junit.Assert.assertTrue;

import org.apache.tika.language.LanguageIdentifier;

/**
 * Guess what? This simply doesn't work--what did I miss here?
 * @author Russell Bateman
 * @since September 2016
 */
@SuppressWarnings( "deprecation" )
public class LanguageDetectionTest
{
  @Test
  public void testEnglish()
  {
    LanguageIdentifier identifier = new LanguageIdentifier( "this is english" );
    String language = identifier.getLanguage();
    System.out.println( "Language of string content is " + language );

    assertTrue( "Wrong language", language.equals( "en" ) );
  }

  /**
   * Well, this certainly doesn't work.
   */
  @Test
  public void testFrench()
  {
    LanguageIdentifier identifier = new LanguageIdentifier( "ce texte est en français" );
    String language = identifier.getLanguage();
    System.out.println( "Language of string content is " + language );

    assertTrue( "Wrong language", language.equals( "fr" ) );
  }
}

Using Tika in Java

Tika in library form: I have been using the Tika JAR, but it's available via Maven.

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-core</artifactId>
  <version>1.13</version>
</dependency>

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>1.13</version>
</dependency>

Tika can be used from the command line or from an HTTP API (as shown yesterday). However, to use Tika from Java, use the Tika façade class.

PDF in Tika

Parse PDF in Tika.

ExtractPdfTest.java:

package com.etretatlogiciels.tika;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Arrays;
import java.util.List;

import org.junit.Test;

import static org.junit.Assert.assertTrue;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

/**
 * @author Russell Bateman
 * @since September 2016
 */
public class ExtractPdfTest
{
  @Test
  public void test() throws IOException, TikaException, SAXException
  {
    BodyContentHandler handler  = new BodyContentHandler();
    Metadata           metadata = new Metadata();
    FileInputStream    stream   = new FileInputStream( new File( "src/test/resources/Statement_201412.pdf" ) );
    ParseContext       context  = new ParseContext();

    PDFParser parser = new PDFParser();
    parser.parse( stream, handler, metadata, context );

    String contents = handler.toString();

    System.out.println( "PDF contents:" );
    System.out.println( contents );
    System.out.println( "Metadata: " );

    String[]       metadataNames = metadata.names();
    List< String > names         = Arrays.asList( metadataNames );

    for( String name : names )
      System.out.println( "  " + name + " : " + metadata.get( name ) );

    assertTrue( "Not the expected content", contents.contains( "Bateman, Russell" ) );
    assertTrue( "Missing Content-Type",     names.contains( "Content-Type" ) );
    assertTrue( "Content-Type is not PDF",  metadata.get( "Content-Type" ).equals( "application/pdf" ) );
  }
}

Tika's auto-detection parser

Using this is simpler still than the façade.

AutoDetectionParserTest.java:

package com.etretatlogiciels.tika;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;

import org.junit.BeforeClass;
import org.junit.Test;

import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.Parser;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.exception.TikaException;
import org.apache.tika.parser.ParseContext;
import org.xml.sax.SAXException;

/**
 * I undertook this to determine whether AutoDetectParser is preferable
 * over Tika Facade. On the face of it, it is. It does more types, one
 * need not retro-fit new types later, it's as fast, etc.
 * @author Russell Bateman
 * @since October 2016
 */
public class AutoDetectParserTest
{
  private static BodyContentHandler handler;
  private static Metadata           metadata;
  private static ParseContext       context;
  private static Parser             parser;

  private static final boolean VERBOSE = true;

  @BeforeClass
  public static void erectTikaParsingArtifacts()
  {
    long start = System.currentTimeMillis();
    // here we do not tell Tika we want to parse PDF; it will figure that out...
    // this statement takes about 3 seconds to execute:
    parser = new AutoDetectParser();
    System.out.println( "new AutoDetectParser()    elapsed time: " + ( System.currentTimeMillis() - start )
                      + " milliseconds (single instance saved for all tests)" );

    handler  = new BodyContentHandler();
    metadata = new Metadata();
    context  = new ParseContext();
  }

  private void printSummary()
  {
    String[]       metadataNames = metadata.names();
    List< String > names         = Arrays.asList( metadataNames );

    Collections.sort( names );

    if( VERBOSE )
    {
      System.out.println( metadata.get( "Content-Type" ) + " contents:" );
      System.out.println( handler.toString() );
      System.out.println( "Metadata: " );

      for( String name : names )
        System.out.println( "  " + name + " : " + metadata.get( name ) );
    }
  }

  @Test
  public void testPdf() throws IOException, TikaException, SAXException
  {
    FileInputStream stream = new FileInputStream( new File( "src/test/resources/Statement_201412.pdf" ) );

    // this statement takes roughly 7 seconds to execute:
    long start = System.currentTimeMillis();
    parser.parse( stream, handler, metadata, context );
    System.out.println( "parser.parse( small PDF ) elapsed time: "
                      + ( System.currentTimeMillis() - start ) + " milliseconds" );

    printSummary();
  }

  @Test
  public void testXml() throws IOException, TikaException, SAXException
  {
    FileInputStream stream = new FileInputStream( new File( "pom.xml" ) );

    long start = System.currentTimeMillis();
    parser.parse( stream, handler, metadata, context );
    System.out.println( "parser.parse( pom.xml )   elapsed time:  "
                      + ( System.currentTimeMillis() - start ) + " milliseconds" );
    printSummary();
  }

  @Test
  public void testHtml() throws IOException, TikaException, SAXException
  {
    FileInputStream stream = new FileInputStream( new File( "src/test/resources/sample.html" ) );

    long start = System.currentTimeMillis();
    parser.parse( stream, handler, metadata, context );
    System.out.println( "parser.parse( tiny HTML ) elapsed time: "
                      + ( System.currentTimeMillis() - start ) + " milliseconds" );
    printSummary();
  }

  @Test
  public void testText() throws IOException, TikaException, SAXException
  {
    FileInputStream stream = new FileInputStream( new File( "src/test/resources/sample.txt" ) );

    long start = System.currentTimeMillis();
    parser.parse( stream, handler, metadata, context );
    System.out.println( "parser.parse( tiny text ) elapsed time:   "
                      + ( System.currentTimeMillis() - start ) + " milliseconds" );
    printSummary();
  }

  @Test
  public void testJpeg() throws IOException, TikaException, SAXException
  {
    FileInputStream stream = new FileInputStream( new File( "src/test/resources/test-first-code-second.jpg" ) );

    long start = System.currentTimeMillis();
    parser.parse( stream, handler, metadata, context );
    System.out.println( "parser.parse( JPeG )      elapsed time: "
                      + ( System.currentTimeMillis() - start ) + " milliseconds" );
    printSummary();
  }

  @Test
  public void testPng() throws IOException, TikaException, SAXException
  {
    FileInputStream stream = new FileInputStream( new File( "src/test/resources/etl-fhir-quiver.png" ) );

    long start = System.currentTimeMillis();
    parser.parse( stream, handler, metadata, context );
    System.out.println( "parser.parse( PNG )       elapsed time: "
                      + ( System.currentTimeMillis() - start ) + " milliseconds" );
    printSummary();
  }
}