Reading and Writing Avro Files from the Command Line
Apache Avro is becoming one of the most popular data serialization formats nowadays, and this holds true particularly for Hadoop-based big data platforms because tools like Pig, Hive and of course Hadoop itself natively support reading and writing data in Avro format. Many users seem to enjoy Avro but I have heard many complaints about not being able to conveniently read or write Avro files with command line tools – “Avro is nice, but why do I have to write Java or Python code just to quickly see what’s in a binary Avro file, or discover at least its Avro schema?”
To those users it comes as a surprise that Avro actually ships with exactly such command line tools but apparently they are not prominently advertised or documented as such. In this short article I will show a few hands-on examples on how to read, write, compress and convert data from and to binary Avro using Avro Tools 1.7.4.
- What we want to do
- Getting Avro Tools
- Tools included in Avro Tools
- Example data
- Converting to and from binary Avro
- Known Issues of Snappy with JDK 7 on Mac OS X
- Where to go from here
What we want to do
Here is an overview of what we want to do:
- We will start with an example Avro schema and a corresponding data file in plain-text JSON format.
- We will use Avro Tools to convert the JSON file into binary Avro, without and with compression (Snappy), and from binary Avro back to JSON.
Getting Avro Tools
You can get a copy of the latest stable Avro Tools jar file from the
Avro Releases page. The actual file is in the java
subdirectory
of a given Avro release version. Here is a direct link to
avro-tools-1.7.4.jar
(11 MB) on the US Apache mirror site.
Save avro-tools-1.7.4.jar
to a directory of your choice. I will use ~/avro-tools-1.7.4.jar
for the
examples shown below.
Tools included in Avro Tools
Just run Avro Tools without any parameters to see what’s included:
$ java -jar ~/avro-tools-1.7.4.jar
[...snip...]
Available tools:
compile Generates Java code for the given schema.
concat Concatenates avro files without re-compressing.
fragtojson Renders a binary-encoded Avro datum as JSON.
fromjson Reads JSON records and writes an Avro data file.
fromtext Imports a text file into an avro data file.
getmeta Prints out the metadata of an Avro data file.
getschema Prints out schema of an Avro data file.
idl Generates a JSON schema from an Avro IDL file
induce Induce schema/protocol from Java class/interface via reflection.
jsontofrag Renders a JSON-encoded Avro datum as binary.
recodec Alters the codec of a data file.
rpcprotocol Output the protocol of a RPC service
rpcreceive Opens an RPC Server and listens for one message.
rpcsend Sends a single RPC message.
tether Run a tethered mapreduce job.
tojson Dumps an Avro data file as JSON, one record per line.
totext Converts an Avro data file to a text file.
trevni_meta Dumps a Trevni file's metadata as JSON.
trevni_random Create a Trevni file filled with random instances of a schema.
trevni_tojson Dumps a Trevni file as JSON.
Likewise run any particular tool without parameters to see its usage/help output. For example, here is the help of the
fromjson
tool:
$ java -jar ~/avro-tools-1.7.4.jar fromjson
Expected 1 arg: input_file
Option Description
------ -----------
--codec Compression codec (default: null)
--schema Schema
--schema-file Schema File
Note that most of the tools write to STDOUT
, so normally you would like to pipe their output to a file via the Bash
>
redirection operator (particularly when working with large files).
Example data
In the next sections I will use the following example data to demonstrate Avro Tools. You can also download the example files from https://github.com/miguno/avro-cli-examples.
Avro schema
The schema below defines a tuple of (username, tweet, timestamp)
as the format of our example data records.
File: twitter.avsc
:
{
"type" : "record",
"name" : "twitter_schema",
"namespace" : "com.miguno.avro",
"fields" : [ {
"name" : "username",
"type" : "string",
"doc" : "Name of the user account on Twitter.com"
}, {
"name" : "tweet",
"type" : "string",
"doc" : "The content of the user's Twitter message"
}, {
"name" : "timestamp",
"type" : "long",
"doc" : "Unix epoch time in seconds"
} ],
"doc:" : "A basic schema for storing Twitter messages"
}
Data records in JSON format
And here is some corresponding example data with two records that follow the schema defined in the previous section.
We store this data in the file twitter.json
.
Example data in twitter.json
in JSON format:
{"username":"miguno","tweet":"Rock: Nerf paper, scissors is fine.","timestamp": 1366150681 }
{"username":"BlizzardCS","tweet":"Works as intended. Terran is IMBA.","timestamp": 1366154481 }
Converting to and from binary Avro
JSON to binary Avro
Without compression:
$ java -jar ~/avro-tools-1.7.4.jar fromjson --schema-file twitter.avsc twitter.json > twitter.avro
With Snappy compression:
$ java -jar ~/avro-tools-1.7.4.jar fromjson --codec snappy --schema-file twitter.avsc twitter.json > twitter.snappy.avro
Binary Avro to JSON
The same command will work on both uncompressed and compressed data.
$ java -jar ~/avro-tools-1.7.4.jar tojson twitter.avro > twitter.json
$ java -jar ~/avro-tools-1.7.4.jar tojson twitter.snappy.avro > twitter.json
Retrieve Avro schema from binary Avro
The same command will work on both uncompressed and compressed data.
$ java -jar ~/avro-tools-1.7.4.jar getschema twitter.avro > twitter.avsc
$ java -jar ~/avro-tools-1.7.4.jar getschema twitter.snappy.avro > twitter.avsc
Known Issues of Snappy with JDK 7 on Mac OS X
If you happen to use JDK 7 on Mac OS X 10.8 you will run into the error below when trying to run the Snappy related
commands above. In that case make sure to explicitly use JDK 6. On Mac OS 10.8 the JDK 6 java
binary is by default
available at /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/bin/java.
The cause of this problem is documented in the bug report
Native (Snappy) library loading fails on openjdk7u4 for mac. This bug
is already fixed in the latest Snappy-Java 1.5 milestone releases, but Avro 1.7.4 still depends on the latest stable
release of Snappy-Java which is 1.0.4.1 (see lang/java/pom.xml
in the Avro source code).
I also found that one way to fix this problem when writing your own Java code is to explicitly require Snappy 1.5.x.
Here is the relevant dependency declaration for build.gradle
in case you are using Gradle. This seems to solve
the problem, but I have yet to confirm whether this is a safe way for production scenarios.
// Required to fix a Snappy native library error on OS X when trying to compress Avro files with Snappy;
// Avro 1.7.4 uses the latest stable release of Snappy, 1.0.4.1 (see avro/lang/java/pom.xml) that still contains
// the original bug described at https://github.com/xerial/snappy-java/issues/6.
//
// Note that in a production setting we do not care about OS X, so we could use Snappy 1.0.4.1 as required by
// Avro 1.7.4 as is.
//
compile group: 'org.xerial.snappy', name: 'snappy-java', version: '1.0.5-M4'
Detailed error message:
$ uname -a
Darwin mac.local 12.3.0 Darwin Kernel Version 12.3.0: Sun Jan 6 22:37:10 PST 2013; root:xnu-2050.22.13~1/RELEASE_X86_64 x86_64
$ java -version
java version "1.7.0_17"
Java(TM) SE Runtime Environment (build 1.7.0_17-b02)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
$ java -jar ~/avro-tools-1.7.4.jar fromjson --codec snappy --schema-file twitter.avsc twitter.json > twitter.snappy.avro
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:317)
at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:219)
at org.xerial.snappy.Snappy.<clinit>(Snappy.java:44)
at org.apache.avro.file.SnappyCodec.compress(SnappyCodec.java:43)
at org.apache.avro.file.DataFileStream$DataBlock.compressUsing(DataFileStream.java:349)
at org.apache.avro.file.DataFileWriter.writeBlock(DataFileWriter.java:348)
at org.apache.avro.file.DataFileWriter.writeIfBlockFull(DataFileWriter.java:295)
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:266)
at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:109)
at org.apache.avro.tool.Main.run(Main.java:80)
at org.apache.avro.tool.Main.main(Main.java:69)
Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)
at java.lang.Runtime.loadLibrary0(Runtime.java:845)
at java.lang.System.loadLibrary(System.java:1084)
at org.xerial.snappy.SnappyNativeLoader.loadLibrary(SnappyNativeLoader.java:52)
... 15 more
Exception in thread "main" org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null
at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:229)
at org.xerial.snappy.Snappy.<clinit>(Snappy.java:44)
at org.apache.avro.file.SnappyCodec.compress(SnappyCodec.java:43)
at org.apache.avro.file.DataFileStream$DataBlock.compressUsing(DataFileStream.java:349)
at org.apache.avro.file.DataFileWriter.writeBlock(DataFileWriter.java:348)
at org.apache.avro.file.DataFileWriter.writeIfBlockFull(DataFileWriter.java:295)
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:266)
at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:109)
at org.apache.avro.tool.Main.run(Main.java:80)
at org.apache.avro.tool.Main.main(Main.java:69)
Where to go from here
The example commands above show just a few variants of how to use Avro Tools to read, write and convert Avro files. The Avro Tools library is documented at:
That said I found those docs not that helpful (the sources are however). I’d recommend to just try running the command
line tools without parameters and have a look at their usage instructions which they will print to STDOUT
. Normally
this is enough to understand how they should be used.