Avro is a data serialization format with rich features like data structures support, RPC support and lacks requiring generating code to read/write its files. From 1.4.0 upwards you can also use AVRO from within Hadoop’s MapReduce (only Java supports that though).
Here is a sample code snippet that helps you understand how one can serialize (or write, in human terms) a ‘Record’ data type of Avro using its Python module (Installable via `easy_install avro`).
# Import the schema, datafile and io submodules# from avro (easy_install avro)from avro import schema, datafile, ioOUTFILE_NAME = 'sample.avro'SCHEMA_STR = """{ "type": "record", "name": "sampleAvro", "namespace": "AVRO", "fields": [ { "name": "name" , "type": "string" }, { "name": "age" , "type": "int" }, { "name": "address", "type": "string" }, { "name": "value" , "type": "long" } ]}"""SCHEMA = schema.parse(SCHEMA_STR)def write_avro_file(): # Lets generate our data data = {} data['name'] = 'Foo' data['age'] = 19 data['address'] = '10, Bar Eggs Spam' data['value'] = 800 # Create a 'record' (datum) writer rec_writer = io.DatumWriter(SCHEMA) # Create a 'data file' (avro file) writer df_writer = datafile.DataFileWriter( # The file to contain # the records open(OUTFILE_NAME, 'wb'), # The 'record' (datum) writer rec_writer, # Schema, if writing a new file # (aka not 'appending') # (Schema is stored into # the file, so not needed # when you want the writer # to append instead) writers_schema = SCHEMA, # An optional codec name # for compression # ('null' for none) codec = 'deflate' ) # Write our data # (You can call append multiple times # to write more than one record, of course) df_writer.append(data) # Close to ensure writing is complete df_writer.close()def read_avro_file(): # Create a 'record' (datum) reader # You can pass an 'expected=SCHEMA' kwarg # if you want it to expect a particular # schema (Strict) rec_reader = io.DatumReader() # Create a 'data file' (avro file) reader df_reader = datafile.DataFileReader( open(OUTFILE_NAME), rec_reader ) # Read all records stored inside for record in df_reader: print record['name'], record['age'] print record['address'], record['value'] # Do whatever read-processing you wanna do # for each record here ...if __name__ == '__main__': # Write an AVRO file first write_avro_file() # Now, read it read_avro_file()
Hope that the snippet explains enough to understand how one could write/read Avro Data Files. The same technique would work for Java/Ruby also, although they may have certain other abstractions.
rgds.
to be continued...

No comments:
Post a Comment