Name

HDFS — enables you to read and write messages from/to an HDFS file system

Overview

HDFS is the distributed file system at the heart of Hadoop. It can only be built using JDK1.6 or later because this is a strict requirement for Hadoop itself. This component is hosted at http://github.com/dgreco/camel-hdfs. We decided to put it temporarily on this github because currently Apache Camel is being built and tested using JDK1.5 and for this reason we couldn't put that component into the Apache Camel official distribution.

URI format

The URI format for an HDFS endpoint is:

hdfs://hostname[:port][/path][?options]

The path is treated in the following way:

  1. as a consumer, if it's a file, it just reads the file, otherwise if it represents a directory it scans all the file under the path satisfying the configured pattern. All the files under that directory must be of the same type.

  2. as a producer, if at least one split strategy is defined, the path is considered a directory and under that directory the producer creates a different file per split named seg0, seg1, seg2, etc.

Options

Table 15, “HDFS options” lists the options for HDFS endpoint.

Table 15. HDFS options

NameDefault ValueDescription
overwrite true Specifies if the file can be overwritten.
bufferSize 4096 Specifies the buffer size used by HDFS.
replication 3 Specifies the HDFS replication factor.
blockSize 67108864 Specifies the size of the HDFS blocks in bytes.
fileType NORMAL_FILE

Specifies the type of file to use. Valid values are:

  • SEQUENCE_FILE

  • MAP_FILE

  • ARRAY_FILE

  • BLOOMMAP_FILE

See the Hadoop documentation for more information.

fileSystemType HDFS It can be LOCAL for local filesystem
keyType NULL

Specifies the type for the key in case of sequence or map files.

valueType TEXT

Specifies the type for the key in case of sequence or map files.

splitStrategy  

A string describing the strategy on how to split the file based on different criteria.

openedSuffix opened

When a file is opened for reading/ writing the file is renamed with this suffix to avoid to read it during the writing phase.

readSuffix read

Once the file has been read is renamed with this suffix to avoid to read it again.

initialDelay 0

Specifies how long a consumer will wait, in milliseconds, before starting to scanning the directory.

delay 0

Specifies the interval, in milliseconds, between the directory scans.

pattern *

The pattern used for scanning the directory

chunkSize 4096

When reading a normal file, this is split into chunks producing a message per chunk


Related topics

Hadoop