Quickly import massive data into HBase through BulkLoad
When creating an HBase table for the first time, we may need to import a large amount of initialization data into it at one time. We naturally think of inserting data into HBase one by one, or through MR methods. However, these methods are either slow or occupy Region resources during the import process, resulting in low efficiency, so it is not suitable for importing a large amount of data at one time. This article will introduce how to quickly import massive data into HBase through HBase's BulkLoad method.
In general, using the Bulk Load method takes advantage of the fact that the data information of HBase is stored in HDFS according to a specific format, and directly generates persistent HFile data format files in HDFS, and then completes the rapid storage of huge amounts of data. It does not occupy Region resources and does not generate a huge amount of write I/O, so it requires less CPU and network resources. The realization principle of Bulk Load is realized through a MapReduce Job, which directly generates an HBase internal HFile format file through the Job to form a special HBase data table, and then directly loads the data file into the running cluster. Importing data using Bulkload consumes less CPU and network resources than using the HBase API.
Implementation principle
The bulkload process mainly includes three parts:
• Extract data from data sources (usually text files or other databases) and upload to HDFS. Extracting data to HDFS has nothing to do with HBase, so you can choose the way you are good at, and this article will not introduce it.• Process pre-prepared data with MapReduce jobs. This step requires a MapReduce job, and in most cases, we also need to write the Map function ourselves, while the Reduce function does not require us to think about it and is provided by HBase. The job needs to use rowkey (row key) as the output Key; KeyValue, Put or Delete as the output Value. MapReduce jobs need to use HFileOutputFormat2 to generate HBase data files. In order to import data efficiently, HFileOutputFormat2 needs to be configured so that each output file is in an appropriate region. To achieve this, the MapReduce job uses Hadoop's TotalOrderPartitioner class to split the output based on the table's key value. The method configureIncrementalLoad() of HFileOutputFormat2 will automatically complete the above work.• Tell RegionServers where the data is and import the data. This step is the easiest and usually involves using LoadIncrementalHFiles (better known as the completebulkload tool), passing it the location of the file on HDFS, and it will use the RegionServer to import the data into the appropriate region.
The whole process diagram is as follows:
We have introduced the principle of HBase's BulkLoad method above. We need to write a Mapper and driver, which are implemented as follows:
Generate HFile files using MapReduce
public class IteblogBulkLoadMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put>{
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] items = line.split("\t");
ImmutableBytesWritable rowKey = new ImmutableBytesWritable(items[0].getBytes());
Put put = new Put(Bytes.toBytes(items[0])); //ROWKEY
put.addColumn("f1".getBytes(), "url".getBytes(), items[1].getBytes());
put.addColumn("f1".getBytes(), "name".getBytes(), items[2].getBytes());
context.write(rowkey, put);
}
}
driver
public class IteblogBulkLoadDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
final String SRC_PATH= "hdfs://iteblog:9000/user/iteblog/input";
final String DESC_PATH= "hdfs://iteblog:9000/user/iteblog/output";
Configuration conf = HBaseConfiguration.create();
Job job=Job.getInstance(conf);
job.setJarByClass(IteblogBulkLoadDriver.class);
job.setMapperClass(IteblogBulkLoadMapper.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Put.class);
job.setOutputFormatClass(HFileOutputFormat2.class);
HTable table = new HTable(conf,"blog_info");
HFileOutputFormat2.configureIncrementalLoad(job,table,table.getRegionLocator());
FileInputFormat.addInputPath(job,new Path(SRC_PATH));
FileOutputFormat.setOutputPath(job,new Path(DESC_PATH));
System.exit(job.waitForCompletion(true)?0:1);
}
}
Load HFile files by BlukLoad
public class LoadIncrementalHFileToHBase {
public static void main(String[] args) throws Exception {
Configuration configuration = HBaseConfiguration.create();
HBaseConfiguration.addHbaseResources(configuration);
LoadIncrementalHFiles loder = new LoadIncrementalHFiles(configuration);
HTable hTable = new HTable(configuration, "blog_info");
loder.doBulkLoad(new Path("hdfs://iteblog:9000/user/iteblog/output"), hTable);
}
}
Since HBase's BulkLoad method bypasses the processes of Write to WAL, Write to MemStore and Flush to disk, some data replication operations cannot be performed through WAL. Later, I will introduce how to use Hbase's BulkLoad method to initialize data through Spark.
Use Cases for BulkLoad
• Loading raw datasets into HBase for the first time - Your initial datasets are likely to be large, bypassing the HBase write path can significantly speed up the process.• Incremental Load - To load new data periodically, use BulkLoad and import the data in batches at your desired interval. This alleviates latency issues and helps you meet service level agreements (SLAs). However, the compaction trigger is the number of HFiles on the RegionServer. Therefore, frequent import of large numbers of HFiles may cause large compactions to occur more frequently, negatively impacting performance. You can mitigate this problem by adjusting the compression settings to ensure that the maximum number of HFiles that can exist without triggering compression is high, and relying on other factors such as the size of the Memstore to trigger compression.• Data needs to originate elsewhere - If the current system captures the data you want to include in HBase and needs to be kept alive for business reasons, you can periodically bulk load the data into HBase from the system so that it can be Perform operations on it under the premise of the system.
0 Comments