2024-06-02发表2024-07-07更新大数据技术 / HBase1 小时读完 (大约6723个字)

QuickPassHBase

快速上手HBase

[TOC]

⚙ 1. HBase简介

1.1 HBase的定义

Apache HBase 是以 HDFS 为数据存储的，一种分布式、可扩展的 NoSQL 数据库。

HBase 的设计理念依据 Google 的 BigTable 论文，论文中对于数据模型的首句介绍。

BigTable是一个稀疏的、分布式的、持久的多维排序映射(Map)。该映射由行键、列键和时间戳索引作为键(Key)，映射中的每个值(Value)都是一个未解释的字节数组。

HBase 使用与 BigTable 非常相似的数据模型。用户将数据行存储在带标签的表中。数据行具有可排序的键和任意数量的列。该表存储稀疏，因此如果用户喜欢，同一表中的行可以具有疯狂变化的列。

1.2 HBase的数据模型

1.2.1 HBase 的逻辑结构

{
    "row_key1": {
        "personal_info": {
            "name": "ZhangSan",
            "city": "Beijing",
            "phone": "156****0000"
        },
        "office_info": {
            "tel": "010-1234567",
            "address": "Shandong"
        }
    },
    "row_key11": {
        "personal_info": {
            "city": "Shanghai",
            "phone": "133****0000"
        },
        "office_info": {
            "tel": "010-1234567",
        }
    },
    "row_key2": {
        ...
    }
}

列族→	personal_info			office_info
RowKey↓	name	city	phone	tel	address
row_key1	ZhangSan	Beijing	156****0000	010-1234567	Shandong
row_key11		Shanghai	131****0000	010-1234567
row_key2	...	...	...	...	...

在上面的表格中：

personal_info、office_info称为列族
name、city、phone、tel、address称为列
row_key1、row_key11称为行键。
将一整张大表按照行进行拆分，拆分为多个表，拆分后的每个表称为**块(Region)**，用于实现分布式结构。
将一整张大表按照列族进行拆分，拆分为多个**存储(Store)**，用于在底层存储到不同的文件夹中，便于文件对应。

存储数据稀疏，数据存储多维，不同的行具有不同的列。数据存储整体有序，按照RowKey的字典序排列，RowKey为一个Byte数组。

1.2.2 HBase 的物理结构

物理存储结构即为数据映射关系，而在概念视图的空单元格，底层实际根本不存储。

在HDFS中划分好的存储Store如下：

	personal_info
RowKey	name	city	phone
row_key1	ZhangSan	Beijing	156****0000
row_key11		Shanghai	131****0000
row_key2	...	...	...

其底层一定是以映射(Map)的方式进行存储的，格式为**(Key, Value)，Value一定是“ZhangSan”**这种字段。那么Key是什么呢？

为了确定Value值**”ZhangSan”，我们需要用Key对应到Value**，于是得到存储如下：

Row Key	Column Family	Column Qualifier	Timestamp	Type	Value
row_key1	personal_info	name	t1	Put	ZhangSan
row_key1	personal_info	city	t2	Put	Beijing
row_key1	personal_info	phone	t3	Put	156****0000
row_key1	personal_info	phone	t4	Put	156****0001
row_key1	personal_info	phone	t5	Delete	156****0001
…	…	…	…	…	…

因为 HDFS 是无法修改数据的，而 HBase 需要修改数据，那么就需要解决这一问题，于是就有了**时间戳(Timestamp)**。不同版本（version）的数据根据 Timestamp 进行区分，读取数据默认读取最新的版本。

在上面的表格中，t4相对于t3来说就是进行了修改，将t3时的**phone从156****0000修改为t4时的156****0001，读取时默认读取t4时的phone**值，通过这种方式完成了修改。

同样的，我们也不好删除数据，因此我们只需要插入一条**Type**为Delete的数据即可。

1.2.3 数据模型

Name Space 命名空间

类似于关系型数据库的 Database 概念，每个命名空间下有多个表。HBase 两个自带的命名空间，分别是 hbase 和default，hbase 中存放的是 HBase 内置的表，default表是用户默认使用的命名空间。
Table

类似于关系型数据库的表概念。不同的是，HBase 定义表时只需要声明列族即可，不需要声明具体的列。因为数据存储时稀疏的，所有往HBase写入数据时，字段可以动态、按需指定。因此，和关系型数据库相比，HBase能够轻松应对字段变更的场景。

需要注意的是，列族的存在是动态添加列（或称字段）的基础。
Row

HBase 表中的每行数据都由*一个行键(RowKey)和多个列(Column)组成，数据是按照 RowKey的字典顺序存储的，*并且查询数据时只能根据 RowKey进行检索**，所以RowKey的设计十分重要。
Column

HBase 中的每个列都由列族(Column Family)和列限定符(Column Qualifier)进行限定，例如info:name, info:age。建表时，只需指明列族，而列限定符无需预先定义。列限定符听起来很高端，其实就是列名的意思。
Time Stamp

用于标识数据的**不同版本(Version)**，每条数据写入时，系统会自动为其加上该字段，其值为写入 HBase 的时间。
Cell

由 {rowkey, Column Family: Column Qualifier, Timestamp} 唯一确定的单元，Cell 中的数据全部是字节码形式存储。

1.3 HBase 基本架构

Master

主要进程，具体实现类为HMaster，通常部署在NameNode上。

主要功能：负责通过 ZK 监控 RegionServer 进程状态，同时是所有元数据变化的接口，内部启动监控执行 region 的故障转移和拆分的线程。

功能的详细描述：
- 管理元数据表格 hbase:meta：接收用户对表格创建、修改、删除的命令并执行。
- 监控 RegionServer 是否需要进行负载均衡、故障转移和Region拆分。通过启动多个后台线程监控实现上述功能：
  - LoadBalancer 负载均衡器
    
    周期性监控 region分布在 RegionServer 上面是否均衡，由参数 hbase.balancer.period控制周期时间，默认5分钟。
  - CatalogJanitor元数据管理器
    
    定期检查和清理hbase:meta中的数据。
  - MasterProcWAL Master 预写日志处理器
    
    把Master需要执行的任务记录到预写日志WAL中，如果Master宕机，则让BackupMaster继续操作。
RegionServer

主要进程，具体实现类为HRegionServer，通常部署在DataNode上。

功能：主要负责数据 Cell 的处理，同时在执行区域的拆分和合并的时候，由 RegionServer 来实际执行。

功能的详细描述：
- 负责数据 Cell 的处理，例如写入数据put，查询数据get等。
- 拆分合并 region 的实际执行者，有 Master 监控，有RegionServer 执行。
ZooKeeper

HBase 通过 ZooKeeper 来做 Master的高可用、记录 RegionServer 的部署信息、并且存储有 meta 表的位置信息。
HBase 对于数据的读写操作时是直接访问 ZooKeeper 的，在 2.3 版本推出 Master Registry 模式，客户端可以直接访问 Master。使用此功能，会加大对 Master的压力，减轻对 ZooKeeper 的压力。
HDFS

HDFS 为 HBase 提供最终的底层数据存储服务，同时为 HBase 提供高容错的支持。

上图中的Region由三个RegionServer随机管理，尽量均衡。表名hbase:meta是一个特例，他存储在HDFS，但是由Master管理。

🔧 2. 快速上手

2.1 安装部署

2.1.1 分布式部署

至少 3 台虚拟机
1
2
3
hadoop101
hadoop102
hadoop103
保证 ZooKeeper 正常部署，并且启动 ZooKeeper
1
zkServer.sh start
保证 Hadoop 正常部署，并且启动 Hadoop
1
start-dfs.sh

配置 HBase 环境

① 下载 HBase 安装包（压缩包），这里假设为hbase-2.4.11-bin.tar.gz

② 解压 HBase 安装包到一个文件夹

1	tar -zxvf /path/to/hbase-2.4.11-bin.tar.gz -C /path/to/module

③ 在用户目录下，添加用户环境变量

1	vim .bashrc

1
2
3

#HBase_HOME
export HBASE_HOME = /path/to/module/hbase-2.4.11
export PATH = $PATH:$HBASE_HOME/bin

④ 使环境变量生效

1	source .bashrc

⑤ 修改配置文件

hbase-env.sh

1
2
3

# 表示是否需要 HBase 管理维护一个自带的 ZooKeeper, 默认为 true
# 我们需要使用本机已经配置好的 ZooKeeper, 所以修改为 False
export HBASE_MANAGES_ZK = false

hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <!-- ZooKeeper的地址 -->
    <property>
        <name>hbase.zookeeper.quorum</name>
        <value>hadoop101,hadoop102,hadoop103</value>
    </property>

    <!-- HBase数据在HDFS中的存放路径 -->
    <property>
        <name>hbase.rootdir</name>
        <value>hadoop101:8020/hbase</value>
    </property>

    <!-- HBase的运行模式 -->
    <!-- false为单机模式, HBase和ZooKeeper会运行在同一个JVM虚拟机中 -->
    <!-- true 为分布式模式 -->
	<property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
    </property>
    
    <!-- ZooKeeper快照的存储位置 -->
    <!-- 这里替换为自己的 /path/to/ZooKeeperDir -->
	<property>
        <name>hbase.zookeeper.property.dataDir</name>
        <value>/opt/module/zookeeper-3.4.6/data</value>
    </property>
    
    <!-- HBase 安全模式 -->
    <!-- 在分布式模式下, 设置为 false -->
	<property>
        <name>hbase.unsafe.stream.capability.enforce</name>
        <value>false</value>
    </property>
</configuration>

regionservers
1
2
3
hadoop101
hadoop102
hadoop103

⑥ 解决 log4j 不兼容的问题，移除 HBase或者 Hadoop的 .jar包

⑦ 使用 scp 命令同步 HBase 配置，需要提前设置好免密登录。或者使用 xsync

启动 HBase 服务

单点启动

#单点启动HMaster
hbase-daemon.sh start master
#单点启动HRegionServer
hbase-daemon.sh start regionserver

集群启动
1
start-hbase.sh
停止服务
1
stop-hbase.sh

2.1.2 高可用服务

如果 HBase 已经启动，先关闭HBase
1
stop-hbase.sh

添加配置文件 backup-masters

1
2
3

#使用touch命令或者echo命令均可
touch /path/to/hbase-2.1.4/conf/backup-masters
vim /path/to/hbase-2.1.4/conf/backup-masters

添加内容：hadoop102

使用 scp 命令分发配置文件
启动HBase，正常启动进程如下：
1
2
3
hadoop101 -> HMaster HRegionServer
hadoop102 -> HMaster HRegionServer
hadoop103 -> HRegionServer
其中，hadoop101 的 HMaster 先启动作为主节点，hadoop102 的 HMaster后启动，作为**备用节点(Backup-Master)**。

2.2 使用操作

2.2.1 Shell操作

使用命令 hbase shell 启动 HBase 的 Shell 命令界面，所有命令均可以使用 help 查到。

当我们在 hbase shell中输入help命令时，将会弹出HBase的使用提示：

1	hbase shell

hbase(main):001:0> help
HBase Shell, version 2.1.8, rd8333e556c8ed739cf39dab58ddc6b43a50c0965, Tue Nov 19 15:29:04 UTC 2019
Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a specific command.
Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a command group.

COMMAND GROUPS:
  Group name: general
  Commands: processlist, status, table_help, version, whoami

  Group name: ddl
  Commands: alter, alter_async, alter_status, clone_table_schema, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, list_regions, locate_region, show_filters

  Group name: namespace
  Commands: alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables

  Group name: dml
  Commands: append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve

  Group name: tools
  Commands: assign, balance_switch, balancer, balancer_enabled, catalogjanitor_enabled, catalogjanitor_run, catalogjanitor_switch, cleaner_chore_enabled, cleaner_chore_run, cleaner_chore_switch, clear_block_cache, clear_compaction_queues, clear_deadservers, close_region, compact, compact_rs, compaction_state, flush, hbck_chore_run, is_in_maintenance_mode, list_deadservers, major_compact, merge_region, move, normalize, normalizer_enabled, normalizer_switch, split, splitormerge_enabled, splitormerge_switch, stop_master, stop_regionserver, trace, unassign, wal_roll, zk_dump

  Group name: replication
  Commands: add_peer, append_peer_exclude_namespaces, append_peer_exclude_tableCFs, append_peer_namespaces, append_peer_tableCFs, disable_peer, disable_table_replication, enable_peer, enable_table_replication, get_peer_config, list_peer_configs, list_peers, list_replicated_tables, remove_peer, remove_peer_exclude_namespaces, remove_peer_exclude_tableCFs, remove_peer_namespaces, remove_peer_tableCFs, set_peer_bandwidth, set_peer_exclude_namespaces, set_peer_exclude_tableCFs, set_peer_namespaces, set_peer_replicate_all, set_peer_serial, set_peer_tableCFs, show_peer_tableCFs, update_peer_config

  Group name: snapshots
  Commands: clone_snapshot, delete_all_snapshot, delete_snapshot, delete_table_snapshots, list_snapshots, list_table_snapshots, restore_snapshot, snapshot

  Group name: configuration
  Commands: update_all_config, update_config

  Group name: quotas
  Commands: list_quota_snapshots, list_quota_table_sizes, list_quotas, list_snapshot_sizes, set_quota

  Group name: security
  Commands: grant, list_security_capabilities, revoke, user_permission

  Group name: procedures
  Commands: list_locks, list_procedures

  Group name: visibility labels
  Commands: add_labels, clear_auths, get_auths, list_labels, set_auths, set_visibility

  Group name: rsgroup
  Commands: add_rsgroup, balance_rsgroup, get_rsgroup, get_server_rsgroup, get_table_rsgroup, list_rsgroups, move_namespaces_rsgroup, move_servers_namespaces_rsgroup, move_servers_rsgroup, move_servers_tables_rsgroup, move_tables_rsgroup, remove_rsgroup, remove_servers_rsgroup

SHELL USAGE:
Quote all names in HBase Shell such as table and column names.  Commas delimit
command parameters.  Type <RETURN> after entering a command to run it.
Dictionaries of configuration used in the creation and alteration of tables are
Ruby Hashes. They look like this:

  {'key1' => 'value1', 'key2' => 'value2', ...}

and are opened and closed with curley-braces.  Key/values are delimited by the
'=>' character combination.  Usually keys are predefined constants such as
NAME, VERSIONS, COMPRESSION, etc.  Constants do not need to be quoted.  Type
'Object.constants' to see a (messy) list of all constants in the environment.

If you are using binary keys or values and need to enter them in the shell, use
double-quote'd hexadecimal representation. For example:

  hbase> get 't1', "key\x03\x3f\xcd"
  hbase> get 't1', "key\003\023\011"
  hbase> put 't1', "test\xef\xff", 'f1:', "\x01\x33\x40"

The HBase shell is the (J)Ruby IRB with the above HBase-specific commands added.
For more on the HBase Shell, see http://hbase.apache.org/book.html

根据上述信息，我们可以进一步的操作 HBase 数据库。我们实际开发中常用的**命令组(COMMAND GROUPS)**有：general、namespace、ddl、dml等，下面依次介绍这些内容：

通用命令 general

查看 HBase 状态 status，提供 HBase 的状态，如服务器的数量等

1
2
3

hbase(main):001:0> status
1 active master, 0 backup masters, 1 servers, 0 dead, 4.0000 average load
Took 0.5268 seconds

查看 HBase 版本 version，提供正在使用 HBase 版本

1
2
3

hbase(main):002:0> version
2.1.8, rd8333e556c8ed739cf39dab58ddc6b43a50c0965, Tue Nov 19 15:29:04 UTC 2019
Took 0.0002 seconds

表引用命令提供帮助 table_help

提供有关用户的信息 whoami

hbase(main):003:0> whoami
nilera (auth:SIMPLE)
    groups: nilera
Took 0.0283 seconds

操作命名空间 Namespace

**命名空间(Namespace)**，相当于MySQL数据库中的DataBase。Namespace 命令包括：alter namespace、create_namespace、describe_namespace、drop_namespace、list_namespace、list_namespace_tables。下面将对一些常用命令进行介绍：
- 查看全部命名空间 list_namespace
  1
  2
  3
  4
  5
  6
  hbase(main):001:0> list_namespace
  NAMESPACE
  default
  hbase
  2 row(s)
  Took 0.5484 seconds
- 创建命名空间 create_namespace
  
  用法：create_namespace 'ns'
  1
  2
  3
  4
  5
  6
  7
  8
  9
  hbase(main):001:0> create_namespace 'bigdata'
  Took 0.0432 seconds
  hbase(main):002:0> list_namespace
  NAMESPACE
  bigdata
  default
  hbase
  3 row(s)
  Took 0.0224 seconds
- 删除命名空间 drop_namespace
  
  用法：drop_namespace 'ns'，删除命名空间时，命名空间必须为空。
- 查看命名空间 describe_namespace
  
  用法：describe_namespace 'ns'
  1
  2
  3
  4
  5
  hbase(main):001:0> describe_namespace 'bigdata'
  DESCRIPTION
  {NAME => 'bigdata'}
  Took 0.0068 seconds
  => 1
- 查看命名空间下的表 list_namespace_tables
  
  用法：list_namespace_tables 'ns'
  1
  2
  3
  4
  5
  6
  7
  hbase(main):001:0> list_namespace_tables 'default'
  TABLE
  logs
  user
  2 row(s)
  Took 0.3790 seconds
  => ["logs", "user"]

数据定义语言 ddl

DDL(Data Definition Language)数据定义语言，主要是进行定义/改变表的结构、数据类型、表之间的链接等操作。ddl 相关命令如下：alter、alter_async、alter_status、clone_table_schema、create、describe、disable、disable_all、drop、drop_all、enable、enable_all、exists、get_table、is_disabled、is_enabled、list、list_regions、locate_region、show_filters。下面将对一些常用命令进行介绍：

创建表 create

常见用法：

① create 'ns:tb', {NAME => 'cf', VERSIONS => 5}

在命名空间 ns 下，创建一张表 tb，定义一个列族 cf。

② 当在默认命名空间default下创建表时，可以省略 ns

③ create 'tb', 'cf1', 'cf2'

在默认命名空间default下，创建一张表tb，并定义两个列族 cf1、cf2

④ create 'tb', {NAME => 'cf1', VERSIONS => 5}, {NAME => 'cf2', VERSIONS => 5}

在默认命名空间default下，创建一张表tb，并定义两个列族 cf1、cf2，并同时指定两个列族的版本为 5。
1
2
3
4
hbase(main):001:0> create 'bigdata:person', {NAME => 'name', VERSIONS => 5}, {NAME => 'msg', VERSIONS => 5}
Created table bigdata:person
Took 1.5638 seconds
=> Hbase::Table - bigdata:person

查看表的详细信息 describe

用法：describe 'tb'

hbase(main):010:0> describe 'bigdata:person'
Table bigdata:person is ENABLED
bigdata:person
COLUMN FAMILIES DESCRIPTION
{NAME => 'msg', VERSIONS => '5', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'fal
se', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN
_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}
{NAME => 'name', VERSIONS => '5', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'fa
lse', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', I
N_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}
2 row(s)
Took 0.1536 seconds

修改表 alter

表名创建时写的所有和列族相关的信息，都可以后续通过alter修改，包括增加、删除列族。

① 增加列族和修改信息都使用覆盖的方法

修改列族的版本，VERSIONS => 6：

hbase(main):001:0> alter 'bigdata:person', NAME => 'name', VERSIONS => 6
Updating all regions with the new schema...
1/1 regions updated.
Done.
Took 4.0145 seconds

添加列族 tel：

hbase(main):002:0> alter 'bigdata:person', NAME => 'tel', VERSIONS => 6
Updating all regions with the new schema...
1/1 regions updated.
Done.
Took 2.4498 seconds

查看修改后的数据：

hbase(main):003:0> describe 'bigdata:person'
Table bigdata:person is ENABLED
bigdata:person
COLUMN FAMILIES DESCRIPTION
{NAME => 'msg', VERSIONS => '6', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'fal
se', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN
_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}

{NAME => 'name', VERSIONS => '5', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'fa
lse', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', I
N_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}

{NAME => 'tel', VERSIONS => '6', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'fal
se', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN
_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}
3 row(s)
Took 0.0795 seconds

② 删除列族

删除列族可以用以下两种方式：

hbase(main):001:0> alter 'bigdata:person', NAME => 'tel', METHOD => 'delete'
Updating all regions with the new schema...
1/1 regions updated.
Done.
Took 2.1046 seconds

hbase(main):002:0> alter 'bigdata:person', 'delete' => 'msg'
Updating all regions with the new schema...
1/1 regions updated.
Done.
Took 2.9721 seconds

然后查询修改后的数据：

hbase(main):003:0> describe 'bigdata:person'
Table bigdata:person is ENABLED
bigdata:person
COLUMN FAMILIES DESCRIPTION
{NAME => 'name', VERSIONS => '5', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'fa
lse', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', I
N_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}
1 row(s)
Took 0.0391 seconds

禁用表 disable

用法： disable 'ns:tb'或disable 'tb'

1 2	hbase(main):001:0> disable 'bigdata:person' Took 0.9384 seconds

删除表 drop

用法： drop 'ns:tb'或drop 'tb'，删除表时需要保证表是禁用的，否则会出现以下错误：

hbase(main):001:0> drop 'bigdata:person'

ERROR: Table bigdata:person is enabled. Disable it first.

For usage try 'help "drop"'

Took 0.0248 seconds

禁用表后再删除表：

1 2	hbase(main):001:0> drop 'bigdata:person' Took 1.7106 seconds

数据操纵语言 dml

DML(Data Manipulation Language)数据操纵语言，主要是对数据进行增加、删除、修改操作。

写入数据 put

在 HBase 中如果想要写入数据，只能添加结构中最底层的 Cell。可以手动写入时间戳指定 Cell 的版本，推荐不写，默认使用当前的系统时间。如果重复写入相同 rowKey，相同列的数据，会写入多个版本进行覆盖。所以他同时兼具写入和修改的功能。

用法：

① put 'ns:tb', 'rk', 'col', 'value'

向命名空间ns中的tb表中的行键为rk，列为col的位置写入值value。其中col为cf:col（即列族:列名）的格式。

如果重复向相同行号rk，相同col写数据，则会进行覆盖。

hbase(main):001:0> put 'bigdata:student', '1001', 'info:name', 'zhangsan'
Took 0.2415 seconds
hbase(main):002:0> put 'bigdata:student', '1001', 'info:name', 'lisi'
Took 0.0121 seconds
hbase(main):003:0> put 'bigdata:student', '1001', 'info:name', 'wangwu'
Took 0.0342 seconds

hbase(main):004:0> put 'bigdata:student', '1002', 'info:name', 'zhaoliu'
Took 0.0082 seconds
hbase(main):005:0> put 'bigdata:student', '1003', 'info:age', '10'
Took 0.0050 seconds
hbase(main):006:0> put 'bigdata:student', '1003', 'info:sex', 'male'
Took 0.0054 seconds

② put 't1', 'r1', 'c1', 'value'用法同上。

读取数据 get/scan

读取数据的方法有两个：get 和 scan

get最大范围是一行数据，也可以进行列的过滤，读取数据的结果为多行 Cell。
scan是扫描数据，能够读取多行数据，不建议扫描过多数据，推荐使用 startRow 和 stopRow 来控制读取的数据，默认范围左闭右开。

① get命令

Some examples:
  hbase> t.get 'r1'											#查看'r1'的数据
  hbase> t.get 'r1', {TIMERANGE => [ts1, ts2]}
  hbase> t.get 'r1', {COLUMN => 'c1'}						#过滤单列, 只显示 'c1'
  hbase> t.get 'r1', {COLUMN => ['c1', 'c2', 'c3']}			#过滤多列, 只显示 'c1', 'c2', 'c3'
  hbase> t.get 'r1', {COLUMN => 'c1', TIMESTAMP => ts1}
  hbase> t.get 'r1', {COLUMN => 'c1', TIMERANGE => [ts1, ts2], VERSIONS => 4}
  hbase> t.get 'r1', {COLUMN => 'c1', TIMESTAMP => ts1, VERSIONS => 4}
  hbase> t.get 'r1', {FILTER => "ValueFilter(=, 'binary:abc')"}
  hbase> t.get 'r1', 'c1'
  hbase> t.get 'r1', 'c1', 'c2'
  hbase> t.get 'r1', ['c1', 'c2']
  hbase> t.get 'r1', {CONSISTENCY => 'TIMELINE'}
  hbase> t.get 'r1', {CONSISTENCY => 'TIMELINE', REGION_REPLICA_ID => 1}

hbase(main):001:0> get 'bigdata:student', '1001'
COLUMN                                   CELL
 info:name                               timestamp=1717580289267, value=wangwu
1 row(s)
Took 0.0645 seconds

hbase(main):002:0> get 'bigdata:student', '1001', {COLUMN => 'info:name'}
COLUMN                                   CELL
 info:name                               timestamp=1717580289267, value=wangwu
1 row(s)
Took 0.0107 seconds

hbase(main):003:0> get 'bigdata:student', '1003', {COLUMN => 'info:age'}
COLUMN                                   CELL
 info:age                                timestamp=1717580366636, value=10
1 row(s)
Took 0.0185 seconds

② scan 命令

Some examples:
  hbase> scan 'hbase:meta'
  hbase> scan 'hbase:meta', {COLUMNS => 'info:regioninfo'}
  hbase> scan 'ns1:t1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}
  hbase> scan 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}
  hbase> scan 't1', {COLUMNS => 'c1', TIMERANGE => [1303668804000, 1303668904000]}
  hbase> scan 't1', {REVERSED => true}
  hbase> scan 't1', {ALL_METRICS => true}
  hbase> scan 't1', {METRICS => ['RPC_RETRIES', 'ROWS_FILTERED']}
  hbase> scan 't1', {ROWPREFIXFILTER => 'row2', FILTER => "
    (QualifierFilter (>=, 'binary:xyz')) AND (TimestampsFilter ( 123, 456))"}
  hbase> scan 't1', {FILTER =>
    org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}
  hbase> scan 't1', {CONSISTENCY => 'TIMELINE'}
  hbase> scan 't1', {ISOLATION_LEVEL => 'READ_UNCOMMITTED'}
  hbase> scan 't1', {MAX_RESULT_SIZE => 123456}

hbase(main):001:0> scan 'bigdata:student'
ROW                                      COLUMN+CELL
 1001                                    column=info:name, timestamp=1717580289267, value=wangwu
 1002                                    column=info:name, timestamp=1717580320927, value=zhaoliu
 1003                                    column=info:age, timestamp=1717580366636, value=10
 1003                                    column=info:sex, timestamp=1717581149533, value=male
3 row(s)
Took 0.0338 seconds

hbase(main):025:0> scan 'bigdata:student', {STARTROW => '1001', STOPROW => '1003'}
ROW                                      COLUMN+CELL
 1001                                    column=info:name, timestamp=1717580289267, value=wangwu
 1002                                    column=info:name, timestamp=1717580320927, value=zhaoliu
2 row(s)
Took 0.0118 seconds

删除数据 delete/deleteall

删除数据的方式有两个：delete和deleteall

delete 表示删除一个版本的数据，即为 1 个 Cell，不填写版本默认删除最新的一个版本。
deleteall 表示删除所有版本的数据，即为当前行当前列的多个 Cell。执行命令会标记数据为要删除，不会直接彻底删除，删除只在特定时期清理磁盘时进行。

① delete

hbase(main):001:0> put 'bigdata:student', '1001', 'info:name', 'zhangsan'
Took 0.3910 seconds
hbase(main):002:0> put 'bigdata:student', '1001', 'info:name', 'lisi'
Took 0.2024 seconds
hbase(main):003:0> put 'bigdata:student', '1001', 'info:name', 'wangwu'
Took 0.1559 seconds

hbase(main):004:0> scan 'bigdata:student'
ROW                                      COLUMN+CELL
 1001                                    column=info:name, timestamp=1717584831277, value=wangwu
 1002                                    column=info:name, timestamp=1717580320927, value=zhaoliu
 1003                                    column=info:age, timestamp=1717580366636, value=10
 1003                                    column=info:sex, timestamp=1717581149533, value=male
3 row(s)
Took 0.0083 seconds

hbase(main):005:0> delete 'bigdata:student', '1001', 'info:name'
Took 0.0055 seconds

hbase(main):006:0> scan 'bigdata:student'
ROW                                      COLUMN+CELL
 1001                                    column=info:name, timestamp=1717584831277, value=lisi
 1002                                    column=info:name, timestamp=1717580320927, value=zhaoliu
 1003                                    column=info:age, timestamp=1717580366636, value=10
 1003                                    column=info:sex, timestamp=1717581149533, value=male
3 row(s)
Took 0.0087 seconds

② deleteall

2.2.2 API操作

根据官方 API 介绍，HBase 的客户端连接由 ConnectionFactory 类来创建，用户使用完成之后需要手动关闭连接。同时连接是一个重量级的，推荐一个进程使用一个连接。对 HBase 的命令通过连接中的两个属性 Admin 和 Table 来实现。其中 Admin 主要管理 HBase 的元数据，如创建、修改表格信息，也就是 DDL 操作；Table 主要用于表格的增加、删除数据，也就是 DML 操作。

环境搭建

使用 IDEA 创建 Maven 项目，并修改 pom.xml 文件，添加 HBase 所需要用到的依赖。

<dependencies>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>2.4.11</version>
            <!-- 如果报错, 需要排除 javax.el 拓展 -->
            <!-- 因为 2.4.11 对应的是一个测试版本的 javax.el 包 -->
            <!-- 需要先排除这个包后再添加正式版的 javax.el 包 -->
            <exclusions>
                <exclusion>
                    <groupId>org.glassfish</groupId>
                    <artifactId>javax.el</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <!-- 添加正式版的 javax.el 包 -->
        <dependency>
            <groupId>org.glassfish</groupId>
            <artifactId>javax.el</artifactId>
            <version>3.0.1-b06</version>
        </dependency>
    </dependencies>

单线程使用连接

下面展示了一种单线程使用连接的方式，实际开发中实际上很少这样做。

package com.sdutcm;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.client.AsyncConnection;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;

import java.io.IOException;
import java.util.concurrent.CompletableFuture;


public class HBaseConnection {
    public static void main(String[] args) throws IOException {
        // 1. 创建连接配置对象
        Configuration conf = new Configuration();

        // 2. 添加配置参数
        conf.set("hbase.zookeeper.quorum", "bigdata");      // 这些配置都写在 hbase-site.xml 中

        // 3. 创建连接
        // 默认创建同步连接
        Connection connection = ConnectionFactory.createConnection(conf);

        // 也可以创建异步连接: 不推荐使用异步连接
        CompletableFuture<AsyncConnection> asyncConnection = ConnectionFactory.createAsyncConnection(conf);

        // 4. 使用连接
        System.out.println(connection);

        // 5. 关闭连接
        connection.close();
    }
}

多线程使用连接

实际开发中，因为 HBase 的连接是重量级的，所以我们在每个客户端中一般只创建一个（类似于单例模式）。所以我们对代码进行修改，如下：

package com.sdutcm;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.client.AsyncConnection;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;

import java.io.IOException;
import java.util.concurrent.CompletableFuture;


public class HBaseConnection {
    // 声明一个静态属性
    public static Connection connection = null;

    static {
        // 1. 创建连接配置对象: 当完成 resources 目录的配置后, 我们可以直接注释掉创建配置的部分
        // 直接进行创建连接操作
        // Configuration conf = new Configuration();

        // 2. 添加配置参数
        // 实际开发中, 不应该在代码中显式的写参数, 而是将参数写在 resources 下的配置文件中
        // 将虚拟机的 hbase-site.xml 放到 resources 目录下
        // conf.set("hbase.zookeeper.quorum", "bigdata");      // 这些配置都写在 hbase-site.xml 中

        // 3. 创建连接
        // 默认创建同步连接
        try {
            // 这里修改为无参构造
            // connection = ConnectionFactory.createConnection(conf);
            // 这里通过查看 ConnectionFactory.createConnection() -> 查看 create() -> 可以发现 HBase 官方文档添加了两个配置文件
            // 分别为 hbase-default.xml 和 hbase-site.xml
            // 所以我们可以直接复制虚拟机的 hbase-site.xml 添加到 resources 目录下, 并且将这里改为无参构造
            // 无参则默认使用读取本地 hbase-site.xml 文件的方式添加参数
            connection = ConnectionFactory.createConnection();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    // 关闭连接方式
    public static void closeConnection() throws IOException {
        // 判断连接是否为空
        if (connection != null) {
            connection.close();
        }
    }

    public static void main(String[] args) throws IOException {
        // 直接使用创建好的连接, 不要在 main 线程里面单独创建连接
        System.out.println(HBaseConnection.connection);

        // 使用完连接后需要关闭连接
        HBaseConnection.closeConnection();
    }
}

获取 Admin

1
2
3

// 获取 Admin
// Admin 的连接式轻量级的, 不是线程安全的, 不推荐池化或者缓存这个连接
Admin admin = connection.getAdmin();

创建命名空间

package com.sdutcm;

import org.apache.hadoop.hbase.NamespaceDescriptor;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;

import java.io.IOException;

public class HBaseDDL {
    // 声明一个静态属性, 这样我们可以在不同的类中, 调用到同一个对象
    public static Connection connection = HBaseConnection.connection;

    /**
     * @brief 创建命名空间
     * @param namespace 命名空间名称
     */
    public static void createNamespace(String namespace) throws IOException {
        // 1. 获取 Admin
        // Admin 的连接式轻量级的, 不是线程安全的, 不推荐池化或者缓存这个连接
        Admin admin = connection.getAdmin();

        // 2. 调用方法创建命名空间
        // 2.1 创建命名空间描述
        NamespaceDescriptor.Builder builder = NamespaceDescriptor.create(namespace);

        // 2.2 给命名空间添加需求
        builder.addConfiguration("user", "sdutcm");

        // 2.3 使用 builder 构造出对应的添加完参数的对象, 完成创建
        admin.createNamespace(builder.build());

        // 关闭 admin
        admin.close();
    }

    public static void main(String[] args) throws IOException {
        // 测试创建命名空间
        createNamespace("sdutcm");

        // 其他代码
        System.out.println("其他代码");

        // 关闭 HBase 连接
        HBaseConnection.closeConnection();
    }
}

结果如下：

hbase(main):001:0> list_namespace
NAMESPACE
default
hbase
sdutcm		<<< 可以看到 sdutcm 已经被创建出来了
3 row(s)
Took 8.0120 seconds

hbase(main):002:0> describe_namespace "sdutcm"
DESCRIPTION
{NAME => 'sdutcm', user => 'sdutcm'}	<<< 这里是我们添加的描述
Took 0.7576 seconds
=> 1

多异常处理

判断表格是否存在
创建表格

📕 3. 底层原理

3.1 进程架构

3.1.1 Master架构

3.1.2 RegionServer架构

3.2 写流程

3.2.1 写入顺序

3.2.2 刷新机制

3.3 读流程

3.3.1 读取顺序

3.3.2 合并数据优化

3.4 文件合并

3.4.1 大合并

3.4.2 小合并

Region拆分

自定义预分区

系统拆分

🔧 企业开发

TSDB模式

基础表格模式

自定义API

整合框架

Phoenix 读写数据

Hive 分析数据

QuickPassHBase

https://hello-nilera.com/2024/06/02/QuickPassHBase/

作者

NilEra

发布于

2024-06-02

更新于

2024-07-07

许可协议

QuickPassHBase

快速上手HBase

⚙ 1. HBase简介

1.1 HBase的定义

1.2 HBase的数据模型

1.2.1 HBase 的逻辑结构

1.2.2 HBase 的物理结构

1.2.3 数据模型

1.3 HBase 基本架构

🔧 2. 快速上手

2.1 安装部署

2.1.1 分布式部署

2.1.2 高可用服务

2.2 使用操作

2.2.1 Shell操作

2.2.2 API操作

📕 3. 底层原理

3.1 进程架构

3.1.1 Master架构

3.1.2 RegionServer架构

3.2 写流程

3.2.1 写入顺序

3.2.2 刷新机制

3.3 读流程

3.3.1 读取顺序

3.3.2 合并数据优化

3.4 文件合并

3.4.1 大合并

3.4.2 小合并

Region拆分

自定义预分区

系统拆分

🔧 企业开发

TSDB模式

基础表格模式

自定义API

整合框架

Phoenix 读写数据

Hive 分析数据

作者

发布于

更新于

许可协议

喜欢这篇文章？打赏一下作者吧

评论

链接

分类

订阅更新

follow.it

最新文章

归档

标签