Hive内部表和外部表的区别

来源:互联网 发布:windows10固态硬盘优化 编辑:程序博客网 时间:2024/05/24 01:37
Hive内部表和外部表的区别参考网址:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ManagedandExternalTablesManaged and External TablesBy default Hive creates managed tables, where files, metadata and statistics are managed by internal Hive processes. A managed table is stored under thehive.metastore.warehouse.dir path property, by default in a folder path similar to /apps/hive/warehouse/databasename.db/tablename/. The default location can be overridden by the location property during table creation. If a managed table or partition is dropped, the data and metadata associated with that table or partition are deleted. If the PURGE option is not specified, the data is moved to a trash folder for a defined duration.--如果内部表或分区drop掉的话,数据和元数据都会被删除。没有purge选项的话,数据被移动到垃圾文件夹,在定义的时间段内。Use managed tables when Hive should manage the lifecycle of the table, or when generating temporary tables.--使用Hive的内部表时需要管理表的生命周期,否则会产生临时表。An external table describes the metadata / schema on external files. External table files can be accessed and managed by processes outside of Hive. --外部表在外部的文件中描述metadata / schema,外部表文件能够被Hive外部的进程访问和管理。External tables can access data stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations. If the structure or partitioning of an external table is changed, an MSCK REPAIR TABLE table_name statement can be used to refresh metadata information.--外部表能够访问存储在asv或者远程hdfs位置上的数据,如果外部表的结构和分区改变,需要刷新元数据的信息(使用语句:MSCK REPAIR TABLE table_name;)Use external tables when files are already present or in remote locations, and the files should remain even if the table is dropped.--使用外部表时,当文件还存在或者在远端的位置,及时表被drop掉了,文件件也仍然存在。Managed or external tables can be identified using the DESCRIBE FORMATTED table_name command, which will display either MANAGED_TABLE or EXTERNAL_TABLE depending on table type.--可以通过DESCRIBE FORMATTED table_name的命令辨别内部表还是外部表,命令结果将根据表的类型显示MANAGED_TABLE或者 EXTERNAL_TABLEStatistics can be managed on internal and external tables and partitions for query optimization. --内部表,外部表和分区的统计信息都能够被管理。--修复表的命令:Recover Partitions (MSCK REPAIR TABLE)Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (say by using hadoop fs -put command), the metastore (and hence Hive) will not be aware of these partitions unless the user runs ALTER TABLE table_name ADD PARTITION commands on each of the newly added partitions.--通过hadoop fs -put command命令直接加到hdfs的新分区,metastore无法知道分区,除非用户对每个新加入的分区运行ALTER TABLE table_name ADD PARTITION命令,然而,用户也可以运行MSCK REPAIR TABLE table_name;命令来修复。However, users can run a metastore check command with the repair table option:MSCK REPAIR TABLE table_name;which will add metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. In other words, it will add any partitions that exist on HDFS but not in metastore to the metastore. See HIVE-874 for more details. --命令将会把不存在的分区metadata加入到Hive metastore,换句话说,它会将任何存在于hdfs但不存在于metastore的分区加入到metastore中。When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once.--如果有很多分区没加入的话,建议用MSCK REPAIR TABLE batch语句来避免OOM。通过配置hive.msck.repair.batch.size,可以内部批量地运行。默认的值为0,表示马上执行所有的分区。

原创粉丝点击