实战Spark分布式SQL引擎

来源:互联网 发布:现在淘宝卖零食好做吗 编辑:程序博客网 时间:2024/05/02 00:00
一、概览
Spark SQL除了使用spark-sql命令进入交互式执行环境之外,还能够使用JDBC/ODBC或命令行接口进行分布式查询,在这个模式下,终端用户或应用可以直接和Spark SQL进行交互式SQL查询而不需要写任何scala代码。

二、使用Thrift JDBC server
spark版本    :1.4.0
Yarn版本     :CDH5.4.0
1、准备工作
将hive-site.xml拷贝或link到$SPARK_HOME/conf下

2、使用spark安装目录下脚本启动hive thrift server,默认不加参数时,会以local模式启动,占用本地一个JVM进程
sbin/start-thriftserver.sh

3、yarn-client模式启动,默认启动在10001端口
sbin/start-thriftserver.sh --master yarn
接下来,我们观察yarn UI的UI上,启动了25个container


为什么启动了一个JDBC服务就占用这么多资源呢?这是因为conf/spark-env.sh中配置了SPARK_EXECUTOR_INSTANCES为24个实例,再加上一个yarn client的driver实例
export SPARK_EXECUTOR_INSTANCES=24
观察Yarn NodeManager节点上的进程,thriftserver会常驻一个叫org.apache.spark.executor.CoarseGrainedExecutorBackend的进程,随时为之后的SQL作业启动Task。这样做的好处是运行Spark SQL时,减少了启动container上的时间消耗,同时代价是在thrift server空闲的时候,这些container资源仍然占用着不会释放给其他spark或mapreduce作业使用。



4、使用beeline连接Spark SQL交互式引擎
bin/beeline -u jdbc:hive2://localhost:10001 -n root -p root
注意,在非安全Hadoop模式下,用户名使用当前系统用户,密码为空或随意传值都可以;在kerberos Hadoop模式下,需要传递有效的principal令牌才可以登录beeline。

三、命令行帮助
1、Thrift server
Mandatory arguments to long options are mandatory for short options too.
-a, --all do not ignore entries starting with .
-A, --almost-all do not list implied . and ..
--author with -l, print the author of each file
-b, --escape print octal escapes for nongraphic characters
--block-size=SIZE use SIZE-byte blocks. See SIZE format below
-B, --ignore-backups do not list implied entries ending with ~
-c with -lt: sort by, and show, ctime (time of last
modification of file status information)
with -l: show ctime and sort by name
otherwise: sort by ctime
-C list entries by columns
--color[=WHEN] colorize the output. WHEN defaults to `always'
or can be `never' or `auto'. More info below
-d, --directory list directory entries instead of contents,
and do not dereference symbolic links
-D, --dired generate output designed for Emacs' dired mode
-f do not sort, enable -aU, disable -ls --color
-F, --classify append indicator (one of */=>@|) to entries
--file-type likewise, except do not append `*'
--format=WORD across -x, commas -m, horizontal -x, long -l,
single-column -1, verbose -l, vertical -C
--full-time like -l --time-style=full-iso
-g like -l, but do not list owner
--group-directories-first
group directories before files.
augment with a --sort option, but any
use of --sort=none (-U) disables grouping
-G, --no-group in a long listing, don't print group names
-h, --human-readable with -l, print sizes in human readable format
(e.g., 1K 234M 2G)
--si likewise, but use powers of 1000 not 1024
-H, --dereference-command-line
follow symbolic links listed on the command line
--dereference-command-line-symlink-to-dir
follow each command line symbolic link
that points to a directory
--hide=PATTERN do not list implied entries matching shell PATTERN
(overridden by -a or -A)
--indicator-style=WORD append indicator with style WORD to entry names:
none (default), slash (-p),
file-type (--file-type), classify (-F)
-i, --inode print the index number of each file
-I, --ignore=PATTERN do not list implied entries matching shell PATTERN
-k like --block-size=1K
-l use a long listing format
-L, --dereference when showing file information for a symbolic
link, show information for the file the link
references rather than for the link itself
-m fill width with a comma separated list of entries
-n, --numeric-uid-gid like -l, but list numeric user and group IDs
-N, --literal print raw entry names (don't treat e.g. control
characters specially)
-o like -l, but do not list group information
-p, --indicator-style=slash
append / indicator to directories
-q, --hide-control-chars print ? instead of non graphic characters
--show-control-chars show non graphic characters as-is (default
unless program is `ls' and output is a terminal)
-Q, --quote-name enclose entry names in double quotes
--quoting-style=WORD use quoting style WORD for entry names:
literal, locale, shell, shell-always, c, escape
-r, --reverse reverse order while sorting
-R, --recursive list subdirectories recursively
-s, --size print the allocated size of each file, in blocks
-S sort by file size
--sort=WORD sort by WORD instead of name: none -U,
extension -X, size -S, time -t, version -v
--time=WORD with -l, show time as WORD instead of modification
time: atime -u, access -u, use -u, ctime -c,
or status -c; use specified time as sort key
if --sort=time
--time-style=STYLE with -l, show times using style STYLE:
full-iso, long-iso, iso, locale, +FORMAT.
FORMAT is interpreted like `date'; if FORMAT is
FORMAT1<newline>FORMAT2, FORMAT1 applies to
non-recent files and FORMAT2 to recent files;
if STYLE is prefixed with `posix-', STYLE
takes effect only outside the POSIX locale
-t sort by modification time
-T, --tabsize=COLS assume tab stops at each COLS instead of 8
-u with -lt: sort by, and show, access time
with -l: show access time and sort by name
otherwise: sort by access time
-U do not sort; list entries in directory order
-v natural sort of (version) numbers within text
-w, --width=COLS assume screen width instead of current value
-x list entries by lines instead of by columns
-X sort alphabetically by entry extension
-1 list one file per line
 
SELinux options:
 
--lcontext Display security context. Enable -l. Lines
will probably be too wide for most displays.
-Z, --context Display security context so it fits on most
displays. Displays only mode, user, group,
security context and file name.
--scontext Display only security context and file name.
--help display this help and exit
--version output version information and exit
2、beeline
-u <database url> the JDBC URL to connect to
-n <username> the username to connect as
-p <password> the password to connect as
-d <driver class> the driver class to use
-e <query> query that should be executed
-f <file> script file that should be executed
--hiveconf property=value Use value for given property
--hivevar name=value hive variable name and value
This is Hive specific settings in which variables
can be set at session level and referenced in Hive
commands or queries.
--color=[true/false] control whether color is used for display
--showHeader=[true/false] show column names in query results
--headerInterval=ROWS; the interval between which heades are displayed
--fastConnect=[true/false] skip building table/column list for tab-completion
--autoCommit=[true/false] enable/disable automatic transaction commit
--verbose=[true/false] show verbose error messages and debug info
--showWarnings=[true/false] display connection warnings
--showNestedErrs=[true/false] display nested errors
--numberFormat=[pattern] format numbers using DecimalFormat pattern
--force=[true/false] continue running script even after errors
--maxWidth=MAXWIDTH the maximum width of the terminal
--maxColumnWidth=MAXCOLWIDTH the maximum width to use when displaying columns
--silent=[true/false] be more silent
--autosave=[true/false] automatically save preferences
--outputformat=[table/vertical/csv/tsv] format mode for result display
--isolation=LEVEL set the transaction isolation level
--nullemptystring=[true/false] set to true to get historic behavior of printing null as empty string
--help display this message


0 0