Authentication in HDFS and Hadoop common

来源：互联网发布：天天德州作弊器知乎编辑：程序博客网时间：2024/05/21 09:01

Summary

This article is to introduce authentication in HDFS and Hadoop common and will also try to discuss some common authentication methods which can be used in other places.

An overall picture of authentication

As indicated in the above flow, when a client want to access a server, it will perform the following steps:

1. Login using JAAS. This step will generate a subject which contains user identity information, this subject will be used in the authentication.

2. Authenticate to server. In this step, the client and server will do manually authentication using SASL/jGSS/JSSE library.

Where authentication is needed?

Authentication is needed in all places where two process need communicate. Regarding to HDFS, these places include RPC calls via protocol buffer, HTTP/HTTPs and during data transfer. In HDFS, protocol buffer is used in method invocation; HTTP/HTTPs is used in editlog transfer, webhdfs, etc; regarding to data transfer between datanodes and clients, it uses sockets.

authentication in RPC

Hadoop common provide utility classes org.apache.hadoop.ipc.Client/Server to help to build RPC client and RPC server. All the method invocation in HDFS are built upon these classes. Client and Server will do authentication at the initial stage of connection setup. As noted from the above picture, the authentication process is as follows:

1. Negotiate a mechanism both sides support. Server will send a list of mechanisms which it supports to Client; Client choose the first mechnism which it supports and send it back to server to tell which mechanism it chooses.

2. Do authentication based on the negotiated mechanism. One side will send out the first challenge and the other side calls SaslClient/SaslServer.evalucateChallenge() to evalucate the message and generate another challenge if the authentication exchange hasn't completed.

Note:

1. The ugi.doAs() in the above picture at both sides. ugi.doAs() is needed to create SaslClient/SaslServer, as it encapsulates the subject which contains the identity information which is needed for authentication.

2. Authentication mechanisms supported are: token, kerberos, simple.

authentication in HTTP/HTTPs

authentication workflows in HTTP/HTTPs

Hadoop common provide utility classes Authenticator and AuthenticationHandler to help to do authentication in HTTP/HTTPs. Authenticator is used by client side and AuthenticationHandler is used by server side. You can configure "AuthenticationFilter" as the filter for specified paths at server side and the AuthenticationFilter will use AuthenticationHandler to do authentication. Regarding to HDFS, requests which contain delegation token won't need to do authentication and server only need to validate if the delegation token is valid.

As noted in the above picture, a ugi is created for the user or doas user at the end of authentication. This ugi will be used to represent the user in the following process.

authentication in webhdfs

As noted in the above pictures, the HTTP/HTTPs request in HDFS will involve NN and DN.

1. Client will firstly authenticate to NN and NN will delegate clients to DN for some operations, such as Create, Append, etc. If NN delegate clients to DN, it will generate delegationToken for clients if necessary.

2. Client access to DN. As DN will use client's delegationToken to access NN, authentication of client is not needed, it will be done by NN(step 3).

3. DN will create a DFSClient to carry out client's request. DFSClient will use delegationToken to do authentication to NN. NN will check if the delegationtoken is valid.

authentication in other places

Actually in HDFS, authentication is needed not only in RPC and HTTP/HTTPs, but also in some other places, such as data transfer between client and DN or between DNs.

Data Transfer

DN create a DataXceiverServer which is responsible for data transfer. The authentication is as follows:

1) Client authenticate to NN and get a BlockToken which is used access a specific block.

2) Client access to DN with the BlockToken obtained in step 1), so DN can authenticate Client via the validity of BlockToken.

3) Clien authenticate DN via port < 1024 as only root user can listen on privileged ports. (Please refer toHDFS-1150 for detailed information).

4) If DN listen on port > 1024, it will trigger SASL authentication with client using DIGEST-MD5.

UserGroupInfomation

UserGroupInformation contains the identity information of user, such as subject.

doAs()

UserGroupInformation.doAs(PrivilegedAction<T> action) will call Subject.doAs(Subject subject, PrivilegedAction<T> action).

The magic of Subject.doAs() is as follows:

You can retrieve the identity information of the subject when needed in the implementation of action. Actually UserGroupInformation provide a facility to do this:UserGroupInformation.getCurrentUser()

Let's take "mkdir " as an example to explain why this is useful: when client make a rpc call to server side to do operation "mkdir", it will first authenticate to server and server side create a clientUgi to represent the authenticated user and then execute the operation "mkdir" under clientUgi.doAs(). When server need to check if client has the permission to do this operation, it can callUserGroupInformation.getCurrentUser() to retrieve client's user name and group name.

The authentication library SASL/jGSS works under Subject.doAs(). As these library need the credential to do authentication, they need to retrieve the credential from the subject.

In proxy usecase, such as B request to C to do some operation as A, A can send its token to B, then B can create a UserGroupInformation with this token and do authentication to C. The benifit is that A want leak its private credential to B and it can revoke this token anytime.

Compare of different authentication/authronization/security libraries(The following compare are cited fromthis site)

"The Java SE 6 platform provides three standard APIs that allow applications to perform secure communications: The Java Generic Security Service (GSS), the Java SASL API, and the Java Secure Socket Extension (JSSE). When building an application, which of these APIs should you use? The answer depends on many factors, including requirements of the protocol or service, deployment infrastructure, and integration with other security services. For example, if you are building an LDAP client library, you would need to use the Java SASL API because use of SASL is part of LDAP's protocol definition." Pls refer tothis site for detailed discussion.

Simple Authentication and Security Layer (SASL)

You can use SASL to negotiate a security layer to be used after authentication. The properties of the security layer (such as whether you want integrity or confidentiality) is decided at negotiation time.
When compared with JSSE and Java GSS, SASL is relatively lightweight and is popular among more recent protocols. It also has the advantage that several popular, lightweight (in terms infrastructure support) SASL mechanisms have been defined. Primary JSSE and Java GSS mechanisms, on the other hand, have relatively heavyweight mechanisms that require more elaborate infrastructures (Public Key Infrastructure and Kerberos, respectively).

Java Generic Secure Services API (jGSS)

You can use jGSS to negotiate a security layer to be used after authentication. The properties of the security layer such as confidentiality can be turned on or off per message.
"Java GSS is the Java language bindings for the Generic Security Service Application Programming Interface (GSS-API). The only mechanism currently supported underneath this API on J2SE is Kerberos v5." (cited from this site)

Java Secure Socket Extension (JSSE)

JSSE provides a secure end-to-end channel that takes care of the I/O and transport, while Java GSS and Java SASL provide encryption and integrity-protection on the data, but the application is responsible for transporting the secured data to its peer.

0 0