Apache Hadoop is used for processing big data at many enterprises. A Hadoop cluster is formed by assembling a large number of commodity machines, and it enables the distributed processing of data. Enterprises store lots of important data on the cluster. Different users and teams process this data to obtain summary information, generate insights, and gain other very useful information.
Figure 1. Typical Hadoop cluster with clients communicating from all sides
Depending on the size of the enterprise and the domain, there can be lots of data stored on the cluster and a lot of clients working on the data. Depending on the type of client, its network location, and the data in transit, the security requirements of the communication channel between the client and the servers can vary. Some of the interactions need to be authenticated. Some of the interactions need to be authenticated and protected from eavesdropping. How to enable different qualities of protection for different cluster interactions forms the core of this blog post.
The requirement for different qualities of protection
At eBay, we have multiple large clusters that hold hundreds of petabytes of data and are growing by many terabytes daily. We have different types of data. We have thousands of clients interacting with the cluster. Our clusters are kerberized, which means clients have to authenticate via Kerberos.
Some clients are within the same network zone and interact directly with the cluster. In most cases, there is no requirement to encrypt the communication for these clients. In some cases, it is required to encrypt the communication because of the sensitive nature of the data. Some clients are outside the firewall, and it is required to encrypt all communication between these external clients and Hadoop servers. Thus we have the requirement for different qualities of protection.
Figure 2. Cluster with clients with different quality of protection requirements (Lines in red indicate communication channel requiring confidentiality)
Also note that there is communication between the Hadoop servers within cluster. They also can fall into category of clients within the same network zone or internal clients.
# | Client | Data | QOP |
---|---|---|---|
1 | Internal | Normal data | Authentication |
2 | Internal | Sensitive | Authentication + Encryption + Integrity |
3 | External | Normal or sensitive data | Authentication + Encryption + Integrity |
Hadoop protocols
All Hadoop servers support both RPC and HTTP protocols. In addition, Datanodes support the Data Transfer Protocol.
Figure 3. Hadoop Resource Manager with multiple protocols on different ports
RPC protocol
Hadoop servers mainly interact via the RPC protocol. As an example, file system operations like listing and renaming happen over the RPC protocol.
HTTP
All Hadoop server expose their status, JMX metrics, etc. via HTTP. Some Hadoop servers support additional functionality to offer a better user experience over the HTTP. For example, the Resource Manager provides a web interface to browse applications over the HTTP.
Data Transfer Protocol
Datanodes store and serve the blocks via the Data Transfer Protocol.
Figure 4. A Hadoop Datanode with multiple protocols on different ports
Quality of Protection for the RPC Protocol
The quality of protection for the RPC protocol is specified via the configuration property hadoop.rpc.protection. The default value is authentication
on a kerberized cluster. Note that hadoop.rpc.protection is effective only in a cluster where hadoop.rpc.authentication is set to Kerberos
.
Hadoop supports Kerberos authentication and quality protection via the SASL (Simple Authentication and Security Layer) Java library. The Java SASL library supports the three levels of Quality of protection shown in Table 2.
# | SASL QOP | Description | hadoop.rpc.protection |
---|---|---|---|
1 | auth |
Authentication only | authentication |
2 | auth-int |
Authentication + Integrity protection | integrity |
3 | auth-conf |
Authentication + Integrity protection + Privacy protection | privacy |
If privacy
is chosen for hadoop.rpc.protection, data transmitted between client and server will be encrypted. The algorithm used for encryption will be 3DES.
Figure 5: Resource Manager supporting privacy on its RPC port
Quality of protection for HTTP
The quality of protection for HTTP can be controlled by the policies dfs.http.policy and yarn.http.policy. The policy values can be one of the following:
-
HTTP_ONLY
. The interface is served only on HTTP -
HTTPS_ONLY
. The interface is served only on HTTPS -
HTTP_AND_HTTPS
. The interface is served both on HTTP and HTTPS
Figure 6. Resource Manager supporting privacy on its RPC and HTTP ports
Quality of protection for the Data Transfer Protocol
The quality of protection for the Data Transfer Protocol can be specified in a similar way to that for the RPC protocol. The configuration property is dfs.data.transfer.protection. Like RPC, the value can be one of authentication
, integrity
, or privacy
. Specifying this property makes SASL effective on the Data Transfer Protocol.
Figure 7. Datanode supporting privacy on all its ports
Specifying authentication
as the value of dfs.data.transfer.protection forces the client to require a block access token while reading and storing blocks. Setting dfs.data.transfer.protection to privacy
results in encrypting the data transfer. The algorithm used for encryption will be 3DES. It is possible to agree upon a different algorithm by setting dfs.encrypt.data.transfer.cipher.suites on both client and server sides. The only value supported is AES/CTR/NoPadding
. Using AES results in better performance and security. It is possible to further speed up encryption and decryption by using Advanced Encryption Standard New Instructions (AES-NI) via the libcrypto.so
native library on Datanodes and clients.
Performance impact
Enabling privacy comes at the cost of performance. Encryption and decryption require extra performance.
Setting hadoop.rpc.protection to privacy
encrypts all communication from clients to Namenode, from clients to Resource Manager, from datanodes to Namenodes, from Node Managers to Resource managers, and so on.
Setting dfs.data.transfer.protection to privacy
encrypts all data transfer between clients and Datanodes. The clients could be any HDFS client like a map-task reading data, reduce-task writing data or a client JVM reading/writing data.
Setting dfs.http.policy and yarn.http.policy to HTTPS_ONLY
causes all HTTP traffic to be encrypted. This includes the web UI for Namenodes and Resource Managers, Web HDFS interactions, and others.
While this guarantees the privacy of interaction, it slows down the cluster considerably. The cumulative effective will be a fivefold decrease in the cluster throughput. We had set quality of protection for both RPC and data transfer protocols to privacy
in our main cluster. We experienced severe delays in processing that resulted in days of backlogs in data processing. The Namenode throughput was low, the data read/writes were very slow, and this increased the completion times of individual applications. Since applications were occupying containers for five times longer than before, there was severe resource contention that ultimately caused the application backlogs.
The immediate solution was to change the quality of protection back to authentication
for both protocols, but we still required privacy for some of our interactions, so we explored the possibility of choosing a quality of protection dynamically during connection establishment.
Selective encryption
As noted, the RPC protocol and Data Transfer Protocol internally use SASL to incorporate security in the protocol. On each side (client/server), SASL can support an ordered list of QOPs. During client-server handshake, the first common QOP is selected as the QOP of that interaction. Table 2 lists the valid QOP values.
Figure 8. Sequence of client and server agreeing on a QOP based on configured QOPs
In our case, we wanted most client-server interactions to require only authentication and very few client-server interactions to use privacy. To achieve this, we implemented an interface SASLPropertiesResolver. The method getSaslProperties on SASLPropertiesResolver will be invoked during handshake to determine the potential list of QOP to be used for the handshake. The default implementation of the SASLPropertiesResolver simply returned the value specified in hadoop.rpc.protection and dfs.data.transfer.protection for RPC and data transfer protocols.
More details on SASLPropertiesResolver can be found by reviewing the work on the related JIRA HADOOP-10221.
eBay’s implementation of SASLPropertiesResolver
In our implementation of SASLPropertiesResolver, we use a whitelist of IP addresses on the server side. If the client’s IP address is in the whitelist, then the list of QOP specified in hadoop.rpc.protection/dfs.data.transfer.protection is used. If the client’s IP address in not in whitelist, then we use the list of QOPs specified in a custom configuration, namely hadoop.rpc.protection.non-whitelist. To avoid very long whitelists, we use CIDRs to represent IP address ranges.
In our clusters, we set hadoop.rpc.protection and dfs.data.transfer.protection on both client and servers to be authentication,privacy
. The configuration hadoop.rpc.protection.non-whitelist was set to privacy
.
When whitelisted clients connect to servers, they will agree upon authentication
as the QOP for the connection, since both client and servers support a list of QOP values, authentication,privacy
. The order of QOP values in the list is important. If both client and servers support a list of QOP values as privacy,authentication
, then client and servers agree upon privacy
as the QOP for their connections.
Figure 9. internal clients, external clients, cluster
Only clients inside the firewall are in the whitelist. When these clients connect to servers, the QOP used will be authentication
as both clients and servers use authentication,privacy
for RPC and data transfer protocols.
The external clients are not in the whitelist. When these clients connect to servers, QOP used will be privacy
as servers offer only privacy
based on the value of hadoop.rpc.protection.non-whitelist. The clients in this case could be using authentication,privacy
or privacy
in their configuration. If the client only specifies authentication
, the connection will fail as there will be no common QOP supported between client and servers.
Figure 10. Sequence of Hadoop client and server choosing QOP using SASL PropertiesResolver
The diagram above depicts the sequence of an external client and Hadoop server agreeing on a QOP. Both client and server support two potential QOP, authentication,privacy
. When the client initiates the connection, the server consults the SASLPropertiesResolver with the client IP address to determine the QOP. The whitelist-based SASLPropertiesResolver checks the whitelist, finds that the client’s IP address is not whitelisted and hence offers only privacy
as the QOP. The server then offers only privacy
as the only QOP choice to the client during SASL negotiation. Thus the QOP of the subsequent connection will be privacy
. This necessitates the need for setting up a secret key based on the cipher suites available at the client and server.
In some cases, internal clients need to transmit sensitive data and prefer to encrypt all communication with the cluster. In this case, the internal clients can force encryption by setting both hadoop.rpc.protection and dfs.data.transfer.protection to privacy
. Even though the servers support authentication,privacy
, connections will use the common QOP, privacy
.
# | Client QOP | Server QOP | Negotiated QOP | Comments |
---|---|---|---|---|
1 | Authentication, Privacy | Authentication, Privacy | Authentication | For normal clients |
2 | Authentication, Privacy | Privacy | Privacy | For external clients |
3 | Privacy | Authentication, Privacy | Privacy | For clients to transmit sensitive data |
Selective secure communication using Apache Knox
Supporting multiple QOPs on Hadoop protocols enables selective encryption. This will protect any data transfer. The data could be real data or signal data like delegation tokens. Another approach will be to use a reverse proxy like Apache Knox. Here the external clients will have connectivity only to Apache Knox servers and Apache Knox will allow only HTTPS traffic. The cluster will support only one QOP, which is Authentication
.
Figure 11. Protecting data transfer using Reverse Proxy (Apache Knox)
As shown in the diagram, the external clients interact with the cluster via knox servers. The transmission between the client and knox will be encrypted. knox, being an internal client, forwards the transmission to the cluster in plain text.
Selective Encryption using reverse proxy vs SaslPropertiesResolver
As we discussed, it is possible to achieve a different quality of protection for the client-cluster communication in two ways.
- With a reverse proxy to receive encrypted traffic
- By enabling the cluster to support multiple QOPs simultaneously
There are pros and cons with both approaches, as shown in Table 4.
Approaches | PROS | CONS |
---|---|---|
SaslPropertiesResolver | Natural and efficient. The clients can communicate directly with the Hadoop cluster using the Hadoop commands and using native Hadoop protocols. | Maintenance of whitelist. The whitelist needs to be maintained on the cluster. Care should be taken to keep the whitelist at the minimal size using CIDRs to avoid lookup in a large list of IP addresses for each connection. |
Reverse Proxy | Closed Cluster. The external clients need to communicate only with Knox servers and hence all other ports can be closed. | Maintenance of additional components. A pool of knox servers needs to be maintained to handle traffic from external clients. Depending on the amount of data, there could be a quite a lot of Knox servers. |
Depending on the organization’s requirements, the proper approach should be chosen.
Further work
Enable separate SaslPropertiesResolver properties for client and server
A process can either be a client or server or both. In some environments, it is desirable to use different logic for choosing QOP depending on whether the process is a client or server for the specific interaction. Currently, there is provision to specify only one SaslPropertiesResolver via configuration. If a client needs a different SaslPropertiesResolver, it needs to use a different configuration. If the same process needs a different SaslPropertiesResolver while acting as client and server, there is no way to do that. It would be a good enhancement to be able to specify different SaslPropertiesResolver for client and server.
Use Zookeeper to maintain whitelist of IP Addresses
Currently the whitelist of IP addresses is maintained on local files. This introduces the problem of updating thousands of machines and keeping them all in sync. Storing the whitelist information on a Zookeeper node may be a better alternative.
Currently, the whitelist is cached by each server during initialization and then reloaded periodically. A better approach may be to reload the whitelist based on an update event.