PXF Access HDFS

YMatrix supports access to data in HDFS through PXF (Platform Extension Framework). Through PXF, YMatrix can read structured or semi-structured data on HDFS in parallel, avoiding redundancy and delay caused by data migration, and achieving a "connected and ready-to-use" data analysis experience.

This section includes two access methods: PXF Direct Access HDFS and PXF Access HDFS through Ranger for users to choose from. The latter can enable permission settings for HDFS file access through Ranger. Users can directly click on jump to the corresponding chapter for query.

If PXF is not installed, you can refer to the [PXF Installation] (https://ymatrix.cn/doc/latest/dataquery/pxf_install) chapter for installation.

PXF Direct access to HDFS

This section will provide detailed information on how to configure and use PXF to access HDFS in YMatrix.

Test environment

Software Version Information
YMatrix MatrixDB 6.2.2+enterprise
Hadoop Hadoop 3.2.4
PXF pxf-matrixdb4-6.3.0-1.el7.x86_64

User Example

1. YMatrix reads HDFS data

(I) Create folders in Hadoop (executes by Hadoop master node)
First, we need to create some directories on HDFS to store data files. These directories will serve as data exchange locations between YMatrix and Hadoop clusters.

  1. Create the /greenplum directory under the HDFS root directory. This directory is used to store all database-related data files. It can be created with the following command:
    hdfs dfs -mkdir /greenplum
  2. Create the pxf_examples subdirectory under /greenplum. This subdirectory is used to store PXF sample data. Create this subdirectory with the following command:
    
    hdfs dfs -mkdir /greenplum/pxf_examples
3. List the contents of the HDFS root directory and verify that `/greenplum` has been created. Use the `ls` command to list the root directory contents to ensure that the new directory is created successfully:

hdfs dfs -ls /

4. List the contents of the `/greenplum` directory and verify that `pxf_examples` has been created.

hdfs dfs -ls /greenplum/

5. Generate test data files
Create a test data file locally with city, month, sales, and amounts. The following commands can be used:

echo 'Prague,Jan,101,4875.33 Rome,Mar,87,1557.39 Bangalore,May,317,8936.99 Beijing,Jul,411,11600.67' > pxf_hdfs_simple.txt

6. Upload the generated local file to the `/greenplum/pxf_examples` directory of HDFS.

hdfs dfs -put pxf_hdfs_simple.txt /greenplum/pxf_examples/

7. Check the uploaded file content and verify that the upload is successful.

hdfs dfs -cat /greenplum/pxf_examples/pxf_hdfs_simple.txt

8. Grant file permissions
Assign file permissions to the `mxadmin` user through the `hdfs dfs -chown` command:

hdfs dfs -chown -R mxadmin:mxadmin /greenplum/pxf_examples/pxf_hdfs_simple.txt

**(II) Configure environment variables under mxadmin user (added by all YMatrix nodes)**  
To ensure that YMatrix can interact effectively with Hadoop clusters, all YMatrix nodes need to configure relevant environment variables.
- In each YMatrix node's `.bashrc` or environment configuration file, add the following:

export HADOOP_HOME=/opt/modules/hadoop-3.2.4 export HADOOP_CONF_DIR=/opt/modules/hadoop-3.2.4/etc/hadoop export PXF_CONF=/usr/local/pxf-matrixdb4 export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

**(III) Add mxadmin user access permissions to the Hadoop cluster (modify all Hadoop nodes)**  
To ensure that the `mxadmin` user has access to the Hadoop cluster, it is necessary to set it accordingly in the `core-site.xml` configuration file of all Hadoop nodes.
- Edit the `core-site.xml` file and add the following configuration:
hadoop.proxyuser.mxadmin.groups * hadoop.proxyuser.mxadmin.hosts *
**(IV) Create `single_hdfs` folder (added by all YMatrix machines)**  
Create a new folder on the YMatrix master node to store the configuration files for the Hadoop cluster.
- Create folder:

mkdir /usr/local/pxf-matrixdb4/servers/single_hdfs/

**(V) Copy the configuration file from the Hadoop machine to the PXF folder (transfer the Hadoop master node to the YMatrix master node)**  
Transfer the Hadoop configuration file to the PXF directory of the YMatrix master node.
- Use the `scp` command to copy the Hadoop configuration file to the YMatrix master node:

scp $HADOOP_CONF_DIR/core-site.xml 192.168.50.95:/usr/local/pxf-matrixdb4/servers/single_hdfs/ scp $HADOOP_CONF_DIR/hdfs-site.xml 192.168.50.95:/usr/local/pxf-matrixdb4/servers/single_hdfs/ scp $HADOOP_CONF_DIR/yarn-site.xml 192.168.50.95:/usr/local/pxf-matrixdb4/servers/single_hdfs/

**(VI) Create `pxf-site.xml` file (created by YMatrix master node)**  
Create a PXF configuration file `pxf-site.xml` on the YMatrix master node to ensure that the PXF service is properly connected to the Hadoop cluster.
- Create and edit the pxf-site.xml file using the following command:

vi /usr/local/pxf-matrixdb4/servers/single_hdfs/pxf-site.xml

- Add the corresponding configuration content to the file.

<?xml version="1.0" encoding="UTF-8"?>

**(7) Modify the `fs.defaultFS` configuration of `core-site.xml` as the IP of the node where Active NameNode is located, and synchronize the PXF configuration (executes by YMatrix master node)**  
Configure the `fs.defaultFS` property in the `core-site.xml` file to ensure that YMatrix is ​​properly connected to the NameNode of the Hadoop cluster.
- Modify `fs.defaultFS` configuration
fs.defaultFS hdfs://:9000
- Synchronize PXF configuration

pxf cluster sync

**(8) Use the pxf plug-in to query in the YMatrix database (executes by YMatrix master node)**  
1. Create a plugin
Enable the PXF plug-in in YMatrix to allow YMatrix to access HDFS data.
- Execute the following command to create a PXF plug-in

create extension pxf_fdw;

2. Create FDW Server
Create an external data source to connect to the file system of the Hadoop cluster.
- Execute the following SQL command to create an FDW Server:

CREATE SERVER single_hdfs FOREIGN DATA WRAPPER hdfs_pxf_fdw OPTIONS ( config 'single_hdfs' );

3. Create FDW User Mapping
Configure the mapping of YMatrix users to the FDW server.
- Execute the following SQL command:

CREATE USER MAPPING FOR mxadmin SERVER single_hdfs;

4. Create an FDW Foreign Table and map files on HDFS to the YMatrix database.
- Execute the following SQL command to create a Foreign Table:

CREATE FOREIGN TABLE pxf_hdfs_table (location text, month text, num_orders int, total_sales float8) SERVER single_hdfs OPTIONS ( resource '/greenplum/pxf_examples/pxf_hdfs_simple.txt', format 'text', delimiter ',' );

5. Query external tables
Execute the query statement to verify that the data is successfully read.
- Execute the following SQL query command:

SELECT * FROM pxf_hdfs_table;

#### 2. YMatrix writes data to HDFS

**(I)Create directory (Hadoop master node execution)**
Create a new directory on HDFS to store the data to be written.
- Create directory:

hdfs dfs -mkdir /greenplum/pxf_dir_examples

**(II) Create external table (YMatrix master node execution)**  
Create an external table in YMatrix and point it to the new HDFS directory.
- Execute the following SQL command to create an external table:

CREATE FOREIGN TABLE pxf_hdfsdir_table (location text, month text, num_orders int, total_sales float8) SERVER hdfs_svr OPTIONS (resource '/greenplum/pxf_dir_examples', format 'text', delimiter ',' );

**(III) Write data (execute by YMatrix master node)**  
Insert data from YMatrix's existing external table `pxf_hdfs_table` into the newly created external table `pxf_hdfsdir_table`.
- Execute the following SQL command to insert data:

INSERT INTO pxf_hdfsdir_table SELECT * FROM pxf_hdfs_table ;

**(IV) Query Foreign Table to obtain all data in the directory (execute by YMatrix master node)**  
Execute the query command to verify that the data is successfully written to the new directory.
- Execute the following query command:

SELECT COUNT(*) FROM pxf_hdfsdir_table;

## <a id="m02">PXF Access HDFS via Ranger</a>
This section will provide detailed information on how to use PXF to manage access to HDFS files through Ranger.

### Hadoop Pre-configuration

#### Test environment
- Hadoop: 3 nodes (hadoop01 (NameNode), hadoop02, hadoop03)
- YMatrix: 3 nodes (gp01 (Master), gp02, gp03)
- Ranger: Deployed on hadoop01

#### 1. Create a folder in Hadoop (executes by Hadoop master node)
First, we create some directories on HDFS to store data files. These directories will serve as data exchange locations between YMatrix and Hadoop clusters.

(I) Create the `/ranger` directory under the HDFS root directory
This directory is used to store all database-related data files. You can create it with the following command:

hdfs dfs -mkdir /ranger

(II) List the contents of the HDFS root directory and verify whether `/ranger` has been created
Use the `ls` command to list the root directory contents to ensure that the new directory is created successfully.

hdfs dfs -ls /

(III) Generate test data files
Create a test data file locally, you can use the following command.

echo "testuser,May,100,999.99" > /tmp/pxf_test_verify.txt

![pxf_01](https://image.ymatrix.cn/jsdelivr-web/web/doc-images/pxf_hdfs/pxf_01.png)

(IV) Upload the generated local files to the `/ranger` directory of HDFS
Use the `hdfs dfs -put` command to upload the local file to HDFS.

hdfs dfs -put -f /tmp/pxf_test_verify.txt /ranger/

(V) Check the uploaded file content and verify that the upload is successful
Use the following command to view the content of the file and confirm that the file upload is successful.

hdfs dfs -cat /ranger/pxf_test_verify.txt

![pxf_02](https://image.ymatrix.cn/jsdelivr-web/web/doc-images/pxf_hdfs/pxf_02.png)

#### 2. Configure environment variables under mxadmin user (added by all YMatrix nodes)
To ensure that YMatrix can interact effectively with Hadoop clusters, all YMatrix nodes need to configure relevant environment variables.  
In the `.bashrc` or environment configuration file of each YMatrix node, add the following.

export HADOOP_HOME=/opt/modules/hadoop-3.2.4 export HADOOP_CONF_DIR=/opt/modules/hadoop-3.2.4/etc/hadoop export PXF_CONF=/usr/local/pxf-matrixdb4 export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

#### 3. Add mxadmin user access permissions to the Hadoop cluster (modify all Hadoop nodes)
To ensure that the `mxadmin` user has access to the Hadoop cluster, it is necessary to set it accordingly in the `core-site.xml` configuration file of all Hadoop nodes.  
Edit the `core-site.xml` file and add the following configuration.
hadoop.proxyuser.mxadmin.groups * hadoop.proxyuser.mxadmin.hosts *
#### IV. Create the `single_hdfs` folder (added by all YMatrix machines)
Create a new folder on the YMatrix master node to store the configuration files for the Hadoop cluster.

mkdir /usr/local/pxf-matrixdb4/servers/single_hdfs/

#### 5. Copy the configuration file from the Hadoop machine to the PXF folder (transfer the Hadoop master node to the YMatrix master node)
Transfer the Hadoop configuration file to the PXF directory of the YMatrix master node.
Use the `scp` command to copy the Hadoop configuration file to the YMatrix master node.

scp $HADOOP_CONF_DIR/core-site.xml 192.168.50.95:/usr/local/pxf-matrixdb4/servers/single_hdfs/ scp $HADOOP_CONF_DIR/hdfs-site.xml 192.168.50.95:/usr/local/pxf-matrixdb4/servers/single_hdfs/ scp $HADOOP_CONF_DIR/yarn-site.xml 192.168.50.95:/usr/local/pxf-matrixdb4/servers/single_hdfs/

#### 6. Create `pxf-site.xml` file (created by YMatrix master node)
Create a PXF configuration file `pxf-site.xml` on the YMatrix master node to ensure that the PXF service is properly connected to the Hadoop cluster.  
Use the following command to create and edit the `pxf-site.xml` file.

vi /usr/local/pxf-matrixdb4/servers/single_hdfs/pxf-site.xml

Add the corresponding configuration content to the file.

<?xml version="1.0" encoding="UTF-8"?>

#### 7. Modify the `fs.defaultFS` configuration of `core-site.xml` to the node IP where Active NameNode is located, and synchronize the PXF configuration (executes by YMatrix master node)
Configure the `fs.defaultFS` property in the `core-site.xml` file to ensure that YMatrix is ​​properly connected to the NameNode of the Hadoop cluster.  
- Modify the `fs.defaultFS` configuration.
fs.defaultFS hdfs://:9000
- Synchronize PXF configuration.

pxf cluster sync

### Ranger Solution Configuration
PXF interacts with HDFS through operating system users, not database users, while running in the background. Therefore, HDFS permission control in Ranger is based on the operating system user.
#### 1. Fill in HDFS Service configuration
- Create HDFS service and set policies in Ranger
- Log in to the Ranger Web console (http://hadoop01:6080)
- Go to Access Manager > Resource Based Policies > HDFS
- Click Add New Service to create a new HDFS service, and the configuration is as follows:
Service Name: hadoopdev
Display Name: hadoopdev
![pxf_03](https://image.ymatrix.cn/jsdelivr-web/web/doc-images/pxf_hdfs/pxf_03.png)
Username: root
Password: root password
Namenode URL: hdfs://hadoop01:9000 (Fill in according to the `core-site.xml` file)
![pxf_04](https://image.ymatrix.cn/jsdelivr-web/web/doc-images/pxf_hdfs/pxf_04.png)

> In Ranger, the difference between Kerberos and Simple is:
> Kerberos: Use tickets and encryption to authenticate, with high security and suitable for production environments, especially in scenarios where authentication across multiple systems is required.  
> Simple: Simple verification method based on username and password, low security, suitable for development or testing environments.

Save service configuration.
![pxf_05](https://image.ymatrix.cn/jsdelivr-web/web/doc-images/pxf_hdfs/pxf_05.png)

> ** Note! **
> After creating the Service, you need to restart Hadoop's Ranger Plugin to take effect!

#### 2. Create `mxadmin` user in Ranger
(I) Enter the "User/Groups" page
In the menu, select User / Groups and click Add New User

(II) Create `mxadmin` user
In the “User Name” input box, fill in `mxadmin` and select Save to save the user.
![pxf_06](https://image.ymatrix.cn/jsdelivr-web/web/doc-images/pxf_hdfs/pxf_06.png)

(III) Configure HDFS access rights for `mxadmin` user
In Ranger, go to the hadoop_hdfs service and click Add New Policy.  
The configuration is as follows:
Policy Name: `test_allow_mxadmin`
Resource Path: `/ranger/pxf_test_verify.txt`
![pxf_07](https://image.ymatrix.cn/jsdelivr-web/web/doc-images/pxf_hdfs/pxf_07.png)
User: `mxadmin`
Permissions : `read`
![pxf_08](https://image.ymatrix.cn/jsdelivr-web/web/doc-images/pxf_hdfs/pxf_08.png)
Save the policy.


### Manage HDFS access policies using Ranger

#### 1. Create a test library

create database mxadmin;

Switch to mxadmin library

\c mxadmin

#### 2. Create plug-in
Enable the PXF plug-in in YMatrix to allow YMatrix to access HDFS data.  
Run the following command to create a PXF plug-in

create extension pxf_fdw;

#### 3. Create FDW Server
Create an external data source to connect to the file system of the Hadoop cluster.  
Run the following SQL command to create an FDW Server:

CREATE SERVER single_hdfs FOREIGN DATA WRAPPER hdfs_pxf_fdw OPTIONS ( config 'single_hdfs' );

#### 4. Create FDW User Mapping
Configure the mapping between YMatrix users and the FDW server and execute the following SQL commands.

CREATE USER MAPPING FOR mxadmin SERVER single_hdfs;

#### 5. Create FDW Foreign Table
Create an external table, map files on HDFS to the YMatrix database, and execute the following SQL command to create a Foreign Table:

CREATE FOREIGN TABLE pxf_hdfs_verify ( name text, month text, count int, amount float8 ) SERVER single_hdfs OPTIONS ( resource '/ranger/pxf_test_verify.txt', format 'text', delimiter ',' );

#### 6. Query external tables
**When ranger is configured with Deny Conditions**
![pxf_09](https://image.ymatrix.cn/jsdelivr-web/web/doc-images/pxf_hdfs/pxf_09.png)
- Execute the following SQL query command:

SELECT * FROM pxf_hdfs_verify;

- Query results
![pxf_10](https://image.ymatrix.cn/jsdelivr-web/web/doc-images/pxf_hdfs/pxf_10.png)

**When ranger is configured with Allow Conditions**
![pxf_11](https://image.ymatrix.cn/jsdelivr-web/web/doc-images/pxf_hdfs/pxf_11.png)
- Execute the following SQL query command

SELECT * FROM pxf_hdfs_table;


- Query results
![pxf_12](https://image.ymatrix.cn/jsdelivr-web/web/doc-images/pxf_hdfs/pxf_12.png)