Accessing HDFS via PXF

YMatrix supports accessing data in HDFS through PXF (Platform Extension Framework). With PXF, YMatrix can parallelize reads of structured or semi-structured data from HDFS, eliminating redundancy and latency caused by data migration, enabling an "instant connect and use" analytics experience.

This section includes usage instructions for two access methods: Direct HDFS Access via PXF and HDFS Access via PXF with Ranger.

If PXF is not installed, refer to the PXF Installation section for installation guidance.

Direct HDFS Access via PXF

This section details how to configure and use PXF in YMatrix to access HDFS.

Environment Configuration

Software Version Information
YMatrix MatrixDB 6.2.2+enterprise
Hadoop Hadoop 3.2.4
PXF pxf-matrixdb4-6.3.0-1.el7.x86_64

Usage Guide

Reading Data from HDFS in YMatrix

(1)Create Directories in Hadoop (Execute on Hadoop Master Node)
First, create directories in HDFS to store data files. These directories serve as data exchange locations between YMatrix and the Hadoop cluster.

  1. Create the /ymatrix directory under the HDFS root. This directory stores all database-related data files. Use the following command:
hdfs dfs -mkdir /ymatrix
  1. Under /ymatrix, create a subdirectory named pxf_examples. This subdirectory holds PXF example data. Run the following command:
hdfs dfs -mkdir /ymatrix/pxf_examples
  1. List the contents of the HDFS root directory to verify that /ymatrix was created. Use the ls command:
hdfs dfs -ls /
  1. List the contents of the /ymatrix directory to confirm creation of pxf_examples.
hdfs dfs -ls /ymatrix/
  1. Generate a test data file
    Create a local file containing city, month, order count, and sales amount using the following command:
echo 'Prague,Jan,101,4875.33
Rome,Mar,87,1557.39
Bangalore,May,317,8936.99
Beijing,Jul,411,11600.67' > pxf_hdfs_simple.txt
  1. Upload the generated local file to the /ymatrix/pxf_examples directory in HDFS.
hdfs dfs -put pxf_hdfs_simple.txt /ymatrix/pxf_examples/
  1. View the uploaded file content to verify successful upload.
hdfs dfs -cat /ymatrix/pxf_examples/pxf_hdfs_simple.txt
  1. Set file permissions
    Use the hdfs dfs -chown command to assign ownership of the file to the mxadmin user:
hdfs dfs -chown -R mxadmin:mxadmin /ymatrix/pxf_examples/pxf_hdfs_simple.txt

(2)Configure Environment Variables for mxadmin User (Add on All YMatrix Nodes)
To ensure effective interaction between YMatrix and the Hadoop cluster, configure environment variables on all YMatrix nodes.

  • Add the following lines to .bashrc or the environment configuration file on each YMatrix node:
export HADOOP_HOME=/opt/modules/hadoop-3.2.4
export HADOOP_CONF_DIR=/opt/modules/hadoop-3.2.4/etc/hadoop
export PXF_CONF=/usr/local/pxf-matrixdb4
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

(3)Grant mxadmin User Access to Hadoop Cluster (Modify on All Hadoop Nodes)
To allow the mxadmin user to access the Hadoop cluster, update the core-site.xml configuration file on all Hadoop nodes.

  • Edit the core-site.xml file and add the following properties:
<property>
  <name>hadoop.proxyuser.mxadmin.groups</name>
  <value>*</value>
</property>

<property>
  <name>hadoop.proxyuser.mxadmin.hosts</name>
  <value>*</value>
</property>

(4)Create single_hdfs Directory (Add on All YMatrix Machines)
Create a new directory on the YMatrix master node to hold Hadoop cluster configuration files.

  • Run the following command:
mkdir /usr/local/pxf-matrixdb4/servers/single_hdfs/

(5)Copy Configuration Files from Hadoop to PXF Directory (Transfer from Hadoop Master to YMatrix Master)
Copy Hadoop configuration files to the PXF directory on the YMatrix master node.

  • Use the scp command:
scp $HADOOP_CONF_DIR/core-site.xml 192.168.50.95:/usr/local/pxf-matrixdb4/servers/single_hdfs/
scp $HADOOP_CONF_DIR/hdfs-site.xml 192.168.50.95:/usr/local/pxf-matrixdb4/servers/single_hdfs/
scp $HADOOP_CONF_DIR/yarn-site.xml 192.168.50.95:/usr/local/pxf-matrixdb4/servers/single_hdfs/

(6)Create pxf-site.xml File (Create on YMatrix Master Node)
Create the PXF configuration file pxf-site.xml on the YMatrix master node to ensure proper connection between PXF and the Hadoop cluster.

  • Use the following command to create and edit the file:
vi /usr/local/pxf-matrixdb4/servers/single_hdfs/pxf-site.xml
  • Add the following content:
<?xml version="1.0" encoding="UTF-8"?>
<configuration> 
</configuration>

(7)Update core-site.xml's fs.defaultFS to Active NameNode IP and Synchronize PXF Configuration (Execute on YMatrix Master Node)

Set the core-site.xml property in the fs.defaultFS file so YMatrix can correctly connect to the Hadoop NameNode.

  • Modify the fs.defaultFS configuration:
<property>
  <name>fs.defaultFS</name>
  <value>hdfs://<namenode_ip>:9000</value>
</property>
  • Synchronize PXF configuration:
pxf cluster sync

(8)Query Using PXF Plugin in YMatrix Database (Execute on YMatrix Master Node)

  1. Install the plugin
    Enable the PXF plugin in YMatrix to allow HDFS access.
  • Run the command:
create extension pxf_fdw;
  1. Create FDW Server
    Create an external data source pointing to the Hadoop file system.
  • Execute the SQL command:
CREATE SERVER single_hdfs FOREIGN DATA WRAPPER hdfs_pxf_fdw OPTIONS ( config 'single_hdfs' );
  1. Create FDW User Mapping
    Map the YMatrix user to the FDW server.
  • Run the SQL command:
CREATE USER MAPPING FOR mxadmin SERVER single_hdfs;
  1. Create FDW Foreign Table
    Map the HDFS file to a table in YMatrix.
  • Execute:
CREATE FOREIGN TABLE pxf_hdfs_table (location text, month text, num_orders int, total_sales float8) 
SERVER single_hdfs OPTIONS ( resource '/ymatrix/pxf_examples/pxf_hdfs_simple.txt', format 'text', delimiter ',' );
  1. Query the External Table
    Verify successful data retrieval.
  • Run:
SELECT * FROM pxf_hdfs_table;

Writing Data from YMatrix to HDFS

(1)Create Directory (Execute on Hadoop Master Node)
Create a new HDFS directory to store written data.

  • Command:
hdfs dfs -mkdir /ymatrix/pxf_dir_examples

(2)Create External Table (Execute on YMatrix Master Node)
Create a foreign table in YMatrix pointing to the new HDFS directory.

  • SQL command:
CREATE FOREIGN TABLE pxf_hdfsdir_table (location text, month text, num_orders int, total_sales float8) 
SERVER hdfs_svr OPTIONS (resource '/ymatrix/pxf_dir_examples', format 'text', delimiter ',' );

(3)Write Data (Execute on YMatrix Master Node)
Insert data from existing foreign table pxf_hdfs_table into the new table pxf_hdfsdir_table.

  • SQL command:
INSERT INTO pxf_hdfsdir_table SELECT * FROM pxf_hdfs_table ;

(4)Query Foreign Table to Retrieve All Data (Execute on YMatrix Master Node)
Verify successful write operation.

  • Query:
SELECT COUNT(*) FROM pxf_hdfsdir_table;

HDFS Access via PXF with Ranger

This section describes how to use PXF with Ranger to manage HDFS file access permissions.

Prerequisites for Hadoop

Environment Setup

  • Hadoop: 3 nodes (hadoop01 as NameNode, hadoop02, hadoop03)
  • YMatrix: 3 nodes (gp01 as Master, gp02, gp03)
  • Ranger: Deployed on hadoop01

Configuration Steps

(1)Create Directories in Hadoop (Execute on Hadoop Master Node)

Create directories in HDFS for data exchange between YMatrix and Hadoop.

  1. Create the /ranger directory under HDFS root
    Used for storing database-related data files. Run:

    hdfs dfs -mkdir /ranger
  2. List root directory contents to verify /ranger exists
    Use ls to list and confirm:

    hdfs dfs -ls /
  3. Generate test data file
    Create a local test file using:

    echo "testuser,May,100,999.99" > /tmp/pxf_test_verify.txt

    pxf_01

  4. Upload local file to /ranger directory in HDFS
    Use hdfs dfs -put to upload:

    hdfs dfs -put -f /tmp/pxf_test_verify.txt /ranger/
  5. View uploaded file content to confirm success
    Run:

    hdfs dfs -cat /ranger/pxf_test_verify.txt

    pxf_02

(2)Configure Environment Variables for mxadmin User (Add on All YMatrix Nodes)

Ensure YMatrix can interact with Hadoop by setting environment variables on all nodes.
In .bashrc or environment file, add:

export HADOOP_HOME=/opt/modules/hadoop-3.2.4
export HADOOP_CONF_DIR=/opt/modules/hadoop-3.2.4/etc/hadoop
export PXF_CONF=/usr/local/pxf-matrixdb4
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

(3)Grant mxadmin User Access in Hadoop Cluster (Modify on All Hadoop Nodes)

Allow mxadmin user access by updating core-site.xml on all Hadoop nodes.
Edit core-site.xml and add:

<property>
  <name>hadoop.proxyuser.mxadmin.groups</name>
  <value>*</value>
</property>

<property>
  <name>hadoop.proxyuser.mxadmin.hosts</name>
  <value>*</value>
</property>

(4)Create single_hdfs Directory (Add on All YMatrix Machines)

Create a directory on the YMatrix master node for Hadoop configuration files.
Command:

mkdir /usr/local/pxf-matrixdb4/servers/single_hdfs/

(5)Copy Configuration Files from Hadoop to PXF Folder (Transfer from Hadoop Master to YMatrix Master)

Copy Hadoop configs to the PXF directory on YMatrix master.
Use scp:

scp $HADOOP_CONF_DIR/core-site.xml 192.168.50.95:/usr/local/pxf-matrixdb4/servers/single_hdfs/
scp $HADOOP_CONF_DIR/hdfs-site.xml 192.168.50.95:/usr/local/pxf-matrixdb4/servers/single_hdfs/
scp $HADOOP_CONF_DIR/yarn-site.xml 192.168.50.95:/usr/local/pxf-matrixdb4/servers/single_hdfs/

(6)Create pxf-site.xml File (Create on YMatrix Master Node)

Create the PXF config file pxf-site.xml to ensure correct connectivity.

  1. Create and edit the file:
    vi /usr/local/pxf-matrixdb4/servers/single_hdfs/pxf-site.xml
  2. Add the following content:
    <?xml version="1.0" encoding="UTF-8"?>
    <configuration> 
    </configuration>

(7)Update core-site.xml's fs.defaultFS to Active NameNode IP and Sync PXF Config (Execute on YMatrix Master Node)

Set the core-site.xml property in fs.defaultFS for correct NameNode connection.

  1. Update fs.defaultFS:
    <property>
    <name>fs.defaultFS</name>
    <value>hdfs://<namenode_ip>:9000</value>
    </property>
  2. Sync PXF config:
    pxf cluster sync

Ranger Configuration

PXF interacts with HDFS using OS-level users, not database users. Therefore, Ranger's HDFS permission control is based on operating system users.

  1. Configure HDFS Service in Ranger
  • Create an HDFS service and define policies in Ranger.
  • Log in to the Ranger Web console (http://hadoop01:6080).
  • Navigate to Access Manager > Resource Based Policies > HDFS.
  • Click Add New Service to create a new HDFS service with the following settings:
    Service Name: hadoopdev
    Display Name: hadoopdev
    pxf_03
    Username: root
    Password: password for root
    Namenode URL: hdfs://hadoop01:9000 (based on core-site.xml)
    pxf_04

In Ranger, Kerberos vs Simple authentication:
Kerberos: Uses tickets and encryption; high security, suitable for production environments, especially across multiple systems.
Simple: Username/password-based; lower security, ideal for development or testing.

Save the service configuration.
pxf_05

Note!
After creating the service, restart the Hadoop Ranger Plugin for changes to take effect.

  1. Create mxadmin User in Ranger

(i)Go to “User / Groups” page
From the menu, select User / Groups, then click Add New User.

(ii)Create mxadmin user
Enter mxadmin in the User Name field and click Save.
pxf_06

(iii)Assign HDFS Access Permissions to mxadmin
In Ranger, go to the hadoop_hdfs service and click Add New Policy. Configure as follows:
Policy Name: test_allow_mxadmin
Resource Path: /ranger/pxf_test_verify.txt
pxf_07
User: mxadmin
Permissions: read
pxf_08
Save the policy.

Managing HDFS Access Policies with Ranger

1. Create Test Database

create database mxadmin;
# Switch to mxadmin database
\c mxadmin

2. Install Plugin

Enable the PXF plugin in YMatrix to access HDFS data.
Run:

create extension pxf_fdw;

3. Create FDW Server

Create an external data source for the Hadoop file system.
SQL command:

CREATE SERVER single_hdfs FOREIGN DATA WRAPPER hdfs_pxf_fdw OPTIONS ( config 'single_hdfs' );

4. Create FDW User Mapping

Map the YMatrix user to the FDW server.
Command:

CREATE USER MAPPING FOR mxadmin SERVER single_hdfs;

5. Create FDW Foreign Table

Map the HDFS file to a YMatrix table.
Run:

CREATE FOREIGN TABLE pxf_hdfs_verify (
    name text,
    month text,
    count int,
    amount float8
)
SERVER single_hdfs
OPTIONS (
    resource '/ranger/pxf_test_verify.txt',
    format 'text',
    delimiter ','
);

6. Query External Table

  1. When Ranger Policy is Deny Conditions
    pxf_09
  • Execute query:
    SELECT * FROM pxf_hdfs_verify;
  • Result: pxf_10
  1. When Ranger Policy is Allow Conditions
    pxf_11
  • Execute query:
    SELECT * FROM pxf_hdfs_table;
  • Result: pxf_12