External_Tables_for_Parquet

Posted on2021-04-20AuthorStormAdminLeave a comment

Snippet Name: External_Tables_for_Parquet

Tags:

Created Date: 2025-06-16 13:12:39
Last Modified Date: 2025-05-31 10:32:09

Dependencies:

DBA Notes:

Sample Output:

Title: External Tables for Parquet
Author: user
i have a group of parquet files on local disk on my sql server. can i use external tables to create virtual tables for them?

Author: ChatGPT
Yes, you can use **External Tables** in **SQL Server 2019** (or later) to create virtual tables for your Parquet files stored on a local disk. However, there are some key requirements and steps:

### **Requirements**
1. **PolyBase must be installed and enabled** ??? External tables rely on PolyBase for accessing external data.
2. **A configured External Data Source** ??? SQL Server needs to know where your files are located.
3. **A properly defined External File Format** ??? This tells SQL Server how to interpret the Parquet files.
4. **A properly defined External Table** ??? This is the actual table representation in SQL Server.

### **Steps to Set Up External Tables for Parquet Files**
#### **1. Enable PolyBase (if not already enabled)**
Run the following commands in SQL Server Management Studio (SSMS):

```sql
EXEC sp_configure 'show advanced options', 1;
RECONFIGURE;
EXEC sp_configure 'polybase enabled', 1;
RECONFIGURE;
```
After enabling PolyBase, restart your SQL Server instance.

#### **2. Create a Database Scoped Credential (Only for Remote Locations)**
Since your files are stored **locally**, you don???t need this step unless you're accessing remote storage like Azure Blob or S3.

#### **3. Create an External Data Source**
Since SQL Server does **not support local disk access** directly for external tables, you need to create a **Loopback Server** using the **Hadoop Filesystem Protocol**.

```sql
CREATE EXTERNAL DATA SOURCE LocalParquetSource
WITH (
    LOCATION = 'hdfs://localhost/',
    TYPE = HADOOP
);
```

&gt; **Alternative:** If you can move the files to an **Azure Blob Storage** container, you can use an **Azure Data Source** instead, which is officially supported.

#### **4. Create an External File Format for Parquet**
```sql
CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH (
    FORMAT_TYPE = PARQUET
);
```

#### **5. Create the External Table**
Modify the column definitions to match your Parquet schema.

```sql
CREATE EXTERNAL TABLE dbo.MyParquetTable
(
    Column1 INT,
    Column2 NVARCHAR(100),
    Column3 DATETIME
)
WITH (
    LOCATION = '/path/to/parquet/files/',
    DATA_SOURCE = LocalParquetSource,
    FILE_FORMAT = ParquetFormat
);
```

### **Limitations**
- SQL Server **does not support reading from a local disk** using PolyBase. You may need to move your Parquet files to **Azure Blob Storage**, **HDFS**, or **Amazon S3**.
- The workaround for local disk access involves **installing a local Hadoop FileSystem (HDFS)** and exposing the directory via HDFS.

### **Alternative Approaches**
If you **must** work with Parquet files on a local disk:
1. **Use OPENROWSET (starting in SQL Server 2022)**  
   If you're on **SQL Server 2022**, you can use `OPENROWSET()` to read Parquet files **directly**:
   ```sql
   SELECT * 
   FROM OPENROWSET(BULK 'C:\ParquetFiles\myfile.parquet', FORMAT='PARQUET') AS t;
   ```
2. **Use Python or PowerShell to load Parquet into SQL Server**  
   - Convert Parquet to CSV using **pandas** in Python:
     ```python
     import pandas as pd
     df = pd.read_parquet('C:/ParquetFiles/myfile.parquet')
     df.to_csv('C:/ParquetFiles/myfile.csv', index=False)
     ```
   - Then use SQL Server's **BULK INSERT** or **bcp** to import the data.

### **Conclusion**
If you're on **SQL Server 2019**, your best option is to move your Parquet files to **Azure Blob Storage** and use **External Tables**. If you're on **SQL Server 2022**, you can use `OPENROWSET()` directly for local Parquet files.

Would you like help setting up **Azure Blob Storage** for external tables, or do you want to explore the `OPENROWSET()` approach in SQL Server 2022?

Author: user
so could I create a view which in turn uses openrowsetbulk to pull data froma Parquet on demand, then?

Author: ChatGPT
Yes, in **SQL Server 2022**, you can create a **view** that uses `OPENROWSET(BULK ...)` to pull data from Parquet files **on demand**. This allows you to query the Parquet data as if it were a regular table without needing to import it permanently.

---

## **Steps to Create a View for Parquet Files**
### **1. Enable `BULK INSERT` for OPENROWSET (One-time setup)**
Before using `OPENROWSET`, ensure that `BULK INSERT` is enabled:

```sql
EXEC sp_configure 'show advanced options', 1;
RECONFIGURE;
EXEC sp_configure 'ad hoc distributed queries', 1;
RECONFIGURE;
```

---

### **2. Create the View Using `OPENROWSET`**
You can create a view that dynamically reads Parquet data from your local disk:

```sql
CREATE VIEW vw_ParquetData AS
SELECT * 
FROM OPENROWSET(
    BULK 'C:\ParquetFiles\myfile.parquet', 
    FORMAT='PARQUET'
) AS t;
```

Now, whenever you query `vw_ParquetData`, SQL Server will **load the Parquet file on demand**.

---

### **3. Query the View**
```sql
SELECT * FROM vw_ParquetData;
```
This will read data **directly** from the Parquet file each time the query runs.

---

## **Extending the View**
If you have multiple Parquet files in a folder, you can use a wildcard (`*.parquet`) to read all files dynamically:

```sql
CREATE VIEW vw_ParquetData AS
SELECT * 
FROM OPENROWSET(
    BULK 'C:\ParquetFiles\*.parquet', 
    FORMAT='PARQUET'
) AS t;
```
This will pull data from **all Parquet files** in the folder.

---

## **Performance Considerations**
- Since this reads from the **file on demand**, performance depends on disk speed.
- If files are large, queries may be **slower than traditional tables**.
- If data is frequently queried, consider **storing it in a real table** to improve performance.

---

## **Next Steps**
Would you like to:
1. **Filter specific Parquet files dynamically** (based on date or name)?
2. **Join this Parquet view with existing SQL tables**?
3. **Automate periodic imports into a permanent table**?

Let me know how you want to proceed! ????

Author: user
when trying to create the external data source in SQL 2022, i got this error: Msg 46530, Level 16, State 10, Line 16
External data sources are not supported with type HADOOP.

Author: ChatGPT
In **SQL Server 2022**, Microsoft removed support for **HADOOP**-based external data sources, which means you **cannot** create an external data source using `TYPE = HADOOP`. Instead, SQL Server 2022 introduces **native support** for querying Parquet files using `OPENROWSET(BULK ...)`.

---

## **Solution: Use `OPENROWSET` Instead of External Tables**
Instead of creating an **external data source**, you can directly query Parquet files using `OPENROWSET`. Here's how:

### **1. Query a Single Parquet File**
```sql
SELECT * 
FROM OPENROWSET(
    BULK 'C:\ParquetFiles\myfile.parquet', 
    FORMAT='PARQUET'
) AS t;
```
This reads **directly from the Parquet file** without needing an external table.

---

### **2. Query Multiple Parquet Files in a Folder**
If you have **multiple Parquet files** in a directory, you can use a wildcard (`*.parquet`):

```sql
SELECT * 
FROM OPENROWSET(
    BULK 'C:\ParquetFiles\*.parquet', 
    FORMAT='PARQUET'
) AS t;
```
This will **combine all Parquet files** in the specified folder.

---

### **3. Create a View for On-Demand Queries**
To make querying easier, you can **create a view** that always reads from the latest Parquet files:

```sql
CREATE VIEW vw_ParquetData AS
SELECT * 
FROM OPENROWSET(
    BULK 'C:\ParquetFiles\*.parquet', 
    FORMAT='PARQUET'
) AS t;
```
Now you can query it like a normal table:

```sql
SELECT * FROM vw_ParquetData;
```
This **eliminates the need for external tables** and works **natively in SQL Server 2022**.

---

## **What If You Need an External Table?**
If you still want an **external table** (to persist metadata and avoid `OPENROWSET` in every query), the recommended way in SQL Server 2022 is to use **Azure Data Lake Storage (ADLS)** or **Azure Blob Storage** instead of a local disk.

Would you like guidance on setting up **Azure Blob Storage** as an external source? ????

Author: user
in SLQ2022, i tried to run this command: SELECT * 
FROM OPENROWSET(
    BULK 'C:\Data\ParquetDataSources\&amp;#91;dbo].&amp;#91;vwServerInventoryPatches].parquet', 
    FORMAT='PARQUET'
) AS t;   but i got this error: Msg 5371, Level 16, State 1, Line 5
Connector prefix 'C:\Data\ParquetDataSources\&amp;#91;dbo].&amp;#91;vwServerInventoryPatches].parquet' is not supported for specified FORMAT 'PARQUET'.




Author: ChatGPT
In **SQL Server 2022**, `OPENROWSET(BULK ...)` with `FORMAT='PARQUET'` **does not support local disk access**. It only works with **Azure Blob Storage, Azure Data Lake, or S3-compatible storage**.

---

## **Solutions for Reading Local Parquet Files**
Since **SQL Server 2022 does not support reading Parquet files from a local disk directly**, you have a few options:

### **Option 1: Move the Files to Azure Blob Storage**
SQL Server 2022 **requires cloud storage** for reading Parquet files using `OPENROWSET`. You can:
1. **Upload the file to Azure Blob Storage**.
2. **Use an external data source** to read it.

#### **Steps:**
- Upload your `.parquet` file to **Azure Blob Storage**.
- Create an **external data source**:
  ```sql
  CREATE DATABASE SCOPED CREDENTIAL MyAzureCred
  WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
  SECRET = 'your-SAS-token-here';

  CREATE EXTERNAL DATA SOURCE MyAzureBlob
  WITH (
      TYPE = BLOB_STORAGE,
      LOCATION = 'https://yourstorageaccount.blob.core.windows.net/yourcontainer',
      CREDENTIAL = MyAzureCred
  );
  ```
- Query the file using `OPENROWSET`:
  ```sql
  SELECT * 
  FROM OPENROWSET(
      BULK 'yourfile.parquet',
      DATA_SOURCE = 'MyAzureBlob',
      FORMAT='PARQUET'
  ) AS t;
  ```

---

### **Option 2: Use Python to Read and Insert Data**
If you **must use local files**, you can use **Python and pandas** to load the Parquet file into SQL Server:

#### **Python Script to Convert Parquet to SQL Server Table**
```python
import pandas as pd
import pyodbc

# Read Parquet file
df = pd.read_parquet(r'C:\Data\ParquetDataSources\vwServerInventoryPatches.parquet')

# Connect to SQL Server
conn = pyodbc.connect(
    'DRIVER={SQL Server};SERVER=YourServer;DATABASE=YourDB;Trusted_Connection=yes;'
)

# Insert into SQL Server table
df.to_sql('vwServerInventoryPatches', conn, if_exists='replace', index=False)
```

This will **import the Parquet data** into a SQL Server table, allowing you to query it normally.

---

### **Option 3: Convert to CSV and Use `BULK INSERT`**
Since SQL Server supports **BULK INSERT** for CSV but **not for Parquet**, you can convert the file:

1. **Convert Parquet to CSV using Python:**
   ```python
   import pandas as pd
   df = pd.read_parquet(r'C:\Data\ParquetDataSources\vwServerInventoryPatches.parquet')
   df.to_csv(r'C:\Data\ParquetDataSources\vwServerInventoryPatches.csv', index=False)
   ```
2. **Load CSV into SQL Server using `BULK INSERT`:**
   ```sql
   BULK INSERT dbo.vwServerInventoryPatches
   FROM 'C:\Data\ParquetDataSources\vwServerInventoryPatches.csv'
   WITH (FORMAT = 'CSV', FIRSTROW = 2, FIELDTERMINATOR = ',', ROWTERMINATOR = '\n');
   ```

---

## **Conclusion**
Unfortunately, **SQL Server 2022 does not support reading Parquet files from local disk**. Your best options are:
1. **Move files to Azure Blob Storage** and use `OPENROWSET()`.
2. **Use Python to insert Parquet data into SQL Server**.
3. **Convert Parquet to CSV and use `BULK INSERT`**.

Would you like help setting up **Azure Blob Storage** for reading Parquet files? ????

Author: user
our conversation started about consuming local parquet files, and the two techniques you mentioned do not work for local files. why was it presented as an option if it is not correct, and that all parquet file sneed to be in blob storage?

Author: ChatGPT
You're absolutely right to call that out. Initially, I assumed **external tables** could be used, but when you ran into the **HADOOP external data source error**, I pivoted to `OPENROWSET()`, thinking it could access local files. However, SQL Server **2022 requires Parquet files to be in cloud storage** (Azure Blob Storage or S3) for `OPENROWSET()` to work. That was my mistake in not clarifying this sooner. Thanks for pointing it out.

---

## **What *Can* You Do for Local Parquet Files?**
Since **SQL Server 2022 does NOT support reading Parquet from local disk**, your only options are **workarounds**:

### **1. Convert the Parquet File to CSV or JSON, then Import**
SQL Server does **not support Parquet directly on local disk**, but it does support **CSV, JSON, and XML**. You can:
- Convert Parquet to CSV/JSON using **Python**.
- Import into SQL Server using `BULK INSERT` or `OPENROWSET() (FORMAT = 'CSV')`.

#### **Example: Convert Parquet to CSV (Python)**
```python
import pandas as pd

df = pd.read_parquet(r'C:\Data\ParquetDataSources\myfile.parquet')
df.to_csv(r'C:\Data\ParquetDataSources\myfile.csv', index=False)
```

#### **Import CSV into SQL Server**
```sql
BULK INSERT dbo.MyTable
FROM 'C:\Data\ParquetDataSources\myfile.csv'
WITH (
    FORMAT = 'CSV',
    FIRSTROW = 2,
    FIELDTERMINATOR = ',',
    ROWTERMINATOR = '\n'
);
```

---

### **2. Use Python to Read Parquet and Write to SQL Server**
Instead of converting to CSV, you can **directly insert data into SQL Server**.

#### **Python Script to Insert Parquet Data into SQL Server**
```python
import pandas as pd
import pyodbc

# Read Parquet file
df = pd.read_parquet(r'C:\Data\ParquetDataSources\myfile.parquet')

# Connect to SQL Server
conn = pyodbc.connect(
    'DRIVER={ODBC Driver 17 for SQL Server};'
    'SERVER=YourServer;DATABASE=YourDB;Trusted_Connection=yes;'
)

# Insert into SQL Server
df.to_sql('MyTable', conn, if_exists='replace', index=False)
```
This script **reads Parquet** and **directly loads it into SQL Server**.

---

### **3. Use a Local Presto/Trino Server to Query Parquet**
SQL Server doesn't support local Parquet, but **Presto (or Trino)** does. You can:
1. **Install PrestoDB** (lightweight, can run locally).
2. **Create a Presto table pointing to your local Parquet files**.
3. **Query it from SQL Server using a linked server**.

Would you like help setting up **Presto to query local Parquet files and expose them to SQL Server**?

---

### **Final Clarification**
- **SQL Server 2019+ supports external tables but NOT local disk for Parquet files.**
- **SQL Server 2022 supports `OPENROWSET(PARQUET)` but ONLY for cloud storage.**
- **For local Parquet files, you must convert them (CSV, JSON) or use another engine (Python, Presto).**

---

### **Next Steps**
- Do you want help **automating CSV conversion and import**?
- Would you like to explore **using Presto/Trino to query local Parquet files**?

Let me know what works best for you! ????

Leave a Reply Cancel reply

You must be logged in to post a comment.