Import_CSV_from_ZIP

Snippet Name: Import_CSV_from_ZIP

Tags:

Created Date: 2025-06-16 13:12:45
Last Modified Date: 2025-05-31 10:32:16

Dependencies:

DBA Notes:

Sample Output:

Title: Import CSV from ZIP
Author: user
the attached file was created to import a large csv file in chunks due to it;s size, and works very well. the project has expanded to multiple files. now we have a folder full of zip files located at C:\_Data\LinkedInLeads\USAEmails. each zip has multiple files to import. modify the process too import the file into a Table that matches the file name

Author: ChatGPT
All the files uploaded by the user have been fully loaded. Searching won't provide additional information.

Author: ChatGPT
Here???s an enhanced version of your **LinkedIn_csv_Importer.py** script that can handle **a folder of ZIP files**, extract each CSV inside, and import them **into tables named after each CSV file (without extension)**.

It dynamically creates the table if it doesn???t exist and imports the data in chunks.

```python
import os
import zipfile
import pandas as pd
import pyodbc

# SQL Server connection
conn_str = (
    "DRIVER={ODBC Driver 17 for SQL Server};"
    "SERVER=stormbase;"
    "DATABASE=dbaresources;"
    "UID=sql-lizaguirre;"
    "PWD=Warrior#312"
)
conn = pyodbc.connect(conn_str)
cursor = conn.cursor()
cursor.fast_executemany = True

# Folder with ZIP files
zip_folder = r'C:\_Data\LinkedInLeads\USAEmails'
chunksize = 5000

# Common columns (update if needed)
columns = [
    'FullName', 'Industry', 'JobTitle', 'SubRole', 'Emails', 'Mobile',
    'PhoneNumbers', 'CompanyName', 'CompanyIndustry', 'CompanyWebsite',
    'CompanySize', 'Location', 'Skills', 'FirstName', 'LastName',
    'BirthYear', 'BirthDate', 'Gender', 'LinkedInUrl', 'FacebookUrl',
    'TwitterUrl', 'Githu Url', 'Compan LinkedinUrl', 'CompanyFacebookUrl',
    'CompanyTwitterUrl', 'CompanyLocationName', 'CompanyLocationStreetAddress',
    'CompanyLocationAddressLine2', 'CompanyLocationPostalCode', 'LocationCountry',
    'LocationContinent', 'LinkedinConnections', 'InferredSalary', 'YearsExperience',
    'Countries', 'Interests'
]

def create_table_if_not_exists(table_name):
    cols = ',\n'.join([f'[{col}] NVARCHAR(MAX) NULL' for col in columns])
    create_sql = f"""
    IF OBJECT_ID('{table_name}', 'U') IS NULL
    CREATE TABLE [dbo].[{table_name}] (
        {cols}
    )
    """
    cursor.execute(create_sql)
    conn.commit()

# Process each ZIP file
for zip_file in os.listdir(zip_folder):
    if zip_file.lower().endswith('.zip'):
        zip_path = os.path.join(zip_folder, zip_file)
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(zip_folder)
            for extracted_file in zip_ref.namelist():
                if extracted_file.lower().endswith('.csv'):
                    csv_path = os.path.join(zip_folder, extracted_file)
                    table_name = os.path.splitext(extracted_file)[0]

                    # Ensure table exists
                    create_table_if_not_exists(table_name)

                    # Insert data in chunks
                    for chunk in pd.read_csv(csv_path, chunksize=chunksize):
                        data = [
                            tuple(row[col] if col in row and pd.notnull(row[col]) else None for col in columns)
                            for _, row in chunk.iterrows()
                        ]

                        placeholders = ', '.join(['?'] * len(columns))
                        col_names = ', '.join([f'[{c}]' for c in columns])

                        insert_sql = f"""
                            INSERT INTO [dbo].[{table_name}] (
                                {col_names}
                            ) VALUES (
                                {placeholders}
                            )
                        """
                        cursor.executemany(insert_sql, data)
                        conn.commit()
                        print(f"Inserted {len(data)} rows into {table_name}")

                    # Optionally delete extracted file
                    os.remove(csv_path)

# Cleanup
cursor.close()
conn.close()
print("All files imported successfully!")
```

### Key improvements:
??? Automatically extracts ZIP files  
??? Dynamically names the table based on CSV filename  
??? Creates the table if it doesn???t exist  
??? Processes in chunks to handle large files  
??? Deletes extracted CSVs after import (optional; you can comment out `os.remove(csv_path)` if you want to keep them)

If you want, I can also help:
- Add logging to a file  
- Add error handling / rollback  
- Add progress reporting with percentage  

Would you like me to prep that as well? Let me know! ????

Author: user
i got this brief error as part of a longer stack: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 33975: invalid continuation byte 

Author: ChatGPT
That `UnicodeDecodeError` means one or more of your CSV files have **non-UTF-8 encoded content** ??? probably something like ANSI, Latin-1, or Windows-1252.  

Pandas `pd.read_csv()` assumes UTF-8 by default, so it fails when it hits these bytes.

### ??? Solution: Add an encoding fallback

Modify the `read_csv()` call like this:

```python
for chunk in pd.read_csv(csv_path, chunksize=chunksize, encoding='utf-8', on_bad_lines='skip'):
```

or, better:

```python
try:
    for chunk in pd.read_csv(csv_path, chunksize=chunksize, encoding='utf-8'):
        # process chunk
except UnicodeDecodeError:
    for chunk in pd.read_csv(csv_path, chunksize=chunksize, encoding='latin1'):
        # process chunk
```

### ???? Recommended robust version:

```python
def read_csv_with_fallback(path, chunksize):
    try:
        return pd.read_csv(path, chunksize=chunksize, encoding='utf-8')
    except UnicodeDecodeError:
        print(f"?????? UTF-8 decoding failed for {path}, trying latin1...")
        return pd.read_csv(path, chunksize=chunksize, encoding='latin1')

for chunk in read_csv_with_fallback(csv_path, chunksize):
    data = [
        tuple(row[col] if col in row and pd.notnull(row[col]) else None for col in columns)
        for _, row in chunk.iterrows()
    ]

    placeholders = ', '.join(['?'] * len(columns))
    col_names = ', '.join([f'[{c}]' for c in columns])

    insert_sql = f"""
        INSERT INTO [dbo].[{table_name}] (
            {col_names}
        ) VALUES (
            {placeholders}
        )
    """
    cursor.executemany(insert_sql, data)
    conn.commit()
    print(f"Inserted {len(data)} rows into {table_name}")
```

### ??? Extra option:
If you want to **automatically detect the file encoding**, you can use the `chardet` or `charset-normalizer` package:

```bash
pip install chardet
```

Example usage:

```python
import chardet

with open(csv_path, 'rb') as f:
    result = chardet.detect(f.read(100000))
encoding = result['encoding']

for chunk in pd.read_csv(csv_path, chunksize=chunksize, encoding=encoding):
    # process chunk
```

---

???? Summary:
- Add a fallback to `'latin1'` or `'cp1252'`
- Optionally auto-detect with `chardet`  
- Use `on_bad_lines='skip'` if you want to **ignore** corrupted rows

If you want, I can send you the **full ready-to-paste code** including this fix. Want me to prepare it? ????

Author: user
yes, please update teh code

Author: ChatGPT
Here???s the **full updated script** with automatic encoding detection using `chardet` and fallback handling to avoid the `UnicodeDecodeError`.  

??? Handles ZIP files  
??? Creates tables dynamically  
??? Detects encoding automatically  
??? Processes in chunks  
??? Deletes extracted CSVs (optional)

```python
import os
import zipfile
import pandas as pd
import pyodbc
import chardet

# SQL Server connection
conn_str = (
    "DRIVER={ODBC Driver 17 for SQL Server};"
    "SERVER=stormbase;"
    "DATABASE=dbaresources;"
    "UID=sql-lizaguirre;"
    "PWD=Warrior#312"
)
conn = pyodbc.connect(conn_str)
cursor = conn.cursor()
cursor.fast_executemany = True

# Folder with ZIP files
zip_folder = r'C:\_Data\LinkedInLeads\USAEmails'
chunksize = 5000

# Common columns (update if needed)
columns = [
    'FullName', 'Industry', 'JobTitle', 'SubRole', 'Emails', 'Mobile',
    'PhoneNumbers', 'CompanyName', 'CompanyIndustry', 'CompanyWebsite',
    'CompanySize', 'Location', 'Skills', 'FirstName', 'LastName',
    'BirthYear', 'BirthDate', 'Gender', 'LinkedInUrl', 'FacebookUrl',
    'TwitterUrl', 'Githu Url', 'Compan LinkedinUrl', 'CompanyFacebookUrl',
    'CompanyTwitterUrl', 'CompanyLocationName', 'CompanyLocationStreetAddress',
    'CompanyLocationAddressLine2', 'CompanyLocationPostalCode', 'LocationCountry',
    'LocationContinent', 'LinkedinConnections', 'InferredSalary', 'YearsExperience',
    'Countries', 'Interests'
]

def create_table_if_not_exists(table_name):
    cols = ',\n'.join([f'[{col}] NVARCHAR(MAX) NULL' for col in columns])
    create_sql = f"""
    IF OBJECT_ID('{table_name}', 'U') IS NULL
    CREATE TABLE [dbo].[{table_name}] (
        {cols}
    )
    """
    cursor.execute(create_sql)
    conn.commit()

def detect_encoding(file_path):
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read(100000))  # read first 100 KB
    return result['encoding'] or 'utf-8'

# Process each ZIP file
for zip_file in os.listdir(zip_folder):
    if zip_file.lower().endswith('.zip'):
        zip_path = os.path.join(zip_folder, zip_file)
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(zip_folder)
            for extracted_file in zip_ref.namelist():
                if extracted_file.lower().endswith('.csv'):
                    csv_path = os.path.join(zip_folder, extracted_file)
                    table_name = os.path.splitext(extracted_file)[0]

                    # Ensure table exists
                    create_table_if_not_exists(table_name)

                    # Detect encoding
                    encoding = detect_encoding(csv_path)
                    print(f"Detected encoding for {extracted_file}: {encoding}")

                    # Insert data in chunks
                    try:
                        reader = pd.read_csv(csv_path, chunksize=chunksize, encoding=encoding)
                    except UnicodeDecodeError:
                        print(f"?????? Failed with {encoding}, falling back to latin1...")
                        reader = pd.read_csv(csv_path, chunksize=chunksize, encoding='latin1')

                    for chunk in reader:
                        data = [
                            tuple(row[col] if col in row and pd.notnull(row[col]) else None for col in columns)
                            for _, row in chunk.iterrows()
                        ]

                        placeholders = ', '.join(['?'] * len(columns))
                        col_names = ', '.join([f'[{c}]' for c in columns])

                        insert_sql = f"""
                            INSERT INTO [dbo].[{table_name}] (
                                {col_names}
                            ) VALUES (
                                {placeholders}
                            )
                        """
                        cursor.executemany(insert_sql, data)
                        conn.commit()
                        print(f"Inserted {len(data)} rows into {table_name}")

                    # Optionally delete extracted CSV
                    os.remove(csv_path)

# Cleanup
cursor.close()
conn.close()
print("All files imported successfully!")
```

---

### ??? What???s added:
- **Encoding detection** with `chardet`
- **Fallback to `latin1`** if decoding fails
- **Clear console messages** showing progress

### ???? Setup tip:
Install `chardet` if you haven???t already:

```bash
pip install chardet
```

---

If you want, I can also:
??? Add error logging to a `.log` file  
??? Add a summary report at the end showing row counts per table

Would you like me to prep that too? ???? Let me know!

Author: user
Traceback (most recent call last):
  File "c:\_Data\LinkedInLeads\LinkedInLeads_By_Folder_Of_Zips.py", line 77, in <module>
    for chunk in reader:
  File "C:\Users\Lowell\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1624, in __next__
    return self.get_chunk()
  File "C:\Users\Lowell\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1733, in get_chunk
    return self.read(nrows=size)
  File "C:\Users\Lowell\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1704, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "C:\Users\Lowell\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas\_libs\parsers.pyx", line 824, in pandas._libs.parsers.TextReader.read_low_memory        
  File "pandas\_libs\parsers.pyx", line 873, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 848, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 859, in pandas._libs.parsers.TextReader._check_tokenize_status 
  File "pandas\_libs\parsers.pyx", line 2017, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'gb2312' codec can't decode byte 0x9e in position 222644: illegal multibyte sequence

Author: ChatGPT
Thanks ??? that traceback helps a lot!  

This shows that **even after detecting the encoding (`gb2312` in this case)**, Pandas is choking on it because that encoding is sometimes partial or corrupt, especially when detected from mixed or dirty files.

Let???s make the script **more robust** by:

??? Always falling back to `latin1` if *any* encoding fails (including `gb2312`, `utf-8`, etc.)  
??? Wrapping the **chunk reading loop** in its own `try/except` so even mid-stream errors don???t kill the whole job  
??? Adding **`on_bad_lines='skip'`** to skip over junky lines

Here???s the updated section for reading and processing the CSV safely:

```python
def read_csv_with_fallback(path, chunksize, encoding):
    try:
        return pd.read_csv(path, chunksize=chunksize, encoding=encoding, on_bad_lines='skip')
    except Exception as e:
        print(f"?????? Failed with detected encoding '{encoding}' ({e}), falling back to latin1...")
        return pd.read_csv(path, chunksize=chunksize, encoding='latin1', on_bad_lines='skip')
```

And in your main loop:

```python
# Detect encoding
encoding = detect_encoding(csv_path)
print(f"Detected encoding for {extracted_file}: {encoding}")

# Insert data in chunks with fallback
reader = read_csv_with_fallback(csv_path, chunksize, encoding)

for chunk in reader:
    try:
        data = [
            tuple(row[col] if col in row and pd.notnull(row[col]) else None for col in columns)
            for _, row in chunk.iterrows()
        ]

        placeholders = ', '.join(['?'] * len(columns))
        col_names = ', '.join([f'[{c}]' for c in columns])

        insert_sql = f"""
            INSERT INTO [dbo].[{table_name}] (
                {col_names}
            ) VALUES (
                {placeholders}
            )
        """
        cursor.executemany(insert_sql, data)
        conn.commit()
        print(f"Inserted {len(data)} rows into {table_name}")
    except Exception as e:
        print(f"?????? Error processing chunk in {table_name}: {e}")
```

---

### ???? Summary of improvements:
??? Use `on_bad_lines='skip'` to bypass malformed lines  
??? Catch **per-chunk** errors, not just overall file errors  
??? Always fallback to `'latin1'` (super permissive) if detected encoding fails  
??? Clear error messages with table/file info for troubleshooting

---

If you want, I can prepare the **entire ready-to-paste script** with these improvements. Want me to send that? ????

Author: user
yes please send the complete script

Author: ChatGPT
Here???s the **complete and improved script** with all the fixes included:

??? Handles ZIP folders  
??? Dynamically creates tables by CSV filename  
??? Automatically detects encoding  
??? Falls back to `latin1` if detected encoding fails  
??? Skips bad lines (`on_bad_lines='skip'`)  
??? Catches **per-chunk errors** so one bad batch doesn???t kill the run

```python
import os
import zipfile
import pandas as pd
import pyodbc
import chardet

# SQL Server connection
conn_str = (
    "DRIVER={ODBC Driver 17 for SQL Server};"
    "SERVER=stormbase;"
    "DATABASE=dbaresources;"
    "UID=sql-lizaguirre;"
    "PWD=Warrior#312"
)
conn = pyodbc.connect(conn_str)
cursor = conn.cursor()
cursor.fast_executemany = True

# Folder with ZIP files
zip_folder = r'C:\_Data\LinkedInLeads\USAEmails'
chunksize = 5000

# Common columns (update if needed)
columns = [
    'FullName', 'Industry', 'JobTitle', 'SubRole', 'Emails', 'Mobile',
    'PhoneNumbers', 'CompanyName', 'CompanyIndustry', 'CompanyWebsite',
    'CompanySize', 'Location', 'Skills', 'FirstName', 'LastName',
    'BirthYear', 'BirthDate', 'Gender', 'LinkedInUrl', 'FacebookUrl',
    'TwitterUrl', 'Githu Url', 'Compan LinkedinUrl', 'CompanyFacebookUrl',
    'CompanyTwitterUrl', 'CompanyLocationName', 'CompanyLocationStreetAddress',
    'CompanyLocationAddressLine2', 'CompanyLocationPostalCode', 'LocationCountry',
    'LocationContinent', 'LinkedinConnections', 'InferredSalary', 'YearsExperience',
    'Countries', 'Interests'
]

def create_table_if_not_exists(table_name):
    cols = ',\n'.join([f'[{col}] NVARCHAR(MAX) NULL' for col in columns])
    create_sql = f"""
    IF OBJECT_ID('{table_name}', 'U') IS NULL
    CREATE TABLE [dbo].[{table_name}] (
        {cols}
    )
    """
    cursor.execute(create_sql)
    conn.commit()

def detect_encoding(file_path):
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read(100000))  # check first 100KB
    return result['encoding'] or 'utf-8'

def read_csv_with_fallback(path, chunksize, encoding):
    try:
        return pd.read_csv(path, chunksize=chunksize, encoding=encoding, on_bad_lines='skip')
    except Exception as e:
        print(f"?????? Failed with detected encoding '{encoding}' ({e}), falling back to latin1...")
        return pd.read_csv(path, chunksize=chunksize, encoding='latin1', on_bad_lines='skip')

# Process each ZIP file
for zip_file in os.listdir(zip_folder):
    if zip_file.lower().endswith('.zip'):
        zip_path = os.path.join(zip_folder, zip_file)
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(zip_folder)
            for extracted_file in zip_ref.namelist():
                if extracted_file.lower().endswith('.csv'):
                    csv_path = os.path.join(zip_folder, extracted_file)
                    table_name = os.path.splitext(extracted_file)[0]

                    # Ensure table exists
                    create_table_if_not_exists(table_name)

                    # Detect encoding
                    encoding = detect_encoding(csv_path)
                    print(f"Detected encoding for {extracted_file}: {encoding}")

                    # Read with fallback
                    reader = read_csv_with_fallback(csv_path, chunksize, encoding)

                    # Process each chunk
                    for chunk in reader:
                        try:
                            data = [
                                tuple(row[col] if col in row and pd.notnull(row[col]) else None for col in columns)
                                for _, row in chunk.iterrows()
                            ]

                            placeholders = ', '.join(['?'] * len(columns))
                            col_names = ', '.join([f'[{c}]' for c in columns])

                            insert_sql = f"""
                                INSERT INTO [dbo].[{table_name}] (
                                    {col_names}
                                ) VALUES (
                                    {placeholders}
                                )
                            """
                            cursor.executemany(insert_sql, data)
                            conn.commit()
                            print(f"??? Inserted {len(data)} rows into {table_name}")
                        except Exception as e:
                            print(f"?????? Error processing chunk in {table_name}: {e}")

                    # Optionally delete extracted CSV file
                    os.remove(csv_path)

# Cleanup
cursor.close()
conn.close()
print("???? All files imported successfully!")
```

---

### ??? Summary of improvements:
- Robust encoding handling with `chardet` + fallback  
- `on_bad_lines='skip'` to skip malformed lines  
- Per-chunk error handling  
- Clear progress messages

---

### ???? Requirements reminder:
Make sure you install the `chardet` package:

```bash
pip install chardet
```

---

If you want, I can also write:
??? A `.log` file to record errors  
??? A summary report (how many rows per table)  
??? A dry-run mode to just **validate** files without inserting

Would you like me to prep any of these bonus features? Let me know! ????

Author: user
& C:/Users/Lowell/AppData/Local/Programs/Python/Python310/python.exe c:/_Data/LinkedInLeads/LinkedInLeads_By_Folder_Of_Zips.py
Detected encoding for USA Emails_0.csv: GB2312
Traceback (most recent call last):
  File "c:\_Data\LinkedInLeads\LinkedInLeads_By_Folder_Of_Zips.py", line 77, in <module>
    for chunk in reader:
  File "C:\Users\Lowell\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1624, in __next__
    return self.get_chunk()
  File "C:\Users\Lowell\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1733, in get_chunk
    return self.read(nrows=size)
  File "C:\Users\Lowell\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1704, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "C:\Users\Lowell\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas\_libs\parsers.pyx", line 824, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas\_libs\parsers.pyx", line 873, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 848, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 859, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "pandas\_libs\parsers.pyx", line 2017, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'gb2312' codec can't decode byte 0x9e in position 222644: illegal multibyte sequence
PS C:\Users\Lowell>
PS C:\Users\Lowell> Leads_By_Folder_Of_Zips.py
Leads_By_Folder_Of_Zips.py: The term 'Leads_By_Folder_Of_Zips.py' is not recognized as a name of a cmdlet, function, script file, or executable program.
Check the spelling of the name, or if a path was included, verify that the path is correct and try again.
PS C:\Users\Lowell>
PS C:\Users\Lowell> & C:/Users/Lowell/AppData/Local/Programs/Python/Python310/python.exe c:/_Data/LinkedInLeads/LinkedInLeads_By_Folder_Of_Zips.py
Detected encoding for USA Emails_0.csv: GB2312
Traceback (most recent call last):
  File "c:\_Data\LinkedInLeads\LinkedInLeads_By_Folder_Of_Zips.py", line 81, in <module>
    for chunk in reader:
  File "C:\Users\Lowell\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1624, in __next__
    return self.get_chunk()
  File "C:\Users\Lowell\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1733, in get_chunk
    return self.read(nrows=size)
  File "C:\Users\Lowell\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1704, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "C:\Users\Lowell\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas\_libs\parsers.pyx", line 824, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas\_libs\parsers.pyx", line 873, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 848, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 859, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "pandas\_libs\parsers.pyx", line 2017, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'gb2312' codec can't decode byte 0x9e in position 222644: illegal multibyte sequence

Author: ChatGPT
Thanks ??? this gives a clear picture.  

Even though we are **detecting `gb2312`** from `chardet`, pandas cannot read the file reliably using that encoding, and our current fallback only happens **at the `pd.read_csv` creation step**, not inside the loop.  

We need to **wrap the entire file-reading + processing block** so that if any `UnicodeDecodeError` occurs, we completely **restart the reading** with `'latin1'` as a fallback.

Here???s the **fixed full script** that will handle this gracefully:

---

### ??? Full script with bulletproof fallback

```python
import os
import zipfile
import pandas as pd
import pyodbc
import chardet

# SQL Server connection
conn_str = (
    "DRIVER={ODBC Driver 17 for SQL Server};"
    "SERVER=stormbase;"
    "DATABASE=dbaresources;"
    "UID=sql-lizaguirre;"
    "PWD=Warrior#312"
)
conn = pyodbc.connect(conn_str)
cursor = conn.cursor()
cursor.fast_executemany = True

# Folder with ZIP files
zip_folder = r'C:\_Data\LinkedInLeads\USAEmails'
chunksize = 5000

# Common columns (update if needed)
columns = [
    'FullName', 'Industry', 'JobTitle', 'SubRole', 'Emails', 'Mobile',
    'PhoneNumbers', 'CompanyName', 'CompanyIndustry', 'CompanyWebsite',
    'CompanySize', 'Location', 'Skills', 'FirstName', 'LastName',
    'BirthYear', 'BirthDate', 'Gender', 'LinkedInUrl', 'FacebookUrl',
    'TwitterUrl', 'Githu Url', 'Compan LinkedinUrl', 'CompanyFacebookUrl',
    'CompanyTwitterUrl', 'CompanyLocationName', 'CompanyLocationStreetAddress',
    'CompanyLocationAddressLine2', 'CompanyLocationPostalCode', 'LocationCountry',
    'LocationContinent', 'LinkedinConnections', 'InferredSalary', 'YearsExperience',
    'Countries', 'Interests'
]

def create_table_if_not_exists(table_name):
    cols = ',\n'.join([f'[{col}] NVARCHAR(MAX) NULL' for col in columns])
    create_sql = f"""
    IF OBJECT_ID('{table_name}', 'U') IS NULL
    CREATE TABLE [dbo].[{table_name}] (
        {cols}
    )
    """
    cursor.execute(create_sql)
    conn.commit()

def detect_encoding(file_path):
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read(100000))  # check first 100KB
    return result['encoding'] or 'utf-8'

def process_file(csv_path, table_name, encoding):
    try:
        reader = pd.read_csv(csv_path, chunksize=chunksize, encoding=encoding, on_bad_lines='skip')
        for chunk in reader:
            insert_chunk(chunk, table_name)
    except Exception as e:
        if encoding != 'latin1':
            print(f"?????? Failed with {encoding} on {table_name} ({e}), retrying with latin1...")
            process_file(csv_path, table_name, 'latin1')
        else:
            print(f"??? Failed with latin1 on {table_name}: {e}")

def insert_chunk(chunk, table_name):
    data = [
        tuple(row[col] if col in row and pd.notnull(row[col]) else None for col in columns)
        for _, row in chunk.iterrows()
    ]
    if data:
        placeholders = ', '.join(['?'] * len(columns))
        col_names = ', '.join([f'[{c}]' for c in columns])
        insert_sql = f"""
            INSERT INTO [dbo].[{table_name}] (
                {col_names}
            ) VALUES (
                {placeholders}
            )
        """
        cursor.executemany(insert_sql, data)
        conn.commit()
        print(f"??? Inserted {len(data)} rows into {table_name}")

# Process each ZIP file
for zip_file in os.listdir(zip_folder):
    if zip_file.lower().endswith('.zip'):
        zip_path = os.path.join(zip_folder, zip_file)
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(zip_folder)
            for extracted_file in zip_ref.namelist():
                if extracted_file.lower().endswith('.csv'):
                    csv_path = os.path.join(zip_folder, extracted_file)
                    table_name = os.path.splitext(extracted_file)[0]

                    create_table_if_not_exists(table_name)
                    encoding = detect_encoding(csv_path)
                    print(f"Detected encoding for {extracted_file}: {encoding}")

                    process_file(csv_path, table_name, encoding)

                    os.remove(csv_path)

# Cleanup
cursor.close()
conn.close()
print("???? All files imported successfully!")
```

---

### ???? Key improvements:
??? Fallback happens **around the entire read + insert block**  
??? Retries with `'latin1'` if **any error** happens, not just at `pd.read_csv()`  
??? Skips bad lines with `on_bad_lines='skip'`  
??? Logs clean error and retry messages

---

### ???? Reminder

Make sure you have `chardet` installed:

```bash
pip install chardet
```

---

If you want, I can also prepare:
??? A `.log` file of all successes + errors  
??? A summary report at the end (rows per table, time taken)  
??? A ???dry run??? mode to test imports without touching the database

Would you like me to prep any of these add-ons? Let me know! ????

Author: user
what about the messages about low memory? this is running on Win10 with 64GB ram, about 40GB in use due to SQL Server

Author: ChatGPT
Great question ??? you???re right to be cautious here!  

The **low memory messages** you may see when using `pandas.read_csv()` often come from its internal **C parser** when reading big files with `low_memory=True` (the default).  

This doesn???t mean your system is actually running out of RAM ??? it means pandas is trying to **optimize type inference by reading in small chunks first** but sometimes gets tripped up and issues warnings.

### ??? How to address this safely on your 64GB system:

1. **Explicitly set `low_memory=False`** in the `read_csv()` calls:
   
   This tells pandas:
   > ???Read the whole file and infer data types in one pass, even if it uses more memory.???

   Since you have ~24 GB free (64 GB total ??? ~40 GB in use), this is a good tradeoff.

   Modify the `read_csv()` lines in the script:

   ```python
   reader = pd.read_csv(csv_path, chunksize=chunksize, encoding=encoding, on_bad_lines='skip', low_memory=False)
   ```

   And in fallback:
   
   ```python
   reader = pd.read_csv(csv_path, chunksize=chunksize, encoding='latin1', on_bad_lines='skip', low_memory=False)
   ```

---

2. **Optional ??? Increase chunk size if RAM allows**

   You can raise:
   
   ```python
   chunksize = 5000
   ```
   
   to something like:
   
   ```python
   chunksize = 10000 or 20000
   ```
   
   This reduces the number of SQL calls and improves overall speed, but only if you have headroom. With 24 GB free, you can experiment safely.

---

3. **Optional ??? Reduce SQL Server memory cap temporarily**

   If you want to be aggressive, you can temporarily lower SQL Server???s `max server memory`:
   
   ```sql
   EXEC sp_configure 'max server memory (MB)', 40000;
   RECONFIGURE;
   ```
   
   This will free up additional RAM for your import and can be raised back later.

---

### ???? Summary of updates to apply in code:
```python
reader = pd.read_csv(csv_path, chunksize=chunksize, encoding=encoding, on_bad_lines='skip', low_memory=False)
```

---

If you want, I can:
??? Update the full script with this `low_memory=False` setting  
??? Recommend an adaptive `chunksize` strategy  
??? Write a helper to auto-tune the chunk size based on available RAM

Would you like me to prepare that? ???? Let me know!

Author: user
regardles sof the memory setting, the updated code is reporting that it "inserted 5000 rows into  USA Emails_0" for example 

Author: ChatGPT
Great observation ??? I know exactly why you???re seeing that!  

Right now, the script reports:

```python
print(f"??? Inserted {len(data)} rows into {table_name}")
```

But `len(data)` is always equal to your **chunk size** (currently `5000`), except for the last chunk, which may be smaller. So it???s not reporting the **total rows imported** ??? only the current batch size.

---

### ??? Fix: Track cumulative total rows per table

I???ll show you how to modify the script so it:
- Tracks the **cumulative total per table**
- Reports it **after finishing each file**

Here???s the modified `process_file()` and `insert_chunk()` logic:

```python
def process_file(csv_path, table_name, encoding):
    total_inserted = 0
    try:
        reader = pd.read_csv(csv_path, chunksize=chunksize, encoding=encoding, on_bad_lines='skip', low_memory=False)
        for chunk in reader:
            rows_inserted = insert_chunk(chunk, table_name)
            total_inserted += rows_inserted
        print(f"?????? Total inserted into {table_name}: {total_inserted} rows")
    except Exception as e:
        if encoding != 'latin1':
            print(f"?????? Failed with {encoding} on {table_name} ({e}), retrying with latin1...")
            process_file(csv_path, table_name, 'latin1')
        else:
            print(f"??? Failed with latin1 on {table_name}: {e}")

def insert_chunk(chunk, table_name):
    data = [
        tuple(row[col] if col in row and pd.notnull(row[col]) else None for col in columns)
        for _, row in chunk.iterrows()
    ]
    if data:
        placeholders = ', '.join(['?'] * len(columns))
        col_names = ', '.join([f'[{c}]' for c in columns])
        insert_sql = f"""
            INSERT INTO [dbo].[{table_name}] (
                {col_names}
            ) VALUES (
                {placeholders}
            )
        """
        cursor.executemany(insert_sql, data)
        conn.commit()
        print(f"??? Inserted {len(data)} rows into {table_name}")
        return len(data)
    return 0
```

---

### ???? What???s improved:
??? Each chunk logs its own count (`??? Inserted X rows into table`)  
??? After finishing a file, it reports the **total imported** (`?????? Total inserted into table: Y rows`)  
??? Handles the last small chunk correctly

---

### ???? Example output:
```
Detected encoding for USA Emails_0.csv: GB2312
??? Inserted 5000 rows into USA Emails_0
??? Inserted 5000 rows into USA Emails_0
??? Inserted 2764 rows into USA Emails_0
?????? Total inserted into USA Emails_0: 12764 rows
```

---

If you like, I can send you the **full script with this change already applied** ??? want me to prep that? ????

Author: user
this code Appears to be very effective, and we are just waiting for it to process all the files.

Author: ChatGPT
That???s fantastic to hear ??? I???m really glad it???s running effectively for you! ????????  

With the improved script, you should get:  
??? Clean handling of mixed encodings  
??? Chunked inserts that are memory-efficient  
??? Clear per-chunk and total row counts per file  
??? Resilience against bad data lines  

As the script runs, just keep an eye on the console output ??? it???s your **live progress report.**  

---

### ???? Bonus suggestions for future improvements:
If you want to further enhance this workflow later, I can help you add:
??? **A summary report** at the end (total rows per table, time taken)  
??? **A log file** that writes all successes, errors, and fallback info to disk  
??? **Parallel ZIP processing** to speed things up on multi-core machines  
??? **Validation / cleanup rules** (e.g., trim strings, normalize columns, deduplicate rows)  
??? **Database indexes** added after import for faster querying

---

If you???d like, just say:
> ???Send me the bonus features???

And I???ll prep a next-level version of the script for you! ????

Author: user
send me the bonus features

Author: ChatGPT
Awesome ??? here???s a **next-level upgrade** of your script with all the requested bonus features:

??? **Summary report** at the end (rows per table + total time)  
??? **Log file** (`import_log.txt`) for successes, errors, fallbacks  
??? **Optional parallel ZIP processing** (commented, ready to enable)  
??? **Basic cleanup rule**: trims strings to avoid accidental space issues  
??? **Recommended: add database indexes after import (SQL script)**

---

### ???? Full upgraded script

```python
import os
import zipfile
import pandas as pd
import pyodbc
import chardet
import time
from datetime import datetime

# SQL Server connection
conn_str = (
    "DRIVER={ODBC Driver 17 for SQL Server};"
    "SERVER=stormbase;"
    "DATABASE=dbaresources;"
    "UID=sql-lizaguirre;"
    "PWD=Warrior#312"
)
conn = pyodbc.connect(conn_str)
cursor = conn.cursor()
cursor.fast_executemany = True

# Folder with ZIP files
zip_folder = r'C:\_Data\LinkedInLeads\USAEmails'
chunksize = 5000
log_file = 'import_log.txt'

# Common columns
columns = [
    'FullName', 'Industry', 'JobTitle', 'SubRole', 'Emails', 'Mobile',
    'PhoneNumbers', 'CompanyName', 'CompanyIndustry', 'CompanyWebsite',
    'CompanySize', 'Location', 'Skills', 'FirstName', 'LastName',
    'BirthYear', 'BirthDate', 'Gender', 'LinkedInUrl', 'FacebookUrl',
    'TwitterUrl', 'Githu Url', 'Compan LinkedinUrl', 'CompanyFacebookUrl',
    'CompanyTwitterUrl', 'CompanyLocationName', 'CompanyLocationStreetAddress',
    'CompanyLocationAddressLine2', 'CompanyLocationPostalCode', 'LocationCountry',
    'LocationContinent', 'LinkedinConnections', 'InferredSalary', 'YearsExperience',
    'Countries', 'Interests'
]

summary = {}

def log(message):
    print(message)
    with open(log_file, 'a', encoding='utf-8') as f:
        f.write(f"{datetime.now()} - {message}\n")

def create_table_if_not_exists(table_name):
    cols = ',\n'.join([f'[{col}] NVARCHAR(MAX) NULL' for col in columns])
    create_sql = f"""
    IF OBJECT_ID('{table_name}', 'U') IS NULL
    CREATE TABLE [dbo].[{table_name}] (
        {cols}
    )
    """
    cursor.execute(create_sql)
    conn.commit()

def detect_encoding(file_path):
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read(100000))
    return result['encoding'] or 'utf-8'

def process_file(csv_path, table_name, encoding):
    total_inserted = 0
    try:
        reader = pd.read_csv(csv_path, chunksize=chunksize, encoding=encoding, on_bad_lines='skip', low_memory=False)
        for chunk in reader:
            rows_inserted = insert_chunk(chunk, table_name)
            total_inserted += rows_inserted
        log(f"?????? Total inserted into {table_name}: {total_inserted} rows")
        summary[table_name] = total_inserted
    except Exception as e:
        if encoding != 'latin1':
            log(f"?????? Failed with {encoding} on {table_name} ({e}), retrying with latin1...")
            process_file(csv_path, table_name, 'latin1')
        else:
            log(f"??? Failed with latin1 on {table_name}: {e}")
            summary[table_name] = f"Failed ({e})"

def insert_chunk(chunk, table_name):
    data = [
        tuple(
            str(row[col]).strip() if col in row and pd.notnull(row[col]) else None
            for col in columns
        )
        for _, row in chunk.iterrows()
    ]
    if data:
        placeholders = ', '.join(['?'] * len(columns))
        col_names = ', '.join([f'[{c}]' for c in columns])
        insert_sql = f"""
            INSERT INTO [dbo].[{table_name}] (
                {col_names}
            ) VALUES (
                {placeholders}
            )
        """
        cursor.executemany(insert_sql, data)
        conn.commit()
        log(f"??? Inserted {len(data)} rows into {table_name}")
        return len(data)
    return 0

# Start timer
start_time = time.time()

# Clear log file
with open(log_file, 'w', encoding='utf-8') as f:
    f.write("=== Import Log ===\n")

# Process each ZIP file
for zip_file in os.listdir(zip_folder):
    if zip_file.lower().endswith('.zip'):
        zip_path = os.path.join(zip_folder, zip_file)
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(zip_folder)
            for extracted_file in zip_ref.namelist():
                if extracted_file.lower().endswith('.csv'):
                    csv_path = os.path.join(zip_folder, extracted_file)
                    table_name = os.path.splitext(extracted_file)[0]

                    create_table_if_not_exists(table_name)
                    encoding = detect_encoding(csv_path)
                    log(f"Detected encoding for {extracted_file}: {encoding}")

                    process_file(csv_path, table_name, encoding)

                    os.remove(csv_path)

# Cleanup
cursor.close()
conn.close()

# Summary report
log("\n=== Summary Report ===")
for table, count in summary.items():
    log(f"{table}: {count} rows")

elapsed_time = time.time() - start_time
log(f"\n???? All files imported successfully in {elapsed_time:.2f} seconds!")
```

---

### ??? Bonus improvements included:
- Logs all activity to `import_log.txt`  
- Reports rows inserted per table  
- Records **failures or fallback attempts**  
- Shows **total elapsed time** at the end  
- Strips extra whitespace from string fields

---

### ???? Recommended: Add indexes after import

For faster querying later, run this in SQL Server **after import**:

```sql
CREATE INDEX idx_Email ON [dbo].[YourTableName] ([Emails]);
CREATE INDEX idx_LinkedInUrl ON [dbo].[YourTableName] ([LinkedInUrl]);
```

Do this **after imports finish** to avoid slowing down inserts.

---

### ??? Optional: Parallel processing (advanced)
If you want, I can prepare:
- A **parallel version** using `concurrent.futures`  
- A **dry-run mode** to only validate files without inserting

Would you like me to prep one of these next? ???? Let me know!

Author: user
you mentioned parallel processing, but i don't see it in the script, can you help identify the are in the script

Author: ChatGPT
Great catch! You???re right ??? in the script I sent, I **only mentioned parallel processing as an optional upgrade, but didn???t include the code** yet. Let me explain where and how you can add it.

---

### ???? Where to parallelize

In the script, the **main loop over ZIP files** looks like this:

```python
for zip_file in os.listdir(zip_folder):
    if zip_file.lower().endswith('.zip'):
        zip_path = os.path.join(zip_folder, zip_file)
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(zip_folder)
            for extracted_file in zip_ref.namelist():
                if extracted_file.lower().endswith('.csv'):
                    csv_path = os.path.join(zip_folder, extracted_file)
                    table_name = os.path.splitext(extracted_file)[0]

                    create_table_if_not_exists(table_name)
                    encoding = detect_encoding(csv_path)
                    log(f"Detected encoding for {extracted_file}: {encoding}")

                    process_file(csv_path, table_name, encoding)

                    os.remove(csv_path)
```

??? We can **parallelize at the per-ZIP level** ??? process multiple ZIP files at the same time.

We will use **`concurrent.futures.ThreadPoolExecutor`** for I/O-heavy tasks like this.

---

### ??? Here???s how to parallelize:

1?????? Wrap the inner ZIP ??? CSV work into a function:

```python
def process_zip(zip_file):
    zip_path = os.path.join(zip_folder, zip_file)
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(zip_folder)
        for extracted_file in zip_ref.namelist():
            if extracted_file.lower().endswith('.csv'):
                csv_path = os.path.join(zip_folder, extracted_file)
                table_name = os.path.splitext(extracted_file)[0]

                create_table_if_not_exists(table_name)
                encoding = detect_encoding(csv_path)
                log(f"Detected encoding for {extracted_file}: {encoding}")

                process_file(csv_path, table_name, encoding)

                os.remove(csv_path)
```

---

2?????? Replace the main loop with `ThreadPoolExecutor`:

```python
import concurrent.futures

# Collect ZIP files
zip_files = [f for f in os.listdir(zip_folder) if f.lower().endswith('.zip')]

# Run in parallel
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    executor.map(process_zip, zip_files)
```

---

### ???? Example integration in your script:

```python
# Collect ZIP files
zip_files = [f for f in os.listdir(zip_folder) if f.lower().endswith('.zip')]

# Parallel processing (adjust max_workers as needed)
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    executor.map(process_zip, zip_files)
```

---

### ?????? Important notes:
- Parallel DB inserts **can put more pressure on SQL Server** ??? make sure you monitor CPU/disk.
- You can tune `max_workers=4` ??? increase to `8` or more if you have many CPU cores + fast storage.
- You might want to **wrap DB connection setup inside `process_zip()`** to avoid shared connections across threads.

---

### ???? Would you like me to:
??? Send the **full ready-to-run script with parallelism built in**?  
??? Make it **multiprocessing-safe** with separate SQL connections per worker?

If you want, just say:
> ???Send me the parallel version script???

And I???ll deliver it fully integrated for you ????

Author: user
yes, provide the fulls cript with parralellism

Author: ChatGPT
Awesome! ???? Here???s your **full script with parallel ZIP processing** using `ThreadPoolExecutor` ??? safely handling SQL Server connections inside each thread.

---

### ???? Complete script: Parallel processing + safe DB connections

```python
import os
import zipfile
import pandas as pd
import pyodbc
import chardet
import time
from datetime import datetime
import concurrent.futures

# SQL Server connection string
conn_str_template = (
    "DRIVER={ODBC Driver 17 for SQL Server};"
    "SERVER=stormbase;"
    "DATABASE=dbaresources;"
    "UID=sql-lizaguirre;"
    "PWD=Warrior#312"
)

# Folder with ZIP files
zip_folder = r'C:\_Data\LinkedInLeads\USAEmails'
chunksize = 5000
log_file = 'import_log.txt'

# Common columns
columns = [
    'FullName', 'Industry', 'JobTitle', 'SubRole', 'Emails', 'Mobile',
    'PhoneNumbers', 'CompanyName', 'CompanyIndustry', 'CompanyWebsite',
    'CompanySize', 'Location', 'Skills', 'FirstName', 'LastName',
    'BirthYear', 'BirthDate', 'Gender', 'LinkedInUrl', 'FacebookUrl',
    'TwitterUrl', 'Githu Url', 'Compan LinkedinUrl', 'CompanyFacebookUrl',
    'CompanyTwitterUrl', 'CompanyLocationName', 'CompanyLocationStreetAddress',
    'CompanyLocationAddressLine2', 'CompanyLocationPostalCode', 'LocationCountry',
    'LocationContinent', 'LinkedinConnections', 'InferredSalary', 'YearsExperience',
    'Countries', 'Interests'
]

# Shared summary dictionary
summary = {}
summary_lock = concurrent.futures.thread.Lock()  # Thread-safe lock for summary

def log(message):
    print(message)
    with open(log_file, 'a', encoding='utf-8') as f:
        f.write(f"{datetime.now()} - {message}\n")

def create_table_if_not_exists(cursor, table_name):
    cols = ',\n'.join([f'[{col}] NVARCHAR(MAX) NULL' for col in columns])
    create_sql = f"""
    IF OBJECT_ID('{table_name}', 'U') IS NULL
    CREATE TABLE [dbo].[{table_name}] (
        {cols}
    )
    """
    cursor.execute(create_sql)
    cursor.connection.commit()

def detect_encoding(file_path):
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read(100000))
    return result['encoding'] or 'utf-8'

def insert_chunk(cursor, chunk, table_name):
    data = [
        tuple(str(row[col]).strip() if col in row and pd.notnull(row[col]) else None for col in columns)
        for _, row in chunk.iterrows()
    ]
    if data:
        placeholders = ', '.join(['?'] * len(columns))
        col_names = ', '.join([f'[{c}]' for c in columns])
        insert_sql = f"""
            INSERT INTO [dbo].[{table_name}] (
                {col_names}
            ) VALUES (
                {placeholders}
            )
        """
        cursor.executemany(insert_sql, data)
        cursor.connection.commit()
        log(f"??? Inserted {len(data)} rows into {table_name}")
        return len(data)
    return 0

def process_file(cursor, csv_path, table_name, encoding):
    total_inserted = 0
    try:
        reader = pd.read_csv(csv_path, chunksize=chunksize, encoding=encoding, on_bad_lines='skip', low_memory=False)
        for chunk in reader:
            rows_inserted = insert_chunk(cursor, chunk, table_name)
            total_inserted += rows_inserted
        log(f"?????? Total inserted into {table_name}: {total_inserted} rows")
        with summary_lock:
            summary[table_name] = total_inserted
    except Exception as e:
        if encoding != 'latin1':
            log(f"?????? Failed with {encoding} on {table_name} ({e}), retrying with latin1...")
            process_file(cursor, csv_path, table_name, 'latin1')
        else:
            log(f"??? Failed with latin1 on {table_name}: {e}")
            with summary_lock:
                summary[table_name] = f"Failed ({e})"

def process_zip(zip_file):
    conn = pyodbc.connect(conn_str_template)
    cursor = conn.cursor()
    cursor.fast_executemany = True
    zip_path = os.path.join(zip_folder, zip_file)
    try:
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(zip_folder)
            for extracted_file in zip_ref.namelist():
                if extracted_file.lower().endswith('.csv'):
                    csv_path = os.path.join(zip_folder, extracted_file)
                    table_name = os.path.splitext(extracted_file)[0]

                    create_table_if_not_exists(cursor, table_name)
                    encoding = detect_encoding(csv_path)
                    log(f"Detected encoding for {extracted_file}: {encoding}")

                    process_file(cursor, csv_path, table_name, encoding)

                    os.remove(csv_path)
    except Exception as e:
        log(f"??? Failed processing ZIP {zip_file}: {e}")
    finally:
        cursor.close()
        conn.close()

# Start timer
start_time = time.time()

# Clear log file
with open(log_file, 'w', encoding='utf-8') as f:
    f.write("=== Import Log ===\n")

# Collect ZIP files
zip_files = [f for f in os.listdir(zip_folder) if f.lower().endswith('.zip')]

# Parallel processing
max_workers = 4  # Adjust based on your CPU
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
    executor.map(process_zip, zip_files)

# Summary report
log("\n=== Summary Report ===")
for table, count in summary.items():
    log(f"{table}: {count} rows")

elapsed_time = time.time() - start_time
log(f"\n???? All files imported successfully in {elapsed_time:.2f} seconds!")
```

---

### ??? Key features:
??? Runs ZIP files **in parallel** (`max_workers=4`, adjustable)  
??? Uses **separate DB connections per thread** (safe)  
??? Logs all progress + errors to `import_log.txt`  
??? Tracks and reports **total rows per table**  
??? Includes **thread-safe lock** for summary updates

---

### ???? Important tuning tip:
If you have a **high-core CPU (??? 8 cores)**, try:
```python
max_workers = 8
```

But on lower-core machines or when SQL Server is busy, stay conservative:
```python
max_workers = 2 or 4
```

---

### ???? Reminder
Make sure you have:
```bash
pip install chardet pandas pyodbc
```

---

If you want, I can also help with:
??? A **multiprocessing version** (uses separate Python processes)  
??? A **dry-run mode** to validate files without DB inserts  
??? A **post-import index script** ready for SQL Server

Just say:
> ???Send me the multiprocessing version???  
or  
> ???Send me the dry-run mode???

and I???ll prep it for you! ????

Author: user
can we modify the create table command to always use SQL Servers DATA_COMPRESSION = PAGE

Author: ChatGPT
Excellent enhancement idea! Adding `DATA_COMPRESSION = PAGE` will reduce storage space and improve I/O efficiency, especially for large imported tables.  

Here???s the updated **`create_table_if_not_exists()`** function in your script with that change:

---

### ???? Modified function with `DATA_COMPRESSION = PAGE`

```python
def create_table_if_not_exists(cursor, table_name):
    cols = ',\n'.join([f'[{col}] NVARCHAR(MAX) NULL' for col in columns])
    create_sql = f"""
    IF OBJECT_ID('{table_name}', 'U') IS NULL
    CREATE TABLE [dbo].[{table_name}] (
        {cols}
    )
    WITH (DATA_COMPRESSION = PAGE)
    """
    cursor.execute(create_sql)
    cursor.connection.commit()
```

---

### ???? How it works:
- When the table is created, it **automatically applies page compression**.
- If the table already exists, it **leaves it unchanged**.
- Compression works on all NVARCHAR(MAX) columns (SQL Server will compress what it can, typically repeated patterns and nulls).

---

### ??? Things to consider:
- **No need to manually rebuild indexes after import** ??? the table is compressed at creation.
- SQL Server **Enterprise, Developer, and Evaluation editions** fully support PAGE compression.
- On **Standard edition**, PAGE compression is supported starting with SQL Server 2016 SP1.

---

### ???? Final integration step:
Just replace your current `create_table_if_not_exists()` in the parallel script with the block above.

---

If you want, I can:
??? Provide a **post-import script** to apply compression on all existing tables  
??? Add a **check for SQL Server edition** before applying compression  
??? Auto-apply **index creation + compression**

Just say:
> ???Send me the post-import compression script???

and I???ll prep it for you! ????

Author: user
Traceback (most recent call last):
  File "c:\_Data\LinkedInLeads\LinkedInLeads_By_Folder_Of_Zips.py", line 39, in <module>
    summary_lock = concurrent.futures.thread.Lock()  # Thread-safe lock for summary
  File "C:\Users\Lowell\AppData\Local\Programs\Python\Python310\lib\concurrent\futures\__init__.py", line 53, in __getattr__
    raise AttributeError(f"module {__name__} has no attribute {name}")
AttributeError: module concurrent.futures has no attribute thread

Author: ChatGPT
Thanks ??? you caught an important bug!  

The line:  

```python
summary_lock = concurrent.futures.thread.Lock()
```

is incorrect because **`concurrent.futures` has no `thread` module or Lock object**.

??? Correct fix ??? import `threading` and use its `Lock`:

```python
import threading
summary_lock = threading.Lock()
```

---

### ???? Here???s the corrected block:

```python
import threading  # ??? add this import

# Shared summary dictionary and lock
summary = {}
summary_lock = threading.Lock()  # ??? use threading.Lock() here
```

---

### ???? Summary of fix:
- Remove ??? `concurrent.futures.thread.Lock()` ???
- Add ??? `import threading` ???
- Replace with ??? `summary_lock = threading.Lock()` ???

---

If you want, I can **send you the fully corrected script ready-to-paste** with this fix already integrated.

Would you like me to prepare that? ????


Leave a Reply