PowerShell and SQL.
So my most recent focus has been a huge scan of Linked In Leads I purchased. It’s a monstrous sized scan from linked in from August of 2024, so slightly dated but still a good baseline. 31 Million USA Location Linked in records with Email Addresses.93 Million USA Location Linked in records to harvest for , say, company names that the user mentions but did not have their email address listed.
And not just the US! Every Country that has members that mention their country , i have a record for them, also separated into two groups…those with emails to reference, and otehr records from the same country, but that do not feature an email.
What I purchased was access to all the flat file dumps, and the leads with email addresses are obviously the most value as far as marketing goes.
It turned into a fascinating project where I had wanted to bring that data into a proper database, and normalize it, to make it useful. What’s really great is how there are exception records….like what should be an email address, but it imports as an Nvarchar(max), and is a bunch of text and narrative from whoever that record is related to. 99% of the columns are what they “should” be, but those 1% are the things that every DBA has to work through and clean up.
So far, this has been my strategy.The code below is parameterized so i can apply it to every country.
Looking at the data, I want to denormalize the data, and limit the column sizes to varchar(128) except for true descriptions. The data will require some AI or Human Review and Intervention to eliminate garbage data that is preventing the data resizing. I Imported each file into SQL Server 2022, each file as a table.
- Kept the original 31M row table as [dbo].[USAEmails_Original] for reference and cross checks.
- Add a BinaryCheckSum Column to help find duplicates later
- Add a BIGINT IDENTITY() column to create a unique primary key to properly reference each row
- Compressed Every Table with data_compression=PAGE
- Cleanup data with noise or garbage repeating characters, like data that had all “zzz” values in them
- Converted my [LinkedInLeads] table to varchar instead of nvarchar.
- Properly sized the columns from varchar(max) to a calcualted column that is just slightly larger than the current longest column ( ie [FullName from nvarchar(max) to VARCHAR(630) NULL; from there, usaemails
- Created a ProtoType LinkedInLeads table with the VARCHAR(128) I Envisioned, and just arbitrarily truncated all but descriptions to VARCHAR(128)
- Created the table LinkedInLeads as MemoryOptimized to see if it was faster to query as well as seeing the size shrin to 1/3 of it’s original 34GB
From There, it was time to start denormalizing and making the data usable
There are a few columns that would lend themselves to leveraging Vectors to find similar data.
- “industry”
- “jobtitle”
- “subrole”
- “location”
- “skills”
- “company”
- “interests”
using a Python Script I created I grabbed each field, and for fields that were comma delimited them, reordered them alphabetically, lower cased the data, and created a vector for each unique value. Then I associated that unique value back to the original records in a M:1 relationship to my Distinct Vector Tables
things like emails, phone numbers and skills were comma delimited lists, so i pulled them out to a separate table as a one to many relationship. they Python script is slow, as it queries each value to vectorize it from my local LMStudio’s service using the text-embedding-nomic-embed-text-v1 model, but even if it’s doing six records a second, five million distinct values is going to take ten days to populate!
IF OBJECT_ID('[dbo].[USAEmails]') IS NOT NULL
DROP TABLE [dbo].[USAEmails]
GO
CREATE TABLE [dbo].[USAEmails] (
[USAEmailsId] BIGINT IDENTITY(1,1) NOT NULL,
[FullName] VARCHAR(630) NULL,
[Industry] VARCHAR(max) NULL,
[JobTitle] VARCHAR(2980) NULL,
[SubRole] VARCHAR(2850) NULL,
[Emails] VARCHAR(360) NULL,
[Mobile] VARCHAR(560) NULL,
[PhoneNumbers] VARCHAR(680) NULL,
[CompanyName] VARCHAR(230) NULL,
[CompanyIndustry] VARCHAR(290) NULL,
[CompanyWebsite] VARCHAR(250) NULL,
[CompanySize] VARCHAR(1110) NULL,
[Location] VARCHAR(2350) NULL,
[Skills] VARCHAR(max) NULL,
[FirstName] VARCHAR(1090) NULL,
[LastName] VARCHAR(470) NULL,
[BirthYear] VARCHAR(470) NULL,
[BirthDate] VARCHAR(190) NULL,
[Gender] VARCHAR(810) NULL,
[LinkedInUrl] VARCHAR(720) NULL,
[FacebookUrl] VARCHAR(120) NULL,
[TwitterUrl] VARCHAR(830) NULL,
[GithubUrl] VARCHAR(780) NULL,
[CompanyLinkedinUrl] VARCHAR(290) NULL,
[CompanyFacebookUrl] VARCHAR(240) NULL,
[CompanyTwitterUrl] VARCHAR(100) NULL,
[CompanyLocationName] VARCHAR(90) NULL,
[CompanyLocationStreetAddress] VARCHAR(130) NULL,
[CompanyLocationAddressLine2] VARCHAR(90) NULL,
[CompanyLocationPostalCode] VARCHAR(100) NULL,
[LocationCountry] VARCHAR(1100) NULL,
[LocationContinent] VARCHAR(2090) NULL,
[LinkedinConnections] VARCHAR(850) NULL,
[InferredSalary] VARCHAR(990) NULL,
[YearsExperience] VARCHAR(1090) NULL,
[Countries] VARCHAR(1110) NULL,
[Interests] VARCHAR(3620) NULL,
[OldLocation] VARCHAR(2350) NULL,
[BinaryCheckSum] INT NULL,
CONSTRAINT [PK__USAEmails__USAEmailsId] PRIMARY KEY CLUSTERED ([USAEmailsId] asc) WITH (FILLFACTOR = 100,DATA_COMPRESSION = PAGE ))
GO
Latest Pages
Latest Posts
- SQL Shark
- (no title)
- Scanning Every Server with Dynamic Reports
- Standard SQL Server Machine
- Lonely Secondary Log Shipping
Latest Presentations