I have two lists of song titles, each in a plain text file, which are the filenames of licensed lyric files - I want to check if the shorter list titles (needle) are in the longer list (haystack). The script/app should return the list of titles in the needle that aren't in the haystack.
I'd prefer to use Python or a shell script (BASH) or just use visual diff program that can handle the fuzziness needed.
The main problem is that the titles need to be fuzzy matched to account for data entry errors and possibly also word ordering.
Haystack sample (note some duplicate and near duplicate lines, matches highlighted):
Yearn
Yesterday, Today And Forever
Yesterday, Today, Forever
You
You Alone
You Are Here (The Same Power)
You Are Holy
You Are Holy (Prince Of Peace)
You Are Mighty
You Are Mine
You Are My All In All
You Are My Hiding Place
You Are My King (Amazing Love)
You Are Righteous (Hope)
You Are So Faithful
You Are So Good to Me
You Are Worthy Of My Praise
You Have Been Good
You Led Me To The Cross
You Reign
You Rescued Me
You Said
You Sent Your Own
You Set Me Apart (Dwell In Your House)
You alone are worthy (Glory in the highest)
You are God in heaven (let my words be few)
You are always fighting for us (Hallelujah you have overcome)
You are beautiful (I stand in awe)
You are beautiful beyond description
You are mighty
You are my all in all
You are my hiding place
You are my passion
You are still Holy
You are the Holy One (We exalt Your name)
You are the mighty King
You are the mighty warrior
You are the vine
**You chose the cross (Lost in wonder)**
You have shown me favour unending
You hold the broken hearted
You laid aside Your majesty
You said
You're Worthy Of My Praise
You're calling me (Unashamed love)
You're the God of this city
You're the Lion of Judah
You're the word of God the Father (Across the lands)
You've put a new song in my heart
Your Beloved
Your Grace is Enough
Your Great Name We Praise
Your Great Name We Praise-2
Your Light (You Have Turned)
Your Light Is Over Me (His Love)
**Your Love**
**Your Love Is Amazing**
Your Love Is Deep
Your Love Is Deeper - Jesus, Lord of Heaven (Phil Wickham)
Your Love Oh Lord
Your Love Oh Lord (Psalm 36)
Your Love is Extravagant
Your Power (Send Me)
Your blood speaks a better word
Your everlasting love
**Your grace is enough**
**Your grace is enough (Great is Your faithfulness)**
Your mercy is falling
Your mercy taught us how to dance (Dancing generation)
Your voice stills the oceans (nothing is impossible)
Yours Is The Kingdom
Needle sample:
You Are Good (I Want To Scream It Out)
You Are My Strength (In The Fullness)
You Are My Vision O King Of My Heart
You Are The King Of Glory (Hosanna To The Son)
**You Chose The Cross (Lost In Wonder)**
**Your Grace Is Enough (This Is Our God)**
**Your Love Is Amazing Steady And Unchanging**
**Your Love Shining Like The Sun**
Note that the needle title "Your Love Shining Like The Sun" is only a possible match for "Your Love". It's better to fail towards not matching and therefore any uncertain title matches should appear in the output.
comm -1 -3 <(sort haystack.txt) <(sort needle.txt)
doesn't find any of the matches. diff
or grep
seems they'd have the same problem with not being fuzzy enough. Kdiff3
and diffnow.com
were as quick as a manual comparison as I had to still scan through for nearly all matches, they could only cope with whitespace and letter-case differences.
ExamDiffPro
from prestosoft.com looks like a possibility but is MS Windows only and I'd prefer a native Linux solution before I go messing with WINE or VirtualBox.
The needle is actually a CSV so I've thought about using LibreOffice and treating it as a database and doing SQL queries or using a spreadsheet with hlookup or something ... Another question led me to OpenRefine (formerly google-refine)
Seems like this is a common category of problem (it's basically "record linkage" which often uses [Levenshtein] edit-distance calculation), how should I approach it? Suggestions please?