I've tryed to parse song lyrics from biggest russian lyrics site http://amalgama-lab.com and save lyrics (translated and original) into audio list from my Vkontakte account(sadly, amalgama doesn't have any API)
import urllib
from BeautifulSoup import BeautifulSoup
import vkontakte
vk = vkontakte.API(token=<SECRET_TOKEN>)
audios = vk.getAudios(count='2')
#{u'artist': u'The Beatles', u'url': u'http://cs4519.vkontakte.ru/u4665445/audio/4241af71a888.mp3', u'title': u'Yesterday', u'lyrics_id': u'2365986', u'duration': 130, u'aid': 166194990, u'owner_id': 173505924}
url = 'http://amalgama.mobi/songs/'
for i in audios:
print i['artist']
if i['artist'].startswith('The '):
url += i['artist'][4:5] + '/' + i['artist'][4:].replace(' ', '_') + '/' +i['title'].replace(' ', '_') + '.html'
else:
url += i['artist'][:1] + '/' + i['artist'].replace(' ', '_') + '/' +i['title'].replace(' ', '_') + '.html'
url = url.lower()
page = urllib.urlopen(url)
soup = BeautifulSoup(page.read(), fromEncoding="utf-8")
texts = soup.findAll('ol', )
if len(texts) != 0:
en = texts[0].text #this!
ru = texts[1].text #this!
vk.get('audio.edit', aid=i['aid'], oid = i['owner_id'], artist=i['artist'], title = i['title'], text = ru, no_search = 0)
but .text method returns string without any separators:
"Yesterday, all my troubles seemed so far awayNow it look as though they're here to stayOh, I believe in yesterdaySuddenly, I'm not half the man I used to beThere's a shadow hanging over meOh, yesterday came suddenly[Chorus:]Why she had to go I don't know, she wouldn't sayI said something wrong, now I long for yesterdayYesterday, love was such an easy game to playNow I need a place to hide awayOh, I believe in"
It's main problem. Next, what better way to save lyrics such this way:
Lyrics line 1 (Original)
Lyrics line 1 (Translated)
Lyrics line 2 (Original)
Lyrics line 2 (Translated)
Lyrics line 3 (Original)
Lyrics line 3 (Translated)
...
? I get only messy code. Thanks
<br/>
tags, which the OP is stripping out.. – Wamble