Just to clarify in advance, I don't have a Facebook account and I have no intent to create one. Also, what I'm trying to achieve is perfectly legal in my country and the USA.
Instead of using the Facebook API to get the latest timeline posts of a Facebook page, I want to send a get request directly to the page URL (e.g. this page) and extract the posts from the HTML source code.
(I'd like to get the text and the creation time of the post.)
When I run this in the web console:
document.getElementsByClassName('userContent')
I get a list of elements containing the text of the latest posts.
But I'd like to extract that information from a nodejs script. I could probably do it quite easily using a headless browser like puppeteer
or the like, but that would create a ton of unnecessary overhead. I'd really like to a simple approach like downloading the HTML code, passing it to cheerio and use cheeriio's jQuery-like API to extract the posts.
Here is my attempt of trying exactly that:
// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');
rp.get('https://www.facebook.com/pg/officialstackoverflow/posts/').then( postsHtml => {
const $ = cheerio.load(postsHtml);
const timeLinePostEls = $('.userContent');
console.log(timeLinePostEls.html()); // should NOT be null
const newestPostEl = timeLinePostEls.get(0);
console.log(newestPostEl.html()); // should NOT be null
const newestPostText = newestPostEl.text();
console.log(newestPostText);
//const newestPostTime = newestPostEl.parent(??).child('.livetimestamp').title;
//console.log(newestPostTime);
}).catch(console.error);
unfortunately $('.userContent')
does not work. However, I was able to verify that the data I'm looking for is embedded somewhere in that HTML code.
But I couldn't really come up with a with a good regex approach or the like to extract that data.
Depending on the post content the number of HTML tags within the post varies heavily.
Here is a simple example of a post containing one link:
<div class="_5pbx userContent _3576" data-ft="{"tn":"K"}"><p>We're proud to be named one of Built In NYC's Best Places to Work in 2019, ranking in the top 10 for Best Midsize Places to Work and top 3 (!) for Best Perks and Benefits. See what it took to make the list and check out our profile to see some of our job openings. <a href="https://l.facebook.com/l.php?u=https%3A%2F%2Fbit.ly%2F2H3Kbr2&h=AT29h2HyDsEk0rHRWqJA-Fa4M1qi3nJT1NBi95othaR3qeFuFAMNiVS2Dgtv5KR5m0xqjw6kfwZdhZt0_D3UQT1Oel2UhxRql-KwkA1xqWvrql4u1jDhzrkGVT_XxoUd8_w8_fzLZzzhz23a8yPCK6IPfWKB76_CEFjG3b78y4dFJvY9Z08AYlR01dmi5_FvWVEVytkN-123u6alYE8pqL6Jb6dtIQUTWGXYJPaNMrtxkCUZniEVXEcILkwHGSuHqCTAarboyMP55F1vhYO3OAiVMkvjbN274fVq92YvbK3bi90bU9T-5ADWHDUJ-CwcofSBTW47chstQeY0n_UluD_rBIPLsfXVSnCtpRkR2kXi9zzHLnNeIYeNssv3i7UKS_f5Z2pnVT6xe3zJbNpB68doH1Z__I9nsTCNIyFyKx2VxabecoL03DIawbRrzBoxLAwzNPLACBjTkpEQhdVn4_wdAIjXRL4cLQDcZkLEoG_sspBgRePH23TFbNufQOBly-FNtLHnkUDO2Ca-FYvAGXpcu6J4B1aH3XFPB803lsz-GRdACyOFOgXDXJfwr4WtWzUHxfiOPULWiI43yI5L4aU6wYRhPjxua3RuRZ8oj9fXa1w4Jrht94Ue2wfKtz8" target="_blank" data-ft="{"tn":"-U"}" rel="noopener nofollow" data-lynx-mode="async">http://*******/2H3Kbr2</a></p></div>
Formatted in a more readable form it looks somewhat like this:
<div class="_5pbx userContent _3576" data-ft="{"tn":"K"}">
<p>
We're proud to be named one of Built In NYC's Best Places to Work in
2019, ranking in the top 10 for Best Midsize Places to Work and top 3 (!) for
Best Perks and Benefits. See what it took to make the list and check out our
profile to see some of our job openings.
<a href="VERY_LONG_URL.........." target="_blank" data-ft="{"tn":"-U"}" rel="noopener nofollow" data-lynx-mode="async">SHORT_LINK.....</a>
</p>
</div>
This regex seems to work okay, but I don't think it is very reliable:
/<div class="[^"]+ userContent [^"]+" data-ft="[^"]+">(.+?)<\/div>/g
If for example the post contained another div-element then it wouldn't work properly. In addition to that I have no way of knowing the time/date the post was created using this approach?
Any ideas how I could relatively reliably extract the most recent 2-3 posts including the creation date/time?