Dealing with Thin Content

reading

(Nick Emmett) #1

Continuing the discussion from So, what are you working on?:

@Patrick_Curtis, it may be worth filling in more detail but essentially Patrick has seen growth stall somewhat and has attributed this, at least in part, to the amount of thin content on the site and how Google indexes this and how it can now be detrimental to your site’s ranking. Patrick’s looking for your views and advice from anyone that has done this previously.

There was a really good session at SPRINT London about this from Dominic Woodman. It’s a shame he’s not on here to invite to the conversation.


2017 Recap & 2018 Plans - Google's AMP?
(Patrick Curtis) #2

Nick, thanks for starting this. It might be helpful to give a little history and then dive into what we’ve done.

History: so back in late 2012 (Nov 17) to be exact, many large forums seemed to see a dramatic falloff. Including large popular forums like Meta Filter: http://metatalk.metafilter.com/23251/Metafilter-SEO

While we were NOT hit like that, there has definitely been a dramatic decline in growth rate of our organic traffic, right around the same time…we went from ~20% growth YOY to basically flat.

Now that I’ve been doing more research, it seems like Google is devaluing forums (not targeting / attacking on purpose per se, but as a natural result of UGC and how much thin content is typical in a forum)…

Now whether Panda is binary or not can be debated forever…but what is very suspicious is the timing on all of this. While I have been one of the “lucky” large forum owners since we weren’t completely whacked…we have also been swimming what feels like up stream. Investing over $400,000+ over the past 4 years to making the site faster, the UX better, the community more engaging, user testing, etc…while I think the platform we have today is leaps and bounds better than what we had, it’s not reflected in our search rankings…SOOOO, we’ve started digging more into the SEO / site audit side of things to see what we could clean up. Over the past few months (with minimal to no improvement yet), we have done the following:

  1. Using Varnish to make the site much faster for logged out users (heavy cache layer) with smart purge so if new content is added, it;s still showing / fresh.

  2. Worked on Javascript aggregation settings (the one suspicious thing is that when we tweak these settings, it tends to have a dramatic impact on percieved speed and organic search traffic…although the signals could be getting mixed with other changes)

  3. No indexed nodes with under 500 characters and 0 comments (only about 2700 nodes out of 200,000+)…this was first step, we’ll likely get more aggressive…but want to be more sophisticated and look at organic search traffic to these pages as well since oftentimes thin pages can still rank well (worth it to beef them up)

  4. ran the site through Botify free trial to help us clean up ~10,000 dupes (Titles, H1s, metas)…still in process

  5. pagination…increased # of nodes/forum container page to reduce site avg depth from 55 to ~22…still high, but not sure how else to get this down (we already have a very flat structure). using rel next, prev and noindex, follow on page 2+

NEXT STEPS:

  1. We have noticed using Botify that about 6% of the nodes make up about 75% of the traffic and about ~20% of the urls make up about 95% of the traffic…but I think we have an opportunity here to do better than just to mass delete or noindex a huge portion of the site…what about MERGING and PURGING…

  2. With #1 in mind, we have been developing an internal tool that allows us to type in the title of a url and basically, up to 50 related urls pop up…many that are talking about the exact same or very similar niche topic or asking the same question. This tool (currently) will show us the title, preview of the post, character count, comment count, and # of pageviews over last year (analytics API)…

The tool allows you to basically select a “master” url (the one with most comments/traffic, etc) and as many “children”…what i would call “weaker” versions of the master…and merge them. It takes the children comments, moves them to the master thread and then puts a 301 redirect from the child to the master (in case there is any link juice, we dont want to lose it)…

Any thoughts on this? Obvioulsy, we have to be careful in how we use this tool (and it’s still not 100% done), but I thought it could be a way to at least combine and thicken some topics…for example, a small company may only have ~5 dicussions about it over 5 years…but a user would have to go to 5 different pages to read all the comments about it. Isn’t it a better user experience if we could merge those and have 1 thread with more comments???

Wow, that was longer than expected! Thanks for your thoughts!

Talk soon,
Patrick


(Sarah Hawk) #3

Good call. I’ll see what I can do. :slight_smile:

@Patrick_Curtis Interesting challenge! When you talk about thin content, are you referring to automatically generated pages (ie profiles etc)?


(Kristen Gastaldo) #4

Ohh, we have so much of this. I actually just had to search and see if I had a profile page on this site, as I never use it.

The community I work on has your public facing profile and your private dashboard (my topics, bookmarked topics, etc). We also migrated over about 10k worth of profiles that haven’t been activated yet. I wonder how much the norm that is!


(Sarah Hawk) #5

Have a read of this article, esp step 5. All of the advice that I’ve heard lately points to noindexing this kind of thin community content.


(Nick Emmett) #6

Dominic talked about similar themes. His slides are here


(Patrick Curtis) #7

Thanks…yeah, everything I’ve read has pointed to either deleting, noindexing or redirecting to related thicker content. The hard part is across 200,000, what do you do to what.

We are using botify to do a deep crawl and cross reference our organic traffic stats…urls with thin content (ie less than 1,000 characters and say 1 or 0 organic search traffic) we are thinking of just noindexing to start, but we are still cutting those #s.

The next step after that will be to identify pages that have good potential but are thin…maybe less than 2,000 characters of unique content but with 100+ organic visits in last 90 days (is there a “parent” or similar topic that it can be merged with?)

…lots of work to do :slight_smile: will keep you updated!


(Joe Velez) #8

​No need to delete … NOINDEX will do the job. Delete should only be used when running out of space/money.

Some questions…

Are you referring to pageviews, sessions, or both?

Is WSO the site in question?

Is site optimized for mobile?

Mobile vs desktop traffic (percentage)?

When was the last time you changed site (big changes)?


(Patrick Curtis) #9

Ok, thanks Joe…my plan was to only delete pages that were super thin and added no value and/or were actual dupes and to NoIndex the majority of other low quality / thin pages.

When i say organic visits, i am referring to sessions that originated from organic search.

Yes, WSO (Wall Street Oasis) is the site in question :slight_smile:

Yes, the site is optimized for mobile.

Mobile vs desktop = 30% vs 70%

Some big changes…well we upgraded from Drupal 6 to 7 on March 31, 2015 including a new redesign (but one that was just cleaner vs previous version, no dramatic changes all at once). We are just treading water on organic search for many years now…some months it will look like we are back on right track with 10-15% YOY growth, then other times it looks like we are right back down to -2% - +2%…in other words, basically 0% growth.

Another recent change (not sure if you would classify as big) was to dramatically simplify our navigation and reduce the amount of space given to the header to being up the content even higher above the fold. Any changes we have made have been to try and make Google happier (we worked a lot on speed for a few months)…

I welcome any thoughts you have!

Thanks everyone!
Patrick


(Dominic Woodman) #10

My ear’s are burning :slight_smile: Thanks for the heads up Sarah.

Judging the from description of Varnish & AdvAgg I’m gonna assume you’re reasonably technical and just jump in.

Getting more information

  • Check out Screaming Frog. License is super cheap (unlike Botify) and with about 8GB of RAM you should be able to comfortably crawl 200k URLs. (you can alter how much RAM it uses)

  • Check out your logs - that will give you a better idea of whether or not pagination is an issue and how often it’s getting to your threads. (if you start noindexing a bunch, this might also start being an issue, although something better than nothing etc etc.) Note: This is technical and can be fiddly if you can’t do it in Excel. Looking at SimilarWeb, looks like you guys get a fair bit of traffic so you might have to go pretty small scale to fit it in Excel (maybe grab a couple weeks, I can usually work with 600k or so comfortably in Excel.) If not do your devs have any ordinary log monitoring going on, that you could borrow? (you won’t be able reverse DNS lookup on Googlebot, but it should still give you some sort of idea.)

Thoughts on what’s been mentioned so far

  • When you migrated from Drupal 6 to 7, was there much migration and redirection? (Could there have been a drop from that, which knocked down your otherwise good growth.)

  • “6% of the nodes make up about 75% of the traffic and about ~20% of the urls make up about 95% of the traffic” – This definitely seems like a big thin content warning sign. Judging from what you’ve said here (so I could be totally missing something) I think I’d probably spend most of my time, trying to find ways to hide that low quality content away and finding ways to improve titles on valuable posts. With that in mind:

Working with all that content

  • Your internal tool about aggregating posts sounds really interesting, I would be worried though that it might create a lot of discussions that make very little sense from a user experience POV, but assuming you can get around that, then yes there’s definitely potential there. (Another variant of this tool, if you can’t solve the making sense issue, might just be one which noindexes all the cruft and then adds them as nofollow futher reading links at the bottom.)

  • Do the article you guys write on the site, end up bumping out valuable community content on the same topic? - Answer might be to pull all the community content into the article.

  • You’ve got a nice “best response” format, might be a good way to pull out immediately valuable content for more prominent linking.

  • You’ve got a lot of terrible titles, here’s just one - “Ask Natalie from Accepted about Business School Admissions” – amazing thread – switch that title tag to something like “Commonly asked questions about business school admissions - 300 Qs&As” (not perfect but something like that.)

  • I think you could probably mass pull on certain title patterns to find bad threads to start noindexing e.g. “!” – probably lower quality question, “macbook” – example of offtopic or at least not immediately useful in search, anything with 1 - 3 words,

Random other thoughts

  • I found the threads really hard to parse because of all the information in between each reply. Maybe try AB testing pushing some of that over to the left (/removing it) and leaving less in-between so it’s easier to read as a conversation.


Why Your Community Search Traffic Is Declining (and how community management is about to change)
(Patrick Curtis) #11

Wow @dominic_woodman , you rule! Thanks for taking the time to dive into my issues (just tried to reach out to you over LinkedIn)…maybe I can hire you to help :slight_smile: I’m at Patrick@wallstreetoasis.com

Getting more information

  1. Screaming Frog…was able to negotiate a great rate from botify (they realized we werent an agency with clients so they gave us a break)…i find botify easier to visualize my issues (see report attached).WSO-botify-3-23-16-crawl-report.pdf (456.2 KB)

  2. Re: logs. Yeah, the botify report has a “depth” portion of the report…it’s one of the reasons we dramatically increased the # of items per page (dropped avg depth from 54 down to 14…still bad, but i think better)… with such a large forum, I’m not sure how else we can reduce the depth effectively at scale…not sure if I fully follow what else may pop up in a log report, but I can ask one of my devs to maybe grab it?

Thoughts on whats been mentioned so far

  1. no noticeable drop in organic search traffic…I’ve attached a year over year graph of the past 16 months mapped to the organic traffic the year before. As you can see, with the exception of a few short periods, we map almost exactly to the prior year org traffic. No noticeable move from the D6 to D7 upgrade on March 31, 2015. 16mo-YOY-org-traffic-trend.pdf (158.3 KB)

  2. Thin content warning sign. Yes, 100% agreed. I have “total character count” & organic search traffic over a 90 day period for over 150,000 urls. I am starting by no-indexing ~31,000 nodes tomorrow that are under 1,000 total characters and had 0 search traffic…next stage will be (likely, assuming no adverse effects from first step) to no index another ~30,000 nodes (0 or 1 organic visit and under 2,000 characters)…then wait another ~2-3 weeks. This isn’t really “risky” since these nodes aren’t generating much (if any) traffic, but I’m hoping Google will see the no-index as some sort of signal. I think, however, that in order to really see an improvement site-wide, we may have to cut deeper…something like urls that have less than 10 visits and are under 3,000 characters…before we go there (a bit scary to cut that deep), however, I think there are other ways to merge / beef up high potential pages (thin but getting traffic)…

Working with all that content

  1. Our internal merging tool. Yes, I think if it’s not done carefully, it can create havoc and confusion for a user. We have to be very careful to only merge 2 nodes that are truly talking about the same thing…so we’d be able to merge the ~5 discussions just asking for general information on “Duke MMS” for example, but not the discussions asking “Duke MMS or UVA MSF?”

  2. Great question, but we write VERY little of our own content. about 99% is UGC. I think regardless, finding a “master” and merging+redirecting the weaker children to that master discussion would provide for a much better user experience and really thicken up our content. So if it’s good for user, engagement, etc, it’s probably good for SEO (just obviously a huge pain in the ass)…

  3. thanks for the “best response” compliment…that is actually new, and just this past week we started floating the best response to the top. I think we can do more around some things you mentioned in your presentation (I went through it last night, but i think i’d need the audio on some slides :slight_smile:

  4. Agreed on terrible titles! The horror of UGC and not writing for SEO…definitely think we can & should identify the best content and optimize it more (again, very manual)

  5. Yes, we’ve done some of this noindexing (threads that started with “delete”, for example, there was a time when we allowed users to change topic title, and after many go their answer, they woudl remove everything and ask for it to be deleted = bad for SEO and dupe titles, etc)…around the issue of dupes…still way higher than i want (which you can see from botify report), but we have dramatically reduced these over the past month (no noticeable improvement in SEO yet)

Random Other Thoughts

  1. Great feedback, thanks! I think in the past we had it do those buttons would only appear if you hovered over the comment, but i wanted people to know they could take action…not sure if we can remove the buttons, but maybe we can do away with the user signatures and tighten it up?

Thanks again Dominic for all of your thoughts and helpful post. Hopefully, as some community managers are dealing with forum SEO, this will serve as an interesting case study :slight_smile: Of course, after we get through this massive project, I’ll be happy to report the results (good or bad) here to the community to give back.

Happy to chat offline as well if you have time.

-Patrick


Re-Engagement Message
(Joe Velez) #12

Love the site. Great design!

I only ask about sessions vs pageviews because a lot of people get them confused.

Pageviews usually drop because of modernization of platforms and subtle changes such as increasing topics/comments per page. What use to take 3 clicks now take only 1, etc. Yes, small changes can have huge outcomes (negative/positive) on sessions and pageviews.

Mobile is another biggie. The more growth you have in mobile the more likely you will see a decrease in overall pageviews and/or sessions.

With your mobile numbers so low there may be something else going on. What are your demographics (age)?

Our mobile growth is currently at 60% (sessions) / 55% (pageviews).

By the way, your topic list pages does not look good on mobile. Content wider than screen. https://www.google.com/webmasters/tools/mobile-friendly/?url=http%3A%2F%2Fwww.wallstreetoasis.com%2Fforum%2Ftrading

Google does not hate forums. It never has.

A few things are happening…

Google is getting really good at finding “quality” topics. A quality topics isn’t just a 1,000 word topic. It’s whatever people react to… A 200 character post, an image, a video, downloadable files, etc. Whatever people like … share … comment … revisit, etc … over and over again.

Google wants to make sure that their users are satisfied with their search results. They will show the most relevant, popular, local, etc topics first. (Focus on FRESH hot topics. Get traffic to these pages. Share on social, email, etc.)

Your organic search visitors are not getting what they want and leave quickly (bounce). This is an important signal to Google nowadays. They will rank pages lower as this continues. (This could be thin content but most likely it’s user intent. Google got it wrong. But, they are getting really good at it nowadays.)

Google previously showed a lot of duplicate “junk”. I’m not talking about the same topics … I’m referring to different URLs pointing to the same topic but appearing on Google. Poof. All gone now. This was years ago but some people fail to make a connection (or forgot about it).

Anyway, all of these changes plus some have decreased traffic to many sites.

Oh, don’t forget that there’s more competition these days. This means Google has more content to rank which may rank better than yours. So, you need to work harder!

So, how do we help Google find “quality” pages on your site.

For starters, I think you are doing a good job. Increasing internal links to topics is very important and you are doing a good job here. I even found them in profile pages which is great.

Increase internal links wherever possible. More links to a page tells Google that the page is important to you. (regardless if it’s a thin or auto generated page) … Check Search Console > Internal Links and find the most linked to pages. If you find that these pages are not getting that much traffic than you need to fix it. (eg. Sites usually have highest # of links to Terms, Privacy, Register, etc … change this so that the hottest/popular topics are always on top - revolving)

Check Google Search Console (aka Webmaster Tools)…

Look into Crawl Errors, Crawl Stats, Index Status, Search Anlaytics, Links to Your Site (maybe disavow a few sites), Internal Links, etc.

Pay close attention to Index Status. This should be growing but it can drop. If you delete or NOINDEX you will see a drop here.


Be really careful because you can do more harm than good.


NOINDEX is good because shared links to these pages within and outside of the site will still work.

You are not gaining anything by deleting pages. Once it’s gone it’s gone. Manual deletes are ok but stay away from mass deleting.

As you know, topics in forums move quickly. Sometimes a great topic ends up with no views or comments because it was just a busy day in the forums.

If you are going to flag thin pages I recommend looking at some or all of the following:

  • 3+ years old only
  • no views
  • no comments
  • less than 300 words ?
  • RSS generated content (from outside sources posted on your site)?

My approach to SEO is to learn everything that you want to learn, make corrections, but let Google do it’s thing. Don’t focus too much on Google. As long as you have the basics covered you will be ok.

DON’T FORGET THAT THIS IS A FORUM that we are talking about. Content is user generated. You can only do so much.

With that said, I recommend creating more in-house articles/images/videos/files/etc. After posting, let 1 day pass then share on social media … another day past, share in newsletter.


Focus on improving ALL your traffic channels…

  • Organic search
  • Social (Use it to direct traffic to site)
  • Referrals
  • Newsletter / Email
  • Conferences / Events ??
  • Mobile apps
  • External pages (sister sites)
  • Others

Your facebook page is a big FAIL. (Almost) every post is linking to a 3rd party. Take this opportunity to direct readers to your site. Mix it up with images and throw in some 3rd-party links every 10+ posts. Grow your followers/likes.

This isn’t EVERYTHING but I hope it helps. :slight_smile:

(Never thought about merging topics. I think it’s a great idea!)


(Dominic Woodman) #13

Thanks for the heads-up I’m terrible at checking LinkedIn.

Getting more information

  1. Excellent, in that case stick with Botify. SF is a great, very cheap, very ugly swiss army knife. Sometimes if all you need is cutlery, better to stick with that. Especially if it’s a good deal. (I’m not quite sure that analogy works…). There doesn’t look to be anything else that leaps out of that Botify report, or at least nothing I’d prioritize over fixing your title tags, aggregating and removing all the fluff.

  2. So the great thing about logs, is they’re the only data we can get from Google about our site which is utterly concrete. Botify can tell you, your sites average depth is 14. Your logs can tell you whether Google is visiting pagination up to page 14 for example. Maybe it is and depth 14 is fine. You filter your logs for Googlebot entries, reverse lookup the IPs to make sure it is really Googlebot and then see what it’s spending it’s time crawling. You would also be amazed at the kind of stuff Googlebot will find on your website (I’ve seen ridiculous cases where it crawls a config file 100k times in a day, very very unlikely but it does happen). (Or you could test other hypothesis, like how often does it re-crawl a thread to discover fresh content?)

Thoughts on whats been mentioned so far

  1. Yep that does look reasonably constant. Would be interesting to dig into exactly what increased for those two periods of increased performance.

  2. Definitely anything with no traffic over 3 months, seems like a pretty reasonable noindex. I think you have exactly the right idea, in terms of staggered roll outs. Let me know how that goes.

Working with all that content

1 & 2. Ah ok, yes cutting quite specifically then. Slight tangent but to create deeper content and less one off questions, have you also considered doing something similar to what inbound.org does, they have a lot of ask the community questions which are explicitly aimed at bringing in a bunch of opinions. Depending on the community feel, possibly something you could lift successfully. (You could also create the starting post with content lifted from existing threads which you then close.) Seems like you have a lot of threads that get opened because it’s not obvious where people should go to. (perhaps steal the stack exchange pre question post search)

The forums are also relatively unstructured beyond the top level. Add more common tabs or sections, would let you create more stickied resources for each section, have them more prominently linked and target the curated content be more accurately (how do you curate content for an entire site with different intents? really hard gotta segment.). (This would also perhaps prevent more common or bad questions.

-3. Cheers! Highlighting the ones with good answers in page title or with tagging, I think makes a big difference too. Some communities like Inbound again or Growth Hackers, put tags next to threads to help the users pull out the useful ones. (these could be automated to some extent based on the feedback in the thread)

Random Other Thoughts

Maybe not remove them, but perhaps turn them into logos, or shift them to the right hand side so they’re not in the main flow? Looks like you don’t have indenting when people are replying to each other either, I would think that’s definitely worth adding too (then you could cut out the “in reply to x” header above each post.

Definitely let us know how it works out! I’m relatively busy, but I’m sure we could grab a drink sometime and have a chat!

  • Dominic

(Patrick Curtis) #14

@jgsnatos , please read this from dominic, thanks


(João Santos) #15

OK


(Patrick Curtis) #16

also check out the cool email notification discourse platform uses!


(Sarah Hawk) #17

I thought I’d update this topic because we’ve spent this month cleaning up a lot of thin content issues.

The main things I’ve done so far are

  • remove extraneous tags in WP (we had a lot of legacy tags that weren’t relevant e.g. there was onlinecommunities and onlinecommunity, both of which created near duplicate URLs for every post)
  • cleaned up author names from WP (similar reason as above – every piece of content had a /articlename and a richardmillington/articlename URL)
  • tidy up near duplicate URLs
  • tidy up 4xx and 5xx errors

We’re already seeing a noticeable difference. If you’re on WP, there are lots of easy wins. If you’re not, I’d still run a crawl report and tidy up redirects and onsite errors


(Patrick Curtis) #18

Thought it would only be fair to give you guys an update on where things stand:

  1. MERGING THREADS: our internal merging tool is now basically a script. we manually use a massive excel file to look over / filter thousands of discussions to identify “Masters” and “children”…once the script is run, the children get merged into the Master node and a 301 redirect is placed on the old children. RESULT: As of right now it’s pretty inconclusive…we aren’t seeing many dramatic increases, but we’ve only “touched” about 10% of all threads and my guess is we won’t see big wins until we get into the weaker middle “belly” of the thin content…will keep you guys updated (probably another ~6 months and ~6 batch merges)

  2. Botify has helped us to the point where our page speed is very fast. We have also started using what I’d call a “smart warm up script” where we crawl the site…but we also look at AGE of the cached page…if the node has been in our cache for 12+ days (it gets purged by day 14 = TTL), then we force purge it to recache it for another 12 days. The idea here is to keep as high a % of content in Varnish to provide lightening fast speeds for logged out users and bots…and it’s working since our cache hit rate is as high as ever.

@Joe_Velez , also want to thank you…while we are trying to increase our traffic from all sources (social, direct, newsletters, etc), we still are at 83% organic search and we are still seeing little to no growth…

  1. We just made the jump to PHP7 (from 5) and Ubuntu 16 (from 14) + on a new server and our speed for logged in users is even better now…here’s hoping that translates into more pages/visit and more traffic and engagement :slight_smile:

The forums are also relatively unstructured beyond the top level. Add more common tabs or sections, would let you create more stickied resources for each section, have them more prominently linked and target the curated content be more accurately (how do you curate content for an entire site with different intents? really hard gotta segment.). (This would also perhaps prevent more common or bad questions.

We have now created a “Hall of Fame” for each forum (helps internal linking and helps curate our content)…but I think we may be able to do a better job with taking this advice to the next level - categories, etc

Random Other Thoughts…

We have implemented comment nesting and have removed the “in reply to” tabs…but we still have the headers. My concern is pushing that stuff (user image, username title, rank, etc) to the left by stacking it will make the minimum comment height very tall and cause for much more scrolling than necessary…

Just wanted to say thank you to everyone that contributed here…

We’re still battling and I’ll keep you updated as we keep working our way through the merge process, etc…

Thanks!
Patrick


(Sarah Hawk) #19

Curious about this. Can you explain how it works?


(Patrick Curtis) #20

The Hall of Fames are stickied thread that highlight the best content from that specific forum over the years…

here is an example: http://www.wallstreetoasis.com/forums/wso-hall-of-fame-investment-banking-forum

I’m trying to think of other ways we may be able to “segment”/curate our content to have more “master pages” that are great references for members and visitors…