When you write AI and social in the same sentence you typically end up somewhere close to ad targeting and selling member metadata to third parties. At Haaartland we go about things a bit differently. Since we don’t have any ads on our platform the ad targeting part just doesn’t apply to us. And we have, since day one, pledged to never sell your data to anyone else. Your data is your data.
However, to be able to continuously improve our platform and recommend relevant content to members in our communities we still need to track things like clicks, reads, and claps in addition to various other metrics that we collect for performance optimizations. We use the same type of algorithms as the rest of the industry, but the purpose is always to try to help the community help itself. Our goal is never to just maximize time spent habitually browsing a feed hoping for new content (to have more time to show ads). We want interactions with the platform to have a higher purpose. That’s why we don’t use any algorithms (or random cat videos) to populate your community and room feeds. The content you see there is the content from the community in a predictable order.
Topic modeling is at the core of Haaartland’s AI. We decided early on that recommendations for content needed to be based both on interest, peers, and the content itself.
Content-Based Recommendations (Topics)
We generate topic models based on the posts in a community where we then can recommend and search the collection of posts in the community based on the underlying topics. The searching part is a great addition to our more standardized keyword search and can often generate relevant results even when the keyword search is failing to find anything. Sometimes it is hard to know exactly which keywords to use and this is where searching based on topics makes sense, especially on a large data set.
Interest (For You)
Since we internally keep track of what members show interest in (popularly named the ‘clickstream’) we can personalize the content recommendations based on what the member has shown interest in earlier. This is still using the same topic models to find recommendations but the ‘prediction’ of what the member would like to read next is based on the type of content within the community that the member already has read. We currently don’t promote newer posts over older ones but treat all posts equally and trust that a community’s content trove is relevant over time. Again, Haaartland’s goal is not to create flame wars over current events (there are plenty of those platforms around already), but to cater for deeper, fact-based discussions.
Collaborative Filtering (Peers)
Keeping track of members’ clickstreams also gives us a chance to find members that seem to be interested in the same topics. Given that we have a group of members that show similar interests, we can use that knowledge to recommend missed content nuggets that peers seem to enjoy. This also helps solve the cold-start problem where new members can get relevant recommendations based on work done by older members.
Our topic modeling algorithm of choice is Latent Dirichlet Allocation (LDA), it is one of the more popular in the industry and well-proven. We use LDA unsupervised, meaning that the topics found by the algorithm are the true latent topics in the community’s collection of posts and not manually curated topics. We do offer a way of tagging content manually, but that is different and separate from the latent topics.
You can think of LDA as your community’s private librarian working relentlessly to organize your books (posts) into the shelves of your community library. Every time a new book is added to your library, the librarian may decide it is time to add a new shelf or reorganize all the existing shelves. As the library grows so does the value of having your private librarian working 24/7.
Topics are derived from the corpus (the collection of text documents, typically the posts in your community) and are a statistical snapshot of the underlying topic distribution. Topics are typically represented visually as word clouds or just the most representative words from the topic. Some examples: [content marketing social] or [living diabetes chronic]. Every text document will have some contribution by every topic, but typically one or a few topics will dominate for every text document. For humans, 20 topics seem to be the upper threshold for what can be understood, though machines downstream can benefit from more topics (especially for large data sets).
Now, if we ask for two or three topics, visualization of the topics becomes trivial as the 2D/3D space fits naturally for plotting. This can be of use to understand the algorithm, but in any real scenario, we are likely asking for more than 3 dimensions (topics). To bridge this gap we are projecting the many dimensions (example 20 topics) onto a 2D/3D space using another statistical method, t-distributed stochastic neighbor embedding (t-SNE). Through exploration of the t-SNE visualization of the topic model, the corpus can be better understood. In addition to only finding the topics, various types of metadata can be overlayed. Some examples include popularity (read/comment counts per post), sentiment (average sentiments for post comments). This helps community owners better understand their corpus, what topics are popular, where do heated discussions most often appear, and so on. Even if the topic model is not used for anything else, various types of anomalies such as duplicates or garbage posts without any real content typically stand out in the visualization. As a corpus grows so does the need for curating.
Visualization of a corpus for topic modeling, 20 newsgroups (20ng), commonly used to classify text documents. The test corpus holds about 18.5k documents and contains documents (in labeled newsgroups) about anything from religion to computer hardware. In the visualization, the color of each data point comes from the newsgroup. You can clearly see that documents originating from the same newsgroup typically end up near each other in the plot by looking at the respective clusters’ color. Naturally, even in a controlled test data set like this, some documents will be off-topic but the overall clusters still show strong separation. The bars (top left) show each documents’ distribution over topics. Remember, each document will have some contribution by every topic, but the further out in each cluster you get the more dominant one topic becomes.
When the topic model has been trained, getting recommendations and searching using the topic model can be done via inference (aka prediction). Either by looking up the ‘position’ (remember the number of topics is likely greater than the 3 dimensions we can represent visually) of an existing document and returning the closest neighbors or by inferring the ‘position’ of a new document onto the model and then returning the closest neighbors.
This is the essence of what we are using our topic model for in the system, but visualizations can also be used to further understand/explore a corpus using other overlayed metadata.
Hopefully, this should give you an idea of some of the work we do for our communities behind the scenes here at Haaartland. There’s of course a lot more going on technically to implement all this, but for you as a manager or member of a community running on Haaartland, this is another thing you don’t have to worry about.