This post was written by Nicolas Grasset, CTO at Tripl, formerly creative technology director at the swedish digital agency RIVER and software engineer at Yahoo! Mobile. Follow him on Twitter: @fellowshipofone
Big data is a very hot area right now with new technical possibilities available to smaller companies and many industries that have yet to benefit from big data analysis or “smarter” products. More common with web analytics, finance and enterprise solutions, Tripl is building a product with Big Data for Travel. We are starting with a consumer product and evolving into an open platform for the industry to tap into. In this post, I describe how we came across these new big data problems in everyday life and the challenges involved.
Tripl, time geography and paths
The problem we are trying to solve is simple, we want to help people meet, whether they are traveling or being locals in their own city. And while the problem itself is fairly social already, we also want to leverage their social graph together with friends recommendations to make these meetings more meaningful in time. So in terms of data, we look at four main dimensions:
Location: where are you? We store it as geo-coordinates visualize it at a city level, and plan the UI so that Brooklyn and Manhattan are both different part New York City at different distances from Jersey City, NJ.
Time: when are you traveling, or when were you last spotted somewhere? Time helps us visualize future plans or recent check-ins.
Social: how are you related to these other person? Are you friends, do you know anyone in common, etc.
Interest: are there any passions or activities that could be reasons for you to meet someone? We use this information mostly for relevance sorting at the moment
As a result, we have can build a time-space path for each of our users, which illustrate the movement and limitations of individuals across time and space, and how these paths may overlap. So at first we want to gather enough data to accurately identify these overlaps in the context of travel, not just when two locals stay at home, and then highlight the most relevant of them.
Why now: active vs passive data
There are two ways we can compile information about time and location for users: we can ask them, this is the active data, or we can infer that they are traveling from data we gather on other services, most often the passive data.
There are different ways to collect future travel intent. Tripl has its own interface to store planned trips, and we have already integrated with Trip It to import planned trips made there. Going forward, we plan on connecting with different social media services, and most importantly with any actual travel agency, airlines or hotel websites which hold even more accurate plans.
But as we try to make our service useful immediately to any new users, waiting for a critical mass of users is not an option. In the past months, Facebook started geo-tagging every photo or status messages, and Foursquare and Instagram have been growing their user-base faster than ever. Much of this information such as geo-tagged photos or check-ins can give hints about unannounced trips or future plans, they represent the passive data which is becoming key to making Tripl more accurate already at signup while decreasing our need for a critical mass of users.
Juggling between different database solutions
A typical query is “Who will be in San Francisco next weekend, for how long, how do I know them?”.
User location in time: A way to look at it is to first solve the user-location problem in time “in San Francisco next weekend and for how long”; easily scalable, we can just store user locations with dates as layers. Each entry, home location or a trip, is rarely updated, but as we build up integrations, more entries are added all the time. And obviously the read-frequency is very high. Different NoSQL databases such as MongoDB offer perfect solutions for this problem.
Social: The second part of the question, “how do I know them”, is more typically a graph problem. As long as the amount of users remains reasonable, we can easily emulate a graph behavior with a relational database. The social graph is frequently updated with new nodes (users) during the early stages of the product, but as their friends join, the graph edges (relationships) are the elements more often updated. In the context of travel, the first degree of separation (friends) and second degree (friends of friends) are by far the most interesting ones, but two levels still mean a very large amount of nodes to update frequently which makes caching at the user level very inefficient and the use of relational database expensive. This is why we are exploring different scaling options such as dynamic caching with Redis or new projects on distributed graph databases based on Hadoop.
So one of the biggest challenges Tripl is setting out to solve is to combine both the time-geography queries with social-graph queries. There are several ways of doing it once data is stored on multiple data-store instead of a single relational database cluster:
if a user has a small network and travel to a busy city, it might be more efficient to first get a list of friends and friend of friends and then check them against the city travelers and locals at that time
if a user has a large social network and travel to a quiet city, it might be more efficient to first get a list of travelers and locals at that time in the city and then check them against the social network
and if no clear pattern emerges, we then need to handle more cases such as extending the radius around a city in case the number of users visiting is too little (think about back-packers in south-east Asia), or simply rely on a more appropriate database engine once the data-set gets too big
In the end, much of the problem is about understanding the data, simplifying the queries and visualization to focus on what matters, rely on statistics and pre-computing depending on patterns (conferences, spring-break, summer vacation, holidays, …). We are just getting started, it will get more complex with much more data types, and this is an exciting area to be working on as new project emerge all the time. We hope to be able to contribute ourselves, and for that we are hiring, so feel free to contact us!