2 weeks ago I went to a business trip to San Francisco. As always I enjoyed being in the Silicon Valley and SF. The trip began with the Lucene Revolution with interesting talks about Lucene and Solr - I was presenting our Panasonic Search Case Study there. And it ends with the TYPO3 Conference in San Francisco where I spoke about "how a complete Telco runs on TYPO3". In between there was the opportunity to do on-site workshops and meetings with some clients in the Bay area.
After being back in germany again and shrinking the to-do queue a bit its time to summarize some of the inspirations and informations from this trip. In general its always fascinating that a big part of the internet-technologies is connected with the bay area. Not only that google, facebook and co are located there - but you can also go to a Node-JS meetup every week and speak with all the core developers or knock the doors from the github people...
Its commonly known that the amount of data is growing exponential and also that its more and more important to have access to the most recent and relevant informations. Also users expect search and information delivery to work fast and to show the most relevant results. That means also, that informations may have to be personalized. That also means, that parts of the websites needs to be different per user (to fit there semantic context) - and that means that content may not be cacheable at all anymore.
Lucene and Solr are great open source software for searching within a huge amount of different data. They are strong at scaling and have flexible relevancy (score) calculations.
Here are some tools and services that catched my interest during the conference:
In times of Twitter & Co real-time search is getting more and more important. And especially Twitter requires real time indexing of new content. They are indexing 100 million tweets per day and have about 2 billion searches per day - using Lucene! The Twitter Blog explains some more details: http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html
There is also a very nice blog from the Lucene Core Developer Mike - that also covers recent work on realtime search: blog.mikemccandless.com
Very interesting is always scaling and handling of big data. Someone published a nice quote: "7 Dwarves of Big Data -- Hadoop, MongoDB, CouchDB, Cassandra, HBASE, memcached, Voldemort ...". To catch some of this tools:
Apache Hadoop: Is an open-source implementation of frameworks for reliable, scalable, distributed computing and data storage. For example there are components for map-reduce implementations etc...
MongoDB, CouchDB: NonSQL databases that are build for scaling - also on multiple servers.
Cassandra: "The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model."
HBase is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware.
Including available semantic informations for better search results is a good idea.
A nice extraction service is calais - based on large training data this service can extract semantic contexts from any english text. Try using the online-demo with a english news-paper article: viewer.opencalais.com
edismax in solr 3.1
Old news but still great - the edismax handler is available in Solr 3.1 and makes live easier when you want to use dismax and lucene syntax for your solr querys.
Stemming and Language detection
Especially for european languages or languages like chinese and japanese it is hard to do good stemming based on algorithms. Basis Technologies offers different parsers based on dictionaries that do a great job for different languages. Unfortunately the products are not cheap.
TYPO3 Conference San Francisco
Of course one of the highlights was the TYPO3 Conference in San Francisco: Great People, great Location, great weather...
A special highlight was the keynote from Jez Humble (yes the one who wrote on one of the best IT books "continuous delivery").
Here are some of the many interesting topics at the conference:
Like also written in previous posts - a continuous delivery process with the help of a deployment pipeline is a very good thing to have. We are releasing nearly every project through an automated deployment-pipeline and learned a lot during the last years. The keynote was a good summary of the core ideas. I like the statement "Without testing the default state of your application is broken - unless you prove otherwise . With a deployment-pipeline and automated tests the default state is ok and you are fine to deploy urgent changes to production."
Nice was also the mentioning of two possible deployment methods:
canary releasing: Is a method to only route some people to the new version. This way you can monitor the application and the user behaviour and then decide wether to roll it out for all or not. Thats what google also do often.
dark launching: Its a nice method to deploy a new feature that should replace an existing feature: What you do is to fire the new implementation with real traffic already in the backround - but the customers still using the old implementation. This way you can test new implementations with less risk before switching them visible.
Robert and Karsten did a great job to prepare nearly 2 days of FLOW3 workshops. All of them are available in the Vimeo channel.
It only about some days that the team will finish the last work on FLOW3 Beta Release and the Documentation - and then its time to consider FLOW3 when it comes to decide on a framework for a new project.
Also very interesting was the talk from Andrei about deploying and hosting TYPO3 projects in the cloud. The video is online: Video T3CON11-SF: Fluffy TYPO3 Automatic Deployment of TYPO3 in the Cloud
He mentioned also Chef and Puppet - both are tools that helps to automate the setup and configuration of your infrastructure.
I am still searching for the promised code samples in the presentation as well as the mentioned TYPO3 improvements (like storing sessions in a key-value store).
Puppet: With Puppet you can describe your system configuration and dependencies at a central place and you can automate the deployment and infrastructure management. The learning curve is quite high - but it seems to be worth it. www.puppetlabs.com/puppet/introduction/
Scalr: Seems to be a nice GUI based tool to set up your first cloud based infrastructure. www.scalr.net
Thats it for now - here are some relevant links:
- T3CON11 San Francisco Vimeo Channel: vimeo.com/channels/207300/
- Slideshare: http://www.slideshare.net/event/t3con11sf
- FLOW3 News: news.typo3.org/news/article/on-the-road-to-flow3-10-beta-1/
- FLOW 3 Tutorials from Thomas: www.layh.com/work/flow3-fluid.html