Berlin Buzzwords 2016: what was hot and what was not?

CkQvbQ-WEAApMGGThe seventh Berlin Buzzwords 2016, Germany‘s leading Conference on Open Source Big Data technologies, was held from 5-7 June at the Kulturbrauerei in Berlin. A very interesting venue for cultural events, under national trust protection, Kulturbrauerei is a spacious former brewery with a lot of courtyards and buildings.

The program was dense and jam-packed with interesting sessions, organized around three basic tracks: Search, Store and Scale. The conference focuses on “scale“ topics such as Apache Kafka, Flink and Spark. The “search” topic deals with Apache Solr, Lucene, Elasticsearch and machine learning. The final subject addresses the topic “store“ discussing Apache Parquet, Cassandra and NoSQL DBs.

To help attendees choose talks based on their skill status each presentation was tagged by level of “entry barrier”: Beginner, Intermediate or Advanced.

CkVJbDZXAAA-l1t

It felt like a community, seeing many of the same people participating every year (including me, going for the second time), with several social streams: #bbuzz twitter hashtag, facebook page, internet site, youtube channel. After a few days most of the slides and videos were available online (https://berlinbuzzwords.de/16/sessions).

Sunday – BarCamp

Buzzwords kicked off with the barcamp session, in an “un-conference” fashion, where the attendees suggest topics that will be discussed, each within a 30-minutes timebox. This bottom-up approach allowed me to attend discussions on a wide range of topics and here are the basic lessons:

Docker: Images file size matters when deploying them on production. Also use your own (private) images. You have to know where your application stores its data before you dockerize it.

Word2Vec model in NLP applications: Used widely though Dutch lacking in language resources available.

Deep Learning: I left with the impression that is is a re-branding of neural networks for Big Data.

13415669_1328158020529587_4508471742409957468_oMonday

Building a Real-time news Search Engine: Ramkumar Aiyengar talked about the backend behind News Search at Bloomberg LP. Four main goals were set: Make it work, fast, stable, better. There was an architecture redesign with Solr/Lucene to achieve scalability. Noteworthy remark from Ramkumar: customers had very personalized requests so Bloomberg has exposed the search query language to them.

Real-time analytics with Flink and Druid: Jan Graßegger’s talk was about building a streaming-only data processing pipeline with Kafka, Flink and Druid. This basically constitutes a lambda architecture that proves very useful when you want to arrive at conclusions from live data with a minimum latency but also store and process your historical data with the same codebase.

The Stream Processor as a Database Building Online Applications directly on Streams with Apache Flink and Apache Kafka: In the same domain as the previous talk, Stephen Ewan, from Apache Flink, focused on continuous processing on data that is continuously produced. He presented a setup where the stream processor (Kafka) takes the role of the database, and the processing results are being stored and updated. The so-called Queryable State in Flink mitigates the big bottleneck of long communication time with external key/value stores to publish real-time results.

Apache Lucene 6: What’s coming next? Uwe Schindler gave a thorough presentation of  the new features of upcoming Lucene 6: A new data type called “points” (also known as dimensional values) and corresponding queries as faster, multidimensional replacement for NumericRangeQuery. It will also completely remove the concept of “filters” from the query API in favour of non-scoring queries. Last but not least, Okapi BM25 (bag-of-words retrieval function) will be the default scoring algorithm, replacing TF/IDF. Apache Solr 6 will be bundled with Lucene 6 release. Most important features here are SQL Parser and the New Streaming API running on SolrCloud, Cross Data Center Replication and GraphQuery for graph traversal.

Parallel SQL and Streaming Expressions in Apache Solr 6: Shalin Shekhar Mangar discussed Solr’s Parallel Computing Framework, including Parallel SQL which provides a simplified interface for parallel execution of SQL commands across SolrCloud collections and Streaming API and Expressions for parallel computation. It also performs operations such as sorts and shuffling inside Solr using Map/Reduce implementation for massive speedups, provides best practices based query optimization.

What We Talk About When We Talk About Distributed Systems: Alvaro Videla gave a very interactive and provocative talk about the issues arising when designing and implementing distributed systems. He reviewed the different models: asynchronous vs. synchronous distributed systems; message passing vs shared memory communication; failure detectors and leader election problems; consensus and different kinds of replication. Alvaro presented some accredited books and papers on the subject during his talk.

Fast Cars, Big Data – How Streaming Can Help Formula 1: Ted Dunning, always an entertaining speaker, showcased a high fidelity physics-based automotive simulator to produce realistic, but fake, data from simulated cars running on the Spa-Francorchamps track. The result can be used to prove, test and tune software architectures as well as simulate system failure scenarios. It is also a great exposition of how to synthesize data based on KPI Matching Simulation and then move data using messaging systems like Kafka.

The final event of the day was the FutureTDM cafe. A roundtable discussion organized into four main areas: Skills and education, economics influence of TDM, technical barriers to adopting TDM, and legal and IPR barriers. A free-form discussion where representatives from business and academia had the chance to openly express their concerns and experiences.  

For photos and notes from the FutureTDM cafe, go to the FutureTDM website.

13350456_1327505547261501_4752266021071490577_o

Tuesday

Predictive maintenance: from POC to Production with Spark: Heloise Nonne choose public transportation, in particular trains (all the more digital natives nowadays), in order to illustrate predictive maintenance using machine learning techniques. Proactively handling faults on trains, i.e. predict them in advance, minimizes train delays and reduces maintenance costs. Her approach uses random forests and artificial neural networks with theano, and transitioned progressively from python to a distributed environment using Spark and mllib.

Running High Performance And Fault Tolerant Elasticsearch Clusters on Docker: Rafał Kuć. talked about containerized Elasticsearch nodes and how to do that effectively, and at scale. It was a focused and technical presentation where Rafal gave useful insight (and command-line examples) on dealing with storage and network, persisting data, improving performance, achieving high availability, monitoring, and metrics. Very important take-away was again the security measures when dockerizing an application.

SMACK Stack – Data done Right: Stefan Siprell covered a fully-fledged data platform consisting of Spark, Mesos, Akka, Cassandra and Kafka. He described it as a sweet spot between Batch and real-time processing that best suits Updating News Pages, User Classification, (Business) Real-time Bidding for Advertising, Automative IoT, Industry 4.0. SMACK is an architecture toolbox that can facilitate resilient ingestion pipelines, offering a range of Query Alternatives, and baked-in support for management and flow-control.

Attendee Card

Conclusion

Berlin Buzzwords 2016 tracks and discussions radiate with most of the topics of TDM, and OpenMinTeD in particular. Search technologies are at the core of TDM applications since they cover a wide range: from low-level text processing to indexing to ranking to personalization. Newest features from Lucene/Elasticsearch/Solr will be utilized to improve searching both on the algorithmic and performance level. Scale technologies are essential to managing big data and big workloads inside the OpenMinTeD infrastructure. Well-established platforms (Spark, Flink, Kafka) and their consolidations (SMACK) pave the way here. Special consideration goes to Docker which will allow TDM applications to be ported and deployed to any host OS, locally or in the cloud. Last, Store technologies are also important for the backbone of data management. The champions here come from the NoSQL platforms, and the way they seamlessly play with Hadoop, Flink and Spark ecosystems. In summary it was a really worthwhile conference, with state-of-the-art speakers and presentations.

This blog post was written by Byron Georgantopoulos.