dotScale 2016

After dotSecurity last Friday, it is now time to go to dotScale. This follows the same pattern as the dotCSS/dotJS conferences. One short (one afternoon) on the Friday and one longer (one full day) on the next Monday.

dotScale

As usual, this took place in the Théatre de Paris. The main issue of the last dotJS that took place here (the main room getting hotter and hotter along the day) was fixed, which let us completely enjoy the talks. There were more than 800 attendees this time.

Mickael Rémond

First talk was by Mickael Rémond, about the Erlang language. It was a high level talk, explaining the history of the language.

Mickael

It started in 1973, in big telecommunication firms. It follows what is called the Actor Model. Each actor has a single responsibility and can send messages to other actors. It can also create new actors and process its list of incoming message in a sequential order, one at a time. Each actor is an independent entity and they share nothing.

By definition they are scalable, because they all do the same thing without being tied to any specific state or data. You can distribute them on one machine, or across several machines in a cluster.

In 1986, Erlang was officially born. It added a second principle of "let it crash". As actors have one responsibility, then do not have to bother about handling errors. If it fails, it crashes, and the whole language embraces that. Other actors can be added on top of the first to handle the errors, but the main responsibility of each actor does not care about errors.

More than ten years later, in 1998, Erlang was finally released as open-source. The most interesting features of Erlang are not the language itself, but all the primitives of it that are implemented by the Erlang VM. Today new languages are born on top of Erlang, that uses the same VM, like Elixir (done by the core contributors of Ruby on Rails, it is gaining some popularity lately).

Overall the talk was a nice way to start the day. Even if not directly a story of scaling, it was great generic knowledge.

Vasia Kalavri

Next talk continued on that generic knowledge pattern, with Vasia Kalavri who told us more about graph databases. I didn't really see the link with scaling here either, I must say. The slides are available here.

The dotPost

In graph databases, relationships between objects are first class citizen, even more than the objects themselves. You can run machine learning algorithm on those relationships, and infer recommendation, "who to follow"-like features. Even search in a graph database is nothing more than a shortest path algorithm. Even the Panama Paper leaks was mapped using a graph database.

When we start talking about big data, we often think about distributed graph processing. Vasia was here to tell us that distributed processing was not always the right answer for graph databases. Because we are mapping relationships between objects, the size of the dataset to map does not have a direct correlation with the final size of the data in memory. You might think that with your TB of user data you might need a huge cluster to compute everything. Actually, to process its "who to follow" feature, Twitter only needs 64GB of RAM. And that's more than 2 billion users. When you start thinking you'll need a distributed cluster to host all your data, please think again. The chances that you have more users than Twitter are actually quite low.

If storage space is not a relevant argument, maybe speed is? You might think that using a cluster of several nodes will be faster than using one single node. Actually, this is not true either. The biggest challenge in building graph databases is cleaning the input. You will aggregate input from various sources, in various formats, at various intervals. First thing you have to do is clean it and process it, and this is what will take the longest time. When you think of it, your data is already kind of distributed across all your sources. Just process/clean it where it is before sending it to your graph database.

Then the talk went deeper into the various applications of graph databases, like iterative value propagation, which is the algorithm used for calculating things like Google PageRank. You basically compute a metric on each node and aggregate it with the same metric from the neighbor nodes.

It then went into specifics of Flink, Sparks and Storm but I didn't really manage to follow here, except that the strength of Flink was its decoupling of streaming and batch operation while Spark does streaming through micro-batches.

Not really what I was expecting from dotScale here, but why not.

Sandeepan Banerjee

After that the ex-head of data of Google, Sandeepan Banerjee, talked about containers and how they deal with state.

Sandeepan

His point was what we've all heard for the past two years. How Docker is awesome, and how Docker still does not answer the question of stateful data. How do you handle your dependencies in a container? How do you handle logging from a container? How do you run health checks on containers? Where do you save your container database?

As long as your container only compute data coming to it, and sending it back, you're fine. But as soon as you have to store anything anywhere, your container starts to be linked to its surroundings. You cannot move it

When an issue is raised in production, it would be nice to be able to get the container back with its data to your dev environment and run your usual debugging tools on it, with the production data. It also raises the question of integration testing of containers. How do you test that your container is working correctly, when it is talking to other micro-services?

In a perfect world, you would be able to move a container with its data from one cloud provider to another, as easily as if there was no data attached. You could move it back to local dev as well and share the data with other teams so each can debug the issue in their service, with the same data from production. But we're not in an ideal world.

He then gave his solution, which still does not have a name, of a way to organise containers by grouping them and moving the whole group at once. Even if each internal container will be stateless, the group itself will have knowledge of the associated data and relationships between containers, thus becoming stateful.

This solution would be based on a git-like approach, where the snapshots would not only contain relationships and data, but would also include a time dimension. This would allow you, just like in git, to get back in time and reload the snapshot like it was two days ago, and share a unique reference to that moment in time with other teams. He also talked about using the same semantics of branches and merges across snapshots.

He went even further talking about a GitHub-like service where you could push your container snapshots and share them, versioning your infra as you go and integrate it into CI tools to deploy a whole network of containers with a simple commit.

Overall this looked like a very promising idea, even if we are really far from it. A few other talks during the day mentioned the Git approach of branches and unique commit identifier as a way to solve a lot of issues. I hope this project will be released one day because if it actually fixes the issues that were raised, it will greatly help. Still, today it felt more like a wish list than anything real.

Oliver Keeble

Next talk was one of the most interesting of the day, one that makes you put in perspective your own scalability issues. It was about the Large Hadron Collider of the CERN, and how they manage to handle all the data they got from the experiments.

Oliver

The LHC is the world's largest particle accelerator, it takes the form of a 27km long underground tunnel where they collide particles at the speed of light, and analyse the results. It's an international scientific collaboration. When the collision happens, things happen, things disappear, and the only things that are left are the data the machines where able to capture.

Because such an experiment is a huge scale, they need to capture as much data as possible per experiment. They have more than 700 tons of detection devices, that record data at a rate of 40 millions times per seconds. That is a lot of data to capture, send and process.

Because there is so much data, they know they won't be able to capture all of it, so they have to make estimates based on what they captured, and extrapolate from that. To do so, they have to calibrate the machines with empty tests, to see the amount of data they manage to capture from a known experiment. Then, it's just drawing a statistical picture for when they will do it with more data.

In addition to data that can fail to be recorded, data transmission can also fail. Sending so much data from the source of the experiment to the machines that will process it will obviously incur some data loss in the transfer. To counter that, they have a validation mechanism to check that the received data is not corrupted. It makes the transfer slower, but is needed because every piece of information counts.

They also have a dynamic throttling of the data sent. They analyze in real time the percentage of success of such checksum validation, and if the percentage improves, they will push more data. If it decreases, they will push less, which should dynamically alter the throughput to be the most efficient. As the speaker said:

You'll always be disappointed by your infrastructure, then better to embrace it.

Because of its international structure, the data analysis of the data is done in parallel in every parts of the globe, along what they call the grid. Machines in all lab research facility can help computing the data. They just register to a queue of available machines, announcing their current load. The CERN on the other end will submit jobs to another queue and attribute each job to a matching machine. Once the job is finished on the machine, it will just call home with the results and final results of all jobs are merged into a new queue.

It takes 25 years to build such a collider, and they are building a new one, a 100km ring around Geneva, so they expect to catch 10 times more data in the future.

That was the kind of talk I like to see at dotScale. Real world use cases of crazy scalability issues and how to solve them.

Lightning talks

After the lunch break was time for the lightning talks. The overall quality was much better than at dotSecurity.

lightning

Datadog

First talk was by a guy from Datadog, but he never advertised its company directly. Instead he used the same colors as the company logo, and talked about a dating website for dogs. You know, a place to date a dog… It was not completely obvious, so fair enough :)

His talk was about the need to monitor everything, not only some key metrics, because you never really know what metrics you should be careful about. By logging more and more data, you can correlate parts of your infra/business together. The more you monitor, the more you get to know your data and you can refine non-stop what you should be looking at.

Their pricing model is based on how many metrics you keep track of, so I'm not sure how biased his speech was, but the idea is interesting nonetheless.

CRDTs

Then Dan Brown came again (he was here the year before as well) to talk about CRDT, or Conflict-free Replicated Data Types. This gives a set of guidelines to know how to reconcile conflicts across two different versions of a data in an eventually consistent architecture.

By defining a standard set of rules, conflictual merges can be reasoned about with the same outcome by everybody. When you have to changes on a boolean value, let's use true by default. When changing the value of a counter, let's always use the most positive (or negative) value. When changing the value of a simple key/value pair, let's say that the last write wins. For arrays, if you happen to have several conflicting add/remove jobs, give priority to the additions.

This will not avoid conflicts, but will give a commonly known set of rules to automatically handle them, without requiring manual interaction.

Scaling bitcoins

Then was Gilles Cadignan, about how to make a proof of existence using bitcoins. He started with the assumption that everybody knew what bitcoin is and roughly how it works. That was my case, I still only roughly get how it works.

In a nutshell, he explained how to embed some document hashes into the bitcoin chain to prove that a document existed at a given date. Not sure I can give more explanation than that, sorry.

Monorepos and many repos

Then, Fabien Potencier from Symfony explained how the monorepo organization of Symfony is working. They currently have 43 projects in the same monorepo. They use this repo for development, not for deployment. They have scripts to split this big monorepo into individual repositories.

They still actually use both one big monorepo and several small many repos. I'm not sure I get the story completely straight, because when I talked about it with other attendees, we all understood something different. But what I get was that the monorepo is the source of truth, this is where all the code related to the Symfony project is hosted. It makes synchronizing dependencies and running tests much easier as everything is in the same place.

But they still automatically split this main monorepo into as many repos as there are projects. Those projects actually reference the same commit id as the one in the main monorepo. By doing so they still keep a logical link between commits in two repos, but they can give specific ACL to people on specific parts of the project.

What I did not understand is how you contribute. Do you submit a PR to the main monorepo and it will trickle down to the simple repo, or do you submit a PR to the small repo and it will then in turn submit a PR to the main one?

Juan Benet

After that, it was time to get back to the real talks. The next one was by Juan Benet, about IPFS, the InterPlanetary File System. One of the most awesome and inspiring talks of the day. To be honest, there was so much awesome content in it that it could have lasted the whole day.

Juan

He started by reminding us how bad the whole concept of centralizing data is. Distributed data is not enough, you just create local copies of your data, but they still all depend on a central point. What we need is decentralized. We have more and more connected devices in our lives, but still, it is not possible, as of today, to easily connect two phones together, directly. We still have to pass through external servers to send files. Considering the machine power of the devices we have in our pockets, this is insane. Those devices should be able to speak to each other without requiring a third party.

Centralization is also a security issue. Even if the connection is encrypted, the data stored on the machine is not. It has to be stored in plain somewhere, which means that it can be accessed, copied or altered.

Most of the time, we cannot directly access the data. We always have to use a specific application to access its data. There is no way to directly connect to a DB of a distant server for example, we have to go through their API (when it exists).

Webpages are getting bigger and bigger, but we do not really care because we have faster and faster data connection. Except when we don't. When we are in the countryside or in the subway for example. Or living in a country where the data connection infrastructure is not as good. And we're not only talking about third world countries here, but every country after a natural disaster such as an earthquake. And it's in those very specific moments that you do need to have your means of communication working. But there are also the human disasters to take into account, the government oppression, the freedom of speech to defend...

In our current web, URLs are not eternal. Links breaks all the time because website evolve, or because url shorteners are no longer maintained.

Well, that's a huge list of problems that I can only agree with. All of this comes from the same root cause, that all the online resources are linked to an address. Internet is an awesome pool of knowledge, but it is still very awkward and old-fashion in certain aspects. 50 years ago if I were to tell you "Oh, I read this book, it is awesome, you should read it too", you just had to go to the closest library and borrow the book. Today with the way internet is working you would have to go to the very specific library I'll point to you, even if the same content could have been found somewhere else. We are not pointing at content anymore, but at addresses to find that content.

It's a big problem, and there is a way to fix this, that draw its inspiration from git. SVN was centralized, and it exhibited all the issues I talked about earlier. Git and its decentralized model fixed it all.

IPFS would be like git, but for content. Each object would be linked to another through crypto hashes, like commit hashes, and that way ensure the integrity of the whole. You couldn't just change something in the chain of commits without everybody knowing. And when you would talk about a specific file, through its hash, everybody would know what you are talking about. Because it would be distributed, there would not be any central source to get data from, you could get it directly from any of your peers that have it as well.

There is already a lot of P2P systems that works through this logic, like bitcoins. OpenBazaar is an eBay-like, distributed and using bitcoins as a currency. The beauty of such a system is that you can still add some central nodes, like GitHub does for git projects, but they won't become SPOF, they are just here for performance and convenience.

libp2p is a library that integrates a lot of existing P2P protocols to build this IPFS. All of them are compatible and allow sharing of documents through a forest of Merkle trees from various compatibles origins.

This idea of using a git structure for various usages was discussed by a few speakers, but this presentation really went deeper into it. Two years ago everybody was talking about Docker in their presentations like it will be the next big thing, I really hope that IPFS will be a reality in two years. This will just be one of the greatest improvement in the worlds of data sharing, hosting and security.

Such networks are already in place for some blogs, web apps and scientific papers. Following the talk of Diogo Monica at dotSecurity, such a system will also be of great value for package managers.

Such an awesome talk really, this is my personal highlight of the whole day.

Greg Lindhal

Next talk was also one of my highlights. It was from Greg Lindhal, and explained how the Internet Archive works, from the inside.

Greg

The Internet Archive is a non profit library that aims to store everything that is produced on the internet. It contains more than 2 million videos, books and audio recording, more than 3 million hours of television and 478 billions of website captures. Overall, this weight around 25 petabytes of data.

One of their main focus is on not losing any data, ever. They store the data on two geographic locations, and store it on two different physical disks on those locations. They extract metadata from the documents and have a centralized server that contains only the metadata and information about where to find the complete file.

Metadata information is also stored alongside the data itself, so if the centralized server gets corrupted, its content can still be reconstructed from the raw data.

They get their initial data from various crawlers, some are their owns, other are from external sources, either professional (like Alexa) or volunteers.

They have a lot of what they call derived data. From a given source, they can run OCR, extract the transcript, or other kind of data. To do so, they need to operate on a copy of the original data, to avoid corrupting the data in case something goes wrong.

They currently have issues with their search, which is not very efficient. They do manually build an index of all their resources in a text file of 60TB compressed. It is so big they have to build an index of the index itself.

In the future they would like to do a partnership with major browsers, that could provide a fallback on a previous version of a website, in case of a 404 error. But before being able to provide that, they need to figure out how to scale the lookup time.

They are currently operating on the maximum capacity. They plan to move on SSD, but this is still too expensive. They are also investigating migrating to a NoSQL database to replace their manual text index file. It currently takes 100 days, just to index content. But this is an old project, with hacks built on top of hacks built on top of hacks. Changing the stack means backporting all the (undocumented) hacks. Technical debt is not about technology, it's about people, and how they used the technology along the years.

As the speaker was saying, all of their projects, even the smallest one can be considered big data. They tried to move to ElasticSearch, but 100GB of their data exponentially explode to 12TB once put into ES.

Surprisingly, they do not receive that many take down requests for copyright infringement, but this is mainly because their search is so bad, that it is nearly impossible to find something. I especially like this quote by the speaker about that:

First thing you do when building a website search engine with ElasticSearch is to replace the default ES algorithm.

Really nice overview of how this massive project operates. The amount of data stored is extraordinary, and I've always wondered how a non-profit managed to handle this. The answer is: they struggle.

Sean Owen

Then it was a talk from Sean Owen, from Cloudera. He mainly talked about machine learning in the context of content recommendation for music genres. It went quickly too technical for me to follow the main idea.

The dotPost

He told that what started as a machine learning project, quickly became a big data project. They realized that to build a relevant content recommendation system, they will need data from their users. They use Spark to handle the processing of the data, because the technology is apparently well suited for scaling this kind of recommendation and machine learning issue.

As I said, he quickly started giving a mathematical explanation of how to compute two matrix of music genres, with specific tips and tricks to make it faster to compute. I cannot remember the exact tricks of this specific use-case, except that he suggested to always use the native math libs of Sparks instead of the JVM for anything math-heavy related.

So yeah, sorry this recap is not as thorough as the others, that's really all I can say about it.

Spencer Kimball

The next talk was a joke. Spencer Kimball was supposed to tell us more about CockroachDB, but he seemed like he had a plane to catch. He did in less than 20 minutes what was supposed to be a 1 hour long talk.

Spencer

We barely had time to read a slide before he jumped to the next. The only thing I grasped from his talk was that he knew what he was talking about. Good for him, because honestly, I think he was the only one.

Gleb Budman

Fortunately, the next talk was its exact opposite and was one of my favorites from the day. Gleb Budman, from BackBlaze gave an overview of how they built BackBlaze from the first days.

The dotPost

BackBlaze is a cloud backup system. They offer unlimited backup for 5$/month (unfortunately not compatible with Linux, otherwise I would already be a customer). What is really interesting is their approach into building such a system, in an iterative process.

They first had to choose where to store the data. They couldn't go to Amazon because for 5$ a month, they could only get 30GB, and they needed to offer infinite storage space, because you never know how much data you'll have to save. And you want a backup system that "just works", without you worrying about disk space.

So they started to investigate into buying their own hardware. They quickly realized that they could buy a 1TB drive for 100$ in a local shop, while it would cost more than 1000$ for the same capacity for its server-side version.

They never imagined that they would start building their own hardware. Amazon already does that, and there was no way they could be more cost efficient than Amazon. But it was either trying, or closing the company. So they tried.

They started with external drives and cascading USB hubs. Didn't work.

So they started thinking about what they do not need. They are only storing data, they do not need to run application on their hardware. The data will only be accessible from time to time, not continuously. The various machines will not need to be able to talk to each other directly.

But they do need to be cost and space efficient. They need to be able to store as much data as possible in the smallest possible space, without being too expensive. So they bought a bulk of hard drives. Not high quality ones, but commodity parts. They created a custom case, made out of wood, to host as many drives as possible.

This worked well, but didn't scale really well. They will need more and more of those cases if they want to grow. So they decided to do some Agile, but for hardware. They contacted a company that could build a metal case from the wood prototype. And in small iterations of 4 weeks, they manage to learn a lot about what they needed, by testing it constantly, and improving each case from the knowledge they got from the previous case. They tested different shops, different 3D printing techniques, and regularly improved the hard drive case.

But then developers started to ask for an API, to directly access the data. So they needed to keep iterating on the case, but now to hold hardware that could host both the data and run a webserver.

They managed to fit 60 hard drive in a box, or what they call a pod. Because it's commodity hardware, it will fail. So they replicate the data across the drives. For each chunk of 20 drives, they can afford 3 of them down at any time. The data is replicated enough so that they could rebuild the missing pieces as long as 17 drives are working.

He then gave more details about some real life issues they had when building the pods. Like forgetting to put a hole to set a power button, or putting a coat of paint that was too thick and prevented the case to fit in the datacenter. Or how a flood in Thailand stopped the worldwide production of hard drives and how they had to manually go buy external drives and rip them out to get the hard drives to put in the pods.

Overall, his talk was really inspiring. Telling us that there is nothing more important than field knowledge. Go test your product on the field as soon as possible. It can work with hardware as well as with software. Find ways to build, even if not perfect, you will learn and you will improve it over time. Toss the unnecessary, don't buy things that are too expensive and provide features you don't need. Create prototypes fast, and often.

Ted Dunning

Then Ted Dunning talked about real time processing, and how streams are the future. The talk seemed really meta and high level and hard to follow.

Ted

He started by saying that everything in the world works in a symmetric way, with an associated conservative law, like energy and time. And that this is true both in the microscopic and macroscopic world. He then throws a few dollar bills on the floor, telling us that if we had to pick those bills, the most clever way would be to pick the highest value bills first. I told you it was meta.

The idea behind the bills show is that, if bytes were money, we should pick the best bytes first. A data processing system should first take the more interesting data, and then continue by picking the less interesting data. This would create a graph were the value we get from the data is high at first, and then just hit a plateau where we got less and less value per data. The flatter the graph goes, the more expensive it gets to get value.

He then went on talking about the fact that data should never be altered. That data should only be reasoned about in terms of now, after and before. That all processing of data is actually a stream, and that data will just keep growing and that scaling is inevitable.

He also went on talking about the communication cost of developers. In a perfect world, putting more developer on a task should make this task faster to resolve. In practice, this adds a communication cost. There is an optimum number of developers for a given task where the boost in productivity outweighs the cost of the communication. Each company, and even each project, should find this optimum.

He concluded by saying that even if REST is nice, and that you can technically do anything with it, it still creates a communication cost in between all the nodes that needs to communicate. On the other hand, streaming (through something like Kafka), removes the need for coordination, or acknowledgement, and will win this part in the future.

To be honest, I had a hard time following where he wanted to bring us with this talk. I am still not sure what is point was.

Eliot Horowitz

The next talk was easier to follow, even if I didn't really share the excitement of the speaker. Eliot Horowitz is the co-founder and CTO of MongoDB.

Eliot

He talked about how we all already have microservices in the form of third parties in our infra. We have analytics in GA, project management in JIRA, various tools for the NPS, the emailing and a lot of other applications.

The issue with that is how to keep the data consistent across all those services, and more importantly how to search into it all at once. Each service has its own part of the data, and see it through his own system. It is very hard to get a big picture of it all, and link data together.

You could build a regular report by fetching data from all sources, aggregate them and output a coherent set of values. This would take some time and money, and will be fixed in time. This is not the way to search and explore your data.

You could build a data warehouse, where you put all the structured data you got from building the reports. This would let you search into it. But every time you will add a new source or change the schema of the data, this will create inconsistencies in your warehouse.

On the other hand, you could build a data lake, where you put everything, with various schemas. Unfortunately, the lake will quickly become a dump. You're actually just building a drowned warehouse.

So let's start again, and let's not assume you need to put all the data in the same place. By keeping the data where it is, you offer more flexibility to the usages. You can add new services that will handle new type of data.

Mongo Aggregate is the service he wanted to tell us about. It is a new pipeline in mongoDB that lets you query data from various (external) sources. It is built to let you link several cohorts. For example knowing if the users that did something in system A, are also the same that did something else in system B. For example, are the users that read the release note the same that give a high NPS?

To do so, mongo has to actually query this external services. A lot. But because the whole process of getting a subset of users from one system and checking in another system if they share a common attribute is wrapped into mongo, they can optimize the query on several levels. How? I don't know, I think he only mentioned caching. I wonder how this handle third party failures.

I'm always highly skeptical about tools that are branded as "you can anything with it, we just take care of everything". Connection to third parties API is trivial, except when it fails. And it fails a lot, so having all this hidden inside mongoDB does not seem an attractive feature to me.

Eric Brewer

Last talk of the day was by Eric Brewer, VP of Infrastructure at Google. He talked about immutable containers.

Eric

For him, we should really not focus on the containers, but on the services they are running and exposing. We should not care about which machine is hosting a container, we should just care about the service it exposes.

He discusses grouping containers into pods, that also share a data volume. All containers of a pod are collocated on the same machine, so each knows about the IP address and exposed ports of the others.

You then start to build services on top of pods. The pods are stateless, they are self-containers, even if their internal containers are stateful. Now, you can just duplicate the pods and put a load-balancer in front of them. Pods are interchangeable, because they are immutable. I could just restart one at any time. And I name them based on the service they host, not their inner components.

If we go one level higher, a network of pods should also be immutable because each of its parts is immutable. Because of this, you could build your own infra and test that it works correctly before actually deploying it. When you build, you actually just build a graph of all your pods together.

This all looks well on paper, but I often hear that immutability is the answer to all software problems, like asynchronous code was a few years ago, so this just looks a bit too magical.

Anyway, he finished with a demo of how he could do a hot version update on hundred of pods without downtime, by replacing each non-used pod one after the other and replacing it with one with the new version.

He still finished on an interesting note about hard drive and datacenter reliability. In the future, we will store less and less data on our laptops and more and more data in the cloud. Drives in datacenters will become the main storage point, and those are already pretty reliable today. They are even often more reliable than the rest of the structure of the datacenter. So there will be no need to further improve their reliability; instead we should reduce their size, make them cheaper, make them use less energy. In the end they only need to be as reliable as the datacenter that is holding them.

Conclusion

And that's it, the end of the day. This was a pretty long day, especially considering that we also had the dotSecurity the Friday before. Next year, there will even have a dotAI on the day after. Not sure I'll be able to do the three days in a row.

Final

The main subjects were about containers and how they handle state. How going for fully stateless is not actually possible and containers should embrace their data, through pods or other grouping pattern. Merkle trees and git-like structure where also a hot topic, and one that got my attention.

Overall I have the same feeling than after dotJS. Lots of talks, but only a few really inspiring. Those talks are awesome, don't get me wrong, and they make the conference worth it, but the ratio is less interesting than at dotSecurity for example. I think I will come back next year, but not sure I'll go to dotAI.

dotSecurity 2016

First session of the dotSecurity conference this year. I've been to a lot of dotConferences in the past years, and they always took place in the most gorgeous venues. This one was no exception. As usual, their is also no wifi in the theater, and attendees are invited to not use their laptops.

Obviously, I did not follow that advice, otherwise you couldn't have been able to read this short recap.

Making HTTPS accessible

The first talk of the day was by Joona Hoikkala, about https, and the hassle it is to properly install and configure it to a server. The overhead is getting so low and the benefits so high in term of security and privacy, that the whole web should be https. It protects end-users from session hijacking, content injection and arbitrary censorship.

Joona

And yes, I get the irony that, at the time of writing, this very own website is not served through https. I'll fix that soon.

The other advantages are that http/2 will only work with https. Well, according to the specs, it could also run without it. But in practice, no current implementation handle this case, and all need https enabled. It is also said that it should give better SEO results, but as with all things SEO-related, I wouldn't put much faith in that.

Still, the issue today (and that will be my excuse for not having https on this very own website) is that it is hard and cumbersome (and expensive!) to get a certificate for each server we own. We could default to a wildcard server (something like *.pixelastic.com), but this will just lower the overall security because if one server got compromised, all are insecure.

The solution today is to use Let's Encrypt that gives easy and free certificates for everybody. It comes with a command line interface and ease the burden of creating, revoking and updating certificates. There are more than 2 millions certificates issued through Let's Encrypt since its creation.

Overall, I must say that this first talk was a bit slow to start. I took me a while to understand that the speaker actually worked for Let's Encrypt. If anything, it gives me one more nudge to force me to move my servers to https.

Life of a vulnerability

The second talk was by Filippo Valsorda, from Cloudflare. He explained in very simple terms how vulnerabilities (or vulns) are discovered, published and fixed. Really nice overview for something that I only hear from a distance without knowing how the process really works.

Filippo

He started by defining a vuln by being basically a state of software that lets user do something that they are not allowed to do. Most common forms include DDOS, leaks of data, SQL injections or remote code execution.

Not all vulns have a nice name, website, logo and stickers. Most of them don't, actually. When you discover a vulnerability, you have to report it. There are two ways to report them: full disclosure and responsible disclosure.

For full disclosure, you post the vulnerability to a public forum, letting everybody know about it. Such forums are often target of legal threats of companies that don't want their vulnerabilities publicly accessible. Other people can choose to sell them. Who actually buy this vulns is still a bit obscure for me, but it can be people buying them to resell them later, or use them for their own profit (either governments or mafias).

For responsible disclosure (also known as coordinated disclosure), you do not post publicly, but contact the website/owner of the service being impacted. You tell them about the issue, the risks and how to fix it. This gives them the time to fix the issue before it does too much damage. That is why it is very important on any website or open-source software to have a way to contact the team for security reason. It can be a simple security@domain.com email address, but it shows that the team will take the matter seriously.

Also, being publicly acknowledged has having found a vuln on major websites like Google or Facebook is very nice for exposure and building a reputation of security expert.

Once the vuln is identified, it should be assigned a unique number so everybody knows they are talking about the same issue, and can work in a coordinated effort to fix it. There are hundreds of vulnerabilities issued per month, some will target only a very specific version of a very specific framework while other can have a much bigger impact.

Such identifies vulns are publicly announced with the list of affected version of the software, the explanation of the issue and the real world impact of such issue. Alongside are displayed patches and/or possible fixes as well as credits.

Then, he ended his presentation with something I found a bit weird to do in such a conference. He did a live exploit of a known vulnerability on an old version of Ruby on Rails to allow remote code execution. He used MetaSploit and Shodan to perfectly show how easy it is to find an insecure machine and run a known exploit to it.

Sure, the machine he targeted was his own, but it wasn't really clear at first. I don't think it was such a good idea to show the audience how easy it is to get access to distant machine and start breaking them, without telling them about the legal implication of doing (like it could get you in jail in some countries). I would have appreciated a bit more context here.

Still, very good talk that taught me a lot.

Multi-factor authentication

Following talk was by Jacob Kaplan-Moss. He told us everything about multi-factor authentication (which is a fancy word to say two-factor authentication, I didn't get the difference).

Jacob

Login to a website simply through a password is no longer enough. We need a secondary channel to confirm the person doing the action is actually the one we think it is. Computers with password saved in cache can be stolen, or left unlocked.

For 2FA, we need another object that is owned by the user trying to do a sensitive action. Most of the time it is mobile phone with a 2FA application that display a small code that you have to input on the website, but there are other ways.

You could send a SMS to the user, or give him an automated phone call with a special number he need to input. This is called out of band communication, you request information from the user in channel A that was send to him through channel B. This way you make sure you're not discussing with a fake user impersonating the real one because he stole his laptop.

There are two type of tokens you can exchange that way. Soft tokens are the own that are generated through 2FA Apps like Google Authenticator or Authy. Hard tokens are physical objects like YubiKeys.

All those solutions offer various degrees of risks, UX and cost. When choosing a solution, you should not only focus on the risks. If you take only the less risky solution you often end-up with the most miserable of UX and users will in the end not use your ultra-secure solution because it just too damn hard to use.

When assessing risk, you have to check if the token can be intercepted, if they can be brute-forced, if you can notice if a token has been stolen. You also have to check how the apps or hardware is protected against malware.

When assessing costs, you have to take into account both the cost for the company (how much does it cost to give YubiKeys to everybody?) as well as the cost for the end-user (will it be cumbersome, will they lose time doing it, etc.).

For example an out of band communication through SMS is easy. Most of your users already have a phone, and they know how to use SMS. It is also quite easy to intercept SMS, and such a communication channel could easily fail in countries where the phone network coverage is bad.

For soft tokens through authentication apps, it is much harder to compromise tokens (even if it possible), but you don't need a network as it can work offline. Still, there is a UX cost because you need to download an external app and manually input the code from your phone.

For hard tokens, the real threat is that the master key that can validate the tokens for all physical keys can be stolen. There is no risk-free solution, and even if such an event is unlikely, if this happens, you need to build new keys for everybody and ship them to wherever they are in the world. Cost-wise, it can become really expensive. But in term of UX, this is the must, you just have to press a button to prove that you are you.

Whatever the method you chose, you have to ask for 2FA not only when they login, but whenever they request a sensitive action. This could be adding new members on a team, changing passwords or billing information. You should also monitor for weird behavior, like changes in connecting devices or geo-location.

You also need a backup plan. What happens when people lose their phone or keys. How can they connect to the service again? There is no perfect answer for that question. You could provide user with backup recovery code, but then the user must store them securely somewhere which just moves the burden of security to the average users. Most won't store them at all.

You could allow for a backup phone, but this only increase the attack surface, making everything less secure. You could allow the support team to help, but they will then become open to social engineering. Or you can simply state that there is no way to recover if you lost your authentication method. Whatever method you chose, you have to tell you users upfront or they will be really pissed off when they discover they have no way to get their account back.

The speaker suggestion was the following setup, that differentiate between public accounts and internal accounts:

For public accounts, 2FA apps are the best trade-off in terms of UX and security. Authy being better than Google Authenticator UX-wise, but more costly for the company. You also have to request the token on each sensitive action. For recovery, don't allow support to recover an account, and provide backup codes.

For internal accounts, you can greatly improve both security and UX by putting a bit more money on the table and giving a YubiKey to each employee for 2FA. This will prevent outsiders from taking over an employee account. Add behavior analysis, like detecting when a connection is not done from the known workplaces, and trigger a 2FA check. Also, only allow 2FA reset when asked face to face to the security team.

I really liked a quote of the speaker during the short Q&A session that came afterwards:

Security should not be handled by a security team. It should be for the day to day developer. Just like tests should no longer be in the hand of testing teams but embedded into development through things like TDD.

Lightning talks

After those talks and a break, we continued with a small commercial break, aka. Lightning talks.

The first one presented in the form of a git branch workflow what he called the classical "security audit workflow". You request a security audit on your code, so the security team creates a new security branch to start working on it. During that time, the main develop branch keeps growing because devs don't stop coding during the audit. Then it's time for the release, so we merge develop into release. But the security audit wasn't finished, so no security fixed was pushed to production. And once the security audit is done, the code on develop has changed so much that the audit is useless.

His solution to that was to do runtime checking of the code, through his company called Sqreen.

From my point of view, the whole premises where fucked up. He was giving a solution to the wrong problem. Why would you even start a security audit if you're not going to implement the results before shipping? This is the root cause of the issue. Either integrate security testing in the flow process, fix issues before release or open a public bounty program. There are so many other ways to avoid the absurd previous branching model shown on screen.

I'm not saying Sqreen product is bad (we even used it at some point), but that what it was trying to solve was not a tech issue but an organizational issue.

Then Ori did some Ori on stage, showing dead squirrels and demonstrating a slack bot to publicly shame sudo users.

Finally a guy from OpenCredo tried to explain something, but as he was constantly looking at the slides behind him, I couldn't hear what he was saying.

I'm a bit disappointed by the lightning talks of this edition. This looks too much like the bad sessions of the ParisTechTalk meetup, where you have talks that are only an excuse to sell a product or company. This is especially surprising because all the other talks of all the other dotConferences takes great care to avoid talking about their companies.

Content Security Policy

We then came back to the real talks with Scott Helme and the CSP (Content Security Policy). I discovered the CSP last October at ParisWeb, so nothing really new here. I still think it's something I should use more and it got me thinking about so many usages.

Scott

But first, let's explain what that is. Basically it's a new header (content-security-policy), that you add to your responses. Its value is a long string.

It will tell the browser what it can or can't do regarding loading of external sources. And the browser support is actually quite good already. You can tell the browser to only load assets from your specific CDN as well as your main domain, to protect you from XSS. You can fine-tune it to apply different rules for images, videos, scripts, styles, etc. You can prevent your page from being embedded in an iframe, etc.

By default it will block the execution of all inline script in your page, forcing you to load them from an external file that should be hosted on one of the servers in the whitelist. Another cool feature is to automatically disable the loading of http assets on an https page, avoiding the warning about mixed content. You can even force the browser to always upgrade http urls to https.

But the two most interesting meta features are that you can toggle a report mode that will not block anything, but only display warning in the console. This lets you test it to see what will break before really deploying it. The other feature is a way to send those reports to an external server instead of displaying them in the console, for later analysis.

I wonder how easy it is to test those CSP with statically generated websites (the Middleman/Jekyll type). I'm also wondering how much we could exploit this mechanism to send detailed information about the user, directly from the browser, without any third-party tracking script.

I mean, if I turn on report mode only and disable loading of all assets, I will receive the report of all the assets the user wanted to load. Doing so I'll be able to analyze for example which users have an adblocker (because they didn't got "blocked" while trying to load an ad). Could I also load a script from a valid source that will dynamically inject a custom script with all the user data (viewport, ip, user-agent, etc.) in its url to log it?

I'm sure there are really interesting data to get from this feature, as well as increase the overall security of apps of course!

Diogo Monica

The next talk was my personal favorite of the whole afternoon. It was made by Diogo Monica, head of security at Docker. I did not get everything from the talk to be honest, but it was explained in such a clear and concise way that I really liked it.

Diogo

At its core, Diogo said that updates were one of the most, if not the most, important parts of security. When an issue is found and fixed, you have to update your software to actually enjoy this fix. And how do you update? Through your package manager.

So package manager are actually a very important part of the security chain, and any security issue in the package manager itself can compromise any package it handles.

All package managers today download updates through a secure connection. This offer great protection against man-in-the-middle (MITM) attacks, but is not enough. If the source is compromised, you are still downloading an infected update through a secure channel.

The best way to protect sources is to sign them, usually through PGP. As a maintainer, you add your PGP "stamp" on the package, and you give your public key to anybody wanting to check your packages. When a user downloads the package, he just have to check (with the public key), that the package was actually packaged by the maintainer. If an attacker changes anything in the source, the key check will fail.

But this won't protect you against what is named a "downgrade attack". The PGP signature only lets you check that the package you've download actually comes from the maintainer.

Now imagine you're using a specific piece of software in v1. You've downloaded the binary from the source, checked that it's really coming from the official source and install it. Two weeks later, a vuln is found in this version. A v2 is released that fixes the issue. You install it through your package manager, it downloads it from the source, checks that it really comes from the maintainer, and installs it. You're safe!

Or maybe not. It only checked that the file downloaded came from the maintainer. Not that it actually was the v2. Maybe you just reinstalled the exact same software. Maybe one of the servers that offered the download was compromised and offered v2 instead of v1 and you didn't see it because the PGP protection doesn't check the version, only that it comes from the correct maintainer.

An easy way to check would be to have a look at your current installed version. It's quite easy, but it is not the job of PGP, and it is apparently quite difficult for package manager to check that the correct version is installed.

Diogo then talked about the way to delegate power through a hierarchy of keys. Having one master key (stored offline), that could delegate power to other keys that would themselves delegate power. Keys on the top of the hierarchy would have longer expiration time while keys at the bottom will have short expiration time.

He also talked about a way to sign a package with multiple keys, to validate for example that a specific binary has correctly passed the staging and preprod tests. I must say I did not get exactly how all this part was working, but it is all included in the way Docker does manage its packages.

Really nice talk, I strongly encourage you to watch the video once it will be available.

Anne Canteaut

Next talk was not a very technical one, but more mathematical oriented. It was done by a researcher on cryptography. She pointed out that most of the security breaches described in the previous talks were never about weakness found in cryptographic parts.

Anne

Does that mean that the crypto part of security is perfect? Far from it. If you listen to crypto analysts, seems like a lot of algorithms are breached on a regular basis. But what does that mean anyway? And should we worry? Cryptographers are known to be paranoid, so should we really listen to them?

Spoiler: yes.

If cryptographers says that something is broken, you should stop using right now. Even if there is no practical way to exploit the weakness today, there will be in the future. Those breaches never get better, they always get worse.

There are two ways to consider a hash algorithm to be breached. The first one is to find two inputs that gives the same hash output. This is called a collision and md5 is breached in that way. This is "quite easy" to do, but does not really reduce the security level in any significant way. It opens the way to other exploits, but is not too dangerous in itself.

The second way is far more dangerous. It is when you are able to find a preimage, this means when given a specific hashed output, you can craft your own input that will result in this output. This is slightly different from the collision in one important way: the output is arbitrary and already given to you. It's like the birthday paradox: it is easier in any group of person to find two person with the same birthday, than in the same group finding one with the same birthday as you.

If you know how to craft a preimage, you will then be able to create an infected version of an valid input, that will give the same output as the valid one. If this happens, the algo is breached, and you should stop using it.

The number one rule of cryptography is to always trust public analysis. A lot of people can come up with new algorithm to hash data in a secure way. But it requires a large number of eyes looking at how it's done and brains trying to find breaches to actually prove that one is secure or not.

One of the most famous, AES, has been initially published in 1998 and is still used today, without any breach. The explanation of why it is so resilient is because he is the winner of a publicly held competition of algorithm that took place for 5 years. The best way to make your algo win in such a competition is to find breaches in the algo of the others. Doing so, you're sure to have the brightest minds on the subject trying to breach into it.

It was nice to have a glimpse of the mathematical side of this world. The subject is still very far from what I do, but interesting. I am completely unable to assess if a cryptographic algorithm is good or not, so I trust the experts on the subject. And it's nice to see that even expert can't easily assess every algo, and so that they have to trust each other on public analysis.

Paul Mockapetris

We then had a quick Q&A session with Paul Mockapetris, inventor of the DNS. From this exchange I'll only remember the following quote:

Paul

Hardware is like milk, you want it the freshest you can find. Software is like wine, you want it with a bit of age.

Web Platform Security

The last talk I could attend was by Mike West. There was another talk after this one, but I had a train to catch so couldn't stay up to the end. You can find the slides in here.

Mike

Mike started by telling us the story of Ulysses and the sirens. The sirens were beautiful female creatures with charming and dangerous voices. If men hear them, they would become crazy and jump overboard and drown. Ulysses learned from Circe that the only way to protect against the siren singing was to either put wax in your ears, or to tie yourself to the mast.

Same goes in web security. You have to tie yourself to the mast, meaning you have to follow a principle of least privilege. That way, even if an attacker manages to take control of your account, he won't be able to do much. It is better to do it that way than to try to block every possible way an attacker could compromise an account.

It is especially true for browsers. Browsers are becoming more and more powerful, accessing banking websites, connecting to bluetooth and camera. And being used by average users everyday. Mike, working in the Chrome dev team, told us about the beta features they are working on in Chrome.

He told us about the CSP, as we've seen, and its limitations. If you have different applications running on different subdomains, you cannot whitelist the main domain, because it will expose all subdomains to a breach in any of them. You could host a malicious script in one infected subdomain and make it load in another. They are working on something that would allow better granularity on the scheme, host and port allowed.

They are also adding another way to check for loaded script by adding a checksum verification on script. You would call your <script integrity="{hash}" src="..."> and the browser will check that the checksum of the loaded file matches the specified hash before executing it.

Conclusion

Really nice first occurrence of dotSecurity. Another one will take place next year and I will surely go again. I'll try to be an ambassador this time as well.

The panel of talks was large, from generic knowledge about the security world, to more technical one. I would say the mix was perfect.

Next year I'd like to see talks explaining in more details the selling and buying of vulnerabilities. Who discover them? How? Why? Who do they sell them to? How much? What do buyer do with it, etc.

HumanTalks February 2016

This month's HumanTalks was hosted by Criteo. We had a hard time managing to correctly plug the laptops in to the projector and connect to the wifi so we started a bit late. This forced us to cut down the questions from 10mn to only 6 or 7.

Intro

Customer Solutions Engineer

The first talk was by Nicolas Baissas, my coworker at Algolia. He explained what the job of CSE really means. The job originated mostly in the Silicon Valley for SaaS companies, but is moving to all sort of companies.

The goal of a CSE is to make the customers successful, by making them happy to use the product, for the longest amount of time. This is especially important for SaaS companies where revenue is based on a monthly payment. You do not want your customers to leave your service, and the best way to keep them is to offer them a better and better service.

CSE

There will always be customers leaving, which is what is called a churn, but the goal of a CSE is to make sure the company has a negative churn. This means that the customers that stay compensate for those who left, because they now use more of the service than before. The CSE must ensure that less and less customers want to leave, by understanding what they are looking for while managing to bring more and more value of the product to the already happy customers.

The only way to do that is to try to be part on their team, showing that you're on their side, and not trying to push selling more and more. CSE are experts on their product, and they share that expertise with the customers by email, by Skype or whenever possible, by meeting them directly. This also happens before, during and after Algolia's deployment in their system. The CSE ensures that they use the service at its best, and that it is used in the best way possible to fit their needs.

Month after month, they have a regular look at all the past implementations of their customers and reach out with advices if they see things that could be improved. Today Algolia has more than 1000 customers but only 4 CSEs, so this has trouble scaling. So in parallel they also work on making it easy for all the customers they do not have time to talk to.

They write guides, explaining specific features in detail, with examples. They have already explained the same thing hundreds of times on calls, so they have experience on how to explain it clearly, then it's just a matter of writing it down. They also write detailed tech tutorials. They all have a tech background and know how to code, so they can really understand what it requires to implement Algolia in an existing system.

The goal is to automate most of the recurring work. They built a new tab in the UI to analyze the current configuration of a customer and suggest improvements. Those are the exact same improvements than they would have suggested during a one-to-one call, but because they have the experience and know how to code, they can just simply make it self-service for users directly.

Some of the features are too complex to be correctly grasped just by reading the theory, like geosearch. So they created a demo with real data, using all the best practices and letting users play with it to see how it was working. This worked really well and transformed a theoretical feature into a real life application and in turn generated signups to the service.

What Nicolas really stressed out is that the role of a CSE is to be as close to the customer as possible, in order to really understand, in a real-life scenario, what he wants to do, with the very specifics of its project. But as well be as close as possible to the product itself, part of the team that builds it so he can exactly know which features are ready and how they work. By doing both you can really bring your deep expertise of the service to the specific issues of the customer while helping build the service with real-life examples of real-life issues customers have.

A CSE's ultimate goal is to not be needed anymore, the documentation and self-service information should be enough for most of the users, and core developers of the service should be in direct contact with the users so they know how people really use their service.

JSweet

The second talk was about JSweet, a transpiler of Java to JavaScript. A transpiler is like a compiler, it transforms a language into another language, or even the same. There are Java-to-Java transpilers, that can refactor the code and there are already a lot of tools transpiling to JavaScript (e.g. CoffeeScript, Dart and TypeScript).

JSweet

Of these three, TypeScript seems to be the most popular today. It was originally created by Microsoft, but then Google started using it for Angular. TypeScript mostly adds a typed layer on top of JavaScript, but still lets you use vanilla JavaScript wherever you want.

There already were attempts at a Java-to-JavaScript transpiler in the past, namely with GWT, but it was not as promising as announced and carried many organic limitations. GWT is too much of a blackbox, and you couldn't use the generated JavaScript with regular JS APIs, so that it was quickly outdated and the promise of having all Java in JavaScript wasn't even fulfilled. Mostly, it was done for developers that did not want to learn JavaScript and just wanted their backend code to work in the frontend as well.

Later on, we saw the emergence of NodeJS, where one of the cool features was that you could use the same language on the backend and frontend. NodeJS being written in JavaScript, it meant that you had to ditch your old Java/Ruby/PHP backend and put NodeJS instead. JSweet follows the same logic of "same language" everywhere, but this time it lets you use JavaScript with the Java language.

TypeScript syntax being really close to Java, it is easy to transpile from one to the other. Then, TypeScript transpiling to JavaScript, you can transpile all the way from Java to JavaScript.

This lets you use all your JavaScript libraries (like underscore, moment, etc) directly in your Java code. And you can also write your frontend code with Java, letting you follow the paradigm of "one language everywhere". Internally your Java code will be transpiled to TypeScript, then to JavaScript. Not all of Java will be available in JSweet, though.

I never coded in Java so I am unsure how useful this really is, but it seemed like a nice project if you want to keep the same language from back to front.

Shadow IT

Next talk was about what is called the Shadow IT, by Lucas Girardin. The shadow IT encompass all the IT devices in a company that are invisible to the radar of the official IT departement. It includes all the employees cell phones that are used to check personal emails during the day, all the quantified self devices (FitBit, etc), Excel files filled with custom macros, personal dropbox account and even the external contracts with freelancers that are not approved by the IT department.

ShadowIT

Granted, this kind of issues only occur in big companies, where there are way too many layers between the real needs of employees and the top hierarchy trying to "rationalize" things. This talk gave a great echo to the first talk about CSE and reminded me why I quit consulting ;).

Anyway, the talk was still interesting. It started by explaining why these kinds of shadow developments appeared in the first place: mainly because the tools the employees have were not powerful enough to let them do their job in the best environment. And because employees are getting more and more tech-savvy, they found their own ways to bypass the restrictions. They expect to have the same level of features in their day job than at home or on their smartphone. If their company cannot provide it, they will find other ways.

Unfortunately, these ways are frowned upon by the IT departement. Maybe the Excel sheet the employee is creating is really useful, but maybe it is also illegal in regard to personal information storage. Or it will break as soon as the Excel version is changed, or be completely lost when the employee leave and no backup exists.

Then I started getting lost in the talk. Some of the concepts he talked about were too alien to what I experience everyday that I had trouble understanding what it was really about. In the end he suggested a way to still rationalize those various independant parts, by building a Platform that lets users build their own tools, even if they do not know how to code. This platform would get its data from a Referential that is accessible company-wide and hold the only real trustable source of data. And finally, the IT departement will build Middlewares that will help application A to communicate with application B.

In the end, the IT department will stop building custom applications for their employees, but simply provide the tools to help them build it themselves. Still, it will have to create the middlewares to let all those parts discuss.

I cannot help but think that this does not fix the initial issue but simply gives the IT department the feeling that it is in control again. As soon as the platform tools will be too limited for employees to really do what they want (this will happen really quickly), they will revert to using other, more powerful, tools and this will still be out of the IT departement reach. I fail to see how this is any different from what it was before, except that instead of building the application themselves, the IT departement now builds the tools so the employees can build the applications, but they are still needed to make them work together and will still be a bottleneck.

You can find the slides here

Criteo Hero Team

The last talk was presented by Michel Nguyen and was about the Criteo Hero Team.

Michel told us about his team, the Hero team, for Escalation, Release and Ops. He added a H in front of it to make it cooler. The team is in charge of all prod releases as well as dealing with incidents.

Hero team

They realized that whenever they put something in production, something else breaks. So they started coordinating the releases and having all the people that could understand the infra in the same team, to better understand where this could break.

The now use a 24/7 "follow the sun" schedule where teams in France and the US are always awake to follow the potential issues. They have an escalation system with two layers, that lets them deal with the minor issues without creating a bottleneck for the major ones. The Hero team is in charge of finding the root causes of the issues and if none can be found quickly enough, they will just find a temporary workaround and dig deeper later. Once the issue is found and fixed, they do a postmortem to share with everybody what went wrong, how they fixed it, and ways to prevent it from happening again.

They use Centreon and Nagios as part of their monitoring and check after each production release the state of the metrics they are following, to see if nothing extremely abnormal appeared. If too many metrics changed too widely, then we can assume something is not working correctly.

The current production environment of Criteo is about 15.000 servers, which weigh as much as 6 Airbuses and would be twice the height of the Empire State Building. They handle about ~1200 incidents per year and resolve about 90% of them in escalation. The last 10% are incidents that depends from third-parties, or one-shot incidents they never understood.

To be honest, even if the talk was interesting (and Michel is a very good speaker), it felt too much like a vanity metrics contest. I know Michel had to cut his talk from 30mn to 10mn in order to fit in the HumanTalks format, so I'd like to see the full version of it.

Conclusion

I did not felt like I was the target of the talks this time. I already knew everything about the CSE job because I work with them everyday, I never coded in Java, stopped working in companies big enough to have a Shadow IT issue and as I said, the last one left me hungry for more.

Still, nice talks and nice chats afterwards. Except for the small hiccup with the projector and wifi at the start, the room was perfect and very comfortable and the pizza + sushi buffet was great.

Next month we'll be at Viadeo. Hope to see you there!

HumanTalks January 2016

Note: I'm actually writing this blog post several months after the event, so my memory might not be the freshest here.

Anyway, we had 4 talks as usual for the HumanTalks, hosted at Deezer this time, with food courtesy of PaloIT.

Apache Zepellin

First talk was by Saad Ansari, with an introduction to Apache Zeppelin. Their use-case was that they had a huge amount of tech data (mostly logs) and they had no idea what to do with it.

They knew they should analyze it and extract relevant information from it, but they had so much data, in various forms, that they didn't really know where to start. Sorting them manually, even just to find which data was interesting and which was garbage was a very long task that weren't able to do.

So they simply pushed them to Zeppelin. It understand the global structure of the data and display it in tables and/or graphs. It basically expect CSV data as input and then lets you use a SQL-like syntax to do requests on it and display visual graphs. The UI even provides a drag'n'drop feature for easier refinement.

I was a bit confused as to who the target of such a tool was. Definitely it was not for any BigData expert, because the tool seem too basic. It wouldn't fit for someone not technical either because it still requires to write SQL queries. It's for the developer in between, to get an overall feeling of the data, without being too fine-grained. Nice to get a first overview of what to do with the data.

As the speaker put it, it's the Notepad++ of BigData. Just throw garbage CSV and logs in it, and then play with SQL and drag'n'drop to extract some meaning from it.

The Infinite, or why it is smaller that you may think

Next talk by Freddy Fadel was a lot more complex to follow. I actually stopped taking notes to focus on what the speaker was saying and trying to grasp the concepts.

It was about the mathematical definition of the infinity and what it implies, and how we can actually count it. Really I cannot explain what it was about, but it was still interesting.

At first I must say I really wondered what such a talk was doing in a meetup like the HumanTalks, and was expecting a big WTF moment. I was actually gladly surprised to enjoy the talk.

Why is it so hard to write APIs?

Then Alex Estela explained why it is so complex to build API, or rather, what are the main points that people are failing at?

First of all, REST is an exchange interface. It's main and sole purpose is to ease the exchange between two parties. The quality of the API will be as good as the communication there is between the various teams that are building it.

REST is just an interface, there is no standard and no specific language or infrastructure pattern to apply. This can be intimidating, and gives so many possible reasons to fail at building it.

Often people build REST API like they built everything else, thinking of SOAP, and exposing actions, not resources. Often, they build an API that only expose the internals of the system, without any wrapping logic. You also often see APIs that are too tailored for the specific needs of one application, or on the other hand that can let you do anything but built with no specific use-case in mind so you have to retro-engineer it yourself to get things done.

The technical aspect of building an API is not really an obstacle. It's just basic JSON over HTTP. Even HTTP/2 does not radically change things, it will just need a few adjustments here and there, but nothing too hard. The issue is the lack of standards, that give too many opportunities to do things badly. You can use specs like Swagger, RAML or Blueprint, they all are good choices with strength and weaknesses. Pick one, you cannot go wrong.

There is no right way to build an API in terms of methodology. The one and only rule you have to follow is to keep it close to the users. Once again, an API is a mean of communication between two parties. You should build it with at least one customer using it. Create prototypes, iterate on it, use real-world data and use-cases, deploy on real infrastructure. And take extra care of the Developer Experience. Write clear documentation, give examples on how to use it, give showcases of what you can build with it. Use it. Eat your own dog food. Exposing resources is not enough, you also have to consume them.

Make sure all teams that are building the API can easily talk to each other in the real world and collaborate. Collaboration is key here. All sides (producer and consumer) should give frequent feedback, as it comes.

To conclude, building an API is not really different than building any app. You have to learn a new vocabulary, rethink a bit the way you organize your actions and data, and learn to use HTTP, but it's not really hard.

What you absolutely need are users. Real-world users that will consume your API and use it to build things. Create prototypes, stay close to the users, get feedback early and make sure every actor of the project can communication with the others.

Why do tech people hate sales people ?

Last talk of the day was from Benjamin Digne, coworker of mine at Algolia. He explained in a funny presentation (with highly polished slides⸮) why dev usually hate sales people.

Being a sales person himself, the talk was much more interesting. Ben has always worked in selling stuff, from cheeseburgers to human beings (he used to be an hyperactive recruiter in a previous life).

But he realized that dealing with developers is very different from what he did before. This mostly come from the fact that the two worlds are actually speaking different languages. If you overly stereotype each part you'll see the extrovert salesman only driven by money and the introvert tech guy that spend his whole day in front of his computer.

Because those two worlds are so different, they do not understand each other. And when you do not understand something, you're afraid of it. This really does not help in building trust between the two parts.

But things are not so bleak, there are ways to create bridges between the two communities. First of all, one has to understand that historically sales people were the super rich superstars of the big companies. Techies were the nobodies locked up in a basement somewhere.

Things have changed, and the Silicon Valley culture is making superheroes out of developers. Still, mentalities did not switch overnight and we are still in an hybrid period where both sides have to understand what is going on.

Still, the two worlds haven't completely merged. Try to picture for a minute where the sales people office are located at your current company. And where the R&D is. Are they far apart, or are they working together?

At Algolia, we try to build those bridges. We first start by hiring only people with a tech background, no matter their position (sales, marketing, etc.), which makes speaking a common language easier. We also do what we call "Algolia Academies" where the tech team explain how some parts of the software are working to non-tech employees. On the other hand, we have "Sales classes" where the sales teams explain how they built their arguments and how a typical sales situation is. This helps each part better understand the job of the other part.

We also have a no-fixed-seats policy. We have one big open space, where every employees (including founders) are located. We have more desks than employees and everyone is given the opportunity to change desk at any time. Today we have a JavaScript develop sitting between our accountant and one of our recruiters, and a sales guy next to an op, and another one next to two PHP developers. Mixing teams like this really helps avoiding creating invisible walls.

Conclusion

The talks this time were kind of meta (building an API, sales/tech people, the infinity) and not really tech focused, but that's also what makes the HumanTalks so great. We do not only talk about code, but about everything that happens around the life of a developer. Thanks again to Deezer for hosting us and to all the speakers.

Paris Vim Meetup #11

I went to the 11th Paris Vim Meetup last week. As usual, it took place at Silex Labs (near Place de Clichy), and we were a small group of about 10 people.

Thomas started by giving a talk about vimscript, then I talked about Syntastic. And finally, we all discussed vim and plugins, and common issues and how to fix them.

Pictures

Do you really need a plugin?

This talk by Thomas was a way to explain what's involved in the writing of a vim plugin. It covers the vimscript aspect and all its quirks and was a really nice introduction for anyone wanting to jump in. Both Thomas and I learned vimscript from the awesome work of Steve Losh and his Learn Vimscript the Hard Way online book.

Before jumping into making your own plugin, there are a few questions you should ask yourself.

Isn't this feature already part of vim? Vim is such a huge piece of software that it is filled with features that you might not know. Do a quick Google search or skim through the :h to check if this isn't already included.

Isn't there already a plugin for that? Vim has a huge ecosystem of plugins (of varying quality). Check the GitHub mirror of vim-scripts.org for an easy to clone list.

And finally, ask yourself if your plugin is really useful. Don't forget that you can call any commandline tool from Vim, so maybe you do not have to code a whole plugin if an existing tool already does the job. I like to quote this vim koan on this subject:

A Markdown acolyte came to Master Wq to demonstrate his Vim plugin.

"See, master," he said, "I have nearly finished the Vim macros that translate Markdown into HTML. My functions interweave, my parser is a paragon of efficiency, and the results nearly flawless. I daresay I have mastered Vimscript, and my work will validate Vim as a modern editor for the enlightened developer! Have I done rightly?"

Master Wq read the acolyte-s code for several minutes without saying anything. Then he opened a Markdown document, and typed:

:%!markdown

HTML filled the buffer instantly. The acolyte began to cry.

Anybody can make a plugin

Once you know you need your plugin, it's time to start, and it's really easy. @bling, the author of vim-bufferline and vim-airline, two popular plugins, didn't known how to write vimscript before starting writing those two. Everybody has to start somewhere, so it is better to write a plugin that you would yourself use, this will give you more motivation into doing it.

A vim plugin can add almost any new feature to vim. It can be new motions or text objects, or even a wrapper on an existing commandline tool or even some syntax highlight.

The vimscript language is a bit special. I like to say that if you've ever had to write something in bash and did not like it, you will not like vimscript either. There are initiatives, like underscore.vim, to bring a bit more sanity to it, but it is still hackish anyway.

Vimscript basics

First thing first, the variables. You assign variables in vimscript with let a = 'foo';. If you ever want to change the value of a, you'll have to re-assign it, and using the let keyword again.

You add branching with if and endif and loops with for i in [] and endfor. Strings are concatenated using the . operator and list elements can be accessed through their index (it even understand ranges and negative indices). You can also use Dictionaries, that are a bit like hashes, where each key is named (but will be forced to a string, no matter what).

You can define functions with the function keyword, but vim will scream if you try to redefine a function that was already defined. To suppress the warning, just use function!, with a final exclamation mark. This is really useful when developping and sourcing the same file over and over again.

Variables in vimscript are scoped, and the scope is defined in the variable name. a:foo accesses the foo argument, while b:foo accesses the buffer variable foo. You also have w: for window and g: for global.

WTF Vimscript

And after all this basics, we start to enter the what the fuck territory.

If you try to concatenate strings with + instead of . (maybe because you're used to code in JavaScript), things will kind of work. + will actually force the variables to become integers. But in Vimscript, if a string starts with an integer, it will be casts as this integer. 123foo will become 123. While if it does not, it will simply be casts as 0. foo will become 0.

This can get tricky really quickly, for example if you want to react to the word under the cursor and do something only if it is an integer. You'll have a lot of false positives that you do not expect.

Another WTF‽ is that the equality operator == is actually dependent on the user ignorecase setting. If you :set ignorecase, then "foo" == "FOO" will be true, while it will stay false if the option is not set. Having default equality operators being dependent on the user configuration is... fucked up. Fortunatly, Vimscript also have the ==# operators that is always case-insensitive, so that's the one you should ALWAYS use.

Directory structure

Most Vim plugin packagers (Bundle, Vundle and Pathogen) expect you, as a plugin author, to put your files in specific directories based on what they do. Most of this structure is actually taken from the vim structure itself.

ftdetects will hold the code that is used to assign a specific filetype to files based on their name. ftplugin contains all the specific configuration to apply to a file based on its filetype (so those two usually works together).

For all the vim plugin writers out there, Thomas suggested using scriptease that provides a lot of debug tools.

Tips and tricks

Something you often see in plugin code is the execute "normal! XXXXX". execute lets you pass an argument that is the command to execute, as a string. This allows you to build the string yourself from variables. The normal! tells vim to execute the following set of keys just like when in normal mode. The ! at the end of normal is mandatory to override the user mappings. With everything wrapped in a execute you can even use special chars like <CR> to act as an Enter press.

Syntastic

After Thomas talk, I briefly talked about Syntastic, the syntax checker for vim.

I use syntasic a lot with various linters. Linters are commandline tools that analyze your code and output possible errors. The most basic ones only check for syntax correctness, but some can even warn you about unused variables, deprecated methods or even style violation (like camelCase vs snake_case naming).

I use linters a lot in my workflow, and every code I push goes through a linter on our Continuous Integration platform (TravisCI). Travis is awesome, but it is asynchronous, meaning I will receive an email a few minutes after my push if the build fails. And this kills my flow.

This is where syntastic comes in play. Syntastic lets you add instant linter feedback while you're in vim. The way I have it configured is to run the specified linters on the file I'm working whenever I save that file. If errors are found, they are displayed on screen, on the lines that contains the error, along with a small text message telling me what I did wrong.

It is then just a matter of fixing the issues until they all disappear. Because the feedback loop is so quick, I find it extremely useful when learning new languages. I recently started a project in python, a language I never used before.

The first thing I did was install pylint and configure syntastic for it. Everytime I saved my file, it was like having a personnal teacher telling me what I did wrong, warning me about deprecated methods and teaching me the best practices from the get go.

I really recommend adding a linter to your workflow as soon as possible. A linter is not something you add once you know the language, but something you use to learn the language.

Syntastic has support for more than a hundred language, so there's a great chance that yours is listed. Even if your language is not in the list, it is really easy to add a Syntastic wrapper to an existing linter. Without knowing much to Vimscript myself, I added 4 of them (Recess, Flog, Stylelint, Dockerfile_lint). All you need is a commandline linter that outputs the errors in a parsable format (json is preferred, but any text output could work).

Conclusion

After those two talks, we all gathered together to discuss vim in a more friendly way, exchanging tips and plugins. Thanks again to Silex Labs for hosting us, this meetup is a nice place to discover vim, whatever your experience with it.