Unstructured Data

Photo by Ricardo Viana on Unsplash

Have you ever wondered how Youtube stores our videos? Not only they need to play it well, but they also need to search it as fast as we type our keywords. This new kind of data does not fit well with the existing approach of structured data. We need a new way to store and retrieve this kind of data.

A Video Story

Using the same approach as structured data will be challenging because searching using partial video name is computationally expensive. In addition, how we can store the video itself? Having a low file size limit will not give a great user experience. Meanwhile, have a huge size limit is going to be costly for the company and if the video size is tiny compared to the maximum limit, we would end up having too many unused space in our storage.

Another challenge is to play the video. Retrieving full size of a video will need significant amount of time. And after we get the video, where should we put it temporary so the browser can play it well? Obviously not in network or in the browser since the bigger the file size, the more it will eat the bandwidth and your hard drive.

Say Hi to Hadoop

The challenges above enforce us to define a new way to store and retrieve the data. Instead of having a full size of video, we will store the important information like name, length, resolution in one file (master). The video content itself is split into some chunks. Each chunk will save some part of content (e.g. 1 minute video), its partition info (e.g. it can say: hey I am the first part), and its owner (pointing to the master file).

The chunks are replicated and distributed in many servers around the globe to ensure we always have a backup when a chunk is not available and to have a balance workload between the chunk. This is also applied for the master file.

See it in Action

When we search a video, it will look into the master file and not retrieving any video yet. Once we click on the video we want, it will ask: hey please look for all pieces of this video. And each chunk, having the information of its owner, will submit their content part. The master will arrange the chunk accordingly and can start playing the video in parallel. This is why we can skip some part of video and it can still play without waiting for the previous part to be loaded.

This method will act as complement for structured data technology. Each has its own strength and by combining both, we could enjoy massive benefits like the world we are living now.

See you in the next post!

Leave a comment