PhD position: Efficient data transfer and streaming strategies for workflow-based Big Data processing

The goal of this Ph.D. proposal is to design and implement next-generation data processing, transfer and streaming models, specifically targeting applications that require a general data orchestration, independent of any programming model. During this three-year PhD. position, the student will evaluate the limitations and bottlenecks of current NoSQL / MapReduce based solutions for general workflow / streaming Big Data applications, formalize the corresponding requirements and propose processing models and techniques for optimized data transfers between workflow nodes (inter and intra datacenter) and for efficient stream processing.

Description

In the past years, a subclass of Big Data, fast data (i.e., high-speed real-time and near-real-time data streams) has also exploded in volume and availability. These specific data, often denoted as events, are typically characterised by a small unit size (in the order of kilobytes), but overwhelming collection rates. Examples of such data include sensor data streams, social networks feeds (e.g. 4k tweets per second, 35k Facebook likes and comment per second), stock-market updates. Numerous applications must process vast amounts of fast data collected at increasing rates from multiple sources, with minimal latency and high scalability. Enabling fast data transfers across geographically distributed sites allows such applications to manage the continuous streams of events in real time and quickly react to changes. 

Traditional workflow processing engines often consider data resources as second-class citizens and support access to data only as a side-effect of computation. Currently, the workflow data handling is achieved using either some application specific overlays that map the output of one task to the input of another in a pipeline fashion, or, more recently, leveraging the MapReduce programming model, which clearly does not fit every scientific application. When deploying a large scale workflow across multiple datacenters, the geographically distributed computation faces a bottleneck from the data transfers, which incur in high costs and significant latencies. Without appropriate design and management, these geo-diverse networks can raise the cost of executing scientific applications. 

The goal of this Ph.D. proposal is to design and implement next-generation data processing, transfer and streaming models, specifically targeting applications that require a general data orchestration, independent of any programming model. During this three-year PhD. position, the student will evaluate the limitations and bottlenecks of current NoSQL / MapReduce based solutions for general workflow / streaming Big Data applications, formalize the corresponding requirements and propose processing models and techniques for optimized data transfers between workflow nodes (inter and intra datacenter) and for efficient stream processing.


To apply, please email a cover letter, CV, contact address of at least two professional references and copies of degree certificates to Dr. Gabriel Antoniu. Incomplete applications will not be considered or answered.

Nr of positions available : 1

Research Fields

Computer science

Career Stage

Early stage researcher or 0-4 yrs (Post graduate) 

Research Profiles

First Stage Researcher (R1) 

Benefits

Mobility Schedule:The candidate will be mainly hosted at Inria, Rennes (France). 

After the first year, the candidate will join the Universidad Politécnica de Madrid for a several months (possible for multiple stays) to work under the supervision of his/her secondary advisor (Jesus Montes) to develop a set of models for the targeted data management techniques. 

After the second year, the candidate is also expected to be hosted for a 3-month secondment by another partner of the consortium. We have identified IBM Research Dublin as such a possible partner, where the candidate could validate the proposed models and platforms with real data issued from the Smart Cities application developed by IBM.

Comment/web site for additional job details

bigstorage.oeg-upm.net/jobs.html


Requirements

Required Languages
LanguageENGLISH
Language LevelExcellent
Required Education Level
Degree FieldComputer science
Additional Requirements
At the time of recruitment, the applicant must not have lived in France for more than 12 month in the previous 36 month
Required Education Level
DegreeMaster Degree or equivalent
Additional Requirements
- An excellent Master degree in computer science or equivalent
- Strong knowledge of computer networks and distributed systems
- Knowledge on storage and (distributed) file systems
- Ability and motivation to conduct high-quality research, including publishing the results in relevant venues
- Strong programming skills (e.g. C/C++, Java, Python).
- Working experience in the areas of Big Data management, Cloud computing, HPC, is an advantage
- Very good communication skills in oral and written English.
- Open-mindedness, strong integration skills and team spirit

Envisaged Job Starting Date

01/07/2015

Application Deadline

31/05/2015

Application e-mail

Gabriel.Antoniu@inria.fr