On the Web and in social networks, textual similarity search is a problem of the utmost practical interest.
Devising metrics for textual contents (Web pages, tweets, ..) serves the purpose of today Web services -- search engine, recommender system, online advertising, ... to name a few.
In this context, one of the popular strategy to overcome scalability issues is Semantic Hashing, which has been proposed in the early 2000 [2, 4]. Semantic hashing aims at embedding the data points from high dimensional spaces of traditional approaches into a Hamming hypercube of size n. In semantic hashing, each data point is associated to a binary code of length n, so that their Hamming distance is similar to the one in the original space. The problem to find how to perform such an embedding provides a hot topic for the computer scientists. Many semantic hashing schemes rely on machine learning: using a corpus, binary classifiers are trained in order to satisfy different optimisation functions [6, 12, 13, 11, 3, 8, 5].
The main issues with what is commonly accepted are the following:
- data dependency : the entire corpus needs to be known in advance -- or a sufficient portion of it making the approach also subject to cold start.
- concept-insensitive : as these processes relies on keyword feature spaces, two documents semantically similar but making use of different terms are not mapped to close binary codes.
- language-dependant : while some of the most recent works consider hashing textual and multimedia items together, few works focus on multi-lingual corpus [10].
Instead of high dimensional vector spaces, we consider in this thesis other document representation as candidates for hashing. We especially consider graphs as a document representation that can be tuned to be data-independant, concept-sensitive, and language-independant. We aim at exploring further if this representation is more suitable for semantic hashing. This track has been settled in the institute in the last couple of years. Especially, in [9] we demonstrated that a graphical model is a suitable document representation for semantic hashing, as it presents an interesting an massive speed-up for pairwise similarity computation at the cost of a limited loss of semantic similarity with respect to high dimensional models. In [1], we extended our model and we show that an external taxonomy used in the semantic hashing scheme provide concept-sensitivity to our semantic hashing process. The candidates are strongly advised to read these two publications from the team.
In this thesis, we aim at improving this method and evaluate what are the performances and limitations of a semantic hashing scheme based on a graphical representation of documents.
We will put focus on under-studied applications of semantic hashing : multi-lingual semantic hashing and speed up of natural language processing tasks.
Supervisors : Christophe Gravier and Julien Subercaze
Requirements:
-------------
The candidate MUST have :
1. a Master degree in Computer Science,
2. a good mathematical background,
3. a strong background in programming, Java experience is a plus,
4. excellent english writing skills.
In addition, although an obvious and very useful quality in general, it is not mandatory to speak French to apply.
How to apply:
--------------
* Agenda
Submission deadline : May 7th 2015 11:59pm Paris time.
Interview, if selected : between 11-13 May 2015.
Notification : between 10-15 June
* Application file
Your application file MUST contain the following items :
1. A curriculum vitae,
2. The master diploma as a PDF file,
3. The details of your marks for the two last years of the Master,
4. Contact details of two associate professors or professors you had interacted with as reference.
Your application file MAY also contain any information or link you think is appropriate (e. g. scientific publication you have contributed to, online service you maintain, link to your social coding account, ...).
You must send your application as a single, zipped, PDF file to : christophe.gravier@univ-st-etienne.fr
* Evaluation of applications
All applications will be reviewed by all the members of the Knowledge and Representation project, a subdivision of the Connected Intelligence group at Hubert Curien laboratory.
A first selection will be made from your application folder.
If selected, you will enter a second phase of selection based on three exerices :
- A motivation interview so that we can know each others. The interview includes the advisors, but also the other researchers from the laboratory with a more remote perspective on the research,
- A technical assignment : a basic 2-hours Java (or C++) programming task.
- An english assignment.
The entire process may not exceed half a day and are all held the same day.
On that day, you are invited to come to the laboratory.
Application Deadline : 7 May 2015