Network and Information Technologies

Computer Vision, Machine Learning and Pattern Recognition

Available thesis proposals:

 

Thesis proposals Researchers Research Group

Explainable AI

Explainable artificial intelligence (AI) focuses on making the decision-making processes of AI systems, such as deep learning models, transparent and understandable to humans. As AI systems become increasingly powerful and widely used across multiple domains, their "black box" nature poses significant challenges for trust, accountability, and effective human oversight. The goal of explainable AI is to provide techniques that help understand how the AI systems represent the information, and how the information is used by the AI systems to produce outputs.

The field of explainable AI emerged around 2014, right after the beginning of the deep learning revolution [1,2]. Since then, several techniques have been developed to improve model interpretability. For example, feature attribution methods, such as LIME (Local Interpretable Model-Agnostic Explanations) [3] and SHAP (SHapley Additive exPlanations) [4], estimate the contribution of each input feature to a model's prediction. In turn, visualization methods, such as CAM [5], are post hoc explainability methods for computer vision models that produce heatmaps to visualize the regions of the input data (images) that contributed the most to the output. Other techniques focus on revealing information of the internal representation learned by the model [6]. More recently, autonomous agents that actively probe models have emerged as a methodology of automating interpretability, enabling model analysis at scale. For instance, the Multimodal Automated Interpretability Agent (MAIA) [7] framework enables agents to perform iterative experimentation with specific multimodal tools, while the Self-Reflective Interpretability Agent (SAIA) [8] framework adds a self-reflection mechanism that allows the agent to revise faulty hypotheses based on experimental evidence, leading to more accurate and robust conclusions.

The use of one technique or another really depends on the purpose and context of the interpretability analysis, since different objectives, data modalities, and stakeholders require different forms of explanations [9]. For example, in model debugging and validation, feature attribution may help developers identify spurious correlations within the model and the data. Revealing spurious correlations is necessary to ensure alignment between the model behaviour and human expectations, which is essential in sensitive application domains such as healthcare. In contrast, automated interpretability agents offer scalable solutions for continuous model monitoring and evaluation in complex and dynamic environments.

This research line on explainable AI will focus on two complementary directions:
1. Application of existing explainability techniques to diverse domains, such as medical image analysis or interpretability of multi-modal models to support clinical decisions.
2. Development of novel explainability methods that go beyond current approaches in interpretability, with the ultimate goal of ensuring robustness and alignment with human expectations.

Bibliography

[1] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. "Object Detectors Emerge in Deep Scene CNNs." International Conference on Learning Representations (ICLR). 2015.

[2] Q. Zhang, Z. Song-Chun, "Visual interpretability for deep learning: a survey", Frontiers of Information Technology & Electronic Engineering, 2018.

[3] M. T. Ribeiro, S. Singh, C. Guestrin, in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, USA 2016, pp. 1135–1144

[4] S. M. Lundberg, S.-I. Lee, Advances in Neural Information Processing Systems, The MIT Press Number 12, Vol. 30, 2017.

[5] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba. Learning Deep Features for Discriminative Localization, CVPR, 2016.

[6] D. Bau, J-Y Zhu, H. Strobelt, A. Lapedriza, B. Zhou, A. Torralba, "Understanding the role of individual units in a deep neural network", Proceedings of the National Academy of Sciences (PNAS), 2020.

[7] Tamar Rott Shaham, Sarah Schwettmann, Franklin Wang, Achyuta Rajaram, Evan Hernandez, Jacob Andreas, and Antonio Torralba. A multimodal automated interpretability agent. In Forty-first International Conference on Machine Learning, 2024.

[8] C. Li, J.Lopez-Camuñas, J. Thomas, J. Andreas, A. Lapedriza, A. Torralba, T. Rott, "Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent", Neural Information and Processing Systems (NeurIPS), 2025.

[9] Ribera, Mireia, and Agata Lapedriza. "Can we do better explanations? A proposal of user-centered explainable AI." IUI Workshops. 2019.

Dr Àgata Lapedriza

Mail: alapedriza@uoc.edu

Dr David Masip

Mail: dmasipr@uoc.edu

AIWELL Lab
Perception of emotions based on facial expressions and prosodic speech patterns
 
Facial expressions are a key source of information for developing new technologies. Humans use facial cues to convey emotions, and psychologists have studied this since Charles Darwin's early work [1]. One of the most influential models is the Facial Action Coding System (FACS) [2], which defines action units (facial muscle movements) as the basis for six basic emotions: happiness, surprise, fear, anger, disgust and sadness. Understanding this near-universal language is a major focus in computer vision, with applications in human-computer interaction. However, human emotions go far beyond this basic set. Our research applies deep learning to enable computers to interpret emotions from facial and prosodic cues.
We emphasize applications in child psychology, educational tech and e-health, where multimodal data can support personalized learning and mental health diagnostics.

We explore innovative uses of advanced technology in mixed research designs with children in natural settings (e.g., schools). Our work includes developing ethical, efficient and non-intrusive methods to collect and process audiovisual data, as well as applying statistical and spatio-temporal models to analyse machine-generated data, including coding and compression techniques.

This PhD thesis will be conducted in collaboration with the Child Tech Lab and the AI for Human Well-being Lab at the UOC.

[1] Darwin, Charles (1872), The expression of the emotions in man and animals, London: John Murray.
[2] P. Ekman and W. Friesen. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto, 1978.

 

 

Dr Lucrezia Crescenzi

Mail: lcrescenzi@uoc.edu

Child Tech Lab
Deep-learning algorithms
 
In recent years, end-to-end learning algorithms have revolutionized many areas of research, such as computer vision [1], natural language processing [2], gaming [3], robotics and protein folding [4]. Deep learning techniques have achieved the highest levels of success in many of these tasks, given their astonishing capability to model both the features/filters and the classification rule.

Despite these advancements, there remains considerable potential for enhancing deep learning methodologies to better address real-world challenges. This research initiative will explore novel approaches in the following areas:

- Federated learning: a machine learning technique that enables training deep learning models on decentralized data sources. This approach addresses privacy concerns and logistical challenges associated with data sharing, making it particularly valuable in healthcare and other privacy-sensitive domains.

- Uncertainty estimation and conformal predictions: The output of a Neural Network typically consists of a score indicating the probability or regression of a user-defined label. However, real-world applications necessitate explicit quantification of the uncertainty associated with these predictions. For instance, in a cancer diagnosis system, it is crucial to provide not only the type and probability of cancer but also the confidence level of the prediction [5]. To address this, we propose the development of novel deep learning methodologies that explicitly model predictive uncertainty. Furthermore, we aim to leverage this uncertainty to facilitate active and semi-supervised learning on extensive unlabelled datasets.

These algorithms will be applied to real computer vision problems in the fields of automated medical diagnosis in e-Health applications, or automated identification, tracking and population analysis in several citizen science projects. We collaborate with several hospitals in the Barcelona area and two research centres where these methods are currently applied: VHIR (Vall d'Hebron Institut de Recerca), CSIC-ICM (Marine Sciences Institute https://www.icm.csic.es/en).

[1] A. Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
[2] Sutskever, I., et al. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 2014
[3] Silver, D., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484-489.
[4] Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In: ICML. (2016).
[5] Adhane, G., Dehshibi, M. M., and Masip, D. (2021). A Deep Convolutional Neural Network for Classification of Aedes Albopictus Mosquitoes. IEEE Access, 9, 72681-72690.

 

Dr David Masip

Mail: dmasipr@uoc.edu

AIWELL Lab
Emotional intelligence for human-computer interaction
 
We are already beginning to interact daily with intelligent machines – social robots, virtual agents and other autonomous systems – that are becoming part of our homes, workplaces and public spaces. These technologies are increasingly capable of sustaining long-term relationships with people [1]. Ongoing research and deployment efforts are demonstrating their potential in diverse domains, such as socially assistive systems that support healthcare [2], robots that aid and accompany older adults [3,4] and intelligent agents that enrich teaching and learning experiences [5].

An important aspect for these intelligent machines to communicate fluently with people is the capacity perceiving expressions of emotions, preferences, needs or intents, and reacting to those expressions in a socially and emotionally intelligent manner. This research line focuses on designing and implementing technologies that allow intelligent machines to have these types of simulated emotional intelligence, which are essential abilities to maintain social interactions with humans.

More generally, this area of research includes other subareas related with the development of emotional intelligence in machines, like emotion perception in the wild [6,7], analysis of emotions and social cues in dyadic interactions, text sentiment analysis [8], visual sentiment analysis [9] and emotion perception from audio [10]. In terms of applications, we focus on human assistance, human companionship, entertainment and emotional well-being.

References:

[1] C. Kidd and C. Breazeal (2008). "Robots at Home: Understanding Long-Term Human-Robot Interaction". Proceedings of the 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2008). Nice, France.

[2] Breazeal, C. (2011, August). Social robots for health applications. In 2011 Annual international conference of the IEEE engineering in medicine and biology society (pp. 5368-5371). IEEE.

[3] Broekens, J., Heerink, M., & Rosendal, H. (2009). Assistive social robots in elderly care: a review. Gerontechnology, 8(2), 94-103.

[4] Camuñas, J.L., Bustos, C., Zhu, Y., Ros, R. and Lapedriza, A., 2025. Experimenting with Affective Computing Models in Video Interviews with Spanish-Speaking Older Adults. In Proceedings of the Winter Conference on Applications of Computer Vision (pp. 1-10).

[5] Belpaeme, T., Kennedy, J., Ramachandran, A., Scassellati, B., & Tanaka, F. (2018). Social robots for education: A review. Science Robotics, 3(21).

[6] R. Kosti, J.M. Álvarez, A. Recasens and A. Lapedriza, "Context based emotion recognition using emotic dataset", IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2019.

[7] Cabacas-Maso, J., Ortega-Beltrán, E., Benito-Altamirano, I., & Ventura, C. (2025). Enhancing Facial Expression Recognition with LSTM through Dual-Direction Attention Mixed Feature Networks and CLIP. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 5665-5671).

[8] Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., & Lehmann, S. (2017). Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. arXiv preprint arXiv:1708.00524.

[9] C. Bustos, C. Civit, B. Du, A. Sole-Ribalta, A. Lapedriza, "Leveraging Vision-Language models for Visual Sentiment Analysis: a study on CLIP", 11th International Conference on Affective Computing & Intelligent interactions (ACII), 2023.

[10] Ortega-Beltrán, E., Cabacas-Maso, J., Benito-Altamirano, I., & Ventura, C. (2024, September). Better Spanish Emotion Recognition In-the-wild: Bringing Attention to Deep Spectrum Voice Analysis. In European Conference on Computer Vision (pp. 335-348). Cham: Springer Nature Switzerland.

 
Mail: alapedriza@uoc.edu
 
Mail: cventuraroy@uoc.edu
AIWELL Lab
Oculomics: systemic medical diagnosis using retinal imaging
 
Oculomics is the study of how the health of the eye can reflect the general health of the body, using the eye as a "window" to detect and understand systemic diseases. This field integrates advanced imaging technologies and artificial intelligence to identify biomarkers in the eye that may indicate the presence of diseases like diabetes, cardiovascular diseases [1,2], anaemia [3], kidney disease [4], Alzheimer's [5] and other conditions. The goal is to enable earlier (even before clinical symptoms appear), non-invasive and cost-effective diagnoses of diseases affecting the rest of the body, significantly reducing the economic burden for healthcare systems.

This research project will focus on analysing traditional (fundus, OCT or retinal angiography) and advanced (such as AOSLO) retinal imaging, developing deep learning models that are able to early diagnose for preventive treatment, thereby facilitating new, more personalized medicine approaches and prevention plans.
The three main challenges when dealing with these applications are:

1. Small sample size problems, usually the N is reduced, and the generalization capabilities of the network are affected. We will explore transfer learning and the use of generative model methods for this purpose.

2. The resulting models should be explainable and easy to interpret. We will provide both a classification score and an explanation of this score, to make the early diagnosis more reliable and trustable.

3. Federated learning (FL). Due to the sensitivity and regulations (like HIPAA/GDPR) surrounding patient health data, sharing is difficult. The healthcare sector is increasingly adopting federated learning (FL) to overcome this by sharing model parameters (knowledge) instead of data, maintaining privacy with a decentralized machine learning approach.

The resulting methods will be transferred to hospitals from the Barcelona metropolitan area, and the research efforts will result in a strong social return.


[1] Gerrits et al. (2020). Age and sex affect deep learning prediction of cardiometabolic risk factors from retinal images. Scientific Reports, 10(1), 1-9.
[2] Barriada, R. G., Simó-Servat, O., Planas, A., Hernández, C., Simó, R., & Masip, D. (2022). Deep Learning of Retinal Imaging: A Useful Tool for Coronary Artery Calcium Score Prediction in Diabetic Patients. Applied Sciences, 12(3), 1401.
[3] Tham, Y. C., et al. (2020). Detection of anaemia from retinal images. Nature Biomedical Engineering, 4(1), 2-3.
[4] Sabanayagam, C., et al. (2020). A deep learning algorithm to detect chronic kidney disease from retinal photographs in community-based populations. The Lancet Digital Health, 2(6), e295-e302.
[5] McGrory et al. (2017). The application of retinal fundus camera imaging in dementia: a systematic review. Alzheimer's & Dementia: Diagnosis, Assessment & Disease Monitoring, 6, 91-1
 

Dr David Merino

Mail: dmerinoar@uoc.edu

Dr David Masip

Mail: dmasipr@uoc.edu

 
 
AIWELL Lab
Large deep learning multimodal models 
 
The rise of large vision-language models (VLMs), such as CLIP [1], ALIGN [2], Flamingo [3], GPT-4V [4] and Gemini [5], has moved computer vision from traditional classification tasks towards performing complex high-level vision-language reasoning tasks that were previously out of reach. These new VLMs can be used for video understanding, visual question answering, detailed image/video captioning, and even open-ended dialogue grounded in visual content.

This research line on large deep learning multimodal models aims to bridge vision systems with cognitive capabilities by leveraging and extending current VLMs and also by developing novel multimodal models that incorporate additional modalities beyond vision and text (e.g. audio or physiological signals). We focus on developing multimodal deep learning architectures to address high-level vision and language tasks, such as:

- Emotion perception: Given an image or a video, building models that can perceive emotions represented in the visual content [6].
- Visual question answering (VQA): Given an image or video and a question, building models that generate an appropriate answer [7].
- Image/video captioning: Automatically generating detailed and context-aware textual descriptions of visual content [8].
- Image/video segmentation: Identifying and segmenting specific regions or objects in an image or video based on a description [9].
- Inclusive VLMs: Analysing the extent to which current VLMs are inclusive across demographic, cultural and linguistic dimensions. This line of research investigates biases in data and model outputs, evaluates accessibility and representation in multimodal systems, and develops strategies for building more equitable and socially responsible VLMs (e.g. [10]).

Bibliography

[1] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR.

[2] Masry, A., Rodriguez, J. A., Zhang, T., Wang, S., Wang, C., Feizi, A., ... & Rajeswar, S. (2025). Alignvlm: Bridging vision and language latent spaces for multimodal understanding. arXiv preprint arXiv:2502.01341.

[3] Alayrac, J. B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., ... & Simonyan, K. (2022). Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35, 23716-23736.

[4] OpenAI. Gpt-4v(ision) technical work and authors. 2023.

[5] Gemini Team, Google. Gemini: A Family of Highly Capable Multimodal Models. 2023. Available online: https://gemini.google.com

[6] C. Bustos, C. Civit, B. Du, A. Sole-Ribalta, A. Lapedriza, "Leveraging Vision-Language models for Visual Sentiment Analysis: a study on CLIP", 11th International Conference on Affective Computing & Intelligent interactions (ACII), 2023.

[7] Khan, Z., BG, V. K., Schulter, S., Yu, X., Fu, Y., & Chandraker, M. (2023). Q: How to specialize large vision-language models to data-scarce vqa tasks? A: Self-train on unlabeled images! In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15005-15015).

[8] Hu, X., Gan, Z., Wang, J., Yang, Z., Liu, Z., Lu, Y., & Wang, L. (2022). Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 17980-17989).

[9] Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., ... & Khan, F. S. (2024). Glamm: Pixel grounding large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13009-13018).

[10] S. Dudy, I.S. Ahmad, R. Kitajima, A. Lapedriza, "Analyzing Cultural representations of Emotions in LLMs through Mixed Emotion Survey", 12th International Conference on Affective Computing & Intelligent interactions (ACII), 2024. (Best Paper Award).

 
Mail: alapedriza@uoc.edu
 
Mail: cventuraroy@uoc.edu
 
AIWELL Lab
Leveraging deep learning algorithms to create automated video editions

While deep learning has enhanced interactive video retrieval [1], automated multicamera editing remains underexplored. This research line builds on our recent work [2], which addresses automated classical concert editing by decomposing the problem into two key sub-tasks: "when to cut" (temporal segmentation) and "how to cut" (spatial shot selection).

For "when to cut," we developed a lightweight, multimodal convolutional-transformer (audio, time features, visual embeddings) that outperforms statistical baselines. For "how to cut," we improved shot selection by replacing older backbones with a CLIP-based encoder for better semantic alignment, trained on a pseudo-labelled dataset created using a hybrid pipeline including LLM-based confirmation (Gemini).

Future research will focus on:

- Extending the temporal model to a regression task to predict precise cut timestamps.
- Incorporating higher-level affective cues, like musical emotion or visual emotion, to guide shot selection.
- Developing prompt-based or user-in-the-loop editing systems.
- Applying these models to new domains (e.g., sports, stage plays, public speaking).
- Implementing long-term temporal modelling for globally coherent editing.

These developments underscore the potential for automated systems to handle complex multicamera scenarios, applying learned stylistic cues and viewer-centric strategies to produce engaging video content.
Dr Ismael Benito-Altamirano
Mail: ibenitoal@uoc.edu

Mail: cventuraroy@uoc.edu
AIWELL Lab
Colorimetry, traditional computer vision, and deep learning applied to sensors

The integration of colorimetry with computer vision, employing both traditional and deep learning methodologies, has provided significant advancements in transforming qualitative sensors into precise colorimetric measurement tools. This transformation is crucial for industrial and environmental applications, where accurate color detection can offer valuable information. In 2021, we presented a practical approach for printed sensor labels capable of detecting various gases in the ambient air through colorimetric changes, highlighting the potential of such tools for broader applications in different industries [1].
 
Machine-readable patterns, such as QR codes, can be used in colorimetric tasks as a means to encode and transmit color reference information or calibration data, thus enhancing the efficiency of image capture and analysis processes. We demonstrated how modified QR codes could be applied in challenging conditions to convey colorimetric data, ensuring robust performance in diverse environments [2, 3]. This work provided a foundational understanding of how traditional computer vision can be integrated with deep learning to handle complex imaging tasks effectively.
 
Key areas of ongoing research in this field include:
 
- Color Correction Algorithms: Leveraging deep learning to enhance color correction processes, leading to more consistent color reproduction across various sensors and devices [1].
- Enhanced Image Capture Techniques: Utilizing deep learning models, such as those applied in QR code reading, to improve the accuracy and reliability of color data extraction [4].
- Multimodal Image Fusion: Employing techniques that combine data from various imaging modalities (e.g., RGB-D, RGB-T, IR-RGB) to expand the capabilities of classification and segmentation networks, as demonstrated by advances in multispectral detection [5].
 
These advancements signal a future where qualitative sensors can become powerful, reliable tools in colorimetric applications, empowering industries with accessible and precise measurement solutions.
 
Bibliography
 
1. Engel, L., Benito-Altamirano, I., Tarantik, K. R., Pannek, C., Dold, M., Prades, J. D., & Wöllenstein, J. Printed sensor labels for colorimetric detection of ammonia, formaldehyde and hydrogen sulfide from the ambient air. Sensors and Actuators B: Chemical, 2021.
2. Benito-Altamirano, I., Martínez-Carpena, D., Casals, O., Fàbrega, C., Waag, A., & Prades, J. D. Back-compatible Color QR Codes for colorimetric applications. Pattern Recognition, 133, 2023, 108981.
3. Benito-Altamirano, I., Martínez-Carpena, D., Lizarzaburu-Aguilar, H., Fàbrega, C., & Prades, J. D. Reading QR Codes on challenging surfaces using thin-plate splines. Pattern Recognition Letters, 2024.
4. Shi, B., Wang, X., Lyu, P., Yao, C., & Bai, X. Robust scene text recognition with automatic rectification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4168-4176.
5. Qingyun, F., Dapeng, H., & Zhaokui, W. Cross-modality fusion transformer for multispectral object detection. arXiv preprint arXiv:2111.00273, 2021.
Dr Ismael Benito-Altamirano
Mail: ibenitoal@uoc.edu
AIWELL Lab
Raman spectroscopy as an ophthalmology diagnosis tool

Raman spectroscopy is a powerful non-destructive analytical technique relying on the inelastic scattering of monochromatic light, usually from a laser, by molecules in the sample. The interaction of the incident photons with the molecular vibrations causes a shift in the energy of the scattered light, known as the Raman shift. This shift provides a unique spectral fingerprint that is characteristic of the molecular structure, chemical composition and physical state of the material, making it an invaluable tool across various scientific disciplines, including chemistry, materials science and biomedical research.

This thesis proposal aims to develop novel deep learning algorithms for the automated, objective interpretation of spectral signatures [1]. Although it may have multiple applications, a primary application, currently supported by ongoing collaborations with clinicians in the Barcelona area, focuses on precise tumour classification and prognosis. This work aims to provide clinicians with a safe, convenient and easily interpretable method, thereby replacing subjective manual interpretation with objective analysis.

The group currently collaborates with a multidisciplinary network of experts: from ocular oncology clinicians (HUB-IDIBELL, HSJD), to biophotonics researchers (ICFO, HSJD), who will contribute with technical knowledge and clinical data for supporting this research.

[1] Terán, M., Ruiz, J. J., Loza-Álvarez, P., Masip, D., & Merino, D. (2025). Open Raman spectral library for biomolecule identification. Chemometrics and Intelligent Laboratory Systems, 105476.

Dr David Merino

Mail: dmerinoar@uoc.edu

Dr David Masip

Mail: dmasipr@uoc.edu

AIWELL Lab
Retinal imaging for eye tracking

The eye is continuously in motion, and these movements have a very important role in the vision strategy and in how the brain encodes the electrical information sent by the photoreceptors. To simplify: if the eye did not move continuously, our eyesight would blur and we would not see in a matter of seconds.
There are different methods of measuring eye movement, such as pupil cameras, but their resolution is limited spatially and/or temporally.
In collaboration with the University of California, Berkeley, we have built a retinal camera that can measure eye movements on a very fine temporal and spatial scale.
This camera obtains videos that, processed correctly, can give us information that will help decipher how the movements of the two eyes work at the same time. This information is important to understand how we see.
We are currently working on adapting the algorithms that process these videos to improve the results obtained with this new equipment.

1. Fast binocular retinal tracking using line scanning laser ophthalmoscopy, D Merino, F Feroldi, P Tiruveedhula, C Wang, A Roorda, Ophthalmic Technologies XXXIV, PC128241I
2. David Merino, Jacque L. Duncan, Pavan Tiruveedhula and Austin Roorda, "Observation of cone and rod photoreceptors in normal subjects and patients using a new generation adaptive optics scanning laser ophthalmoscope," Biomedical Optics Express 2, 2189-2201 (2011)

Dr David Merino

Mail: dmerinoar@uoc.edu

Dr David Masip

Mail: dmasipr@uoc.edu

AIWELL Lab
Mobile corneal topography

MoCoTo (Mobile Corneal Topographer) allows for low-cost corneal topographies, using an add-on for a mobile phone or tablet, improving the adaptation of contact lenses to corneal shape.
The add-on illuminates the cornea of the eye with a specific pattern, and this is used to determine the shape of the cornea. This information can be used to adapt the shape of contact lenses and improve the experience of the user.

Dr David Merino

Mail: dmerinoar@uoc.edu
AIWELL Lab
Video and image analysis for eye and brain disease forecasting

The analysis of the back of the eye (fundus) is crucial to identify not only eye disease but also serious neurological conditions that can lead to blindness, brain injury and even death [1]. Determining these conditions is a common clinical challenge in ophthalmic, neurosurgical and neurological clinics, and their accurate assessment sometimes requires invasive testing or surgical procedures that carry moderate to severe risks of pain and injury. Given these risks, non-invasive alternatives are imperative [2,3,4,5].

At present, eye screening image and video data (fundal photography, OCT, etc.) require interpretation by an expert assessor, limiting their usefulness in non-specialist environments. AI-driven automated tests would help guide the need for urgent referral in community settings or the appropriate follow-up investigations and interventions in hospital settings.

We aim to develop automated, deep learning-enabled image and video detection systems to create novel tests for eye and brain disease diagnosis and forecasting. The study is conducted by an interdisciplinary and international team of ophthalmology, neurology and artificial intelligence experts.

Keywords: Deep learning, explainability, medical imaging, retina, vascular disease, brain disease.

[1] Ptito, M., Bleau, M., & Bouskila, J. (2021). The retina: a window into the brain. Cells, 10(12), 3269.
[2] Panahi, A., Rezaee, A., et al. (2023). Autonomous assessment of spontaneous retinal venous pulsations in fundus videos using a deep learning framework. Scientific Reports, 13(1), 14445.
[3] Cheung, C. Y., Wong, W. L. E., et al. P. (2022). Deep-learning retinal vessel calibre measurements and risk of cognitive decline and dementia. Brain Communications, 4(4), fcac212.
[4] Cheung, C. Y., Ran, A. R., et al. (2022). A deep learning model for detection of Alzheimer's disease based on retinal photographs: a retrospective, multicentre case-control study. The Lancet Digital Health, 4(11), e806-e815.
[5] Zee, B., Wong, Y., et al. (2021). Machine-learning method for localization of cerebral white matter hyperintensities in healthy adults based on retinal images. Brain Communications, 3(3), fcab124.
Dr Joan M Nuñez do Rio
Mail: jnunezdo@uoc.edu
AIWELL Lab
Generative AI to improve eye diagnosis

The variety of imaging devices and techniques used in ophthalmology departments hinders the implementation of deep learning systems (DLSs) in eye clinics, as the systems are not agnostic to the imaging device. The poor generalizability of deep learning models without access to large amounts of unbiased data is well documented and is often the case in clinical practice, given the limited number of available cases of pathological eyes compared to controls.

Screening algorithms also fail to achieve clinically acceptable levels of performance when used on images of the same modality but acquired in less favourable conditions [1]. However, DLSs developed under such conditions can reach clinically acceptable performance if trained on the appropriate source data [2]. This poor generalizability and adaptability, together with the poor interpretability of the current models, limits the translation of successful DLSs from research to the eye clinic, where data are more variable and multimodal. This is a major issue that affects not just eye care but is common throughout frontline clinical practice.

A potential solution is to use generative models to synthesize or modify images from various modalities and conditions, which can in turn be used to train DLSs to detect diseases that are invariant to imaging conditions and devices [3,4]. Moreover, by exploring the latent space of generative models, model causality can be identified and linked to comprehensive scene attributes and model interpretability.

This project aims to develop synthetic models that generate and modify retinal images to improve disease assessment, and also to explore the causality and interpretability of generative models to extract comprehensive knowledge about morphology and lesions.

Keywords: deep learning, GAN, diffusion, medical imaging, retina, vascular disease.

[1] Nunez do Rio, J. M., Nderitu P., et al. (2022). Evaluating a deep learning diabetic retinopathy grading system developed on mydriatic retinal images when applied to non-mydriatic community screening. Journal of Clinical Medicine, 11(3), 614.
[2] Nunez do Rio, J. M., Nderitu P., et al. (2023). Using deep learning to detect diabetic retinopathy on handheld non-mydriatic retinal images acquired by field workers in community settings. Scientific Reports, 13(1), 1392.
[3] Waisberg, E., Ong, J., et al. (2025). Generative artificial intelligence in ophthalmology. Survey of ophthalmology, 70(1), 1-11.
[4] Remtulla, R., Samet, A., et al. (2025). A Future Picture: A Review of Current Generative Adversarial Neural Networks in Vitreoretinal Pathologies and Their Future Potentials. Biomedicines, 13(2), 284.

Dr Joan M Nuñez do Rio
Mail: jnunezdo@uoc.edu
AIWELL Lab
Multimodal and multiview medical image analysis

A broad range of 2 and 3-dimensional imaging devices are available in hospital care and increasingly in the community [1]. Increased high-resolution and high-contrast imaging techniques with different acquisition technologies provide practitioners with a varied spectrum of image data.

A comprehensive patient assessment often involves not just an evaluation of the patient characteristics. In most cases, multiple imaging modalities and views are used to arrive at the correct diagnosis [2,3,4]. Given the myriad imaging devices, for the full potential of deep learning (DL) impact in health care to be realized, exploitation of multimodal DL approaches and feature integration is required. DL systems (DLSs) stand to benefit from a better view of pertinent characteristics that may be exclusive to specific imaging procedures, but which in aggregate could significantly increase the accuracy of multi-modal DLSs in detecting disease and predicting its progression.

In this project we aim to develop deep learning-driven systems for the automated detection of medical conditions using different image and video data views and modalities.

Keywords: Deep learning, multiview, multimodal, MRI, CT, X-ray, mammography, echocardiography, PET.

[1] Abhisheka, B., Biswas, et al. (2024). Recent trend in medical imaging modalities and their applications in disease diagnosis: a review. Multimedia Tools and Applications, 83(14), 43035-43070.
[2] Liu, D., Gao, et al. (2022). Transfusion: multi-view divergent fusion for medical image segmentation with transformers. MICCAI (pp. 485-495). Cham: Springer Nature Switzerland.
[3] Zhao, Z., Hu, J., Zeng, Z., et al. (2022). Mmgl: Multi-scale multi-view global-local contrastive learning for semi-supervised cardiac image segmentation. In 2022 IEEE international conference on image processing (ICIP) (pp. 401-405). IEEE.
[4] Van Tulder, G., Tong, Y., & Marchiori, E. (2021). Multi-view analysis of unregistered medical images using cross-view transformers. MICCAI (pp. 104-113). Cham: Springer International Publishing.
Dr Joan M Nuñez do Rio
Mail: jnunezdo@uoc.edu
AIWELL Lab
Advanced MRI, sleep parameters and machine learning for tracking progression of Alzheimer's disease

Alzheimer's disease (AD) is the main neurodegenerative dementia that primarily affects memory. Consequently, AD typically leads to progressive deficits in memory, thinking, and daily functioning. The disease manifests in several stages, ranging from mild cognitive impairment due to Alzheimer's pathology to severe dementia. Studying AD progression remains challenging, particularly in the early stages. Early and accurate diagnoses and studying their progression are essential for optimal patient management and care planning. Magnetic resonance imaging (MRI) plays a vital role in supporting diagnosis and monitoring disease progression by detecting characteristic patterns of brain atrophy. Sleep parameters have been increasingly recognized as both potential biomarkers and contributing factors in AD progression.

Our project aims to address this critical gap in medical research. Using advanced data from MRI, we aim to study AD progression and how it interacts MRI data with sleep parameters. Furthermore, we would use sleep parameters for early diagnosis of AD and how it can help to study its progression. We will use explainable machine learning techniques and normative models to track disease progression at the individual level and to explore the different progression rates.

This innovative research will be conducted in collaboration with the Alzheimer's Disease and Cognitive Disorders Group at Barcelona's IDIBAPS-Hospital Clínic, a world-renowned clinical institution. Together, we aim to make significant strides in understanding AD progression, ultimately improving patient outcomes.
Dr Agnès Pérez-Millán
Mail: aperezmill@uoc.edu
AIWELL Lab
Advanced MRI and machine learning for classification and tracking progression of frontotemporal dementia subtypes

Frontotemporal dementia (FTD) is a neurodegenerative dementia that primarily affects the frontal and temporal lobes of the brain. These regions are typically associated with personality, behaviour, language and decision-making. FTD typically leads to changes in these functions. Thus, FTD presents various subtypes, such as behavioural FTD (bvFTD) and primary progressive aphasia (PPA). Diagnosing FTD, however, remains a challenge due to the overlap of symptoms with other neurodegenerative and psychiatric conditions. Achieving an early and accurate diagnosis is crucial to improve patient care. Magnetic resonance imaging (MRI) plays an important role in diagnosing and tracking the disease's evolution.

Our project aims to address this critical gap in medical research. Using advanced data from structural MRI and diffusion tensor imaging (DTI), we aim to classify FTD patients with the help of statistical methods and machine learning algorithms. The classification information, using explainable machine learning techniques, will allow us to identify potential biomarkers to help diagnose the various FTD subtypes. Moreover, by leveraging MRI and DTI data, we plan to create normative models to track disease progression on an individual basis and to explore the different progressions according to FTD subtype.

This innovative research will be conducted in collaboration with the Alzheimer's Disease and Cognitive Disorders Group at Barcelona's IDIBAPS-Hospital Clínic, a world-renowned clinical institution. Together, we aim to make significant strides in understanding and diagnosing FTD, ultimately contributing to better outcomes for patients
Dr Agnès Pérez-Millán
Mail: aperezmill@uoc.edu
AIWELL Lab
Agentic artificial intelligence (agentic AI)

Agentic artificial intelligence (agentic AI) represents a new paradigm in artificial intelligence research. It focuses on developing systems that can autonomously perceive, reason, plan and act to perform a complex task [1]. In contrast with traditional AI models, which are trained to perform just one task, replicating a certain input-output behaviour, agentic AI has a certain degree of autonomy, pro-activity and adaptability, and is characterized by its capacity for goal-oriented behaviour and iterative decision-making. Agentic AI is reshaping the landscape of intelligent systems, offering the potential for adaptive solutions across multiple domains, including healthcare and education, among others.

This research line on agentic AI will explore architectures and applications of agentic AI, and has the following potential directions:

- Novel architectures: developing new mechanisms to integrate perception, reasoning, memory and planning into unified agent frameworks capable of long-term and adaptive behaviour.
- Self-reflective agents: developing systems that can evaluate their own reasoning processes, detect errors or biases, and refine their strategies [2,3].
- Human-AI collaboration [4]: designing agents that can interact and cooperate effectively with humans, adapting their communication style and level of autonomy.
- Domain-specific applications: applying agentic AI to critical domains such as medical decision support [5], or AI model interpretability and debugging [3,6].

Bibliography

[1] Acharya, D.B., Kuppan, K. and Divya, B., 2025. Agentic ai: Autonomous intelligence for complex goals–a comprehensive survey. IEEE Access.

[2] N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36:8634–8652, 2023.

[3] C. Li, J. Lopez-Camuñas, J. Thomas, J. Andreas, A. Lapedriza, A. Torralba, T. Rott, "Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent", Neural Information and Processing Systems (NeurIPS), 2025.

[4] Nicoletti, Bernardo, and Andrea Appolloni. "A digital twin framework for enhancing human–agentic AI–machine collaboration." Journal of Intelligent Manufacturing (2025): 1-17.

[5] Karunanayake, N., 2025. Next-generation agentic AI for transforming healthcare. Informatics and Health, 2(2), pp.73-83.

[6] J.Lopez-Camuñas, C. Li, T. Rott, A. Torralba, A. Lapedriza, "OpenMAIA: a Multimodal Automated Interpretability Agent based on open-source models", Workshop on Mechanistic Interpretability at Neural Information and Processing Systems (NeurIPS Workshops), 2025

Dr Àgata Lapedriza

Mail: alapedriza@uoc.edu

Dr David Masip

Mail: dmasipr@uoc.edu

AIWELL Lab