B cell receptors are transmembrane proteins that are involved in the recognition of antigens and the subsequent production of antibodies within B cells. To ensure that they generate a sufficient variety of antibodies to inhibit an invading pathogen, the structure of the B cell receptor is highly plastic, frequently undergoing genetic recombination and somatic hypermutation. Analysis of the diversity of B cell receptors can be indicative of the health of an individual’s immune system, with a highly varied but relatively homogenously distributed population, while an individual infected with a virus would exhibit a higher frequency of the relevant B cell receptor sequence.
In a paper recently uploaded to the preprint server bioRxiv* by Kim et al. (August 2, 2021), the B cell receptor repertoire of 1,060 COVID-19 patients is compared to that of an equal number of healthy volunteers in a novel manner, utilizing computational deep-learning-based protein embedding techniques, which demonstrate a resolution capable of tracking the changes in immunity in an individual over the course of infection.
*Important notice: bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
How was the study performed?
Most B cell receptor repertoire studies aim to characterize sequence motifs that are responsible for the generation of neutralizing antibodies, utilizing techniques such as fluorescent tagging and flow cytometry to identify these sites. Computational deep learning techniques can be used to recognize and assign amino acid sequences on a protein structure, covering a much larger number of samples in much greater detail than would otherwise be possible by wet laboratory techniques. The group used protein classification software ProtVec to assign each amino acid 3mer a “biological word,” with several thousands of words generated per sequence, which are then converted into a vector for easier visual comparison, finally being plotted in 2-dimensional vector spaces.
The system was first tested using five known protein structures obtained from publically available sources. It was found that the computational method was able to distinguish between each based on the frequency and spacing of biological words generated. When comparing the B cell receptor repertoire of healthy individuals to those with COVID-19, the group noted a major disparity between the homogeneity of sequences. The infected exhibited a significantly greater number of unique sequences that were also expressed to very high levels, and in general, having a higher count for almost every sequence identified.
The ten most frequently occurring sequences from each of the 2,120 repertoires were selected and plotted into 2D space, though the group found no obvious difference in comparing healthy individuals to those with COVID-19. However, sequences are only of relevance when occurring together, and the particular sequence that is upregulated in a diseased state can vary wildly, making correlation difficult. Thus the researchers instead combined the 100 most frequently occurring sequences in each repertoire into a single vector that could be plotted in 2D space. This generated a graphical representation of those with more “COVID-like” repertoires, with this group being easily distinguished from the healthy.
The data utilized by the group had also been employed in an antibody study by another research team. Where available, it was noted in this prior study that for individuals with COVID-19, the divergence of effective neutralizing antibodies varies over the course of infection, first being more widely varied and then becoming more specialized as the effective versions are produced in greater quantity. In these same individuals with multiple data points over time, the group observed that the 2D positioning of the generated vector migrated from the center, to the COVID-19 region, and then finally to the healthy region, reflecting the initial response to infection, the peak of infection, and post-infection periods, respectively.
The method developed in this paper allows researchers to characterize an individual’s B cell receptor repertoire and express it in a graphical form that can be compared to that of other individuals or with data collected from the same individual over time, allowing infection to be identified and tracked.
*Important notice: bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.