Enabling Cross-Lingual AMR Parsing with Transfer Learning Techniques

Abstract Meaning Representation ( AMR ) is a popular formalism of natural language that represents the meaning of a sentence as a semantic graph. It is agnostic about how to derive meanings from strings and for this reason it lends itself well to the encoding of semantics across languages. However, cross-lingual AMR parsing is a hard task, because training data are scarce in languages other than English and the existing English AMR parsers are not directly suited to being used in a cross-lingual setting. In this work we tackle these two problems so as to enable cross-lingual AMR parsing: we explore different transfer learning techniques for producing automatic AMR annotations across languages and develop a cross-lingual AMR parser, XL - AMR . This can be trained on the produced data and does not rely on AMR aligners or source-copy mechanisms as is commonly the case in English AMR parsing. The results of XL - AMR signiﬁcantly sur-pass those previously reported in Chinese, German, Italian and Spanish. Finally we provide a qualitative analysis which sheds light on the suitability of AMR across languages. We release XL - AMR at github.com/SapienzaNLP/xl-amr.


Introduction
Abstract Meaning Representation (AMR) is a popular formalism for natural language (Banarescu et al., 2013). It represents sentences as rooted, directed and acyclic graphs in which nodes are concepts and edges are semantic relations among them. AMR unifies, in a single structure, a rich set of information coming from different tasks, such as Named Entity Recognition (NER), Semantic Role Labeling (SRL), Word Sense Disambiguation (WSD) and coreference resolution. Such representations are actively integrated in several Natural Language Processing (NLP) applications, inter alia, information extraction (Rao et al., 2017), text summarization (Hardy and Vlachos, 2018;Liao et al., 2018), paraphrase detection (Issa et al., 2018), spoken language understanding (Damonte et al., 2019), machine translation (Song et al., 2019b) and human-robot interaction (Bonial et al., 2020). It is therefore desirable to extend AMR semantic representations across languages along the lines of cross-lingual representations for grammatical annotation (de Marneffe et al., 2014), concepts (Conia and Navigli, 2020) and semantic roles (Akbik et al., 2015;Di Fabio et al., 2019). Furthermore, it could be especially useful to integrate cross-lingual semantic structures in multilingual applications of natural language understanding.
A peculiar feature of the AMR formalism is that it aims at abstracting away from word forms. AMR graphs are unanchored, i.e., the linkage between tokens in a sentence and nodes in the corresponding graph is not explicitly annotated. Hence, the feature of being agnostic about how to derive meanings from strings makes AMR particularly suitable for representing semantics cross-lingually. However, AMR was initially designed for encoding the meaning of English sentences. Owing to this, the available resources and modelling techniques focus mostly on English, while leaving cross-lingual AMR understudied. Some preliminary studies showed the limits of AMR as an interlingua, categorizing them as due to distinctions in the underlying ontologies or structural divergences among languages (Xue et al., 2014;. More recent studies, instead, have provided evidence that AMR or a simplified version of it can be used as a formalism for cross-lingual semantic representation, showing that it is possible to overcome some of the structural linguistic divergences . The underlying idea of this paper is that AMR can be used to represent semantic information in different languages since there exist key linguistic features that are shared across languages, such as predicates, roles and conjunctions (Von Fintel and Matthewson, 2008). However, developing an AMR parser for multiple languages is hard because the existing annotated training resources that are sufficiently large are available in English only, and, moreover, acquiring semantic annotations for a large number of sentences is well-known to be a slow and expensive process in NLP (Zhang et al., 2018;Pasini, 2020). To this end, we aim at exploiting and developing the necessary tools and resources for enabling cross-lingual AMR parsing, i.e., the task of transducing a sentence in the source language into an AMR graph based on English .
We present XL-AMR, a cross-lingual AMR parser, and study different transfer learning techniques to enable its training: i) model transfer which relies on language-independent features, ii) annotation projection relying on parallel corpora and available English AMR parsers, and iii) automatic translation of the training corpora which guarantees gold AMR structures. We make the following contributions: • We develop and release XL-AMR, a crosslingual AMR parser which disposes of word aligners, i.e., word-to-word and word-to-node, and surpasses the previously reported results on Chinese, German, Italian and Spanish, by a large margin.
• Exploration of different techniques to create cross-lingual AMR training data, showing how it is possible to transfer semantic structure information across different languages.
• Creation and release of diverse quality silver data for cross-lingual AMR parsing.
• Qualitative analysis of the ability of XL-AMR to transfer semantic structures across languages and of AMR to represent the meaning of sentences cross-lingually.

Related Work
Our work lies between two areas, namely, semantic parsing and cross-lingual transfer learning.
Semantic parsing Semantic parsing is a key task required to complete the puzzle of Natural Language Understanding (Navigli, 2018), and one which is receiving growing attention in the scientific community. Besides AMR, various different formalisms have been proposed over the years to encode semantic structures: Elementary Dependency Structures (Oepen and Lønning, 2006, EDS), Prague Tectogrammatical Graphs (Hajič et al., 2012, PTG), Universal Conceptual Cognitive Annotation (Abend andRappoport, 2013, UCCA), Universal Decompositional Semantics (White et al., 2016, UDS), inter alia. While some frameworks, such as UCCA and UDS, have been exploited in a cross-linguistic setting (Lyu et al., 2019;Zhang et al., 2018), cross-lingual AMR has mainly been studied within the scope of annotation analysis works (Xue et al., 2014;. These works point out the limitations of AMR as an interlingua, and consider them partly due to the distinctions in the underlying ontologies and structural divergences among languages.  also evaluate the properties of AMR across languages and aim at simplifying this formalism in order to express only essential semantic features of a sentence, such as predicate roles and linguistic relations. Cross-lingual AMR parsing, instead, has received relatively less attention. This is largely attributable to the lack of training data and evaluation benchmarks in languages other than English.  propose, to the best of our knowledge, the only cross-lingual AMR parser to date and, moreover, their proposed cross-lingual AMR evaluation benchmark has been released only very recently (Damonte and Cohen, 2020). The authors adapt a transition-based English AMR parser (Damonte et al., 2017) for cross-lingual AMR parsing, which is trained on silver annotated data. However, the performances it has achieved are not satisfying in terms of Smatch score , mostly as a result of concept identification errors, which in turn are directly related to the usage of noisy word-to-node alignments projected from English. Throughout the literature English AMR parsers commonly rely on AMR alignments which are automatically created using heuristics (Flanigan et al., 2014), or on pretrained aligners (Pourdamghani et al., 2014;, treated as latent variables of the model (Lyu and Titov, 2018) or implicitly modelled through sourcecopy mechanisms . These alignments, however, take advantage of the fact that AMR nodes and English words are highly related. 1 This dependency is therefore not suitable for crosslingual parsing since similarity between words in the sentences and concepts in the graph does not hold at large. Our parser, instead, disposes of explicit and implicit AMR alignments using a seq2seq model for concept identification and achieves significantly higher performance on all the tested languages. On the other hand, to account for data sparsity, XL-AMR employs several common techniques in English AMR parsing literature (Konstas et al., 2017;, such as anonymization and recategorization, expanding them across languages by relying on multilingual resources. Transfer learning The idea behind this method is to leverage annotations available in one language, commonly English, to enable learning models that generalize to languages where labelled resources are scarce (Ruder et al., 2019). Different techniques include annotation projection, machine translation and language-independent feature-based models. Extensive works in this direction exist, applied to different NLP tasks, i.e., WSD (Barba et al., 2020), SRL (Padó and Lapata, 2009;Kozhevnikov and Titov, 2013), Dependency Parsing (Tiedemann, 2015), concept representation (Conia and Navigli, 2020), etc. In cross-lingual AMR parsing, annotation projection is employed by , who produce cross-lingual silver AMR annotations by exploiting parallel sentences selected from the Europarl corpus (Koehn, 2005): English sentences are parsed using an English parser (Damonte et al., 2017, AMREAGER) and the resulting graphs are associated with the corresponding parallel sentences. However, the data on which AMREAGER was trained is very different from those used to produce the silver annotations, thus affecting the quality and reliability of the AMR graphs produced. Here we test two different techniques: we conduct experiments with annotation projection using Europarl for comparison, and, in addition, we use translation techniques to produce better quality training corpora. This leads to significant improvements and provides evidence that better quality data -and models -allow for using AMR as an interlingua.

Cross-Lingual AMR
In what follows we first formalize the task (Section 3.1) and then detail our cross-lingual AMR parser (Section 3.2) and our proposed silver data creation methods (Section 3.3). Finally, we list the preand postprocessing cross-lingual techniques and resources we employ (Section 3.4).

(C) Relation Identification
The city of Tel Aviv is fewer than 650 miles from Iranian territory.
La ciudad de Tel Aviv está a menos de 1.046 km del territorio iraní.

The Task
Cross-lingual AMR parsing is defined as the task of transducing a sentence in any language to the AMR graph of its English translation whose nodes are either English words, PropBank framesets (Kingsbury and Palmer, 2002) or special AMR keywords. Breaking down this definition, given an English sentence and its translation T L in a language L, their meaning representation is ideally formalized by the same AMR, G = (V, E), where V is a list of concept nodes and E is the set of semantic relations between them. Figure 1-A shows an example of a sentence in English, with its translations into Chinese, German, Italian and Spanish which have the same meaning and therefore the same abstract representation (Figure 1-C). Following state-of-the-art models for English AMR parsing , we tackle cross-lingual AMR parsing as a two-stage approach, i.e., concept and relation identification, which we briefly overview here and later detail in Section 3.2. For concept identification, given the sequence T L = (t 1 , t 2 , . . . , t j ), t i being a word in language L (i ∈ {1, . . . , j}, L ∈ {EN, DE, ES, IT, ZH}), we train a neural network to generate the list of nodes V = (v 1 , v 2 , . . . , v n ), v i ∈ English words ∪ PropBank framesets ∪ AMR keywords. In Figure  1-B we show the list of concepts that represent the words in the sentences of Figure 1-A. The relation identification procedure, instead, is inspired by the arc-factored approaches employed in dependency parsing (Kiperwasser and Goldberg, 2016), i.e., searching for the maximum-scoring connected subgraph over the identified concepts in the previous step. Thus, given the list of predicted nodes V = (v 1 , v 2 , . . . , v n ) and a learned score for each candidate edge, we search for the highest-scoring spanning tree and then merge the duplicate nodes based on unique node indices (see Section 3.2) to restore the final AMR graph. Figure 1-C shows the AMR representing the shared semantics of the sentences in Figure 1-A.

XL-AMR Model
XL-AMR is composed of two modules which are learned jointly, i.e., concept identification, modeled as a seq2seq problem, and relation identification, based on a biaffine attention classifier (Dozat and Manning, 2017). We use a seq2seq model to dispose of the need for an AMR alignment module. Lyu and Titov (2018) argue that alignments are important for injecting a useful inductive bias for AMR parsing and maintain that alignment-based parsers might be better than seq2seq for AMR parsing, owing to the relatively small amount of data available for AMR. However, aligning words to AMR nodes in cross-lingual parsing is challenging. The widely used AMR aligners are usually based on heuristics (Flanigan et al., 2014), or on the fact that AMR and English are highly cognate (Pourdamghani et al., 2014). Hence, these approaches would not be valid for cross-lingual alignment and, moreover, projecting the alignments across languages through English has shown to be noisy and to affect the parsing performance .
Concept identification At training time we obtain the list of nodes by first converting the graph into a tree, duplicating the nodes occurring in multiple relations, and then using a pre-order traversal over the tree. To account for reentrancies we assign a unique index to each node during traversal, similarly to . Following the attention-based encoder-decoder architecture proposed by Bahdanau et al. (2015), our concept identification module consists of a bidirectional RNN encoder and a decoder that attends to the source sentence at each concept decoding step.
The encoder employs an L-layer bidirectional RNN (Schuster and Paliwal, 1997) with LSTM cells (Hochreiter and Schmidhuber, 1997), i.e., BiLSTM, which encodes the input token embeddings e i into hidden states h i . Each hidden state , is a concatenation of the forward hidden state and the backward hidden state at timestep i. Similarly to , the input token embedding e i is a concatenation of contextualized embeddings, word embeddings, Part-of-Speech (PoS) embeddings, token anonymization indicator 2 and character-level embeddings. The subsequent BiLSTM layer, instead, takes the hidden states of the previous layer as input.
The decoder also consists of L recurrent neural network (unidirectional) layers with LSTM cells. The decoder embedding layer concatenates word embeddings, node index embeddings and characterlevel embeddings. The layer l of the decoder cal- is the concept hidden state of the previous layer at timestep t while d l t−1 that of previous timestep. d l 0 is initialized with the concatenation of the encoder's last hidden states h l = [ We follow the input feeding approach of Luong et al. (2015), which concatenates the output of the decoder's embedding layer and an attentional vector computed at the previous timestep. We first compute the source attention distribution a t using additive attention (Bahdanau et al., 2015) as follows: where v, W h , W s and b s are model parameters, and c t is the source context vector. Then, we compute the attentional vector, where W c and b c are model parameters.  used the attentional vector to allow the decoder to copy nodes predicted in the previous steps (target-copy), rather than only generating a new node from the vocabulary. As they provide empirical evidence that this is crucial for handling reentrancies, we employ their target-copy approach and use the attentional vectord t to i) feed in a dense layer and softmax to produce a probability distribution over the vocabulary P vocab = softmax(W vocabdt + b vocab ), ii) to learn a target attention distributionâ t (similar to the source attention distribution above), iii) to calculate p copy and p generate probabilities that decide either to copy one of the previously predicted nodes by sampling a node from the target attention distributionâ t , or to generate a new node from the output vocabulary. Each newly generated node is assigned a unique index, or it is assigned the index of the node copied from the previously generated concepts. At prediction time, we employ a beam search to decode the list of nodes based on the probability distribution computed above.
Relation identification For this module, we follow  and use a deep biaffine classifier inspired by Dozat and Manning (2017), which takes as input the decoder states and factorizes the edge prediction in two components predicting i) whether there is an edge between a pair of nodes, and ii) the edge label for each possible edge, respectively. We direct the reader to  and Dozat and Manning (2017) for technical details on the biaffine attention classifier. At prediction time, to ensure the validity of the tree, given the list of predicted nodes and the score for candidate edges, we search for the highest-scoring spanning tree using the Chu-Liu-Edmonds algorithm. We then merge the duplicate nodes based on the node indices to restore the final AMR graph. The model is trained to jointly minimize the loss of reference nodes and edges.

Silver Training Data
In order to train cross-lingual AMR parsers and to evaluate the cross-lingual properties of AMR as an interlingua, we project existing AMR annotations for English sentences to target language sentences following two different approaches.
Parallel sentences -silver AMR graphs We follow  and project AMR graphs from English sentences to target language sentences through a parallel corpus. Differently from , we do not need word-to-word and word-to-node aligners for training the concept identification module. Instead we directly pair a sentence in the target language with the AMR graph corresponding to its English counterpart. In this case, while the sentences are parallel, the AMR graphs are of silver standard quality, i.e., the English sentences of the parallel corpus are parsed using an existing AMR parser. We refer to this method as PARSENTS-SILVERAMR.
Gold AMR graphs -silver translations In addition to pivoting through parallel sentences, we investigate whether considering human-annotated AMR graphs could bring more benefits than system produced AMR graphs. To this end, we make use of the existing gold standard datasets for AMR parsing, i.e., English sentence-AMR graph pairs, and use machine translation systems to translate the training sentences into the target language. This choice is motivated by the existence of reliable machine translation systems for the languages of our interest. Moreover, we validate the silver translations through a back-translation step (Sennrich et al., 2016). That is, firstly, we translate the sentences from English to the target language and, secondly, using the same neural translation model, we translate the target language translations back to English. Then, to filter out less accurate translations we apply a 1-N N strategy based on the cosine similarity between translations and source sentence semantic embeddings, similarly to Artetxe and Schwenk (2019a). If the nearest neighbour of a translation corresponds to its source English sentence, we consider it a good translation, otherwise we discard it. We employ semantic similarity since we have a two-step automatic translation, due to which lexical differences are introduced into translations compared to the original sentence. Typical machine translation metrics, e.g., BLEU, METEOR, rely on lexical similarity, which could lead good translations being discarded. In fact, we do not need the translation to be word-to-word aligned, but rather to preserve the meaning of the sentence, thus considering valid also the cases when certain words are translated into synonyms or related words. We refer to this method as GOLDAMR-SILVERTRNS.

Pre-and Postprocessing
AMR parsers in the literature rely on several preand postprocessing rules. We extend these rules for the cross-lingual AMR parsing task based on several multilingual resources such as Wikipedia, BabelNet 4.0 (Navigli and Ponzetto, 2010), DBpedia Spotlight API (Daiber et al., 2013) for wikifi- The preprocessing steps consist of: i) lemmatization, ii) PoS tagging, iii) NER, iv) re-categorization of entities and senses, v) removal of wiki links and polarity attributes. The postprocessing steps consist of restoring i) anonymized subgraphs, ii) wikification, iii) senses, iv) polarity attributes. We give full details on pre-and postprocessing in Appendix A.

Experiments
We now present a set of experiments for crosslingual AMR parsing when using different training techniques and the silver data we created (see Section 3.3). We discuss the results of our multiple settings and compare with previous approaches performing cross-lingual AMR parsing.
Test bed We evaluate on the Abstract Meaning Representation 2.0 -Four Translations (Damonte and Cohen, 2020), a corpus containing translations of the test split of 1371 sentences from the LDC2017T10 (AMR 2.0), in Chinese (ZH), German (DE), Italian (IT) and Spanish (ES). This data is designed for use in cross-lingual AMR parsing (available to all LDC subscribers).
For the first approach, inspired by , and for comparison purposes, we 3 Stanza does not provide a NER model for Italian. choose Europarl as parallel corpus. 4 We predict the silver AMR using the model of .
For the second approach, instead, i.e., GOLDAMR-SILVERTRNS, we choose AMR 2.0 as gold dataset and translate the sentences into Chinese, German, Italian and Spanish. For German, Italian and Spanish, for both translating and back-translating the sentences we use the machine translation models made available by Tiedemann and Thottingal (2020, OPUS-MT). 5 For Chinese, instead, since OPUS-MT does not provide translation models, we employ the released MASS 6 (Song et al., 2019a) supervised neural translation models. Then, to filter out less accurate translations, we compute the cosine similarity between dense semantic representations of the original English sentence and its back-translated counterpart. To embed the sentences we use LASER (Artetxe and Schwenk, 2019b), a state-ofthe-art model for sentence embeddings. Details on the number of instances per language and for each silver data approach are shown in Table 1.
Training configurations We conduct experiments following different training approaches: • Zero-shot -the model is trained on English sentences only, relying on multilingual features, and is evaluated on all the target languages (henceforth ∅-shot).
• Language-specific -the model is trained only on target language data, i.e., DE, ES, IT or ZH, and evaluated in the same language.
• Bilingual -the model is trained on English data and one of either DE, ES, IT or ZH, and evaluated in the target language.
• Multilingual -the model is trained on data from all available languages per setting and evaluated on the target languages.
Systems We denote the variations of XL-AMR, based on the above training configurations, as XL-AMR data where, data ∈ {par, trans, amr}, par referring to the data produced with PARSENTS-SILVERAMR approach, trans to GOLDAMR-SILVERTRNS approach, amr to the AMR 2.0 English gold standard, and data+ refers to combining par or trans with amr. The only existing crosslingual AMR parser from the literature to date is    because it performs English AMR parsing. We provide details of our model hyperparameters in Appendix C.
Results In Table 2 we show the Smatch 8 score of the models. This metric computes the degree of overlap of two AMR graphs . We point out the low score of the ∅-shot models, i.e., XL-AMR amr ∅ and XL-AMR par+ ∅ , which perform lower than AMREAGER, especially in the Chinese language. However, XL-AMR par+ ∅ noticeably improves over XL-AMR amr ∅ , which can be explained by the fact that seq2seq requires a large amount of data in order to generalize. This is confirmed by a fine-grained analysis showing lower accuracy of XL-AMR amr ∅ compared to XL-AMR par+ ∅ in concept identification, which, we recall, is a seq2seq module.
Interestingly, the language-specific XL-AMR par , even if trained on less instances, outperforms the ∅-shot models by a large margin. Moreover, it also surpasses AMREAGER, which is trained on the same sentences from Europarl. The results are 7 It translates the test sentences from the target language to English and parses the translations using an English parser. 8 github.com/snowblink14/smatch further improved when jointly training in multiple languages, i.e., when using the multilingual and bilingual configurations. We attribute this improvement to the ability of a seq2seq model to learn better when provided with a larger training set. The domain of the Europarl data is very specific, which does not enable the model to generalize in sentences from other domains. In fact, the XL-AMR par+ models significantly improve over the XL-AMR par bilingual and multilingual models. We attribute the higher performances of XL-AMR par+ to i) larger training dataset, ii) training on different domains, and iii) better quality of the data (AMR 2.0 data is human annotated).
The XL-AMR trans models perform best: we note that the performances of the language-specific variants outperform those of the multilingual XL-AMR trans models, in contrast to the behaviour of the XL-AMR par models, suggesting that the addition of silver data in other languages is not beneficial. This may be due to the fact that the AMR graphs of translated sentences are the same, thus as a consequence the model does not access extra information. Moreover, the inclusion of translated sentences in other languages slightly harms the performances. This is confirmed by the removal from the training set of the most distant language, in the multilingual (-ZH) model, which in turn achieves around 2 F1 points more compared to the multilingual version including Chinese. This can be further explained by the linguistic differences between Chinese and the other languages, which prevent them from benefiting from the inclusion of Chinese instances in the training set. However, when adding English gold AMR 2.0, i.e., XL-AMR trans+ , the model benefits from the better quality of this dataset. In fact, the bilingual version of XL-AMR trans+ is the best performing across the board in German, Spanish and Italian, surpassing AMREAGER by at least 14 F1 points and both XL-AMR par and XL-AMR par+ by at least 5 F1 points in each language. Interestingly, the best results in Chinese are achieved by the languagespecific XL-AMR trans surpassing AMREAGER by 8 F1 points and the ∅-shot models by more than 17 F1 points. This is once again explained by the linguistic differences of Chinese as compared to the other languages, which render the additional data nonbeneficial. Table 3 shows the fine-grained evaluation of AMREAGER and our best performing models for each data creation approach, for which we  use the evaluation tools 9 of Damonte et al. (2017). The fine-grained results for the AMREAGER are not reported by , therefore we run the evaluation using their released models. 10 Our best model outperforms AMREAGER in all subtasks except for Negations in German and Named Entities in Chinese, which are prone to heuristic string matching errors in the pre-and postprocessing procedure of our models. XL-AMR trans+ achieves significantly higher performance in Reentrancies, Concepts, SRL, in all the tested languages, compared to AMREAGER, thus demonstrating the effectiveness of our parser and data creation approaches.
In summary, translating the gold standard training data, i.e., GOLDAMR-SILVERTRNS, leads XL-AMR to achieve higher performances than when trained on parallel sentences associated with silver AMR graphs, i.e., PARSENTS-SILVERAMR.

Qualitative Analysis
We manually check the predictions of XL-AMR in order to establish the nature of the mistakes based on the Smatch score between the gold and predicted AMR graphs and determine their severity. Then, we observe how XL-AMR handles the translation divergences, i.e., linguistic distinctions that make transfer across languages difficult (Dorr, 1994).
Several cases with low Smatch score are due to inconsistent translations of test set sentences into the target language, even though, we recall, the test set has been manually translated. This could be due to translator choices, but can lead to divergent meaning structures, e.g., Ich kann verstehen, wie Du Dich fühlst (DE) (I can understand how you are feeling) whose original English sentence from which the AMR graph is projected is I know what you're feeling. The gold AMR graph is thus not appropriate for the German sentence, due to the sentence's different meaning. Thus these mistakes are not due to the parser, but to the translations.
An interesting cause of drop in the Smatch arises from the prediction of concepts that are synonyms of the corresponding concepts in the gold graph, e.g., say-01 → state-01, stop-01 → halt-01, best friend → best mate, demand-01 → urge-01, etc. We notice that the predicted concepts (to the left of the arrow) are less specific than the gold concepts, yet somehow preserve the meaning. These examples show that the parser captures a close meaning even when failing to predict the exact concept.
Translation divergences We investigate how XL-AMR deals with the cases where there exist translation divergences, i.e., cases in which source and target language have different syntactic ordering properties (Dorr, 1990), as classified by Dorr (1994) using the following 7 categories: i) thematic, ii) promotional, iii) demotional, iv) structural, v) conflational, vi) categorial, vii) lexical. 11 A thematic divergence happens when the argument-predicate structure is different across lan-guages, e.g., I like travelling where I is the subject, in Italian becomes Mi piace viaggiare, and Mi is now the object. XL-AMR overcomes this divergence and predicts the correct AMR, (l / like-01 :ARG0 (i / I) :ARG1 (t / travel :ARG0 i)).
Promotional and demotional divergences can be merged into the head switching macro-category. They arise when a modifier in one language is promoted to a main verb in the other, or vice versa, e.g., John usually goes home is Juan suele ir a casa (John is accustomed to go home) in Spanish. XL-AMR correctly parses the sentence into (g / go-01 :ARG0 (p / person :name (n / Juan)) :ARG4 (h / home) :mod (u / usual)).
A structural divergence exists when a verbal object is realized as a noun phrase (NP) in one language and as prepositional phrase (PP) in the other, e.g., I saw John where John is NP, is translated as Vi a Juan (I saw to John) in Spanish where a Juan is PP. This also is not a problem for our parser, which predicts the correct graph, (s / see-01 :ARG0 (i / I) :ARG1 (p / person :name (n / Juan))).
A conflational divergence refers to the translation of two or more words in one language into one word in the other. The above errors in German compounded words fall into this category and our model does not handle them properly. However, regarding other languages this problem is not common, e.g., I fear translates into Io ho paura (I have fear) in Italian and the parser correctly predicts the AMR graph, (f / fear-01 :ARG0 (i / I)).
A categorical divergence arises when the same meaning is expressed by different syntactic categories across languages, e.g., I agree, where agree is a verb, is expressed by a noun in Italian and Spanish, Sono d'accordo and Estoy de acuerdo. The parser correctly predicts the same AMR for both languages, (a / agree-01 :ARG0 (i / I)).
A lexical divergence arises when a verb in the source language is translated with a different lexical verb, e.g., John broke into the room, Juan forzó la entrada al cuarto, in which the verb break in English is translated with the verb forzar (force) in Spanish. XL-AMR predicts (f / force-01 :ARG0 (p / person :name (n / Juan)) :ARG2 (e / enter-01 :ARG0 p :ARG1 (r / room))) for the Spanish sentence, which, even though it is correctly parsed, does not overcome the lexical difference of the action, which results in different AMR graphs for the same meaning. This is partially due to the fact that AMR is bounded to lexical forms in English.
In summary, XL-AMR overcomes most of the foregoing structural divergences with the exception of two cases: i) the conflational divergence in German, that is caused by the language's compound words vocabulary, for the resolution of which a better preprocessing can be beneficial; ii) the lexical divergence that persists despite the parser predicting a valid graph. The latter divergence results in non-parallel structures for parallel meanings, and we believe this might be tackled by integrating a unified ontology for synonyms or related meanings within the AMR formalism, along the line of disjunctive AMR 12 (Banarescu et al., 2013). We leave exploration of this approach open for future work.

Conclusion
We explored transfer learning techniques to enable high performance cross-lingual AMR parsing. We created silver data based on annotation projection through parallel sentences and machine translation, on which we trained XL-AMR, a cross-lingual AMR parser that achieves the highest results reported to date on Chinese, German, Italian and Spanish. A qualitative evaluation showed that XL-AMR is able to handle most of the structural divergences among languages. The performance of XL-AMR together with the qualitative analysis suggests that carefully modeling cross-lingual AMR parsing leads to the production of suitable AMR structures across languages. It would therefore be promising to extend this line of our research to exploit larger multilingual semantic resources, in order to further improve the parsing quality. These AMR representations could then be integrated into downstream crosslingual tasks to investigate their added value. spans created during preprocessing. Then wiki links are restored using the DBpedia Spotlight API 14 (Daiber et al., 2013), commonly used in English AMR parsing (van Noord and Bos, 2017;Ge et al., 2019). It provides models for multiple languages, except Chinese, for which we use Babelfy (Moro et al., 2014). Since the wiki links identified by DBpedia Spotlight API are language-specific to the text, we further use Wikipedia inter-language links to retrieve the corresponding wiki links for the English entities. We restore senses as the most frequent sense of the predicate in the training data (using -01 if unseen) similar to (Lyu and Titov, 2018; and finally restore polarity attributes based on heuristic rules observed on the training data and linguistic rules specific to each language (included in the released code).

B OpusMT Translation Models
For the translation and back-translation steps of GOLDAMR-SILVERTRNS data creation approach, we use the pretrained models 15 from the huggingface transformers library 16 listed in Table 4.

C Model Hyperparameters
The input features for all the models include: i) fixed mBERT 17 (Devlin et al., 2019) as contextual embeddings (dim = 768), ii) ConceptNet Numberbatch 9.08 18 (Speer et al., 2017) multilingual static word embeddings (dim = 300) which we set as trainable except in ∅-shot models, iii) trainable PoS embeddings (dim = 100) where we use the universal PoS-tags set by Petrov et al. (2012), iv) trainable anonymization indicator embeddings 14 github.com/dbpedia-spotlight/spotlight-docker. 15 6-layer Transformer-based models (Vaswani et al., 2017). 16 huggingface.co/transformers/model doc/marian.html 17 bert-base-multilingual-cased: a contextualized embedding for a token is calculated as the average pooling of its subtoken embeddings. 18 github.com/commonsense/conceptnet-numberbatch (dim = 50), v) trainable character-level embeddings (dim = 100), i.e., CharCNN (Kim et al., 2016). The encoder and decoder of the node prediction module are composed of 2 layers of 512 and 1024 LSTM units each, respectively. All the models are trained using Adam optimizer (Kingma and Ba, 2015) with learning rate 0.001, for 120 epochs and the best model hyperparameters are chosen on the basis of development set accuracy. The models are trained using 1 GeForce GTX TITAN X GPU, full training takes around 48 hours for models trained in the largest dataset XL-AMR trans+ (∼84M trainable parameters) and XL-AMR par+ (∼86M trainable parameters). At prediction time we set the size of beam search to 5.