Introduction

One of the core components of language comprehension is accessing the stored meanings of the individual words in a sentence or discourse and combining them to construct a global interpretation of the speaker’s message. Many critical questions about this process—questions about speed, automaticity, interactivity, and neural implementation—have been explored in the last 30 years through work with the ERP component known as the N400, which is modulated by a vast array of lexical and contextual factors thought to influence the processing of meaning (see [1], for review). This work has been so important and influential that the N400 response is familiar not only to specialists in the electrophysiology of language, but to many in the fields of cognitive psychology and cognitive neuroscience at large, appearing in broader reviews and textbook chapters. In this larger realm, the N400 is most often characterized as a response to ‘semantic anomaly’ [2], ‘semantic violation’ [3], ‘semantic incongruity’ [4] or ‘semantic mismatch’ [5] when the current word doesn’t ‘make sense’ in the context. In keeping with this, the example most often cited in textbooks comes from Kutas and Hillyard’s seminal 1980 work first reporting contextual modulation of the N400 [6], in which they compared the response to congruous and incongruous sentence endings such as It was his first day at work and He spread the warm bread with socks.

However, much subsequent work has showed that a different but strongly correlated factor has an independent impact on N400 amplitude: lexical or conceptual predictability [7]. At a broad level, three possible accounts have been pursued: N400 effects of predictability and congruity might both be generated by a ‘combinatorial’ process like semantic integration [8, 9, 10], N400 effects of predictability and congruity might both be generated by an ‘access’ process like lexical/conceptual network activation [11, 12], or N400 effects of predictability might be generated by an access process and N400 effects of congruity might be generated by a combinatorial process (a ‘multiple generator’ account; [13, 14]). Discriminating between these accounts is of critical importance because the mechanism assumed to be driving N400 effects governs the conclusions about cognitive and neurocognitive models that are drawn on the basis of N400 results.

In the current work we report three ERP experiments that argue against the first idea, that N400 effects of predictability directly reflect a combinatorial process such as integration difficulty, and are rather consistent with the two latter accounts in which N400 effects of predictability reflect some process distinct from the mechanisms invoked by semantic incongruity, such as long-term memory access. This conclusion converges with the results of many earlier studies that we review below. A broader consequence of all of these results is that the common characterization of the N400 as primarily a ‘semantic anomaly’ response is misleading and fundamentally incorrect. Although the current results do not show that the N400 response cannot be modulated by semantic incongruity alone, they indicate that the N400 response can be very strongly modulated by processes other than those elicited by semantic incongruity. Although this observation is not a new one, these results therefore additionally serve as a useful reminder that the amplitude of the N400 should not be used by psycholinguists as a reliable and unambiguous indicator that comprehenders have computed a full message-level interpretation of the sentence and/or detected that such an interpretation violates semantic or world knowledge.

The N400 Response

The N400 is a negative deflection in the ERP that peaks at around 400 ms after stimulus presentation and is largest over centro-parietal sites, with a slightly rightward focus when visual presentation is used [15, 16]. This component first came to the attention of the field with Kutas and Hillyard’s study [6]. As noted above, their classic example sentences were:

(congruous) It was his first day at work.
(incongruous) He spread the warm bread with socks.

Kutas and Hillyard observed a large negative deflection in response to words that were incongruous in a sentence context, whereas the words that were congruous in the sentence context showed no negative deflection at all. Because of the strikingly anomalous interpretation associated with the incongruous sentences and the complete absence of a negative deflection for the congruous completions, Kutas and Hillyard’s first hypothesis was that ‘N400 is not a general response to all linguistic or meaningful stimuli … Rather, the N400 seems to reflect the interruption of ongoing sentence processing by a semantically inappropriate word’. There are various ways of realizing this intuitive description mechanistically—the increased neural activity reflected by the N400 might index a reanalysis process, an inference process, a simulation process—but all of them are computations operating on a combinatorial representation such as the sentence meaning or the discourse model which integrates multiple stored lexical meanings or concepts, and therefore we refer to this kind of account as the ‘integration’ account.

However, in another landmark study four years later, Kutas and Hillyard [7] demonstrated evidence that seemed to cast serious doubt on this earlier hypothesis. They showed that large N400 responses could be observed for words that were completely congruous in their context (‘Don’t touch the wet dog’). The variation in N400 amplitude associated with different contexts appeared to rather be due to the degree to which the context predicted the target, as in this comparison:

(predicted) Don’t touch the wet paint
(not predicted) Don’t touch the wet dog

Kutas and Hillyard [7] observed that N400 amplitude appeared to vary parametrically with the degree of predictability, such that a negative deflection was observed in all but the most predicted condition, but where the size of this negative deflection was inversely related to predictability. They noted that predictability could also have explained their earlier results [6], as the congruous words were probably more predictable than the incongruous words. Therefore, Kutas and Hillyard [7] concluded that ‘These results are in agreement with the hypothesis that the N400 component reflects the extent to which a word is semantically primed, rather than its being a specific response to contextual violations’ (p. 163). In other words, these data opened up the possibility of an alternative to the integration account. Instead of N400 activity being driven by a process operating on combinatorial representations, N400 activity might rather reflect the neural processes engaged in simple memory access of words and/or concepts, such that the preceding combinatorial representation of the context would play the indirect role of priming some of those words or concepts ahead of time such that their memory access was facilitated. For example, when a word is encountered in isolation, many orthographic or phonological neighbors may also initially be activated as competitors before the correct word is selected (e.g., [17, 18]), and multiple senses of the word and associated conceptual features may also be activated, such that a relatively broad portion of the lexical/conceptual memory network is initially active. However, when a word is encountered in a predictive context, pre-activation may result in more selective activation of the correct lexical representation and the corresponding conceptual features that are appropriate to that context, such that a relatively small portion of the network is activated. We refer to this kind of account as the ‘access’ account.

Many subsequent studies replicated and extended these results. In one example, Federmeier and Kutas [19] found that sentence completions that were equally semantically incongruous showed differences in N400 amplitude that corresponded to how closely related the incongruous completion was to the contextually predictable ending (e.g. He caught the pass and scored another touchdown. There was nothing he enjoyed more than a good game of monopoly/baseball). Relatedly, Hoeks et al. [20] showed that sentence completions that were equally semantically incongruous showed N400 differences as a function of how related the incongruous word was to other words and events in the context (e.g. [Dutch] The bread has the baker baked/summarized). More generally, a large body of ERP language studies have shown that it is not only anomalous words that elicit a negative deflection peaking at 400 ms, but in fact almost all words show this response, whether they are presented in isolation or in context (for review see [21, 22].

None of these results provided clear evidence that semantic integration processes do not also modulate the N400 independent of memory access processes, and in fact several studies have argued that they do [13, 23, 24, 25, 26]. These have motivated multiple generator accounts in which both access and integration processes contribute to the amplitude of the N400 response. However, others have gone further to argue that all the existing N400 data could still receive a unitary integration account. One might take the perspective that even a single word must be integrated into its (non-linguistic) context [27], and as Van Berkum et al. [10] pointed out, the factors that make a word less predictable might also make it more difficult to integrate with the context even if it is not incongruous.

Perhaps the best existing evidence against a unitary integration difficulty account of the N400 comes from a study by DeLong, Urbach, and Kutas [28], which showed differences in N400 amplitude corresponding to the predictability of the semantically empty a/an alternation conditioned on the predictability of the subsequent noun (e.g. The day was breezy so the boy went outside to fly a/an …), as it is difficult to argue that a should be easier to semantically integrate with the prior context than an. Similar work has examined the impact of gender or animacy morphemes on a neutral prenominal element that match or mismatch a predicted noun (e.g. in Spanish When the king died the prince could finally wear the-masc/fem crown-fem, from Wicha et al. [29]); some of these studies have similarly shown N400 effects on the prenominal element [30] although in other cases the prenominal predictability effect has a different polarity or distribution [10, 29, 31]. In a different kind of design, Lau et al. [32] show that the same semantic relation between a prime and a target (e.g. salt-pepper) results in a much larger N400 reduction when the experimental context encourages prediction than when it does not, which is also not obviously predicted by a unitary integration difficulty account.

Despite these kinds of results, the initial idea that the N400 primarily reflects the response to semantic integration difficulty as in semantic incongruity has had a strong and long-lasting impact on the field, with important theoretical consequences. In an early example, Fischler and colleagues [33] argued that sentential negation takes considerable time to process because A sparrow is not a bird demonstrated a smaller N400 than A sparrow is not a vehicle. However, if the N400 is at least partially driven by lexical/conceptual access processes, this result can easily be explained by semantic priming or contextual prediction [34]. More recently, the lack of an N400 difference between role-reversal sentences such as ‘The meal was devoured’ and ‘The meal was devouring’ has been argued by many to constitute evidence that readers do not immediately recognize the semantic violation in the second sentence [35, 36]. Again, however, if the N400 indexes lexical/conceptual access processes, this result could also be easily accounted for by the high semantic association between the argument(s) and the verb independent of argument structure [20, 37, 38, 39]. While these cases illustrate the incorrect conclusions that may be drawn if the N400 is erroneously taken as a simple index of semantic integration difficulty, another consequence is that, for many years, less effort was devoted to identifying other neural indices of sentence- and discourse-level integration and interpretation [38, 40, 41, 42].

The Current Study

The discussion above illustrates both the importance of establishing the processes that generate N400 effects and the challenges in doing so. One of the key open questions is the extent to which semantic incongruity modulates the N400 when unconfounded from predictability and semantic association, and whether the processes driving this semantic congruity effect are the same as the processes driving N400 effects of predictability. In the current study we investigated these questions using an adjective-noun phrase paradigm.

Constructing materials that unconfound predictability and congruity in full sentences is challenging because in order to minimize the effects of contextual predictability, incongruous completions must be compared with extremely low probability congruous completions. However, it is difficult to accurately estimate probabilities on the lower end of the probability range for sentences with standard materials norming using the Cloze sentence completion task [43], because this task encourages participants to respond with their most preferred completion. If individuals in fact maintain a probability distribution over possible endings, less probable endings may be underrepresented by the Cloze task (e.g., [44]). Longer sentence contexts could also contain semantic associates to the critical word. A recent sentence study by DeLong et al. [25] illustrates some of these challenges. While this study was targeted at dissociating frontal and posterior late positivities, DeLong et al. showed that the N400 response to implausible sentence continuations was slightly but significantly larger than the response to unpredictable sentence continuations. However, they also noted that a small but significant difference in cloze for the two sentence continuations could also have been responsible for this difference.

In the current study we directly investigated whether semantic incongruity would result in an N400 effect parallel to the predictability effect by using an adjective-noun paradigm that allowed us to more precisely estimate these probabilities with corpus counts [45]. In order to minimize effects of predictability on our congruity comparison, we compare incongruous adjective-noun combinations to congruous adjective-noun combinations in which the probability of the noun given the adjective is very low (p < .005) (note that it is the particular noun that is not predictable; in English, adjectives strongly predict that the word class of the following word is a noun). To create balanced congruous and incongruous sets, we crossed animate nouns and inanimate nouns with adjectives that must modify animate nouns and with adjectives that usually modify inanimate nouns, as in Table 1. Although it is likely impossible to create semantically congruous items that are not slightly more predictable or more semantically associated than completely incongruous items, the minimal contexts in this design allowed us to come reasonably close.

Table 1

Examples of materials in Experiments 1–3.


Predictability manipulation Congruous Predictable
runny nose
mashed potato
Congruous Unpredictable
dainty nose
shredded potato
Congruity manipulation Congruous Unpredictable
yellow bag
healthy cat
Incongruous Unpredictable
innocent bag
empty cat

Previous authors (e.g., [46, 41]) have pointed out that ‘semantic anomaly’ or ‘semantic incongruity’ are vague terms that can be taken to refer to various properties, from violations of world knowledge (e.g., in our world, people don’t usually put socks on bread) to mismatches between formal semantic features (e.g. inanimate entities clash with predicates that require animacy). We agree, and are not committed to a particular ontology of semantic well-formedness. Our incongruous items almost certainly vary in the properties that lead them to feel anomalous (e.g. one might argue that a bag cannot be innocent, but that a cat could be empty, much like socks can be put on bread). Given the flexibility of language—e.g. the common use of metaphors, irony, and jokes; the ability to discuss fantastical or impossible worlds—it is difficult to construct grammatical examples that cannot be assigned some kind of coerced or accommodated interpretation. For example, in an appropriate ‘cartoon’ context, an animate predicate like falling in love is non-anomalous and perhaps even predictable given an inanimate subject like peanut [47]. However, we think this also holds for the kinds of incongruous sentences classically used in N400 paradigms (She spread the warm bread with socks) to the same extent as the adjective-noun sequences used here. While it could be the case that participants automatically and rapidly computed this kind of accommodation during the ERP experiment, this would be the kind of integration mechanism that the current experiment was designed to investigate. In the current study, our goal is to test the existing hypothesis that the N400 is impacted by semantic anomaly, and since this term has not received a precise definition in prior literature, we aimed to test items that would fit the conventional understanding of this term and which participants confirm to ‘not make sense’ in the absence of other context.

Several recent EEG and MEG studies have used adjective-noun manipulations to examine prediction and semantic combination mechanisms, as these designs make it possible to maintain tight control over the relevant (single-word) context being manipulated. In a series of MEG studies, Bemis and Pylkkänen (e.g. [48, 49]) have contrasted adjective-noun sequences such as red boat with noun-only sequences such as xkq boat in order to isolate the brain areas involved in semantic and syntactic combination. In another recent MEG experiment, Fruchter and colleagues use adjective-noun sequences varying in predictability to demonstrate neural activity associated with predicting the noun in the time-window prior to noun presentation [50]. These data confirm that single adjectives are enough to engender prediction of the subsequent noun, even outside of sentence contexts.

Most relevant for the current study, Molinaro and colleagues used EEG to examine the response to neutral, atypical, and anomalous noun-adjective sequences (monstruo solitario, monstruo hermoso, and monstruo geográfico, trans.: lonely monster, lovely monster, and geographical monster) embedded in sentences in Spanish [24]. Importantly, in all three conditions they used adjectives that elicited almost zero cloze in an offline sentence completion task, and yet still demonstrated a larger N400 response for the anomalous condition relative to the other three conditions. Therefore, these results appear to constitute evidence that N400 amplitude in fact does index semantic anomaly when predictability is controlled for. However, one potential caveat about this conclusion has to do with whether participants in the offline completion task made different predictions than the participants in the ERP study. Because adjectives are optional modifiers that come after the noun in Spanish, only 5% of offline completions contained any adjective at all; therefore the zero cloze for the critical adjectives reflected not that different adjectives were predicted but that no adjective was predicted. But in the ERP study, all of the experimental items and many of the fillers contained nouns modified by adjectives. While Molinaro et al. [24] report that participants did not report being conscious of any particular construction, participants may have nonetheless implicitly recognized this regularity [51] such that in the experiment they would expect that a noun would be followed by an adjective. If semantic features of the neutral and atypical adjectives were predicted slightly more often than features of the anomalous adjectives given the noun during the ERP experiment, then this could explain the difference in N400 amplitude observed here. In the current study, the word category order of English ensured that participants were very likely to predict some noun after the adjective, and presenting the adjective-noun sequences in isolation allowed us to estimate predictability purely on the basis of the adjective (at the cost of less naturalistic presentation).

Here we report the results of three ERP experiments. Experiment 1 was designed to confirm that N400 effects of predictability are observed in isolated adjective-noun sequences, where predictability is computed through corpus counts rather than offline cloze tasks. Experiment 2 investigated whether semantic incongruity alone could affect N400 amplitude when items were uniformly unpredictable, as in Table 1. Experiment 3 directly contrasted the predictability and semantic incongruity effects of Experiments 1 and 2 in a within-subjects design. Observing robust, parallel N400 effects of both predictability and semantic incongruity when they are independently manipulated would support an account in which all N400 effects are generated by a common semantic integration mechanism. Observing that N400 amplitude is primarily sensitive to contextual predictability and less so to semantic incongruity would be consistent with an account in which all N400 effects are generated by a common lexical/conceptual access mechanism, but it would also be consistent with a multiple generator account according to which semantic integration mechanisms are only operative in full sentence contexts. Importantly, however, such a pattern of results would argue against a unitary integration account (if integration were driving predictability effects and it only occurred in sentence contexts, predictability effects shouldn’t be observed in phrasal contexts).

Experiment 1

Materials

Adjective-noun pairs were selected from the Corpus of Contemporary American English (COCA; Davies, 2009 [52]). As a first step, we identified 120 highly constraining adjectives using the following procedure. We extracted all adjective-noun bigrams from the full list of bigrams in COCA (2012 version [52]), and next selected the subset of these bigrams for which p(noun | adjective) > .50 and which appeared at least 10 times in the corpus. Items that were judged unlikely to be familiar to our participant population (e.g. peroneal nerve, gordian knot), items that were too constraining (for which only one noun seemed felicitous, e.g. iodized salt), and items that contained repetitions of words used in other items were excluded from the set, resulting in 120 highly constraining adjectives.

120 high probability items were created by combining the strongly constraining adjectives (e.g. runny) with the noun which they most frequently occurred with (e.g. nose), such that p(noun | adjective) > .50. 120 low probability items were created by pairing these same nouns (e.g. nose) with weakly constraining adjectives (e.g. dainty) for which p(noun | adjective) < .02, and for which the maximum p(noun | adjective) across all nouns was less than .15. Candidate adjectives that satisfied these criteria were automatically identified for each item and in case more than one such adjective existed, one was selected by hand. The stimuli properties for this comparison are presented in Table 2. Materials from Experiment 1 are available in Supporting Information.

Table 2

Stimulus Properties for Experiment 1.

High Probability Low Probability

p(noun | adjective) 0.650 (.14) 0.007 (.005)
p(adjective | noun) 0.073 (.09) 0.0005 (.0007)
bigram frequency 649 (1107) 3 (5)
adjective constraint (max p(noun | adjective) across nouns) 0.650 (.12) 0.074 (.034)
adjective frequency 961 (1534) 618 (983)
noun frequency 15013 (18329)
adjective length 8.3 (2.1) 8.2 (1.8)
noun length 5.5 (1.9)

Stimulus properties for the 120 pairs of items used to instantiate the probability comparison in Experiment 1, all derived from the Corpus of Contemporary American English. Standard deviations are presented in parentheses. Simple frequencies represent the total number of occurrences of the lemma across all adjective-noun bigrams that occurred in COCA.

We note that in the current design, the bigram frequency of the high probability condition was much higher than in the control condition. This was not a primary concern here because the probability comparison in the current experiment was mainly aimed at replicating the effects of probability that have been observed in sentence paradigms, where an analogous confound between lexical probability and probability of the overall event being described also holds; however, it does mean that this design cannot distinguish between activation and integration accounts of the N400, as discussed further below in the General Discussion.

The frequency of the adjectives in the preceding context also differed across conditions, such that adjectives in the high probability condition were more frequent than the adjective in the low probability condition. Past authors have reported effects of lexical frequency on the N400 (e.g. [53, 54]). Although the stimulus-onset asynchrony between adjective and noun was long enough (600 ms) in the current study that frequency differences at the adjective appear unlikely to impact time-windows of interest at the noun (starting at 300 ms, 900 ms after adjective onset), we conducted pairwise comparisons in the 500:600 ms time-window following the adjective to confirm that this frequency difference on the adjective led to no baseline differences prior to the onset of the noun.

30 items from each of the two conditions were distributed across lists in a Latin Square design, and each list was presented in 4 different random orders for a total of 8 presentation lists. 260 additional low adjective-noun bigrams for which p(noun | adjective) < .50 were drawn from the subset of COCA bigrams and added to each set of 60 experimental items for a total of 320 items per list. Since these additional items are not relevant to the current question of interest, they are not reported here. All items were designed to be semantically congruous, and no participant saw any word more than once in an experimental session.

Participants

Participants were University of Maryland students who participated in the study for monetary compensation. Prior written consent was obtained from all participants according to the established guidelines of the Institutional Review Board of the University of Maryland. All participants were right-handed as assessed by the Edinburgh Handedness Inventory [55]. In total, 38 participants took part in the study, but two datasets were excluded due to excessive artifact, two datasets were excluded due to low accuracy (less than 60%) on the concurrent behavioral measure, and six datasets were excluded because after data collection it was discovered that the participants had significant exposure to a language other than English prior to the age of 5. Of the 28 participants whose datasets were included in the study, 19 were females and 9 were males, with a mean age of 21.3 years.

Procedure

Each experimental session was divided into 4 blocks, with 15 target items and 65 other items presented in each block. Participants were asked to complete a memory recognition test administered on paper after every block. This quiz consisted of 20 bigrams, of which 10 had appeared in the preceding block and 10 were mismatched adjective-noun pairs from the stimulus set. Participants were asked to circle the phrases that they remembered seeing during the previous experimental block.

During the experiment, participants were seated in a chair in a dimly lit room. Stimuli were visually presented on a computer monitor in white 24-point case Arial font on a black background. Each trial began with a fixation cross presented at the center of the screen for 700 ms, followed by a 200 ms blank screen. An adjective was then presented for 500 ms, followed by a 100 ms blank screen, then a noun was presented for 900 ms, followed by another 100 ms blank screen. Each participant began the experiment with a short practice session and was offered the opportunity to take a break between each testing block. In total, the stimulus presentation portion of the experiment lasted 20–25 minutes.

Electrophysiological Recording

Twenty-nine tin electrodes were held in place on the scalp by an elastic cap (Electro-Cap International, Inc., Eaton, OH) in a 10–20 configuration (O1, Oz, O2, P7, P3, Pz, P4, P8, TP7, Cp3, CPz, CP4, TP8, T7, C3, Cz, C4, T8, FT7, FC3, FCz, FC4, FT8, F7, F3, Fz, F4, F8, FP1) Bipolar electrodes were placed above and below the left eye and at the outer canthus of the right and left eyes to monitor vertical and horizontal eye movements. Additional electrodes were placed over the left and right mastoids. Scalp electrodes were referenced online to the left mastoid and re-referenced offline to the average of left and right mastoids. Impedances were maintained at less than 5 kΩ for all scalp electrode sites, less than 2 kΩ for mastoid sites, and less than 10 kΩ for ocular electrodes. The EEG signal was amplified by a NeuroScan SynAmps® Model 5083 (NeuroScan, Inc., Charlotte, NC) with a bandpass of 0.05–100 Hz and was continuously sampled at 500 Hz by an analog-to-digital converter.

Analysis

Averaged ERPs time-locked to adjectives and nouns were formed off-line from trials free of ocular and muscular artifact using preprocessing routines from the EEGLAB [56] and ERPLAB [57] toolboxes. Across the 28 participants included in the analysis, approximately 12% of the trials were rejected because of artifact. A 100-ms prestimulus baseline was subtracted from all waveforms before statistical analysis, and a 40-Hz low-pass filter was applied to the ERPs offline. ERP data for this and the other two experiments in this paper are publicly available on the first author’s website (http://ling.umd.edu/~ellenlau/public_data_archive/N400_plaus_pred/).

Analyses were conducted on mean ERP amplitudes for the critical nouns in the 300–500 ms time-window in which the N400 effect is usually observed. In order to quantify the topography of the effects observed, we focused on a subset of 16 electrodes (left anterior: F7, F3, FT7, FC3; right anterior: F4, F8, FC4, FT8; left posterior: TP7, CP3, P7, P3; right posterior: CP4, TP8, P4, P8) and used R (R Development Core Team, 2010) to conduct a quadrant analysis consisting of a 2 × 2 × 2 (probability × anteriority × hemisphere) Type III SS repeated-measures ANOVA. Because N400 effects often peak at midline electrodes, we also conducted a 2 × 2 ANOVA (probability × anteriority) on the 6 midline electrodes (Fz, FCz, Cz, CPz, Pz, Oz). As adjectives differed across conditions, analyses were also conducted on mean ERP amplitudes for the preceding adjectives in the 500–600 ms time-window (baselined to the 100 ms prior to the adjective) in order to rule out baseline differences in the responses to the adjectives prior to presentation of the critical word.

Results and Discussion

Total mean accuracy on the memory tests was 71.2% (mean d’ = 1.96). ERP waveforms are presented in Figure 1, and the scalp map in Figure 2(A) illustrates the topographical distribution of the probability effect in the N400 time-window.

Figure 1 

ERP waveforms for Experiment 1. ERP waveforms for 28 scalp electrodes for the predictability contrast in Experiment 1. ERPs are time-locked to the onset of the critical noun.

Figure 2 

Topographical distributions for Experiments 1 and 2. Scalp maps and selected electrode waveforms demonstrating measured N400 effects of predictability and congruity during the 300–500 ms time-window in Experiments 1 and 2. Scalp maps illustrate the mean difference between low probability and high probability conditions in Experiment 1 (A) and between semantically incongruous and congruous conditions in Experiment 2 (B).

No significant effects of condition were observed in the 100 ms time-window prior to noun onset (ps > .1), suggesting that responses were well-matched prior to the presentation of the critical noun. In the 300–500 ms time-window the high probability items showed a reduced N400 response relative to the low probability items, as revealed by a main effect of probability (F(1,27) = 10.3, MSE = 7.4, p < .05; mean high probability = .77μV, mean low probability = –.39μV). This effect was largest over posterior electrodes, resulting in a significant interaction between probability and anterior-posterior distribution (F(1,27) = 12.6, MSE = 1.2, p < .05). We followed up with probability × hemisphere ANOVAs in anterior electrodes and posterior electrodes separately. We observed a significant main effect of probability in posterior electrodes (F(1,27) = 22.4, MSE = 3.6, p < .05; mean high probability = .98μV, mean low probability = –.71μV) but no significant effects involving probability in anterior electrodes (ps > .1). Similarly, in midline electrodes we observed a main effect of probability (F(1,27) = 15.2, MSE = 6.7, p < .05) and a significant interaction between probability and anteriority (F(1,27) = 7.3, MSE = 1.0, p < .05).

These results confirm that, just as in sentence paradigms in which contexts are more or less predictable of an upcoming word (as assessed by offline completions), N400 amplitude is strongly modulated by probability (as assessed by corpus counts) in a paradigm in which noun phrases are presented in isolation. In Experiment 2, we investigated whether N400 amplitude in this paradigm would similarly be modulated by semantic incongruity, when probability (as assessed by corpus counts) was held relatively constant.

Experiment 2

Materials

The materials for Experiment 2 were also drawn from the Corpus of Contemporary American English (COCA; [52]); however, all congruous nouns had a low probability (p < .005) in their adjective context, and all adjectives were relatively unconstraining. Each noun was paired with a “well-fitted” adjective to create 80 semantically/pragmatically congruent phrases and a “poorly-fitted” adjective to create 80 semantically incongruent phrases. Specifically, this was accomplished by selecting 40 animate nouns, 40 inanimate nouns, 40 adjectives that seemed to us to describe a property that conventionally requires animacy (e.g. innocent), and 40 adjectives that seemed to us to most describe a property that conventionally applies to inanimate objects (e.g. striped), and pairing each noun with one congruous adjective and one incongruous adjective. Items were distributed across two lists so that each participant saw each word exactly once. The stimulus properties are presented in Table 3. Note that a few of our incongruous items did actually occur in the corpus (e.g. yellow boy), such that the mean bigram frequency for the incongruous condition was not zero as might have been otherwise expected. Materials from Experiment 2 are available in Supporting Information.

Table 3

Stimulus Properties for Experiment 2.

Congruous Incongruous

p(noun | adjective) .002 (.001) 3e-5 (1e-4)
p(adjective | noun) .007 (.023) 3e-5 (1e-4)
bigram frequency 9 (6) .3 (1.3)
adjective constraint (max p(noun | adjective) across all nouns) .08 (.08)
adjective frequency 6475 (7069)
noun frequency 9458 (17824)
adjective length 6.9 (1.8)
noun length 5.8 (1.9)

Mean stimulus properties for the 80 pairs of items used to instantiate the congruity comparison in Experiment 2, all derived from the Corpus of Contemporary American English. Standard deviations are presented in parentheses. Simple frequencies represent the total number of occurrences of the lemma across all adjective-noun bigrams that occurred in COCA. Note that only one value is provided for adjective frequency because the adjectives used in the congruous items for half of the participants were used in the incongruous items for the other half of participants.

In order to confirm that our items did indeed differ in semantic congruity, we conducted an offline rating study with participants recruited through Amazon Mechanical Turk. 32 participants were asked to rate noun phrases on a scale from 1 to 7 according to what degree they ‘made sense’. Each participant saw items from only one of the two lists, so that each participant saw each word exactly once. The results of the ratings showed that, as expected, our congruous items were rated much higher (mean = 6.59, s.d. = .35, range of item means = 4.3–7.0) than our incongruous items (mean = 1.75, s.d. = .5, range of item means = 1.1–3.7), such that the ratings differed significantly (t(1,31) = 43.7, MSE = .2, p < .05).

Participants

As in Experiment 1, participants were University of Maryland students who participated in the study for monetary compensation. Prior written consent was obtained from all participants. Participants adhered to the eligibility guidelines applied in Experiment 1, and had not participated in any prior studies using the same materials. In total, 38 participants took part in the study, but 10 datasets were excluded. Of the 10 datasets excluded, 4 were excluded due to excessive artifact, and 6 were excluded for accuracy below 60%. Of the 28 participants whose datasets were included in the study, 16 were females and 12 were males, with a mean age of 21.9 years.

Procedure

Because this experiment contained only 80 items, the materials were presented in one block. Participants were asked to complete a memory recognition test directly following the conclusion of the block. This quiz consisted of 20 bigrams, of which 10 had appeared in the preceding block and 10 were mismatched adjective-noun pairs from the stimulus set. For this experiment, we implemented the memory quiz electronically rather than on pencil and paper for easier post-processing. Participants were asked to respond with button presses on a keyboard to indicate whether or not they saw the phrases during the previous experimental block. The current experiment was presented as the second part of an EEG session in which the first part consisted of a sentence experiment described in Chow, Smith, Lau and Phillips [39]. The sentence experiment contained congruous and incongruous sentences and required participants to make congruity judgments.

The same presentation parameters were used as in Experiment 1. In total, the stimulus presentation portion of the experiment lasted 10–15 minutes.

Electrophysiological Recording and Analysis

Experiment 2 used the same recording procedure as described in Experiment 1, with the exception that one dataset was erroneously sampled at 1000 Hz and resampled offline to 500 Hz with the EEGLAB toolbox.

Data was preprocessed following the same procedure described for Experiment 1. Four participants with fewer than 60% trials surviving the artifact rejection procedure were excluded from further analysis and are not included in the 28-participant data set presented here. Across the 28 participants included in the analysis, approximately 7.3% of the trials were rejected because of artifact. A 100-ms prestimulus baseline was subtracted from all waveforms and a 40-Hz low-pass filter was applied to the data before statistical analysis.

As in Experiment 1, quadrant and midline analyses were conducted on mean ERP amplitudes for the critical nouns in the 300–500 ms time-window in which the N400 effect is usually observed. We also conducted analyses in the 600–800 ms time-window in which a late positivity is often observed to incongruous sentence continuations [58].

Results and Discussion

Total mean accuracy on the memory tests was 72.6% (mean d’ = 1.22). ERP waveforms are presented in Figure 3, and the scalp map in Figure 2b illustrates the topographical distribution of the congruity effect in the N400 time-window.

Figure 3 

ERP waveforms for Experiment 2. ERP waveforms for 28 scalp electrodes for the semantic congruity contrast in Experiment 2. ERPs are time-locked to the onset of the critical noun.

In the 300–500 ms time-window the incongruous items were more negative than the congruous items, resulting in a significant effect of congruity (F(1,27) = 4.0, MSE = 7.4, p = .05; mean congruous = –.64μV, mean incongruous = –1.36μV). However, the topographical distribution of this effect was more widespread than the probability effect of Experiment 1, such that there was no interaction between congruity and anterior/posterior distribution (p > .5). No other contrasts including congruity yielded significant effects. For comparison with Experiment 1 we followed up with congruity × hemisphere ANOVAs in anterior electrodes and posterior electrodes separately. These showed a significant main effect of congruity in anterior electrodes (F(1,27) = 4.6, MSE = 2.9, p < .05; mean congruous = –.79 μV, mean incongruous = –1.47 μV) and a marginal main effect of congruity in posterior electrodes (F(1,27) = 3.0, MSE = 5.5, p = .09; mean congruous = –.49 μV, mean incongruous = –1.26 μV), again suggesting that the effect of congruity was broadly distributed and if anything more robust in anterior electrodes. Similarly, in midline electrodes we observed a marginal main effect of congruity (F(1,27) = 3.4, MSE = 17.0, p = .08) and no interaction between congruity and anteriority (p > .1). In the 600–800 ms time-window there were no significant effects of congruity (ps > .1).

In summary, in Experiment 2 we examined the effect of semantic congruity on the amplitude of the N400 when probability was held relatively constant, such that the incongruous noun phrases that mostly did not occur in the corpus were compared with congruous noun phrases in which the probability of the noun given the adjective in the corpus was extremely low (< .005). We observed a small but significant effect of congruity in the N400 time-window. However, this effect had a different topographical distribution than what has classically been reported for the N400 effect and from the predictability effect observed in Experiment 1, as the difference between conditions was equally prominent in anterior and posterior electrodes. In the posterior electrodes where the N400 effect is typically strongest, the congruity effect size in Experiment 2 (~.75μV) was a little less than half as large as the probability effect size in Experiment 1 (~1.7μV).

These results appear most consistent with a multiple generator account of the N400, in which some N400 effects are driven by access processes and some are driven by integration processes. If N400 effects of predictability and incongruity were both driven by integration difficulty, it’s not clear why one would expect the incongruity effects to be smaller, since incongruity is typically presented as the hallmark case of semantic integration difficulty. The difference in topographical distribution would also be unexplained. On the other hand, if all N400 effects were driven by access processes, even small N400 effects of incongruity should not be expected when predictability is effectively controlled. However, a multiple generator account of the N400 could easily capture the observed differences in effect size and distribution.

One concern about interpreting these results with respect to the results of Experiment 1 is that the two experiments used different sets of participants, who may have had inherent differences in the amplitude or topography of the N400 effect. In Experiment 3, we included both the probability manipulation and the semantic congruity manipulation in the same participants, in order to more directly compare the two effects and to attempt to replicate the marginally significant negativity of Experiment 2.

Experiment 3

Materials

All 80 item sets from Experiment 2 were included in Experiment 3. Because Experiment 1 included 120 item sets, a subset of 80 item sets was selected for Experiment 3. The stimulus properties for this subset of items is presented in Table 4. Materials from Experiment 3 are available in Supporting Information. Items were randomized (such that Experiment 1 and Experiment 2 items were completely mixed) and distributed across four lists in a Latin Square design so that each participant saw each word exactly once.

Table 4

Mean Stimulus Properties for Experiment 3.

High Probability Low Probability

p(noun | adjective) .664 (.12) .006 (.004)
p(adjective | noun) .091 (.10) .0005 (.0008)
bigram frequency 812 (1299) 4 (6)
adjective constraint (max p(noun | adjective) across nouns) .664 (.12) .075 (.03)
adjective frequency 1188 (1786) 734 (1152)
noun frequency 14253 (19828)
adjective length 8.1 (1.9) 8.1 (1.8)
noun length 5.6 (1.8)

Mean stimulus properties for the 80 pairs of items used to instantiate the probability comparison in Experiment 3, all derived from the Corpus of Contemporary American English. Standard deviations are presented in parentheses. Simple frequencies represent the total number of occurrences of the lemma across all adjective-noun bigrams that occurred in COCA. Note that the stimulus properties for the congruity comparison in Experiment 3 were identical to those in Experiment 2.

Participants

As in Experiments 1 and 2, participants were University of Maryland students who participated in the study for monetary compensation. Prior written consent was obtained from all participants. Participants adhered to the eligibility guidelines applied in Experiments 1 and 2, and had not participated in any prior studies using the same materials. In total, 38 participants took part in the study, but 10 datasets were excluded. Of the 10 datasets excluded, 7 were excluded due to excessive artifact, and 3 were excluded for accuracy below 60%. Of the 28 participants whose datasets were included in the study, 16 were females and 12 were males, with a mean age of 22.0 years.

Procedure

The materials were presented in two blocks of 80 items each. Participants were asked to complete a memory recognition test directly following the conclusion of each block. This quiz consisted of 20 bigrams, of which 10 had appeared in the preceding block and 10 were mismatched adjective-noun pairs from the stimulus set. Participants were asked to respond with button presses on a keyboard to indicate whether or not they saw the phrases during the previous experimental block. The current experiment was presented as the second part of an EEG session in which the first part consisted of a sentence experiment. The sentence experiment contained congruous and incongruous sentences and required participants to answer comprehension questions.

The same presentation parameters were used as in Experiments 1 and 2. In total, the stimulus presentation portion of the experiment lasted 10–15 minutes.

Electrophysiological Recording and Analysis

Experiment 3 used the same recording procedure as described in Experiments 1 and 2, except that data was recorded using a slightly different electrode configuration in which electrode Oz was replaced by electrode FP2. Data was preprocessed following the same procedure described for Experiments 1 and 2. Seven participants with fewer than 60% trials surviving the artifact rejection procedure were excluded from further analysis and are not included in the 28-participant data set presented here. Across the 28 participants included in the analysis, approximately 20% of the trials were rejected because of artifact. A 100-ms prestimulus baseline was subtracted from all waveforms and a 40-Hz low-pass filter was applied to the data before statistical analysis.

As in Experiment 2, quadrant and midline analyses were conducted on mean ERP amplitudes for the critical nouns in the 300–500 ms time-window in which the N400 effect is usually observed and in the 600–800 ms time-window. Since Experiment 3 examined both probability and congruity manipulations within-subjects, we conducted an omnibus 2 × 2 × 2 × 2 (manipulation × contextual support × hemisphere × anteriority) analysis of variance on all four conditions and followed up with separate analyses for each manipulation, which were identical to the analyses conducted in Experiments 1 and 2, respectively. Since this electrode configuration had only 5 midline electrodes, we conducted the midline analysis on 4 of the 5 in order to maintain equal numbers of anterior/posterior observations (Fz, FCz, CPz, Pz).

Results and Discussion

Total mean accuracy on the memory tests was 74.8% (mean d’ = 1.37). ERP waveform plots depicting the two pairwise-comparisons are presented in Figures 4 and 5; Figure 6 illustrates the topographical distribution of the probability effect and the congruity effect in the N400 time-window. Planned quadrant ANOVAs in the time-window that served as the baseline for the critical noun ERP, 500–600 ms following the presentation of the adjective, showed no significant effects involving congruity or probability (ps > .1). However, as pointed out by a reviewer, there appeared to be numerical differences between the two congruity conditions prior to the critical noun, such that the response to the adjective in the incongruous condition was more negative than in the congruous condition. Because the same adjectives appeared in both conditions of the congruity manipulation (counterbalanced across lists), this numerical difference must be spurious. However, to the extent that this spurious difference contaminates the baseline for the noun, it would act to reduce the amplitude of any N400 effects of incongruity. Therefore, in the interest of maximizing the chance of observing true N400 effects of incongruity, we chose to re-baseline all four conditions using a post-stimulus time-window of 50–150 ms in the analyses presented below. We note that the results were essentially the same using the original pre-stimulus baseline, although the N400 effect of incongruity was indeed numerically larger using the post-stimulus baseline.

Figure 4 

ERP waveforms for Predictability Contrast. ERP waveforms for 29 scalp electrodes for the predictability contrast in Experiment 3. ERPs are time-locked to the onset of the critical noun.

Figure 5 

ERP waveforms for Congruity Contrast. ERP waveforms for 29 scalp electrodes for the semantic congruity contrast in Experiment 3. ERPs are time-locked to the onset of the critical noun.

Figure 6 

Topographical distribution for Experiment 3. Scalp maps and selected electrode waveforms demonstrating measured N400 effects of predictability and congruity during the 300–500 ms time-window in Experiment 3. Scalp maps illustrate the mean difference between low probability and high probability conditions (A) and between semantically incongruous and congruous conditions in (B).

Consistent with Experiments 1 and 2, in the N400 time-window we observed a large effect of probability with a central-posterior distribution, and a small effect of congruity with a leftward distribution (here without the anterior focus of Experiment 2). These differences were supported by a significant interaction between manipulation and contextual support in the omnibus quadrant analysis (F(1,27) = 8.1, MSE = 4.9, p < .05). The 3-way interaction between manipulation, contextual support and hemisphere was also significant (F(1,27) = 10.4, MSE = .8, p < .05). In the midline analysis we similarly observed a significant interaction between manipulation and contextual support (F(1,27) = 13.0, MSE = 4.7, p <.05). These significant interactions continued into the 600–800 ms time-window (all ps < .05).

We subsequently conducted analyses of variance in the probability and congruity manipulations separately, in parallel to Experiments 1 and 2. In the probability comparison, we again observed that N400 amplitude was reduced for highly probable items compared to less probable items, and that this effect had the central-posterior distribution characteristic of the N400. This resulted in a significant main effect of probability in the 300–500 ms time-window in the quadrant analysis (F(1,27) = 20.8, MSE = 6.3, p < .05) and a significant interaction between probability and anteriority (F(1,27) = 5.1, MSE = 1.2, p < .05). In the congruity comparison, the incongruous items were slightly more negative than congruous items in right hemisphere electrodes and slightly more positive than congruous items in left hemisphere electrodes, resulting in a significant interaction between congruity and hemisphere in the quadrant analysis (F(1,27) = 8.9, MSE = .8, p < .05). Although ERPs to incongruous items were numerically larger than congruous items in a number of posterior electrodes (e.g. O1, O2, Pz), no other contrasts including congruity yielded significant effects. In midline electrodes we observed a significant effect of probability (F(1,27) = 35.9, MSE = 5.1, p < .05) and a marginal interaction between probability and anteriority (F(1,27) = 3.6, MSE = .9, p = .07), but no significant effects involving congruity (ps > .1). In the 600–800 ms time-window, there were no significant effects involving probability in the quadrant analysis, and the only effect of congruity was a continuation of the interaction between congruity and hemisphere (F(1,27) = 6.08, MSE = 1.5, p < .05). In the midline analysis there was a significant effect of probability (F(1,27) = 7.4, MSE = 3.4, p < .05) reflecting a continuation of the N400 effect, but no significant effects involving congruity (ps > .1).

Finally, we note that based on visual inspection, the N400 difference between the high and low probability conditions appeared to onset by around 200 ms, while the N400 difference due to congruity appeared to onset substantially later. Because of the morphology of the ERP to visually presented words, N400 differences between 200–300 ms appear as an increased positive deflection in the predictable condition relative to the pre-stimulus baseline, and therefore some authors have suggested that differences in this early time-window are due to different neural generators than the later time-window, such as prediction vs. integration (for discussion see [59, 60, 61]. This might predict different onset latencies for predictability and congruity effects, and therefore based on a reviewer suggestion we conducted an exploratory 2 × 2 × 2 × 2 (manipulation × contextual support × hemisphere × anteriority) quadrant analysis in the 200–300 ms time-window. We did observe a significant interaction between manipulation and contextual support (F(1,27) = 6.4, MSE = 4.0, p < .05), which appeared to be driven by the early onset of the N400 effect in the probability manipulation and a reversal in the opposite direction (congruous more negative than incongruous) in this time-window in the congruity manipulation. However, although this result is suggestive, we do not think it can be taken as clear evidence for true differences in N400 effect onset latency because the amplitude of the congruity effect in this experiment was so much smaller than the amplitude of the predictability effect; for this a manipulation in which both effects were of similar size would be needed.

To summarize our main findings, in Experiment 3 we observed that when both the probability manipulation and the congruity manipulation were presented in the same participants as part of the same session, the pattern mirrored what was observed in separate experiments: a robust N400 effect of probability with a classic central posterior focus, and a small N400 effect of incongruity with a notable leftward shift. The distributional differences were supported by a significant 3-way interaction between manipulation, contextual support, and hemisphere. This pattern argues against an account in which all N400 effects are due to integration difficulty, but is consistent with multiple generator accounts in which N400 effects of predictability and incongruity reflect different underlying mechanisms.

General Discussion

Despite much prior evidence to the contrary, the idea that the N400 is primarily sensitive to semantic anomaly persists outside of research on language in much of cognitive psychology and cognitive neuroscience. In the current study, we used an adjective-noun paradigm that allowed us to contrast the effects of contextual predictability and semantic congruity on the N400 while precisely controlling predictability through the use of corpus counts. Across three ERP experiments, we found that predictability had a massive and reliable effect on the amplitude of the N400 component, and that semantic congruity had a much smaller effect with a somewhat different distribution. These results are inconsistent with the hypothesis that N400 effects of predictability and incongruity reflect a common underlying mechanism of semantic integration difficulty. Instead, these results are consistent with the hypothesis that N400 effects of predictability are due to lexical or conceptual access mechanisms [7, 11], and indicate that to the extent that N400 effects of incongruity are observed, they are due to different mechanisms.

Although many studies have observed larger N400 responses for semantically incongruous words relative to semantically congruous words (e.g. [6, 62, 63], these studies have tended to systematically confound congruity and predictability, such that the congruous endings were also highly predicted by the context (see [7, 11, 40], for relevant discussion). Avoiding this confound is challenging in sentence contexts on the assumption that predictions can be distributed across several options, because even in large corpora, tokens of full sentence contexts are too sparse to yield robust corpus estimates of predictability unless numerous assumptions are made about the language model. At the same time, offline completion norming tasks are likely to return the most predictable word for a given context but may not provide accurate information about somewhat lower probability words that are still predictable enough to impact N400 amplitudes. Using a single adjective context for the critical noun in the congruity comparison here allowed us to ensure that the nouns in the semantically congruous items were in fact quite unpredictable, although they were necessarily slightly more predictable than the nouns in the semantically incongruous items. When predictability was controlled in this way, we observed relatively weak N400 effects of congruity with a somewhat more left-lateralized distribution than ‘standard’ N400 effects in the visual modality. These results are similarly consistent with recent work by DeLong et al. [25] demonstrating only a small N400 effect of semantic congruity in sentence materials for which the incongruous continuations were slightly but significantly lower in cloze probability.

One interpretation of the weak N400 effects of semantic incongruity observed here is that N400 effects are primarily driven by access processes and not integration processes, and therefore most apparent N400 responses to semantic incongruity in previous work are due instead to predictive facilitation or semantic association in the semantically congruous conditions. Even the weak N400 effects observed here could be due to the small residual differences in probability between the congruous and incongruous adjective-noun phrases. We do believe that it is extremely challenging to completely unconfound semantic congruity from these other factors, and that a number of the effects that have been attributed to integration difficulty in the prior literature may in fact have been due to facilitated access. Future work may begin to address this challenge both by systematically examining the shape of the N400-predictability function at the lower end of the predictability scale and by refining computational models of predictability in larger contexts (e.g., [64]). However, it would be too strong to conclude from our results that N400 amplitude does not or cannot be modulated by semantic incongruity, because our phrasal paradigm may not have fully engaged semantic integration processes. Although in behavioral norming we found that participants quite clearly distinguished between the congruous and incongruous phrases in judgments, in the ERP experiment with a memory probe task participants may not have interpreted the phrasal meaning. Therefore, N400 effects of incongruity might be more robust in a full-sentence paradigm that more strongly encourages semantic integration.

What our results do appear to rule out are accounts in which N400 effects of predictability are due to semantic integration difficulty. In Experiment 3, the same participants in the same session showed robust N400 effects of predictability and only weak N400 effects of semantic incongruity. One way to explain this pattern is to assume that the N400 primarily reflects access processes, which are more strongly modulated by predictability than incongruity. Alternatively, the weak effects of incongruity might be attributed to participants’ failure to integrate at all in this paradigm, but if this were true then differences in integration cannot explain the robust effects of predictability. Therefore, these data support accounts in which N400 effects of predictability are due to access mechanisms1, and leave open the possibility that other kinds of N400 effects may be due to other mechanisms.

The conclusion that N400 effects sometimes reflect access processes rather than integration processes has important consequences, because the N400 amplitude has been used by many investigators exactly in order to assess whether comprehenders have correctly integrated the input such that they recognize semantic anomalies in particular contexts. One important example is work arguing that comprehenders are ‘attracted’ to a sentence-level meaning inconsistent with the syntax in ‘role-reversal’ sentences such as The hearty meal was devouring or The cop that the thief arrested … because no N400 effect is observed when comparing these sentences with their congruous controls The hearty meal was devoured or The thief that the cop arrested … [35, 36]. The current work suggests that N400 amplitude can be determined primarily by ease of lexical/conceptual access rather than whether a semantically incongruous meaning has been computed or not. Consistent with this, other investigators have noted that (a) in many of these paradigms the critical word is likely to be primed by lexical items or event schemas, creating the potential for a floor effect on N400 amplitude [37, 38, 65, 66] and (b) N400 effects re-emerge when more time is available for predictions to be computed [39]. Although these results do not indicate that syntax-independent interpretations are not constructed, it suggests that the presence or absence of N400 effects cannot be taken to support this hypothesis.

As noted above, the current results do not resolve the question of whether semantic incongruity impacts the N400 response independent of predictability and semantic association. However, a number of previous studies using sentence contexts have reported N400 effects of semantic integration difficulty when cloze probability was relatively well-matched [23, 24, 25, 26, 67, 68, 69]. Although in each individual case one might argue that predictability or association was not perfectly controlled, taken together the results lend support to the idea that integration difficulty does modulate the N400 to some extent. If these results are taken at face value, we can consider how the two remaining accounts of the N400 (the access account and the multiple generator account) might explain them.

An account of the N400 in which it is generated only by access mechanisms would have to in some way translate differences in congruity into differences in access of stored memory representations. As an example, N400 amplitude might index only the activation of stored lexical and conceptual representations, but the process of semantic integration sometimes modulates the activation of these stored representations, for example when activating the rarely used ‘floatable’ feature of basketball is necessary to verify the sentence The child survived in the water by clinging to a basketball [40, 70, 71]. If integration processes only impact N400 amplitude through this indirect route, one might expect the impact of congruity manipulations on the N400 to be quite subtle and dependent on the extent to which the incongruous materials encourage attempts to activate less commonly used conceptual features of the word. This would suggest that further exploration of the effects of congruity on the N400 may provide new insights into how long-term memory representations are manipulated by combinatorial processes.

A multiple generator account of the N400 could simply assume that semantic incongruity effects reflect a different N400 generator than predictability or semantic association effects. This would explain the subtle differences in topographical distribution that we observed in the current study, and might also explain some of the inconsistency and variability in attempts to localize the generators of the N400 (see [12, 16] for review). Some authors have argued that the relatively long-lasting difference in the ERP response to different conditions that is traditionally referred to as the N400 effect (often ranging from ~200–550 ms) reflects the composite effect of multiple computations pushing the ERPs in different directions in different parts of the timecourse. For example, because the early part of the N400 amplitude difference appears as an increased positive deflection (relative to the pre-stimulus baseline) in the predictable condition, whereas the later part of the N400 amplitude difference appears as an increased negative deflection in the less predictable condition, it has been suggested that these differences are due to different neural generators [59, 60, 61]. As ERP has poor spatial resolution and it is not possible to draw strong conclusions from the absolute position of peaks in the ERP response [72], testing multiple generator accounts will require careful comparison of carefully controlled N400 ‘access’ and ‘integration’ manipulations using principle components analysis or spatially-sensitive techniques such as MEG and fMRI; see Helenius et al. [73] and Baumgaertner et al. [74] for some early attempts.

In conclusion, the current study showed weak effects of semantic incongruity on the N400 when predictability was held relatively constant, but showed very large N400 effects of predictability in the same paradigm when congruity was held constant. These results argue that N400 effects of predictability do not reflect integration difficulty of the kind associated with semantic incongruity, and are thus consistent with much prior work in suggesting that references to the N400 as a simple ‘semantic anomaly’ response are bad practice. More work is needed to determine whether the N400 response directly reflects only neural activity associated with access mechanisms or whether it is a hybrid response to which integration mechanisms also contribute in appropriate contexts.